Here I would like to explain a little more in detail the similarity map found in the books area of the website. How it relates to infovis is quite simple: I have information regarding the books that I want to visualize. After the feature definition phase, the similarity map construction can be generalized to other data.
First of all, it might have been obvious already but I’ve only included books that I’ve recently read or re-read (so unless I re-read Harry Potter and …, or any of A song of ice and fire) then it will not be in the list. Also, the values for each of the books are subjective i.e. affected to a large extent by my own biases.
Now, that I have tempered the expectations, let’s continue. Simply, I wanted to explore what I’ve read. I tend to get into genre phases i.e. non-fiction, economy, etc… and for a while only these books are read. The problem is that my lecture gets very biased towards some genres and writers - which is nice as you know what you are getting and most likely it will be a good book too - but it doesn’t foment any growth or getting in touch with if not conflicting theories, at least with thoughts that haven’t been very present on the every day life.
Taking all that into account. The first step (once I know what I want to do with it) is in a way the most difficult: deciding what to take into account. I detail below what features I’ve selected and some reasons why for the non-obvious ones.
English Name: It goes without saying that not all books are written in English but I’ll assume that you, the reader, are fluent in English. The original title of some books, even well known ones, such as Братья Карамазовы might be difficult to understand unless you speak the language. In case of multiple English titles e.g. when used in different countries, I´ll use the one from country of origin of the writer.
Original Name: Quite straightforward. The title of the book in the original language.
Author(s): In some works that rely on interviews such as “Solito, Solita” or “On Palestine” the main editor of the interview or the main interviewee is used as the author. Multiple authors are allowed.
Country origin: The country of origin of the author(s). In the case of multiple authors from different countries and a non-first author can be identified then I’ll use the publishing house origin.
Gender: Some books such as “Conundrum”,“A transgender history” and “Man alive” have been written post-transition. Using the sex might not be appropriate in this occasions.
Type: Fiction or non-fiction, sometimes this is qualified as genre. However, using it as a genre feels to much as a catch-all.
Date Published: Date of first publication. In case the book was actually published as a serial or different volumes such as “Vanity Fair” then I’ll use the last publication. Also, in case I cannot find the precise date, I’ll default to the 1st of the publishing month, or the 1st day of the year if only year is defined.
Genre: The main category this is quite a difficult categorization IMHO as several books do not fall neatly into one category or looking specially at non-fiction the genre may be too broad.
Topics: A list of topics that are touched in some way or another. Going back to the satire “Vanity Fair”. It can be considered as well a book on feminism.
Pages: Length. A better measure will be the number of words (even when languages allow the creation of compound words). Regretfully, it is not such an easy value to find (it will require more effort than I’m willing to give).
Rating: The overall rating I give to the book.
Comment: Not a full review, but just a comment or two of the book. It might be from two sentences to several paragraphs long.
Intellectual: Whether the topic is intellectually exhausting i.e. do I have to turn on my brain to read this book?. It doesn’t mean that the book is stupidly written but that one can read it even after an exhausting day at work.
Emotional: Whether it will give a lasting emotional impression. A very emotional book can be a difficult read, deal with trauma or it will also motivate you towards something e.g. “Half the sky”.
Seriousness: This is a matter of style. An emotional book may also be not that serious such as “Perfect Sound Whatever” where depression is discussed. “What if” may talk about physics but it won’t keep your frown down.
Engrossing: It happens that a book might have an interesting topic, the overall story is good and there is some true growth in their characters… but … it might just stall somewhere in the middle. Maybe it is trying to create more set up? maybe just ran out of ideas? or simple that’s the style of the times? For whatever reason, a non-engrossing book would be read in less than one hour chunks.
In order to compute the similarity map, I apply a projection method called Multidimensional Scaling (MDS). There’s a lot written about it so I won’t cover it here. But it is important to know a couple of things. First, let us consider each feature described above as a dimension. We have therefore a multidimensional sample (the book). Looking at all dimensions at the same time is next to impossible without human-hardware updates.
The aim of projection methods is to reduce the original number of dimensions to a lower one that can be better examined without much loss of information. Different projection methods focus on different things, MDS tries to keep the distance between elements in a lower dimensional space.
How to define the distance or dissimilarity between two books? For features that are numerical by nature this is rather straightforward. Most visualization algorithms are focused on numerical values. However features such as genre have no ordering unless you take into account personal preference. But even taking into account personal preferences, how would you even do certain mathematical operations? Historical Novel - Coming-of-age = ?. These variables are named categorical variables. A subset of categorical variables may have order. We refer to these ones as ordinal variables. Regretfully, even with ordinal variables these mathematical variables are not well defined.
For numerical variables, I’ll use Euclidean distance and assume that the variables are normalized.
The feature topic is defined as a set of words e.g. the topics for “Split tooth” by Tanya Taga can be considered to be Spiritual, Nature, Surreal, and Motherhood. There are several ways in order to analyze the distance from this book to another book. We could, for example, look at the semantics of the words. Try to understand the meaning of each word in the topics and how they relate to one another by looking at a corpus of sentences and see how they are connected.
But that will be an overkill for what I want to accomplish. I decided then to go for a simple distance. In general, a distance between two sets can be computed by looking at the ratio of the number of elements in the intersection over number of elements in the union of the sets. If they have no elements in common then the intersection will be empty and iff all the elements are the same the size of the intersection will be the same as the union. We have then a very nicely defined metric whose values are in the range 0..1.
There are different metrics that can be applied for categorical variables. You are most likely familiar with the overlap measure i.e. as a measure of distance 0 if the categorical values are the same, 0 otherwise. The issue here happens when the categorical values are skewed in some direction. Large amount of samples of a single categorical value may not give large amounts of information gain. It might still happen, we cannot dismiss that. We can only be certain of that until we examine the rest of the variables and how the behave together for any interaction and confounding effects.
Below I’ll describe a few of measures used for categorical distance and see their effects on similarity maps. Assume we are comparing two samples \(X_i\) and \(X_j\) in the categorical dimension \(k\). Then the value of each sample at that dimension is \(X_{ik}\) and \(X_jk\). Being the same is then defined as \(X_{ik}=X_{jk}\).
Overlap: 1 if values are the same 0 otherwise.
Eskin: 1 if values are the same, \(\frac{n^2_k}{n^2_k+2}\) where \(n_k\) is the amount of values the categorical variable can take.
Goodall: 1 if the values are the same, \(\frac{1}{1 + log(f(X_{ik}))\times log(f(X_{jk}))}\) where \(f()\) is defined as the number of occurrences of that value.
Lin: \(2log(p(X_{ij}))\) if the values are the same \(2log(p(X_{ij}) + p(Y_{ij} ))\) otherwise.
Now that we have some of these measures defined, let’s look at concrete examples of their effects. By examining the Titanic dataset. In order to simplify our analysis here, let’s focus on the complete cases and on a handful of categorical variables, namely gender, class, embarked, country and survived, and a handful of cases.
titanic <- titanic[complete.cases(titanic),c("gender","class","embarked","country","survived")]
titanic <- titanic[1:20,]
kable(head(titanic,10))
| gender | class | embarked | country | survived |
|---|---|---|---|---|
| male | 3rd | Southampton | United States | no |
| male | 3rd | Southampton | United States | no |
| male | 3rd | Southampton | United States | no |
| female | 3rd | Southampton | England | yes |
| female | 3rd | Southampton | Norway | yes |
| male | 3rd | Southampton | United States | yes |
| male | 2nd | Cherbourg | France | no |
| female | 2nd | Cherbourg | France | yes |
| male | 3rd | Cherbourg | Lebanon | yes |
| male | 3rd | Southampton | Finland | yes |
Let’s compute the distances
dis_overlap <- sm(titanic)
dis_eskin <- eskin(titanic)
dis_goodall <- good1(titanic)
dis_lin <- lin(titanic)
The function cmdscale in R creates for us the metric MDS once we give it the distance matrix. Below you can observe the same projection method applied to the four measures defined above. Do not focus too much on the axes, as for MDS and non-linear projection methods we cannot easily interpret the axis. It is more interesting to focus on the relative placement from each other.
This was just a small glimpse at how the similarity map was created. So at least a small insight can be given on why it acts the way it does. I’ve used the simple matching just because of simplicity, but that may change as I add more books in the data set. In other posts I’ll go into more detail into the measures above, however I wanted to comment on how different measures exist and how they might affect the plots we see (and their hidden biases).