Here I would like to explain a little more in detail the similarity map found in the books area of the website. How it relates to infovis is quite simple: I have information regarding the books that I want to visualize. After the feature definition phase, the similarity map construction can be generalized to other data.

First of all, it might have been obvious already but I’ve only included books that I’ve recently read or re-read (so unless I re-read Harry Potter and …, or any of A song of ice and fire) then it will not be in the list. Also, the values for each of the books are subjective i.e. affected to a large extent by my own biases.

Now, that I have tempered the expectations, let’s continue. Simply, I wanted to explore what I’ve read. I tend to get into genre phases i.e. non-fiction, economy, etc… and for a while only these books are read. The problem is that my lecture gets very biased towards some genres and writers - which is nice as you know what you are getting and most likely it will be a good book too - but it doesn’t foment any growth or getting in touch with if not conflicting theories, at least with thoughts that haven’t been very present on the every day life.

Taking all that into account. The first step (once I know what I want to do with it) is in a way the most difficult: deciding what to take into account. I detail below what features I’ve selected and some reasons why for the non-obvious ones.

Selected features

Similarity Map

In order to compute the similarity map, I apply a projection method called Multidimensional Scaling (MDS). There’s a lot written about it so I won’t cover it here. But it is important to know a couple of things. First, let us consider each feature described above as a dimension. We have therefore a multidimensional sample (the book). Looking at all dimensions at the same time is next to impossible without human-hardware updates.

The aim of projection methods is to reduce the original number of dimensions to a lower one that can be better examined without much loss of information. Different projection methods focus on different things, MDS tries to keep the distance between elements in a lower dimensional space.

Defining distance

How to define the distance or dissimilarity between two books? For features that are numerical by nature this is rather straightforward. Most visualization algorithms are focused on numerical values. However features such as genre have no ordering unless you take into account personal preference. But even taking into account personal preferences, how would you even do certain mathematical operations? Historical Novel - Coming-of-age = ?. These variables are named categorical variables. A subset of categorical variables may have order. We refer to these ones as ordinal variables. Regretfully, even with ordinal variables these mathematical variables are not well defined.

For numerical variables, I’ll use Euclidean distance and assume that the variables are normalized.

Set Distance

The feature topic is defined as a set of words e.g. the topics for “Split tooth” by Tanya Taga can be considered to be Spiritual, Nature, Surreal, and Motherhood. There are several ways in order to analyze the distance from this book to another book. We could, for example, look at the semantics of the words. Try to understand the meaning of each word in the topics and how they relate to one another by looking at a corpus of sentences and see how they are connected.

But that will be an overkill for what I want to accomplish. I decided then to go for a simple distance. In general, a distance between two sets can be computed by looking at the ratio of the number of elements in the intersection over number of elements in the union of the sets. If they have no elements in common then the intersection will be empty and iff all the elements are the same the size of the intersection will be the same as the union. We have then a very nicely defined metric whose values are in the range 0..1.

Categorical Distance

There are different metrics that can be applied for categorical variables. You are most likely familiar with the overlap measure i.e. as a measure of distance 0 if the categorical values are the same, 0 otherwise. The issue here happens when the categorical values are skewed in some direction. Large amount of samples of a single categorical value may not give large amounts of information gain. It might still happen, we cannot dismiss that. We can only be certain of that until we examine the rest of the variables and how the behave together for any interaction and confounding effects.

Below I’ll describe a few of measures used for categorical distance and see their effects on similarity maps. Assume we are comparing two samples \(X_i\) and \(X_j\) in the categorical dimension \(k\). Then the value of each sample at that dimension is \(X_{ik}\) and \(X_jk\). Being the same is then defined as \(X_{ik}=X_{jk}\).

  • Overlap: 1 if values are the same 0 otherwise.

  • Eskin: 1 if values are the same, \(\frac{n^2_k}{n^2_k+2}\) where \(n_k\) is the amount of values the categorical variable can take.

  • Goodall: 1 if the values are the same, \(\frac{1}{1 + log(f(X_{ik}))\times log(f(X_{jk}))}\) where \(f()\) is defined as the number of occurrences of that value.

  • Lin: \(2log(p(X_{ij}))\) if the values are the same \(2log(p(X_{ij}) + p(Y_{ij} ))\) otherwise.

Effect of different distances in the Map

Now that we have some of these measures defined, let’s look at concrete examples of their effects. By examining the Titanic dataset. In order to simplify our analysis here, let’s focus on the complete cases and on a handful of categorical variables, namely gender, class, embarked, country and survived, and a handful of cases.

titanic <- titanic[complete.cases(titanic),c("gender","class","embarked","country","survived")]
titanic <- titanic[1:20,]
kable(head(titanic,10))
gender class embarked country survived
male 3rd Southampton United States no
male 3rd Southampton United States no
male 3rd Southampton United States no
female 3rd Southampton England yes
female 3rd Southampton Norway yes
male 3rd Southampton United States yes
male 2nd Cherbourg France no
female 2nd Cherbourg France yes
male 3rd Cherbourg Lebanon yes
male 3rd Southampton Finland yes

Let’s compute the distances

dis_overlap <- sm(titanic)
dis_eskin <- eskin(titanic)
dis_goodall <- good1(titanic)
dis_lin <- lin(titanic)

The function cmdscale in R creates for us the metric MDS once we give it the distance matrix. Below you can observe the same projection method applied to the four measures defined above. Do not focus too much on the axes, as for MDS and non-linear projection methods we cannot easily interpret the axis. It is more interesting to focus on the relative placement from each other.

Summary

This was just a small glimpse at how the similarity map was created. So at least a small insight can be given on why it acts the way it does. I’ve used the simple matching just because of simplicity, but that may change as I add more books in the data set. In other posts I’ll go into more detail into the measures above, however I wanted to comment on how different measures exist and how they might affect the plots we see (and their hidden biases).