To look at posts by category, click here

t-SNE is a plotting, not a clustering, algorithm

SNE
t-SNE
Drawbacks
Code

Recently, it seems that t-SNE plots have become all the rage in bioinformatics. The plots that result from this technique are admittedly beautiful, but due to their novelty in the field, very few people know what this tool does.

I decided to write some code to learn about the implementation of the Student-t Stochastic Neighbor Embedding algorithm. There’s a link to the Jupyter notebook at the bottom.

So, what is t-SNE? In a nutshell, t-SNE is an algorithm that takes points that are close in high-dimensional space and puts them close in low-dimensional space, or at the very least it tries hard to. To understand t-SNE better, I first need to explain the way SNE normally works.

SNE

How does it work? Normal SNE works by assuming that each data point i (henceforth, node) can see another node j roughly if the distance between them is σ_i or less and if j is closer than the average node k is to i, in which case i would call j a neighbor. For these two data points, the algorithm tries (struggles, this is hard problem) to find coordinates in low-dimensional space that yield a similar probability for i to call j its neighbor. Notice that in this version of SNE, if i calls j a neighbor, this does not mean that i gets called a neighbor by j.

That last point seems obscure and technical, but it is important. It’s important because it means that the way the points are being placed on a low dimensional map is not by a similarity measurement. It is by information content. However, the most common clustering algorithms, such as NMF, K-means, PCA, all work on what are known as metric spaces, where distances mean the same thing regardless of whether you measure from j to i or if you measure from i to j. This means that SNE is not a clustering algorithm in the sense that we often think about clustering.

t-SNE

Symmetrization of the neighborhood functions

t-SNE makes two modifications to the SNE algorithm. The first is that it attempts to deal with the symmetry problem in high- and low-dimensional spaces.

The first part of the solution to the symmetry problem in both dimensions is to redefine how two nodes are called neighbors. Instead of asking whether a given node j is closer than the average node to node i, the algorithm asks whether i and j are closer than the average pair of nodes in the space. For low-dimensional space, this fully solves the problem.

The authors of t-SNE claim that problem of symmetry is slightly more complicated in high-dimensional space because large distances can become extremely large very quickly. This is problematic because the cost-function becomes insensitive to whatever solution you may propose in low-dimensional space. This problem was solved by adding the probability that i calls j a neighbor, p(j~|i), to the converse, p(i~|j) and renormalizing to make sure that the probabilities still add to one. Functionally, this means that the probabilities have a minimal bound that doesn’t break the cost function. Personally, I find this correction strange, because probabilities do not typically add but multiply, so the meaning of the renormalized probabilities is unclear to me. I think this solution is elegant, but it may be nothing but a numerical trick to render the algorithm more sensitive to outliers.

Student-T distributed neighbors in low-dimensional space exaggerates

distances

The final modification required to go from SNE to t-SNE is to change the neighborhood function in low-dimensional space from a normal distribution to a Student-t distribution with a single degree of freedom. Because the Student-t has more density in the tails than the normal distribution (after a specific cutoff point), the low-dimensional embedding looks different. In general, what will happen is that points that are closer to each other than this cutoff point will appear closer together on the map; whereas points that are farther away will be driven even farther apart as they are projected into two dimensions, for small perplexity values. However, if the perplexity values are large, points that are closest together can be pushed apart, points that are at an intermediate distance can be pushed in, and points that are far apart can be pushed out again.

Drawbacks of t-SNE

t-SNE is rapidly emerging as a popular tool with which to visualize RNA-seq data. It generates beautiful plots with intriguing shapes. However, as a clustering method, t-SNE has several drawbacks.

It is a non-linear method with a fairly challenging gradient. Choosing good parameters for the descent seems important. This technical point will probably become much easier over time (it’s already a tractable problem).
The user specifies a perplexity value, and so far there is not a principled way of choosing how to set the perplexity. Set the perplexity too low, and nothing has neighbors. Set it too high, and everything is neighbors with everything. Thankfully, since the perplexity affects the variances in log-space, this parameter isn’t super sensitive.
There is no concept of distance in t-SNE. This appears to me to be a major drawback and one that does not have a solution. The low-dimensional representation is achieved by taking nodes with different eyesight (recall the average eyesight is set by the perplexity), and creating a map that acts as if everyone had the same eyesight. Not only that, but the Student-t distribution means that a map unit between close points does not have the same meaning and depends on an intersection point between the Normal and the Student-t distribution, which in turn depends upon the perplexity.
The maps need to be re-computed every single time, because t-SNE does not generate a basis on which to project the points. It embeds points into a coordinate system in a way that satisfies its gradient descent. At present we have no way of reconstructing any map.
t-SNE cannot handle points of arbitrary dimension. Instead, people usually feed the first n principal components of their choice from their data. This means that the coordinates in t-SNE are already a) whitened through PCA, b) linear combinations of variables.
There is no test for convergence, so people run the algorithm as long as they want.

Although t-SNE is a neat algorithm that does have some clustering properties, I hesitate to assign any meaning to it beyond its use as a neat plotting method.

An implementation of t-SNE with comments:

See the Jupyter Notebook

Written on March 21, 2018

Musings in Low Dimensional Space