## Sunday, August 30, 2009

### Spectral Clustering.

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)
I don't like you, and I like them, but maybe if we change, we can get along.
Spectral clustering is not clustering: it's actually a metric modification method. It's another way of capturing things that are close versus things that far, using a more global method.

We saw how correlation clustering could be used to capture both the I don't like you'' and I like them'' aspects of clustering by assigning positive and negative weights to individual element pairs. However, it's not easy to come by such a labelling of pairs. More often than not, all we have is a distance function. We could threshold it, declaring all distances within some parameter r to be positive, and all distances larger to be negative, but this is completely arbitrary and depends on the choice of r.

Let's think of the two-class problem for now: we merely want to divide the points into two clusters. Suppose we were able to guess a function that assigns one label (0) to points in one cluster and another label (1) to points in the other cluster. Then we could write down the cost of the clustering (in a correlation sense) by summing up, over all pairs, the difference

. In fact, just to make sure it's always positive, we'll square it, and thus will sum up the function . We'll also assume a weight function on the edge , so that what we're really summing up is .

This is where things start to get very interesting. It's not hard to see that this can be written in term of a variant of the graph Laplacian, treating the f values as a vector. If we now allow the f values to be reals, rather than 0 or 1, we get a classical eigenvalue problem on the Laplacian, in effect determining the second smallest eigenvalue of the Laplacian.

Once we do that, the corresponding eigenvector gives us an assignment for f. we can either merely take positive or negative values of f to group the data, or we can use the f values as a feature representation and recluster the data into two groups based on that.

This is the key insight of spectral clustering''. We perform a spectral decomposition of the Laplacian, and take the top k eigenvectors. Those vectors give us a k-dimensional feature represntation of the data in an inner product space, and we can now recluster. The real question is: why should this feature representation help us in any way at all ?

Here's my take on what's really going on. We know that the graph Laplacian is a discrete equivalent of the Laplace-Beltrami operator on a manifold, which in turn (via the heat equation) captures the flow'' of material in a manifold. Return for a second to our cut-based view of the distance function. In essence, two points are close if there are many ways of getting from one to the other (i.e the min cut between them is large) and they are far if not. This is a flow-based way of thinking about distances, and in effect, what the Laplacian does is map the original data into a new metric space in which distance is based on flow, rather than just on metric distance.

This intuition can be formalized: if we switch now to the stochastic viewpoint, in which we're doing random walks on the graph, then the commute time between two vertices turns out to be precisely the Euclidean distance (squared: thanks, dsivakumar) between the feature vectors constructed from the eigenvalues of the Laplacian.

... take a deep breath...

There are a number of different concepts swirling around here that are worth keeping track of. The cut-based viewpoint of node similarity in graphs lends itself naturally to a Laplacian-based treatment, which in turn leads us to random walks and the heat equation on manifolds. At a deeper level, the Laplacian of a graph, by virtue of how it acts on functions, tells us something very basic about the structure of the distances involved. For more on this particular aspect, you should check out the discussion 'Can you hear the shape of a drum'.

But it's very important to keep in mind that spectral clustering is not so much a clustering technique as a data transformation method. Although k-means is the clustering algorithm of choice once we get down to the feature representation, other methods could be used as well. Also, as relaxations go, this paricular one (relaxing from the hypercube to the reals), isn't that great: the integrality gap'' can be quite high. It turns out that this doesn't matter terribly though.

As an aside, the idea of using the graph Laplacian to define a stretched'' distance that looks Euclidean is not limited to this problem. The most "famous" version of this idea is the Laplacian eigenmaps work by Belkin and Niyogi in the realm of manifold learning. There's also work by Guibas, Ovsjanikov and Sun that uses similar ideas to create signatures of shapes. The key idea, once again, is that if you're on a manifold of some kind, the Laplacian gives you a handle on the shape of this manifold, as well as how to stretch it out'' with local coordinates to make it look flat (and therefore Euclidean). It's a useful trick to keep in your data analysis toolbox.

For further reading, Ulrich von Luxburg has a great survey on spectral clustering.

## Saturday, August 08, 2009

### Negative-type distances and kernels

This is a story of two constructions that actually end up being essentially the same thing, as I recently discovered.

1. Kernels

The story of kernels in machine learning goes somewhat like this:

Take a positive definite function $$K(x,y)$$ that captures some notion of similarity between objects (a standard example of such a kernel is the Gaussian $$K(x,y) = \exp(-\|x - y\|^2)$$). A positive definite function, btw, is like the generalization of a p.d matrix: the integral $$\int f(x)f(y)k(x,y)dx dy \ge 0$$.

You can then construct a Hilbert space that captures the structure of the kernel. Specifically, for a fixed set of points S, construct a vector space from the basis $$\{k_x | x \in S\}$$, where $$k_x(y) = K(x,y)$$, and then define an inner product of two vectors in this space in the usual way: If $$v_a = \sum a_x k_x$$ and $$v_b = \sum b_x k_x$$, then $$v_a \cdot v_b = \sum a_x b_y K(x,y)$$.

You get nice properties from this construction: the so-called reproducing property is one of them, which tells you that $$K(x,y) = k_x \cdot k_y$$. In other words, we can capture the similarity function K(x,y) by a "standard" inner product in a Hilbert space.

What's even neater is that by invoking Mercer's theorem, you can construct an orthogonal basis, and make sure that the Hilbert space is actually a Euclidean space. The squared Euclidean distance in this space can be written in kernel form, as
$d_K(x,y) = K(x,x) + K(y,y) - 2K(x,y)$
which is what you'd expect when treating K(.) as an inner product.

2. Negative-type distance.
The second story comes from Deza and Laurent. When can a distance function be embedded in Euclidean space ? It turns out that it's more convenient to talk about the square of the distance function, which we'll call D.

There's an elegant characterization of when D can be embedded isometrically into $$\ell^2_2$$: this can be done if and only if D satisfies the negative-type inequality:
$\sum b_i b_j D(x_i, x_j) \le 0, \sum b_i = 0$
for all possible assignments to $$b_i$$.

The proof works via a construction called a covariance mapping that takes $D$ to the function $$k_D$$ defined as:
$k_D(x,y) = (1/2)(d(x, x_0) + d(y, x_0) - d(x,y))$
which differential geometry folks will recognize as the Gromov product.

The proof completes by showing that the negative-type condition on D implies positive definiteness of $$k_D$$, and this in turn means that $$k_D$$ can be expressed as an R-covariance:
$k_D(x,y) = \int f_x(\omega)f_y(\omega) d\mu(\omega)$
for some measure space $\mu$.

Note that the RHS of the equation is an infinite-dimensional inner product.

3. Where the two stories come together
The mapping that takes a kernel to a distance is the inverse of the covariance mapping used to map a distance to a metric. In other words, if we take a kernel K, compute $$d_K$$, and then use the distance to kernel mapping to compute $$k_{d_K}$$, we get back K. Further, since we can show that the negative-type condition on D implies a positive-definite condition on $$k_D$$, we can start off with either "D satisfies negative-type inequalities" or "K is a positive definite kernel" and yield the same conclusion on $$\ell_2^2$$ embeddability.

What's interesting is that the two ways of thinking about this don't seem to have been connected together explicitly before.

p.s This is not a fully rigorous correspondence except possibly in the finite dimensional case, but I find that it's often convenient to replace the negative-type argument with a kernel positive-definiteness argument in my head.

## Sunday, August 02, 2009

### Correlation Clustering: I don't like you, but I like them...

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

Whether it's k-clustering, or any kind of hierarchical clustering, the world we live in is still the world of I don't like you'' clustering, where the geometric landscape is defined by distances between points. Consider the following example: There is an intuitive sense in which the first clustering is not real'' and the second one is: the idea of 'well-separatedness'' is a pervasive component of how a good clustering is perceived. But if we only measure distances between points, and only measure the cost of a clustering in terms of how costly each cluster is, we'll never be able to distinguish between these two examples.

What's needed is a way of declaring likes (similarity) as well as dislikes (distance), and then, critically:

penalizing similar items in different clusters AS WELL AS different items in the same cluster.

By that measure, we'd be able to distinguish the first and second clusterings, because in the first case, presumably elements close to each other that lie in different clusters will make the clustering look more expensive. This point is worth reiterating. Unless we have some way of penalizing mis-separations as well as mis-groupings, we'll always be at the mercy of the tradeoff between k and cost.

Continuing the "clustering via self-help metaphors" theme to these essays, I call this the "I don't like you, but I like them" way of modelling data.

Correlation Clustering

This is where correlation clustering enters the picture. The correlation clustering model is as follows: every pair of elements is assigned a 1 or -1, encoding a similarity or dissimilarity. The goal is to find a clustering (note: no k!) in which any pair of points in a cluster is penalized for being dissimilar, and any pair of points in two different clusters is penalized for being similar. This can be generalized to arbitrary weights: for each pair of elements you assign weights w+ and w-, with the possible caveat that w+ + w- = 1.

So now the goal is merely to minimize the cost of a clustering. The elegance of correlation clustering lies in the natural way that clusters merge or split depending on the number of similar or dissimilar pairs. Do note though that you need to have input data that can be written in that manner: thresholding a distance function will give you the desired input, but is an ad hoc way of doing it, since the thresholding is arbitrary.

There are a number of different algorithms for correlation clustering, and also a very simple one that yields a good approximation guarantee: pick a point at random, and pull in all its neighbors (all points similar to it). Repeat this process with a new unpicked point, until all points have been selected. This algorithm gives a 3-approximation for correlation clustering.

Correlation clustering is also useful as a way of combining clusterings. We'll talk about this later (ed: how many times have I said that !), but the problem of conensus clustering is to cluster the clusterings'', or aggregate them into a single average clustering. An easy way to see the connection is this: given a collection of clusterings of a data set, create an instance of correlation clustering with the positive weight for a pair corresponding to the fraction of clusterings that vote'' for that pair being in the same cluster, and the negative weight being the fraction of clusterings that vote'' for that pair being separated.