(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)
We've seen two generic methods for determining the number of clusters in a data set thus far:
the elbow method, and the
ROC-curve/diminishing returns method. Now I want to talk about a more "dynamic" approach to identifying the number of clusters in a data set.
These ideas center around a 'simulated annealing' view of the clustering process. Imagine, if you will, a set of highly excited atoms all bouncing around in a state in which all are indistinguishable. If you now start cooling the atoms, they will start falling into local energy traps over time. If the cooling is done on the right schedule, the atoms will find a minimum energy configuration at each temperature, all the way down to absolute zero, where each atom is in its "own state", so to speak.
What does this have to do with clustering ?
The setup works something like this. Imagine a "soft" clustering assignment of points to clusters (where each point assigns a fraction of its membership to each cluster). We can write down the distortion D associated with this clustering by computing the weighted distance of each point from the various cluster centers. We'd also like the clustering to be more or less deterministic, so we can introduce a penalty team for the level of disorder in the clustering (usually captured by the conditional entropy H associated with the assignments of points to clusters).
Now we want to minimize the distortion subject to some constraint on the entropy, so in good old Lagrangian form, we write down a function of the form
F = D - TH
where T is a Lagrange parameter and F is the overall function to be minimized.
But here's the kicker. You can think of F as the free energy of a statistical ensemble, where D represents its energy, and H represents its entropy, and T is the "temperature" of the system. The T=0 limit captures the idea that distortion is all that matters, and so each point is assigned a cluster of its own. The "T large" limit prioritizes the entropy over the distortion, encouraging us to place all points in a single cluster.
So the annealing process corresponds to minimizing F while decreasing T steadily. It turns out, using magic from the realm of statistical physics, that for any temperature T, the probability of assigning point x to cluster y is proportional to exp(-d(x,y)/T), and as T decreases, the probability of assigning a point to any cluster except the very nearest one decreases dramatically.
All of this is very fascinating, and provides a "smooth landscape" in which to understand how points get assigned to (soft) clusters. Much of the work I'm exploring right now is in trying to understand the space of soft clusterings of data. But that's a digression.
What's really interesting is that if you look at the evolution of the probabilistic assignments, you start seeing phase transitions. As T starts off high, all the points are in the same cluster. At a specific temperature T, a phase transition occurs, and the data starts splitting (based on the assignments) into more than one cluster.
How can you tell when this happens ? Here's a very elegant technique, first proposed by Kenneth Rose in his work on
deterministic annealing. In the high-T regime, where every point is in the same cluster, the cluster center location can be computed by solving a simple convex optimization. As the process evolves, the positive definite matrix defining the optimization starts losing its positive definiteness, till some point when one of its eigenvalues goes to zero. This point can be computed analytically, and yields a specific temperature value at which the first clusters start to emerge.
Kenneth Rose has some nice examples illustrating this behaviour. As time goes one, more and more phase transitions start to appear, as more and more clusters start to emerge. What you end up with is a hierarchy of clusterings that end with the trivial clustering where all data lie in separate clusters.
This idea has been developed further with the
information bottleneck method, which replaces both the distortion and entropy terms by terms involving the mutual information of the assignments. The free energy paradigm works the same way, although now the phase transition points can't be computed analytically (I'll have more to say about the information bottleneck method)
What's entirely weird is that the phase transitions still happen (and I want to stress this point), data sets will split up into different numbers of clusters at the same transition point !
We wrote a paper a few years ago that proposes a particular "temperature" to look for phase transitions in, and lo and behold, we were able to recover the "correct" number of clusters from planted data sets by watching the data split at this point. I'm still not sure why this happens, but it was quite interesting, and yielded one way of identifying "natural clusters" in the data.
Phase transitions and the like are the coin of the realm in bifurcation theory. While a proper understanding of this area is well beyond my skill level, there has been recent work on trying to analyze the information bottleneck from a bifurcation theory perspective, specifically by trying to isolate and classify the different kinds of critical points that emerge in a bifurcation diagram of the clusters as the temperature changes. Albert Parker of the University of Montana
wrote a Ph.D thesis on this topic, and I'd love to understand his work better.
Coming up next: Clustering as compression - the information bottleneck and kolmogorov complexity.