## Sunday, March 07, 2010

### Choosing the number of clusters I: The Elbow Method

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

It's time to take a brief break from the different clustering paradigms, and ponder probably THE most vexing question in all of clustering.
How do we choose k, the number of clusters ?
This topic is so annoying, that I'm going to devote more than one post to it. While choosing k has been the excuse for some of the most violent hacks in clustering, there are at least a few principled directions, and there's a lot of room for further development.

(ed. note. The Wikipedia page on this topic was written by John Meier as part of his assignment in my clustering seminar. I think he did a great job writing the page, and it was a good example of trying to contribute to the larger Wikipedia effort via classroom work)

With the exception of correlation clustering, all clustering formulations have an underconstrained optimization structure where the goal is to trade off quality for compactness of representation. Since it's always possible to go to one extreme, you always need a kind of "regularizer"' to make a particular point on the tradeoff curve the most desirable one. The choice of $k$, the number of clusters, is one such regularizer - it fixes the complexity of the representation, and then asks you to optimize for quality.

Now one has to be careful to see whether 'choosing k' even makes sense. Case in point: mixture-model clustering. Rather than asking for a grouping of data, it asks for a classification. The distinction is this: in a classification, you usually assume that you know what your classes are ! Either they are positive and negative examples, or one of a set of groups describing intrinsic structures in the data, and so on. So it generally makes less sense to want to "choose"' $k$ - $k$ usually arises from the nature of the domain and data.

But in general clustering, the choice of $k$ is often in the eyes of the beholder. After all, if you have three groups of objects, each of which can be further divided into three groups, is $k$ 3 or 9 ? Your answer usually depends on implicit assumptions about what it means for a clustering to be "reasonable"' and I'll try to bring out these assumptions while reviewing different ways of determining $k$.

The Elbow Method

The oldest method for determining the true number of clusters in a data set is inelegantly called the elbow method. It's pure simplicity, and for that reason alone has probably been reinvented many times over (ed. note: This is a problem peculiar to clustering; since there are many intuitively plausible ways to cluster data, it's easy to reinvent techniques, and in fact one might argue that there are very few techniques in clustering that are complex enough to be 'owned' by any inventor). The idea is this:

Start with $k=1$, and keep increasing it, measuring the cost of the optimal quality solution. If at some point the cost of the solution drops dramatically, that's the true $k$.

The intuitive argument behind the elbow method is this: you're trying to shoehorn $k$ boxes of data into many fewer groups, so by the pigeonhole principle, at least one group will contain data from two different boxes, and the cost of this group will skyrocket. When you finally find the right number of groups, every box fits perfectly, and the cost drops.

Deceptively simple, no ? It has that aspect that I've mentioned earlier - it defines the desired outcome as a transition, rather than a state. In practice of course, "optimal quality"' becomes "whichever clustering algorithm you like to run"', and "drops dramatically"' becomes one of those gigantic hacks that make Principle and Rigor run away crying and hide under their bed.

The Alphabet-Soup Criteria

So can we make the elbow method a little more rigorous ? There have been a few attempts that work by changing the quantity that we look for the elbow in. A series of "information criteria"'  (AIC, BIC, DIC, and so on) attempt to measure some kind of shift in information that happens as we increase $k$, rather than merely looking at the cost of the solution.

While they are all slightly different, they basically work the same way. They create a generative model with some kind of term that measures the complexity of the model, and another term that captures the likelihood of the given clustering. Combining these in a single measure yields a function that can be optimized as $k$ changes. This is not unlike the 'facility-location' formulation of the clustering problem, where each "facility"' (or cluster) must be paid for with an 'opening cost'. The main advantage of the information criteria though is that the quantification is based on a principled cost function (log likelihood) and the two terms quantifying the complexity and the quality of the clustering have the same units.

Coming up next: Diminishing returns, the ROC curve, and phase transitions.

1. Are Bayesian non-parametric methods outside the scope of what you want to discuss? They provide an elegant way to handle the "how many clusters" problem.

Also, I don't really understand your point about mixture-model clustering ("it makes less sense to what to choose k") given you then talk about AIC/BIC etc. which AFAIK assume a mixture-model clustering setup.

2. @Noel: am not familiar with Bayesian nonparametric methods other than BIC. Do you mean things like LDA, the Indian buffet process and the like ?

I guess when I was talking about mixture model methods, I was thinking about things like mixture of Gaussian like models. The main question is if you put a price on the complexity of the model or not.

3. I like this post and am looking forward to the others. That said, in the proud academic/Internet tradition, I will show my gratitude by quibbling.

1. At least in statistics, we often use Gaussian (etc.) mixture models without knowing how many components go into the mixture model. Then people use *IC, or cross-validation. (I prefer cross-validation.) My friends Chris Genovese and Larry Wasserman have an interesting paper on using the method of sieves for Gaussian mixtures, though I don't know if anyone's extended it beyond the Gaussian case.

2. How is BIC "nonparametric"? It's all about choosing between finitely-parametrized models. (Admittedly, real Bayesians never choose between models.)

4. Related to the elbow method, are algorithms that compute a permutation of the data such that any prefix is a good clustering under certain measure. The greedy permutation works well for k-center, and Plaxton has a paper showing how to do this for k-means (or is it k-median) clustering...

5. Yes, I mean things like the Indian buffet process. The most relevant work here is the Dirichlet process (DP) which is an extension of the Dirichlet distribution to an infinite dimension. A sample from a DP is thus the mixing distribution for a mixture model with an infinite number of components (with most of the mass concentrated on a few components). With these type of models you don't have to choose k; k is infinite but only a finite number of elements will actually be used given finite data (and generally you get sensible results rather than one component per data point). Hal has published quite a bit in this area.

6. @cosma: the sieve idea is very interesting, and yes I was a tad careless in calling BIC a nonparametric method

@Noel I've learnt everything I know about DP from Hal's seminars :)

@both it seems like the sieve method and the DP process are two examples of the same idea: using an infinite mixture model and letting the data decide the answer. I was trying to avoid generative-based clustering, but I think I will add another part to the series on letting the data pick the answer

7. I've also read about an interesting bayesian approach in Bishop's book under the variational mixture of gaussians model. A prior distribution over the mixture weights is initially specified with a predetermined number of clusters which is assumed to be large enough. The EM algorithm then works in such a way that some clusters have weights being very small and hence are "eliminated", which seems to be some sort automatic relevance determination.

8. 