It's time to take a brief break from the different clustering paradigms, and ponder probably

**THE**most vexing question in all of clustering.

How do we choose k, the number of clusters ?This topic is so annoying, that I'm going to devote more than one post to it. While choosing k has been the excuse for some of the most violent hacks in clustering, there are at least a few principled directions, and there's a lot of room for further development.

(

*ed. note.*

*The Wikipedia page on this topic*

*was written by John Meier as part of his assignment in my clustering seminar. I think he did a great job writing the page, and it was a good example of trying to contribute to the larger Wikipedia effort via classroom work*)

With the exception of correlation clustering, all clustering formulations have an underconstrained optimization structure where the goal is to trade off quality for compactness of representation. Since it's always possible to go to one extreme, you always need a kind of "`regularizer"' to make a particular point on the tradeoff curve the most desirable one. The choice of $k$, the number of clusters, is one such regularizer - it fixes the complexity of the representation, and then asks you to optimize for quality.

Now one has to be careful to see whether 'choosing k' even makes sense. Case in point: mixture-model clustering. Rather than asking for a

*grouping*of data, it asks for a

*classification*. The distinction is this: in a classification, you usually assume that you know what your classes are ! Either they are positive and negative examples, or one of a set of groups describing intrinsic structures in the data, and so on. So it generally makes less sense to want to "`choose"' $k$ - $k$ usually arises from the nature of the domain and data.

But in general clustering, the choice of $k$ is often in the eyes of the beholder. After all, if you have three groups of objects, each of which can be further divided into three groups, is $k$ 3 or 9 ? Your answer usually depends on implicit assumptions about what it means for a clustering to be "`reasonable"' and I'll try to bring out these assumptions while reviewing different ways of determining $k$.

__The Elbow Method__The oldest method for determining the true number of clusters in a data set is inelegantly called the elbow method. It's pure simplicity, and for that reason alone has probably been reinvented many times over (

*ed. note: This is a problem peculiar to clustering; since there are many intuitively plausible ways to cluster data, it's easy to reinvent techniques, and in fact one might argue that there are very few techniques in clustering that are complex enough to be 'owned' by any inventor*). The idea is this:

Start with $k=1$, and keep increasing it, measuring the cost of the optimal quality solution. If at some point the cost of the solution drops dramatically, that's the true $k$.

The intuitive argument behind the elbow method is this: you're trying to shoehorn $k$ boxes of data into many fewer groups, so by the pigeonhole principle, at least one group will contain data from two different boxes, and the cost of this group will skyrocket. When you finally find the right number of groups, every box fits perfectly, and the cost drops.

Deceptively simple, no ? It has that aspect that I've mentioned earlier - it defines the desired outcome as a transition, rather than a state. In practice of course, "`optimal quality"' becomes "`whichever clustering algorithm you like to run"', and "`drops dramatically"' becomes one of those gigantic hacks that make Principle and Rigor run away crying and hide under their bed.

**The Alphabet-Soup Criteria**

So can we make the elbow method a little more rigorous ? There have been a few attempts that work by changing the quantity that we look for the elbow in. A series of "`information criteria"' (AIC, BIC, DIC, and so on) attempt to measure some kind of shift in information that happens as we increase $k$, rather than merely looking at the cost of the solution.

While they are all slightly different, they basically work the same way. They create a generative model with some kind of term that measures the complexity of the model, and another term that captures the likelihood of the given clustering. Combining these in a single measure yields a function that can be optimized as $k$ changes. This is not unlike the 'facility-location' formulation of the clustering problem, where each "`facility"' (or cluster) must be paid for with an 'opening cost'. The main advantage of the information criteria though is that the quantification is based on a principled cost function (log likelihood) and the two terms quantifying the complexity and the quality of the clustering have the same units.

Coming up next: Diminishing returns, the ROC curve, and phase transitions.