In the last post, we looked at the elbow method for determining the "right" number of clusters in data. Today, we'll look at generalizations of the elbow method, all still based on the idea of examining the quality-compression tradeoff curve.
For the purpose of discussion, let's imagine that we have a plot in which the quality of the clustering (measured anyway you please, as long as 0 means all items in one cluster, and 1 means all items in separate clusters) is measured along the y-axis, and the representation complexity (i.e the space needed) is measured on the x axis, where again 0 corresponds to all items in a single cluster (least representation cost), and 1 corresponds to all items in separate clusters.
The ideal cluster is located at (0,1): cheapest representation and perfect quality. In general, as we use more and more clusters to organize the data, we can trace out a curve that starts at (0,0) and ends at (1,1). For most sane functions, this cuve is concave, and lives above the diagonal x=y.
This curve contains lots of useful information about the behavior of the data as the clustering evolves. It's often called the ROC curve, in reference to the curve used to capture the tradeoff between false positive and false negatives in classification. But what can we glean from it ?
We can ask what the curve would look like for "unclusterable" data, or data that has no definite sense of the "right number" of clusters. Such data would look self-similar: you could keep zooming in and not find any clear groupings that stood out. It's not too hard to see that data that looked like this would have a ROC curve that hugs the diagonal x=y, because there's a relatively smooth tradeoff between the quality and compressibility (so no elbow).
Conversely, data that does appear to have a definite set of clusters would try to get closer to the (0,1) point before veering off towards (1,1). This suggests a number of criteria, each trying to quantify this deviation from unclusterability.
- you could measure the area between the curve and the diagonal. This is (essentially) the AUC (area-under-curve) measure.
- You could measure the closest distance between (0,1) and the curve. This also gives you a specific point on the curve, which you could then associate with the "best" clustering.
- You could also find the point of diminishing returns - the point of slope 1, where the clustering starts costing more to write down, but yields less quality. I've used this as the start point of a more involved procedure for finding a good clustering - more on that later.
The ROC curve gives a slightly more general perspective on the elbow method. It's still fairly heuristicy, but at least you're trying to quantify the transition from bad clustering to good in somewhat more general terms. 
Ultimately, what both the ROC curve and the elbow method itself are trying to find is some kind of crucial transition (a phase transition almost) where the data "collapses" into its right state. If this sounds like simulated annealing and bifurcation theory to you, you're on the right track. In the next part, I'll talk about annealing and phase-transition-based ideas for finding the right clustering - all fairly ingenious in their own right. 
(ed. note: comments on the last post have convinced me to attempt a post on the nonparametric approaches to clustering, which all appear to involve eating ethnic food. Stay tuned...)
 
 
I think we should come up with some other food processes as well -- the New York Deli? The Sushi Boat?
ReplyDeleteDiminishing returns idea sounds interesting. I have to look into it more closely.
ReplyDeleteI am under impression that with information theory we can only get well grounded solution to the number of clusters problem. With the interpretation that clustering models the data and so total cost is the data encoded with the model and encoded model. So minimizing that total cost should give us best fit clustering model given original model assumptions (like spherical clusters in the k-means criterion).
Is there other possible interpretation? Elbow idea is ok also (and it is known to work ok). F-ratio statistic is a similar idea. But I don't see how it would be theoretically justified.
Ah, then there is idea of clustering series of random sampled subsets of the original data. Clusterings that are similar with some fixed k are assumed to be correct. Maybe this could be called cross-validation method. But it is a little bit too statistical to my liking. :-)
I am actually acutely interested in this now, as one of my current projects is to give solution of speaker clustering problem in speaker diarization problem. In speaker diarization we are given a meeting recording with unknown speakers and task is to segment the recording according to speakers (giving answer who speaks and when). Current solutions are fairly ad-hoc.