## Friday, July 03, 2009

### Clustering: k-means...

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

k-means (also known as Lloyd's algorithm) is probably the most well known clustering algorithm, and as such, deserves some discussion of its own. We've already seen that it's part of the "I don't like you" family of clustering formulations. How does the method work ?

We start with a set of points in R^d. The algorithm is really simple: Fix a collection of k centers. Now perform the following alternating optimization till you reach a steady state: (1) assign each point to its nearest center (2) for each group of points assigned to the same center, compute a new center by taking the centroid of the points.

More importantly, what problem does this algorithm solve ? The quick answer is: none ! Yes, that's right: k-means gives no answer with any provable guarantee for a generic set of points for any cost measure. Pretty dismal, isn't it ? And yet, it's a really popular algorithm, a state of affairs that generally drives theoreticians to tears. So what's the deal ? The answer, as always, lies in the details, and is a good story of how theoretical work has managed to make some sense of an algorithm that was notoriously hard to analyze properly.

The underlying problem that k-means attempts to solve is the one I mentioned in the last post: let a cluster cost be the sum of squared distances to the cluster center, and minimize the sum of cluster costs over all k-clusterings. Now here are some of the bad things that we CAN say about k-means:
• It might take exponential time to converge, even in the plane
• It can get stuck in a local minimum that has an arbitrarily bad cost.

Running Time:
Let's first consider the problem of guaranteeing running time. Although k-means can take exponential time to converge, a smoothed complexity result says that w.h.p, a perturbed input will converge in polynomial time (for those not familiar with the concept, the smoothed complexity of a problem is its (expected) running time after the inputs have been perturbed randomly by a small amount). The smoothed complexity results suggest one reason why k-means converges quickly "in practice". The kinds of careful constructions (high dimensional, and low dimensional) needed to force super polynomial convergence times are very artificial.

A crucial initialization:

So what about the quality of the results produced ? One unspecified aspect of the k-means algorithm is how the initial set of k centers is created. Two recent works have indicated that the key to extracting good performance from k-means is by being very careful with this initial choice.The idea is very neat, and really should be getting more play in the experimental community than I think it does.

Recall the Gonzalez algorithm for the k-center problem: pick a point, and then its furthest neighbor, and then the point whose closest neighbor in this set is as far as possible, and so on. Consider the following randomized variant:
Pick the next candidate center with probability proportional to its minimum distance squared from the current set of clusters.
Papers by Arthur and Vassilvitski, and by Ostrovksy, Rabani, Swamy and Schulman both show that this simple seeding strategy allows for provable guarantees on the quality of the solution returned by k-means. The actual results are slightly different, and give two different views of the behaviour of the algorithm:
1. The Ostrovsky et al paper starts with the assumption that the point set is epsilon-separated: the intuition behind this definition is that the cost of a k-clustering is significantly better than the cost of a (k-1)-clustering (controlled by a parameter epsilon). Under these conditions, they show that the initial seeding gives a PTAS for the clustering problem.

I'm eliding a number of details about their algorithm. One interesting feature of their algorithm is what they do after the initial seeding: rather than run Lloyd's iterative step, they fix a radius around each center (different for each center) and compute the centroid of the portion of the set within this ball. They also use a more complicated procedure to convert this into a PTAS, and in my mind that makes their approach a lot less practical.

2. The Arthur/Vassilvitski paper starts with the same initialization procedure, but they assume no epsilon-separation result. Their algorithm is simply: run the initialization, and then run regular k-means. The guarantee they get is weaker (a log k approximation ratio), but they also provide empirical results showing the superiority of their approach compared to standard k-means. I think it'd be interesting to compare their approach with the 'ball k-means' approach by Ostrovsky et al to see which work better on benchmark data sets.

Another pleasing property of the k-means algorithm is its relationship to mixture modelling and the EM algorithm. I'll have more to say about EM later on, but the gist of the matter is this: The twin steps of finding new centroids and assigning points to nearest neighbours have direct analogs in the world of maximum likelihood concepts and distribution-generated data. There is a general procedure that takes a given exponential family (Gaussian, Poisson and the like), and generates a distance function d such that the density described by the distribution can be expressed as exp(-d(x, mu)), where mu is the mean of the distribution, and x is the point whose probability we're evaluating. ML folks will recognize the Bregman distances here, and the punchline is that for Gaussian distributions, the corresponding distance function is squared Euclidean distance (the exact function used in the k-means evaluation).

What does this all mean ? It means that the k-means process has a semantic meaning in the context of clustering data drawn from a mixture of Gaussians: what's even neater is that a k-means-like algorithm works for data drawn from other distributions, as long as you replace squared Euclidean distance by the appropriate (Bregman) distance function.

Apart from the simplicity of the algorithm, there are some aspects that make it attractive, regardless of lack of guaranteed bounds. From a heuristic-design standpoint, k-means is a particularly atractive example of a technique that's often called alternating optimization.

Notes:
• I was asked if there were standard data sets for benchmarking clustering algorithms. I don't often see papers doing this in a systematic manner, but a good source of data sets for experimentation is the UCI Machine Learning Repository: there aren't too many clustering data sets, but they can be found here (look for clustering under the default task).
Next up: hierarchical methods...

1. I am enjoying reading this series. I've been working on clustering for a couple of years now and what has struck me is just how popular k-means is within the computer vision community. My guess is that it is easily implementable (and just available on MATLAB) and maybe there is a view that in the end it isn't a critical component of the bigger vision system. I disagree with this, of course, but it has been hard to find other people who feel the same.

I stumbled upon the Arthur and Vassilvitski paper too and am implementing it for a k-means implementation I am currently using. It seems like a very good heuristic though I feel it might tend to pick outliers in initial iterations. And this may be detrimental to clustering at the end.

One of the bigger problems I've had with k-means is in deciding how many clusters there are in the data set. Lot of people just pick a number. Some plot curves and follow it up with voodoo. Recently I was experimenting with some of these methods to pick a K and none of them really produced remotely satisfying results (on both artificial and real data). The general wisdom in using clustering seems to lean towards a preference to "over-segment" clusters and just go with it. I wonder how prevalent this usage of k-means really is.

Thanks for doing this series. It has been fun reading.

2. You're right about the problem with outliers. there are clustering variants that try to address this problem. One way I could imagine the initialization working on this is to bias the sampling by the density: if there's only one point far away, its probability of getting picked is lower than if there's a cloud of points far away.

as for choosing k, I'll have much more to say about that in a future post.

3. I tried out the intelligent seeding on few (adversarial: with outliers and stray points) synthetic data sets in 2D/3D and the UCI repository. It turned out that I get much better clusterings than if I just run the kmeans matlab code. As for the outliers being picked up, one idea that comes to my mind is to neglect points that are unclusterable (stray) and only cluster the clusterable(?) points.

4. hii.. i am new to this clustering concept.. and unable to understand even after getting clustering done , in which group should we put the clusters?
i mean the name of the group or some criterion to differentiate between groups !

5. It's an excellent point. the question you refer to is the problem of labelling the clusters with actual meaning.

Traditionally these two problems are viewed as somewhat separate. Or, the combined problem of labelling and clustering is generally termed multi-class classification. Maybe I'll try to discuss that a bit.

6. thanks in advance if you could do that !! neways thanks for this too..

7. This is to The Tobacconist and to Parasaran Raman. Do either of you have any code for choosing K ? I have a dataset for school that has alot of outliers. But I am interested in outliers. I just don't know how to pick K to start, or to seed the clusters to start.

8. Correct me if I am wrong, but the "outliers as centroids" problem isn't it minimized with k++ (David Arthur and Sergei Vassilvitskii)?
I thought their way of calculating the points probabilities of becoming centroids selection include the density as well.

Thanks for the blog, Really interesting!!!

9. Thanks for writing this article. I have written K-means couple of times but was always ignorant on its convergence property. So when you wrote
" .....
It might take exponential time to converge, even in the plane.", I just wonder where I can find the source.

Arthur Chan

10. Here's the link to Andrea Vattani's paper on this:

http://cseweb.ucsd.edu/~avattani/papers/kmeans.pdf