Friday, February 25, 2005

SODA Outtakes: Lloyd's algorithm and analysing heuristics

Uri Feige's invited talk at SODA 05 was on 'Rigorous Analysis of Heuristics for NP-hard Problems'. His main argument (at least in the initial portion of his talk) was that it's a good idea to analyze known heuristics for problems, because they work well in practice, and a rigorous analysis will shed light on why.

[Aside: in the theoryCS community, a "heuristic" usually refers to a procedure that has no provable performance guarantees; this could mean either no clear running time analysis, or no clear quality of optimization guarantee. The term "algorithm" is usually reserved for a procedure that has well-defined steps and ends in a well-defined state after a definite (finite) number of steps. One will occasionally find the term 'algorithm' used to refer to a heuristic in non-theory areas; beware !]

Doing such analysis, although maybe not as "sexy" as coming up with a new algorithm, is important also for "sociological" reasons. Although it might technically be correct to argue that a particular heuristic is not reliable and thus should not be used, the resolute theoretician is often faced with a mountain of empirical evidence in support of said heuristic, compared with the possibly complex and hard-to-understand algorithm that she proposes.

Some of the best known heuristics show up in clustering, especially since this is a problem that transcends many different application domains, including areas that are not as familiar with theoretical algorithmic methods. One of the best known methods for clustering with k centers is Lloyd's algorithm (or the k-means algorithm), which works essentially as a discrete variant of expectation-maximization. The algorithm itself is very simple:
  1. Choose k "centers"
  2. Assign each point to its nearest center.
  3. Compute the k centroids of points assigned to a given center. These are the new k "centers".
  4. Go back to step 2.
It is known that Lloyd's algorithm will hit a local minimum fairly quickly (though not a global one: k-means is NP-hard). However, there is no analysis of how quickly this happens. In a SODA 2005 paper, Har-Peled and Sadri examine the complexity of k-means. They show that:
  1. In 1D, and in d-D (on the grid), the number of iterations of the k-means algorithm is polynomial in the number of points, the dimension, and the "spread" (the ratio of max distance to min-distance).
  2. Even in 1D, there are example point sets on which the k-means algorithm takes a linear number of iterations to converge.
What they also show is that slight modifications to the k-means heuristic result in algorithms that converge in time independent of the dimension (and with much better upper bounds). The simpler of their two schemes can be described as follows: in Step 2 of the original approach, merely move one "misclassified" point to its true cluster, rather than all.

They go on to present empirical evidence in support of their claim that these algorithms generate similar quality answers (in similar time) to k-means. This part is still a work in progress: there are very fast implementations of k-means that they don't compare against, (but the data structures might be used to speed up their approach as well).

It is not clear to me that their paper answers to any degree the question of why Lloyd's algorithm works as well as it appears to. However, the fact that they can prove bounds for their alternate algorithms suggests that maybe this is a line of attack to take when analyzing Lloyd's method. These alternates are also quite simple, and so could conceivably be used as k-means replacements if one wished guaranteed running time bounds.

One final point: the authors have their code available online. Would that more would do the same !

Disqus for The Geomblog