[Aside: in the theoryCS community, a "heuristic" usually refers to a procedure that has no provable performance guarantees; this could mean either no clear running time analysis, or no clear quality of optimization guarantee. The term "algorithm" is usually reserved for a procedure that has well-defined steps and ends in a well-defined state after a definite (finite) number of steps. One will occasionally find the term 'algorithm' used to refer to a heuristic in non-theory areas; beware !]
Doing such analysis, although maybe not as "sexy" as coming up with a new algorithm, is important also for "sociological" reasons. Although it might technically be correct to argue that a particular heuristic is not reliable and thus should not be used, the resolute theoretician is often faced with a mountain of empirical evidence in support of said heuristic, compared with the possibly complex and hard-to-understand algorithm that she proposes.
Some of the best known heuristics show up in clustering, especially since this is a problem that transcends many different application domains, including areas that are not as familiar with theoretical algorithmic methods. One of the best known methods for clustering with k centers is Lloyd's algorithm (or the k-means algorithm), which works essentially as a discrete variant of expectation-maximization. The algorithm itself is very simple:
- Choose k "centers"
- Assign each point to its nearest center.
- Compute the k centroids of points assigned to a given center. These are the new k "centers".
- Go back to step 2.
- In 1D, and in d-D (on the grid), the number of iterations of the k-means algorithm is polynomial in the number of points, the dimension, and the "spread" (the ratio of max distance to min-distance).
- Even in 1D, there are example point sets on which the k-means algorithm takes a linear number of iterations to converge.
They go on to present empirical evidence in support of their claim that these algorithms generate similar quality answers (in similar time) to k-means. This part is still a work in progress: there are very fast implementations of k-means that they don't compare against, (but the data structures might be used to speed up their approach as well).
It is not clear to me that their paper answers to any degree the question of why Lloyd's algorithm works as well as it appears to. However, the fact that they can prove bounds for their alternate algorithms suggests that maybe this is a line of attack to take when analyzing Lloyd's method. These alternates are also quite simple, and so could conceivably be used as k-means replacements if one wished guaranteed running time bounds.
One final point: the authors have their code available online. Would that more would do the same !
test
ReplyDeletePosted by Suresh
Suresh says "It is not clear to me that their paper answers to any degree the question of why Lloyd's algorithm works as well as it appears to. "
ReplyDeleteIt seems to me that your evaluation above is much
more negative than warranted. Why isn't the analysis
a large step towards asnwering things about Lloyd's
algorithm? They are presenting a worst case
lower bound. The modified algorithm
retains the intuition of the original algorithm
while giving performance bounds. A more refined
analysis would need understanding in some sense
the average case and here there can be many
interpretations. At the end of the day one
might still not "understand" why Lloyd's algorithm
works.
Posted by Chandra Chekuri
The gap IMO rests in the experimental setup. I don't say that the paper is deficient in this regard, but that any modification of k-means that claims to provide intuition about how it works should behave in the same way. They have some experimental evidence supporting this, but more is needed.
ReplyDeleteMaybe I should have relaxed my statement. There is some progress that has been made (which is why I mentioned this paper in the first place), but it is still unclear how to connect SinglePnt (which seems the more natural of the two alternates) and k-means.
At first cut, it appears that k-means is much more efficient because of the multiple misclassifications it fixed in each iteration. The results on SinglePnt suggest that this might be an illusion. Can some kind of dominance be proved ? Moving many points is not much better than moving one etc ?
Posted by Suresh
BTW, this work is hopefully a first step on this problem. Although me & Bardia went with it as far as we could, there is still a huge gap between the lower bound and the upper bound. It would be really nice to close it or just improve it, but I think it would require a new idea. A super linear (resp. quadratic) lower bound would be (resp. very) exciting even in high dimensions.
ReplyDeleteMy intuition is that SinglePnt dominates the k-means method. So, my intruition is that our analysis in fact would also hold for this case. However, proving this connection I am afraid is going to be very painful, since the behavior of k-means seems to be intricate.
In fact, in the talk, I said that I "conjecture" that there is such a realtionship. The double quatation is because we want credit for this conjecture only if it is correct.
In short, this paper is wide open for further improvement...
Posted by Sariel Har-Peled