``I shall not today attempt further to define the kinds of material I understand to be embraced within that shorthand description; and perhaps I could never succeed in intelligibly doing so. But I know it when I see it..."
-- Justice Potter Stewart in Jacobellis v. Ohio.
What makes one clustering better than another? To answer that question we have previously assumed that a well motivated objective function was given to us. Then the situation is easy, we compute the value of the objective and decide accordingly. In practice, however, clustering is often an intermediate step in a longer chain of computations, and there is not specific motivation for one objective function over another. Debating the pros and cons of the multitude of possible objective functions can easily take hours. Instead, we can further the ``I know it when I see it'' intuition by writing down the properties that any good clustering objective should satisfy and see if this axiomatic approach guides us towards the right solution.
This approach was undertaken by Kleinberg in his work on clustering axioms. A clustering function is one that takes a set of points (and their pairwise distances) and returns a partition of the data. The function should abide by the simple axioms:
- Scale-Invariance: If all distances are scaled by a constant factor, the clustering should not change. Put another way, measuring the distance in miles or kilometers (or nanometers) should not change the final clustering partition.
- Richness: Depending on the pairwise distances, any partition should be possible. For example, the clustering should not always put two specific points $x$ and $y$ together, regardless of the distance metric.
- Consistency: If the pairwise distance between points in a cluster decreases, while the distances between a pair of points in different clusters increases, the clustering should not change. Indeed, such a transformation makes clusters `tighter' and better separated from each other, so why should the clustering change?
This leads to a slightly different tack at this problem. Instead of thinking about a specific clustering, or a specific partition of the data, instead we try to define an objective function, so that the optimum clustering under that objective behaves in a reasonable manner. This is the approach adapted by Ackerman and Ben-David. Let $m$ be a function measuring the quality of the clustering. As before, Scale-Invariance requires the measure $m$ to be independent of the units on the distances and Richness requires that every clustering can be the optimum (under $m$) clustering.
The Consistency axiom is the one undergoing a subtle change. Instead of requiring that the clustering stay the same if such a perturbation to the distances occurs, we only require that the score, or measure of the clustering {\em improve} under such a transformation. This breaks the counterexample above -- while the score of a 1-cluster decreases as we move the distances, the score of a 2-clustering decreases even more and surpasses the score of the 1-clustering. Indeed, the authors demonstrate a set of different clustering metrics that are consistent under this set of axioms.
-- Sergei Vassilvitskii
All three axioms, while they may seem natural at first, are ultimately questionable.
ReplyDeleteFor example, almost any real life clustering problem has an implicit notion of unit distance built into it.
Say, if we are tracking human diseases and you have a four point quadrangular configuration a few city blocks apart, this clearly forms a cluster. Now scale up the square so that the corners lie one each in North America, South America, Europe and Africa. Clearly, the new configuration does not form a cluster. Out goes Scale invariance.
Richness is perhaps the most unnatural of the axioms. If you are given two points, and yourclustering algorithm is scale and rotation invariant it is not possible to achieve all partitions of the input set. There is nothing wrong with this, yet it violates the richness axiom.
In my opinion, there is a simple reason why the richness axiom should be modified. Namely, I see no intuitive reason why both the partition that lumps everything into a single cluster and the partition that puts everything into a singleton cluster of its own should be realized. Both of these are basically "failure modes" where the verdict is that there is no natural clustering. If we simply pick one of these two clusterings to represent our failure mode, and forbid the other one, then the inconsistency among the axioms disappears. Conversely, if you insist on richness as standardly defined, it's not really surprising that if you start with everything in a single cluster and then dilate everything, then there's no principled way to decide when you should shatter into singletons—and yet the axioms say you must.
ReplyDeleteI'm not an expert here but you may want to consider level set clustering http://bit.ly/oJ8C2x
ReplyDeleteIts satisfying two of the axioms (consistency and richness)..scale would need change of parameters.
This is also another simpler algorithm it satisfies the two conditions:
http://bit.ly/nPl72W