k-center, k-median, k-means, k-medoids, .... The list is endless, but that pernicious k comes up everywhere. What can we do about it ? There are really two ways to go:

- Figure out the "right" k for a problem. This is a complicated matter, and will be the topic of a later post
- Don't choose: give the user a universal representation from which they can decide what k they want.

This latter formulation takes us into the world of hierarchical clustering, the topic of this post.

There are two different ways to think about hierarchical clustering: a representational view and an algorithmic view. The algorithmic view is quite simple: let's try to design a clustering via greedy operations. In the top-down view, we find a good split of the data into two parts, and then recurse on each side. In the bottom-up view, we select two clusters for merging (in the beginning, all items are in separate clusters), and merge our way up.

Perhaps the most well known of the bottom-up methods is the 'single link' clustering algorithm, in which at each step, you merge the two clusters that are closest to each other. If this sounds familiar, it should: this is merely Kruskal's MST algorithm run partially to completion.

The "single-link" method is optimistic: it creates clusters via connected components, assuming that transitivity is enough to construct a cluster. The "complete-link" approach is more pessimistic: rather than defining the distance between two clusters as their nearest pair distance, it defines it as the furthest pair distance, ensuring a clique-like structure for clusters. Merging happens as before.

Other variants that are more robust to noise average the pairwise distances to obtain the clustering distance. With the right choice of distance, these amount to comparing the centroids of clusters.

You'll notice that I didn't actually define a problem that these algorithms solve, keeping in with the grand tradition of clustering :). There is a way of defining a general optimization function that single-link clustering solves optimally. For the others though, although we can define a cost function, we can't show that they're optimal.

So much for the algorithmic view of HC. The representational view goes somewhat deeper.

I've had at least one person (you know who you are) tell me that one of the only two clustering algorithms that worked for them is hierarchical clustering. And indeed, the idea is very seductive. Rather than present the user with a single k-clustering - a snapshot, if you will, of the data - we give them a tree of merges, in which there is one leaf for each object. The central idea here, and one that bears emphasizing, beacuse it's so different to how we've thought about clustering thus far, is this:

*we understand the structure of data from its transitions, rather than from its states.*

What I mean is this: rather than defining a clustering as a static partitioning of the data, we watch the data ``evolve'' as we start merging points together, and use the merge history to figure out what the real structure is. There are a couple of ways of doing this: the most common way is to imagine drawing the tree from top to bottom, and then *cutting it* by a y-monotone curve (one that intersects any vertical line in exactly one point). This can be done to ensure we have k clusters, or it can be done at points where the merge appears to be ``stable'' (the subtree doesn't merge with any other subtrees for a long time).

A particularly effective visual metaphor for this is the *dendrogram*, which is very popular in evolutionary tree analysis.

The dendogram is visually appealing because it does two things: firstly, it depicts the tree of merges, permuting the nodes so that there aren't crossings. Secondly, and more importantly, it uses the lengths of edges as a visual marker to indicate the 'time of merge' for clusters. If we think of starting at time 0 and merging clusters, then a cluster that sticks around for a long time will have a tall edge connecting it to its parent, in comparison with a cluster that gets merged quickly. (as an aside, visual metaphors are possibly underrated when it comes to thinking about clustering algorithms. My firm belief is that one of the reasons soft clusterings aren't used as much as hard clusterings is because it's hard to visualize them)**Sidebar:**

And this is the key operational idea behind ``clustering from the transitions'': clusters that stay unmerged longer are likely to be more interesting than clusters that get merged quickly. Till recently, this was merely a heuristic way of thinking about hierarchical clusterings. However, we now have the idea of *persistence* that comes from the realm of computational topology. In short, persistence is a way of quantifying topological features of a shape that persist across different scales. Persistence has been studied extensively as a rigorous way of quantifying the triviality (or non-triviality) of features in a shape, and there's a recent paper that applies persistence-like concepts to clustering as well. It's still early days for this direction, but I think it's a promising one.

Returning to hierarchical clustering, one major problem with this approach is that it's local: make the wrong choice of merge early on, and you'll never get to the optimal solution for a k-clustering problem. And since there's no easy way to change your mind (i.e split a clustering), it's hard to reverse bad decisions. If you're so inclined (and I am nowadays), this is the problem of finding monotone paths in the merge-split lattice on partitions of [1..n], but I digress...

Ultimately, the value of hierarchical clustering comes in its ability to represent an entire history of clusterings, rather than just one, and the relative ease with which a bottom-up algorithm can be written. That, coupled with its uses where you really do want to find a tree (evolutionary or otherwise) is what makes it a popular implement in the clustering toolbag.

Nice post, looking forward to more in the series.

ReplyDeleteI'm no expert, but I've found the neglect of soft clustering in the literature to be rather bizarre.

You say:

ReplyDelete"Returning to hierarchical clustering, one major problem with this approach is that it's local: make the wrong choice of merge early on, and you'll never get to the optimal solution for a k-clustering problem.". I don't get it aren´t we talking about hierearchical clustering type, if so why do you say k-clustering, which is non hierarchical and also non comparable to a hierarchical one.

Anyway I like the idea of measuring the hole process of clustering, not just keeping the resulting one, but also see the merging times and probably other stuff.

My comment was referring to the idea of using a hierarchical algorithm to find a k-clustering. you're right that otherwise, they're incompatible concepts.

ReplyDeleteRe: "You'll notice that I didn't actually define a problem that these algorithms solve, keeping in with the grand tradition of clustering :). "

ReplyDeleteNei's "Neighbor Joining" is a flavor of hierarchical clustering that solves a more well-defined problem: If the distance matrix can be embedded in a tree, then this will be the tree computed by neighbor joining. This takes you striaght to the question which matrices can be embedded in a tree, leading you to four-point criteria, split systems, and a lot of beautiful combinatorics behind this - along with splits networks as a way to represent matrices that don't fit into a tree.

All this comes from the world of phylogeny reconstruction in Biology, but should be of equal interest for clustering.

Hi Geomblog,

ReplyDeleteNice posts. Enjoy them. I do have a question. How does one quantitatively compare two hierarchical clusters of the same items. What I mean by that is how does one say with statistical significance the items are closely clustered in one vs the other. The items (lets say genes) are the same in both the trees. Its just they are 2 different experimental conditions producing 2 different dendogram structure. But some genes cluster together in both the dendograms but with different distances. So how do you say that those two clusters of the same set of genes are different with statistical significance in the distances between the genes in each of the cluster.

Thanks

Nish