k-center, k-median, k-means, k-medoids, .... The list is endless, but that pernicious k comes up everywhere. What can we do about it ? There are really two ways to go:

- Figure out the "right" k for a problem. This is a complicated matter, and will be the topic of a later post
- Don't choose: give the user a universal representation from which they can decide what k they want.

This latter formulation takes us into the world of hierarchical clustering, the topic of this post.

There are two different ways to think about hierarchical clustering: a representational view and an algorithmic view. The algorithmic view is quite simple: let's try to design a clustering via greedy operations. In the top-down view, we find a good split of the data into two parts, and then recurse on each side. In the bottom-up view, we select two clusters for merging (in the beginning, all items are in separate clusters), and merge our way up.

Perhaps the most well known of the bottom-up methods is the 'single link' clustering algorithm, in which at each step, you merge the two clusters that are closest to each other. If this sounds familiar, it should: this is merely Kruskal's MST algorithm run partially to completion.

The "single-link" method is optimistic: it creates clusters via connected components, assuming that transitivity is enough to construct a cluster. The "complete-link" approach is more pessimistic: rather than defining the distance between two clusters as their nearest pair distance, it defines it as the furthest pair distance, ensuring a clique-like structure for clusters. Merging happens as before.

Other variants that are more robust to noise average the pairwise distances to obtain the clustering distance. With the right choice of distance, these amount to comparing the centroids of clusters.

You'll notice that I didn't actually define a problem that these algorithms solve, keeping in with the grand tradition of clustering :). There is a way of defining a general optimization function that single-link clustering solves optimally. For the others though, although we can define a cost function, we can't show that they're optimal.

So much for the algorithmic view of HC. The representational view goes somewhat deeper.

I've had at least one person (you know who you are) tell me that one of the only two clustering algorithms that worked for them is hierarchical clustering. And indeed, the idea is very seductive. Rather than present the user with a single k-clustering - a snapshot, if you will, of the data - we give them a tree of merges, in which there is one leaf for each object. The central idea here, and one that bears emphasizing, beacuse it's so different to how we've thought about clustering thus far, is this:

*we understand the structure of data from its transitions, rather than from its states.*

What I mean is this: rather than defining a clustering as a static partitioning of the data, we watch the data ``evolve'' as we start merging points together, and use the merge history to figure out what the real structure is. There are a couple of ways of doing this: the most common way is to imagine drawing the tree from top to bottom, and then *cutting it* by a y-monotone curve (one that intersects any vertical line in exactly one point). This can be done to ensure we have k clusters, or it can be done at points where the merge appears to be ``stable'' (the subtree doesn't merge with any other subtrees for a long time).

A particularly effective visual metaphor for this is the *dendrogram*, which is very popular in evolutionary tree analysis.

The dendogram is visually appealing because it does two things: firstly, it depicts the tree of merges, permuting the nodes so that there aren't crossings. Secondly, and more importantly, it uses the lengths of edges as a visual marker to indicate the 'time of merge' for clusters. If we think of starting at time 0 and merging clusters, then a cluster that sticks around for a long time will have a tall edge connecting it to its parent, in comparison with a cluster that gets merged quickly. (as an aside, visual metaphors are possibly underrated when it comes to thinking about clustering algorithms. My firm belief is that one of the reasons soft clusterings aren't used as much as hard clusterings is because it's hard to visualize them)**Sidebar:**

And this is the key operational idea behind ``clustering from the transitions'': clusters that stay unmerged longer are likely to be more interesting than clusters that get merged quickly. Till recently, this was merely a heuristic way of thinking about hierarchical clusterings. However, we now have the idea of *persistence* that comes from the realm of computational topology. In short, persistence is a way of quantifying topological features of a shape that persist across different scales. Persistence has been studied extensively as a rigorous way of quantifying the triviality (or non-triviality) of features in a shape, and there's a recent paper that applies persistence-like concepts to clustering as well. It's still early days for this direction, but I think it's a promising one.

Returning to hierarchical clustering, one major problem with this approach is that it's local: make the wrong choice of merge early on, and you'll never get to the optimal solution for a k-clustering problem. And since there's no easy way to change your mind (i.e split a clustering), it's hard to reverse bad decisions. If you're so inclined (and I am nowadays), this is the problem of finding monotone paths in the merge-split lattice on partitions of [1..n], but I digress...

Ultimately, the value of hierarchical clustering comes in its ability to represent an entire history of clusterings, rather than just one, and the relative ease with which a bottom-up algorithm can be written. That, coupled with its uses where you really do want to find a tree (evolutionary or otherwise) is what makes it a popular implement in the clustering toolbag.