The Geomblog: 03/01/2010

Tuesday, March 30, 2010

Why Conference Review Must End !

Exhibit A: Matt Welch's post on how PC deliberations happen.

Notes:

Yes, I realize it's partly tongue-in-cheek. But it's not far from the truth !
No, going to all-electronic meetings doesn't solve the problem. It merely replaces one set of group dynamics by another
Yes, we can't hope to remove irrational biases in the review process. That's why all we can hope for is to force them to be exposed and questioned. A back-and-forth between author and reviewer can help do that.
And no, it's not true that "theory PCs are much better".

I've been to NSF panels where the program manager does an excellent job of forcing people to state precisely what they mean by "interesting", "infeasible", "novel contribution" and other such weasel-words. When that happens, it's a bit easier to assess the contribution. One could imagine enlightened PC chairs doing this at paper review time, but there's really no time, given the number of papers that need to be processed in 2 days or so.

Saturday, March 13, 2010

Choosing the number of clusters II: Diminishing Returns and the ROC curve

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

In the last post, we looked at the elbow method for determining the "right" number of clusters in data. Today, we'll look at generalizations of the elbow method, all still based on the idea of examining the quality-compression tradeoff curve.

For the purpose of discussion, let's imagine that we have a plot in which the quality of the clustering (measured anyway you please, as long as 0 means all items in one cluster, and 1 means all items in separate clusters) is measured along the y-axis, and the representation complexity (i.e the space needed) is measured on the x axis, where again 0 corresponds to all items in a single cluster (least representation cost), and 1 corresponds to all items in separate clusters.

The ideal cluster is located at (0,1): cheapest representation and perfect quality. In general, as we use more and more clusters to organize the data, we can trace out a curve that starts at (0,0) and ends at (1,1). For most sane functions, this cuve is concave, and lives above the diagonal x=y.

This curve contains lots of useful information about the behavior of the data as the clustering evolves. It's often called the ROC curve, in reference to the curve used to capture the tradeoff between false positive and false negatives in classification. But what can we glean from it ?

We can ask what the curve would look like for "unclusterable" data, or data that has no definite sense of the "right number" of clusters. Such data would look self-similar: you could keep zooming in and not find any clear groupings that stood out. It's not too hard to see that data that looked like this would have a ROC curve that hugs the diagonal x=y, because there's a relatively smooth tradeoff between the quality and compressibility (so no elbow).

Conversely, data that does appear to have a definite set of clusters would try to get closer to the (0,1) point before veering off towards (1,1). This suggests a number of criteria, each trying to quantify this deviation from unclusterability.

you could measure the area between the curve and the diagonal. This is (essentially) the AUC (area-under-curve) measure.
You could measure the closest distance between (0,1) and the curve. This also gives you a specific point on the curve, which you could then associate with the "best" clustering.
You could also find the point of diminishing returns - the point of slope 1, where the clustering starts costing more to write down, but yields less quality. I've used this as the start point of a more involved procedure for finding a good clustering - more on that later.

The ROC curve gives a slightly more general perspective on the elbow method. It's still fairly heuristicy, but at least you're trying to quantify the transition from bad clustering to good in somewhat more general terms.

Ultimately, what both the ROC curve and the elbow method itself are trying to find is some kind of crucial transition (a phase transition almost) where the data "collapses" into its right state. If this sounds like simulated annealing and bifurcation theory to you, you're on the right track. In the next part, I'll talk about annealing and phase-transition-based ideas for finding the right clustering - all fairly ingenious in their own right.

(ed. note: comments on the last post have convinced me to attempt a post on the nonparametric approaches to clustering, which all appear to involve eating ethnic food. Stay tuned...)

Monday, March 08, 2010

Who pays for submissions ?

Writing a paper takes a tremendous amount of time. So, one of the frequent complaints that authors make is when PC members submit half-baked, clearly below-threshold reviews on a paper just to get the resume bullet and claim to have done their reviewing duties. Personally, I feel intense anger when receiving crappy reviews that come with not the slightest bit of insight, and then am expected to rebut them or accept them. Not to mention the long-term psychological damage incurred by having papers rejected one after another.

The problem is that reviewing a paper for a conference is free: all it takes is a few clicks of the mouse to upload your PDF file. (Of course, I'm not accounting for the cost of doing the research (ha!) and actually reviewing the paper.)

Let's estimate the costs associated with doing research and submitting papers to conferences. I spend many months working, writing and submitting papers to conferences. A highly competitive conference will assign three reviewers to my paper, and with a lot of luck one of them might even tangentially be aware of my research area. After I make up a bunch of numbers, the cost of rejection of my paper amounts to over 3 gazillion dollars, none of which I can recoup. It's clear that conferences, which only survive if people submit, should be paying me to submit !

Of course, imposing this kind of a fee would no doubt drastically reduce the number of papers that are submitted. But this seems like a good thing: it would probably reduce the number of conferences, and remove the fiction that conferences actually do "quality control", leaving them with their original purpose of networking and creating a sense of community. Conferences could generate revenue by charging reviewers for the opportunity to preview the new works being submitted: this would potentially also improve the quality of the reviews as well. Although the financial incentive is not that great, getting paid should encourage TPC members to take the process more seriously.

The only downside I can see is people who submit a ton of papers everywhere and become "professional paper writers", but TPC chairs would clearly have to balance this against the research credentials of the people submitting papers. Note that many journals impose author fees for publication of the paper, so this provides a nice offset against that cost.

It just seems crazy to me that the research community provides this free paper previewing service for committees with no negative ramifications for writing totally bogus reviews.

Disclaimers for the sarcasm-challenged:

Yes, I am obviously aware of Matt Welsh's post on this topic
Yes, this is a total ripoff/parody of his post
Yes in fact, I disagree with his point.

Sunday, March 07, 2010

Choosing the number of clusters I: The Elbow Method

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

It's time to take a brief break from the different clustering paradigms, and ponder probably THE most vexing question in all of clustering.

How do we choose k, the number of clusters ?

This topic is so annoying, that I'm going to devote more than one post to it. While choosing k has been the excuse for some of the most violent hacks in clustering, there are at least a few principled directions, and there's a lot of room for further development.

(ed. note. The Wikipedia page on this topic was written by John Meier as part of his assignment in my clustering seminar. I think he did a great job writing the page, and it was a good example of trying to contribute to the larger Wikipedia effort via classroom work)

With the exception of correlation clustering, all clustering formulations have an underconstrained optimization structure where the goal is to trade off quality for compactness of representation. Since it's always possible to go to one extreme, you always need a kind of "`regularizer"' to make a particular point on the tradeoff curve the most desirable one. The choice of $k$, the number of clusters, is one such regularizer - it fixes the complexity of the representation, and then asks you to optimize for quality.

Now one has to be careful to see whether 'choosing k' even makes sense. Case in point: mixture-model clustering. Rather than asking for a grouping of data, it asks for a classification. The distinction is this: in a classification, you usually assume that you know what your classes are ! Either they are positive and negative examples, or one of a set of groups describing intrinsic structures in the data, and so on. So it generally makes less sense to want to "`choose"' $k$ - $k$ usually arises from the nature of the domain and data.

But in general clustering, the choice of $k$ is often in the eyes of the beholder. After all, if you have three groups of objects, each of which can be further divided into three groups, is $k$ 3 or 9 ? Your answer usually depends on implicit assumptions about what it means for a clustering to be "`reasonable"' and I'll try to bring out these assumptions while reviewing different ways of determining $k$.

The Elbow Method

The oldest method for determining the true number of clusters in a data set is inelegantly called the elbow method. It's pure simplicity, and for that reason alone has probably been reinvented many times over (ed. note: This is a problem peculiar to clustering; since there are many intuitively plausible ways to cluster data, it's easy to reinvent techniques, and in fact one might argue that there are very few techniques in clustering that are complex enough to be 'owned' by any inventor). The idea is this:

Start with $k=1$, and keep increasing it, measuring the cost of the optimal quality solution. If at some point the cost of the solution drops dramatically, that's the true $k$.

The intuitive argument behind the elbow method is this: you're trying to shoehorn $k$ boxes of data into many fewer groups, so by the pigeonhole principle, at least one group will contain data from two different boxes, and the cost of this group will skyrocket. When you finally find the right number of groups, every box fits perfectly, and the cost drops.

Deceptively simple, no ? It has that aspect that I've mentioned earlier - it defines the desired outcome as a transition, rather than a state. In practice of course, "`optimal quality"' becomes "`whichever clustering algorithm you like to run"', and "`drops dramatically"' becomes one of those gigantic hacks that make Principle and Rigor run away crying and hide under their bed.

The Alphabet-Soup Criteria

So can we make the elbow method a little more rigorous ? There have been a few attempts that work by changing the quantity that we look for the elbow in. A series of "`information criteria"' (AIC, BIC, DIC, and so on) attempt to measure some kind of shift in information that happens as we increase $k$, rather than merely looking at the cost of the solution.

While they are all slightly different, they basically work the same way. They create a generative model with some kind of term that measures the complexity of the model, and another term that captures the likelihood of the given clustering. Combining these in a single measure yields a function that can be optimized as $k$ changes. This is not unlike the 'facility-location' formulation of the clustering problem, where each "`facility"' (or cluster) must be paid for with an 'opening cost'. The main advantage of the information criteria though is that the quantification is based on a principled cost function (log likelihood) and the two terms quantifying the complexity and the quality of the clustering have the same units.

Coming up next: Diminishing returns, the ROC curve, and phase transitions.

The Geomblog

Pages