Tuesday, January 26, 2010

Author Feedback, or "Conference Review process considered harmful"

Author feedback is the latest attempt to put a band-aid on the bleeding carcass of the conference review process. We had author feedback at SoCG, and it's a common feature at many other conferences. The ostensible purpose of author feedback is to allow authors to clarify any misconceptions/confusions the reviewer might have so as to make the review process a bit more orderly (or less random?).

Usually, the process works like this: reviewers submit their reviews and have the option of requesting clarification on specific points from the authors. Authors get the questions, are required to submit a rebuttal/response by a certain date, and then deliberation continues. Variations on this include:
  • Length of the author response
  • When it's asked for (before discussions start, or after)
  • Whether it's called a 'rebuttal' or a 'response' or even just 'feedback' - I think the choice of word is significant
  • Whether the reviewers' current scoring for the paper is revealed or not.
While a good idea in principle, it can cause some headache for program committees, and often devolves into a game of cat and mouse: the reviewer carefully encrypts their questions so as not to tip their hand, the author tries to glean the reviewers' true intent from the questions, while trying to estimate which reviewer has the knife in, and so on and so forth.

What I want to rant about is the author feedback system for a conference I recently submitted to. The reviews came back long and vicious: as far as one reviewer is concerned, we should probably go and hide under a rock for the rest of our pathetic (and hopefully short) lives.

That doesn't bother me as much as it used to - I've grown a thick hide for these sorts of things ;). However, a combination of things has sent me into a fury:
  • The reviewer is actually wrong on most counts. This is isn't a matter of disagreeing over motivation, relevance etc. It's just a basic "please read section 5, column 1, sentence 3" type problem.
  • The author feedback limit is 2048 characters (which is a rather tiny amount if you're counting at home)
There's a basic issue of fairness here. Why does a reviewer get to go off on a rant for pages, while we have to limit our response to essentially sentences of the form "Yes. No. Maybe" ? Especially when the reviewer is basically wrong on a number of points, it takes a while to document the inaccuracies. At the very least, we should get as many characters in our response as the reviewers got in theirs ! (point of note: the set of reviews were 11225 characters long, and the specific reviewer I'm complaining about had a 2500 character long review)

This paper is not getting in, no matter what we say: that much is clear. I've almost never heard of a paper successfully rebutting the reviews, and in all fairness the other reviewers have issues that are matters of opinion and can't be resolved easily. That is a little disappointing, but perfectly fine within the way the review process works. But I'm annoyed that there's no good way to express my dissatisfaction with the reviewing short of emailing the PC chair, and it's not clear to me that this does any good anyway.

Overall, I think that author feedback in the limit gets us to journal reviews, which is a good thing (and my colleague actually suggested that conference reviewing should have more rounds of author feedback and less time for actual paper reviewing). But the way it's done right now, it's hard to see it as anything other than 'reviewing theater', to borrow a Bruce Schneier term. It looks nice, and might make authors and PCs feel good, but has little value overall.

Update: in case it was implied, this conference is NOT SoCG :)

Monday, January 18, 2010

SODA Day 1

Priceless: seeing the look on a speaker's face when they're asked for the exact constant in their 'constant factor approximation'

Yes, it's SODA time. And having an iphone means I can leave my laptop in my room, which means blogging waits till the day is over.

It was a good first day. I chaired the session on (as I put it), "geometry, graphs, and the geometry of graphs", and really enjoyed all the papers in the session (David has a summary of the Lee-Sidiropoulos talk). The phrase of the session was 'Poincare inequality'.

Alex Andoni talked about work with T. S. Jayram and Mihai Pătraşcu on lower bounds for computing the edit distance (and related distances like the Ulam distance). The program of attack was via the use of information complexity - a technique first developed for use in streaming lower bounds, but which has general applicability to communication complexity problems. Here's roughly speaking how the argument goes:

The direct-sum family of results says that the communication complexity of a function f expressible as an AND of n functions g is at least n * the information complexity of g. The overall plan is therefore to break down the strings being compared into pieces, and lower bound the information complexity of each piece.

Now let g be a threshold distance function that correctly reports if the distance is "too small" or "too large", but not inbetween. It turns out that the information complexity of g can be lower bounded by a constant relating to the Poincare inequality. So what is this inequality ?

In a sense, it captures the difficulty of distinguishing short distances from long. Specifically, look at the average distance of pairs of points sampled over all "short" pairs, and do the same for "long pairs". If the two resulting numbers are within some constant, then that's the constant used above, and intuitively tells us that we can't tell the pairs apart distributionally speaking.

It's not easy in general to prove Poincare inequalities, although these appear to be at the heart of non-embeddability results. What the authors go on to do is show that if the distance metric being used is "complex" i.e has a reasonably large communication complexity, then this can be used to show a Poincare-type inequality.

Neat stuff.

Saturday, January 16, 2010

Author rebuttal for SoCG 2010

There appears to be an author rebuttal phase for socg 2010. This is causing some confusion since many of the reviews are blank. I suspect based on my experience with DB reviewing that this means the reviewers didn't want clarification.

I got one nontrivial "review" back, and am assuming that some response is called for. The text is phrased not so much as a request for clarification but as an actual review. It's a bit easier when there's a specific question to be answered.

I'm glad the committee is doing this though, even with the extra effort involved. It's a good way of clarifying things that could otherwise have consequences beyond their importance.

Friday, January 08, 2010

Guest Post: Question on posting to the arxiv

ed. note: this post is by Jeff Phillips. For another recent post on arxiv publishing issues, see Hal Daume on the arxiv, NLP and ML.

It seems that over the last few months, the number of papers posted to the arXiv has been noticeably increasing, especially in the categories of Computational Geometry and Data Structures and Algorithms.

I have posted several (but not all) of my papers on the arXiv. I still do not have a consistent set of rule under which I post the papers. Here are a couple circumstances under which I have posted paper to the arXiv.

A: Along with Proceedings Version:
When conference version does not have space for full proofs, so in conjunction with proceedings version, post full version to arXiv. This is a placeholder for the full version until the journal version appears. Additionally, the arXiv paper can be updated when the final journal version appears if it has changed.

Sometimes, I link to the arXiv version in the proceedings version. This makes it easy for a reader of the proceedings to find the full proofs.

If more conferences move to the SODA model where proceedings versions can be much longer (~20 pages), then this situation may not often be necessary.

B: Along with Submitted Version:
When you want to advertise a piece of work, but it has only been submitted, post a version to arXiv. This is useful if you are giving talks on the work, and want a documented time stamp so you can't get scooped, or say, you are applying for jobs and want to make your work very available and public.

This is closer to the math philosophy where many (most?) people submit a version of a paper to arXiv as soon as they submit it to a journal. I think it would be great if CS adapted this policy, as it would be a lot easier to track results. I have a friend who as a math graduate student would start every day by perusing the dozen or so new arXiv post in his area and choosing one paper to read. He told me that almost every paper he read as a grad student was on the arXiv. Wouldn't a world like that be extremely convenient?

However, I have had an issue following this rule. Last year I submitted a paper to a conference and concurrently, submitted a longer version to the arXiv. The paper was unfortunately, not accepted to the conference. My coauthor and I extended the results to the point where it made sense to split the paper. Half was then submitted and accepted to another conference, and full proofs were made available through a tech report at my coauthor's institution, as he was required to do. The second half which has also been extended is now under submission.

I might like to post the (full) second half to the arXiv, but do not want to double the part from the previous post. I am not sure if it make sense to merge the papers at this point either. And I would also like to note on the arXiv page that that version has been extended and part appears as a tech report.

What is the proper arXiv etiquette for this situation?

Thursday, January 07, 2010

Mixture Models: Classification vs clustering

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)
We're all family.

k-means is arguably the most popular clustering algorithm (and one of the top 10 data mining algorithms to boot). It has a close relative though (some might say a generalization) in the expectation-maximization algorithm (or EM).

The EM algorithm represents yet another paradigm shift in how we think about clustering, to the point where it's arguable whether we are clustering anything at all. What the EM algorithm really does is density estimation: it assumes the data being clustered is generated by picking one of k distributions according to some probability distribution (the mixture) and then sampling from this distribution.

In other words, the definition of a cluster is not in terms of relationships betweens pairs of data points any more: it's in terms of how a collection of points behave together as a group, as representative samples from a distribution.

It's not particularly hard to see that the k-means algorithm is really a special case of EM on Gaussians(if I might be permitted an indulgence, a "tropical" limiting case of EM). In general though, the EM algorithm is an alternating optimization that finds the maximum likelihood parameters for distributions that capture the data. In the "E" step, it fixes the current parameters (the current hypothesis for the distributions) and computes the expected (log) likelihood of the data by first estimating the "cluster membership probabilities" of assigning a point to a "cluster" or distribution, and then using this to estimate the overall likelihood function. The "M" step then finds the set of parameters that maximizes the likelihood function. In k-means language, the first step figures out which clusters to assign points to, and the second step assigns new centers to each cluster.

The EM algorithm is very powerful. It's generally known that if the distributions under consideration are from an exponential family, then this method is easy to implement, because the E and M steps reduce to simple operations that have closed forms. Via the usual Bregman connection between divergence functions and distributions, it's also possible to give a purely geometric interpretation to the EM method akin to k-means, but with a different distance structure.

What's even more fascinating is the information-geometric perspective. There's a wonderful Riemannian world in which distributions are points on a manifold with coordinate systems given by their (natural) parameters, and where various information distances can be related to affine connections on said manifold. Too much to go into here, but the key insight is this: the E and M steps of the EM algorithm can be seen as projection operations in primal and dual manifolds, so the EM algorithm really looks like a primal-dual procedure. Way cool !

So is this not clustering ? For a detailed argument along these lines, you should read Hal Daume's analysis (with a nifty example). My take on this is that while density estimation is indeed different to the problem of clustering, I think of mixture-model-based clustering as just a much more "volume-based" approach to doing clustering, you expect not that point associate with each other per se, but that the ensemble of points itself satisfies some global properties.

I'll develop this idea further when I talk about information-theoretic clustering: to me, density-based clustering suggests a new abstraction for thinking about what a clustering ought to mean in the absence of even metric structure, and allows us to make pleasant detours into discrepancy, quasi-randomness and kolmogorov complexity. Stay tuned....

p.s apologies for the long hiatus.

Disqus for The Geomblog