The Geomblog

Tuesday, March 30, 2010

Why Conference Review Must End !

Exhibit A: Matt Welch's post on how PC deliberations happen.

Notes:

Yes, I realize it's partly tongue-in-cheek. But it's not far from the truth !
No, going to all-electronic meetings doesn't solve the problem. It merely replaces one set of group dynamics by another
Yes, we can't hope to remove irrational biases in the review process. That's why all we can hope for is to force them to be exposed and questioned. A back-and-forth between author and reviewer can help do that.
And no, it's not true that "theory PCs are much better".

I've been to NSF panels where the program manager does an excellent job of forcing people to state precisely what they mean by "interesting", "infeasible", "novel contribution" and other such weasel-words. When that happens, it's a bit easier to assess the contribution. One could imagine enlightened PC chairs doing this at paper review time, but there's really no time, given the number of papers that need to be processed in 2 days or so.

Saturday, March 13, 2010

Choosing the number of clusters II: Diminishing Returns and the ROC curve

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

In the last post, we looked at the elbow method for determining the "right" number of clusters in data. Today, we'll look at generalizations of the elbow method, all still based on the idea of examining the quality-compression tradeoff curve.

For the purpose of discussion, let's imagine that we have a plot in which the quality of the clustering (measured anyway you please, as long as 0 means all items in one cluster, and 1 means all items in separate clusters) is measured along the y-axis, and the representation complexity (i.e the space needed) is measured on the x axis, where again 0 corresponds to all items in a single cluster (least representation cost), and 1 corresponds to all items in separate clusters.

The ideal cluster is located at (0,1): cheapest representation and perfect quality. In general, as we use more and more clusters to organize the data, we can trace out a curve that starts at (0,0) and ends at (1,1). For most sane functions, this cuve is concave, and lives above the diagonal x=y.

This curve contains lots of useful information about the behavior of the data as the clustering evolves. It's often called the ROC curve, in reference to the curve used to capture the tradeoff between false positive and false negatives in classification. But what can we glean from it ?

We can ask what the curve would look like for "unclusterable" data, or data that has no definite sense of the "right number" of clusters. Such data would look self-similar: you could keep zooming in and not find any clear groupings that stood out. It's not too hard to see that data that looked like this would have a ROC curve that hugs the diagonal x=y, because there's a relatively smooth tradeoff between the quality and compressibility (so no elbow).

Conversely, data that does appear to have a definite set of clusters would try to get closer to the (0,1) point before veering off towards (1,1). This suggests a number of criteria, each trying to quantify this deviation from unclusterability.

you could measure the area between the curve and the diagonal. This is (essentially) the AUC (area-under-curve) measure.
You could measure the closest distance between (0,1) and the curve. This also gives you a specific point on the curve, which you could then associate with the "best" clustering.
You could also find the point of diminishing returns - the point of slope 1, where the clustering starts costing more to write down, but yields less quality. I've used this as the start point of a more involved procedure for finding a good clustering - more on that later.

The ROC curve gives a slightly more general perspective on the elbow method. It's still fairly heuristicy, but at least you're trying to quantify the transition from bad clustering to good in somewhat more general terms.

Ultimately, what both the ROC curve and the elbow method itself are trying to find is some kind of crucial transition (a phase transition almost) where the data "collapses" into its right state. If this sounds like simulated annealing and bifurcation theory to you, you're on the right track. In the next part, I'll talk about annealing and phase-transition-based ideas for finding the right clustering - all fairly ingenious in their own right.

(ed. note: comments on the last post have convinced me to attempt a post on the nonparametric approaches to clustering, which all appear to involve eating ethnic food. Stay tuned...)

Monday, March 08, 2010

Who pays for submissions ?

Writing a paper takes a tremendous amount of time. So, one of the frequent complaints that authors make is when PC members submit half-baked, clearly below-threshold reviews on a paper just to get the resume bullet and claim to have done their reviewing duties. Personally, I feel intense anger when receiving crappy reviews that come with not the slightest bit of insight, and then am expected to rebut them or accept them. Not to mention the long-term psychological damage incurred by having papers rejected one after another.

The problem is that reviewing a paper for a conference is free: all it takes is a few clicks of the mouse to upload your PDF file. (Of course, I'm not accounting for the cost of doing the research (ha!) and actually reviewing the paper.)

Let's estimate the costs associated with doing research and submitting papers to conferences. I spend many months working, writing and submitting papers to conferences. A highly competitive conference will assign three reviewers to my paper, and with a lot of luck one of them might even tangentially be aware of my research area. After I make up a bunch of numbers, the cost of rejection of my paper amounts to over 3 gazillion dollars, none of which I can recoup. It's clear that conferences, which only survive if people submit, should be paying me to submit !

Of course, imposing this kind of a fee would no doubt drastically reduce the number of papers that are submitted. But this seems like a good thing: it would probably reduce the number of conferences, and remove the fiction that conferences actually do "quality control", leaving them with their original purpose of networking and creating a sense of community. Conferences could generate revenue by charging reviewers for the opportunity to preview the new works being submitted: this would potentially also improve the quality of the reviews as well. Although the financial incentive is not that great, getting paid should encourage TPC members to take the process more seriously.

The only downside I can see is people who submit a ton of papers everywhere and become "professional paper writers", but TPC chairs would clearly have to balance this against the research credentials of the people submitting papers. Note that many journals impose author fees for publication of the paper, so this provides a nice offset against that cost.

It just seems crazy to me that the research community provides this free paper previewing service for committees with no negative ramifications for writing totally bogus reviews.

Disclaimers for the sarcasm-challenged:

Yes, I am obviously aware of Matt Welsh's post on this topic
Yes, this is a total ripoff/parody of his post
Yes in fact, I disagree with his point.

Sunday, March 07, 2010

Choosing the number of clusters I: The Elbow Method

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

It's time to take a brief break from the different clustering paradigms, and ponder probably THE most vexing question in all of clustering.

How do we choose k, the number of clusters ?

This topic is so annoying, that I'm going to devote more than one post to it. While choosing k has been the excuse for some of the most violent hacks in clustering, there are at least a few principled directions, and there's a lot of room for further development.

(ed. note. The Wikipedia page on this topic was written by John Meier as part of his assignment in my clustering seminar. I think he did a great job writing the page, and it was a good example of trying to contribute to the larger Wikipedia effort via classroom work)

With the exception of correlation clustering, all clustering formulations have an underconstrained optimization structure where the goal is to trade off quality for compactness of representation. Since it's always possible to go to one extreme, you always need a kind of "`regularizer"' to make a particular point on the tradeoff curve the most desirable one. The choice of $k$, the number of clusters, is one such regularizer - it fixes the complexity of the representation, and then asks you to optimize for quality.

Now one has to be careful to see whether 'choosing k' even makes sense. Case in point: mixture-model clustering. Rather than asking for a grouping of data, it asks for a classification. The distinction is this: in a classification, you usually assume that you know what your classes are ! Either they are positive and negative examples, or one of a set of groups describing intrinsic structures in the data, and so on. So it generally makes less sense to want to "`choose"' $k$ - $k$ usually arises from the nature of the domain and data.

But in general clustering, the choice of $k$ is often in the eyes of the beholder. After all, if you have three groups of objects, each of which can be further divided into three groups, is $k$ 3 or 9 ? Your answer usually depends on implicit assumptions about what it means for a clustering to be "`reasonable"' and I'll try to bring out these assumptions while reviewing different ways of determining $k$.

The Elbow Method

The oldest method for determining the true number of clusters in a data set is inelegantly called the elbow method. It's pure simplicity, and for that reason alone has probably been reinvented many times over (ed. note: This is a problem peculiar to clustering; since there are many intuitively plausible ways to cluster data, it's easy to reinvent techniques, and in fact one might argue that there are very few techniques in clustering that are complex enough to be 'owned' by any inventor). The idea is this:

Start with $k=1$, and keep increasing it, measuring the cost of the optimal quality solution. If at some point the cost of the solution drops dramatically, that's the true $k$.

The intuitive argument behind the elbow method is this: you're trying to shoehorn $k$ boxes of data into many fewer groups, so by the pigeonhole principle, at least one group will contain data from two different boxes, and the cost of this group will skyrocket. When you finally find the right number of groups, every box fits perfectly, and the cost drops.

Deceptively simple, no ? It has that aspect that I've mentioned earlier - it defines the desired outcome as a transition, rather than a state. In practice of course, "`optimal quality"' becomes "`whichever clustering algorithm you like to run"', and "`drops dramatically"' becomes one of those gigantic hacks that make Principle and Rigor run away crying and hide under their bed.

The Alphabet-Soup Criteria

So can we make the elbow method a little more rigorous ? There have been a few attempts that work by changing the quantity that we look for the elbow in. A series of "`information criteria"' (AIC, BIC, DIC, and so on) attempt to measure some kind of shift in information that happens as we increase $k$, rather than merely looking at the cost of the solution.

While they are all slightly different, they basically work the same way. They create a generative model with some kind of term that measures the complexity of the model, and another term that captures the likelihood of the given clustering. Combining these in a single measure yields a function that can be optimized as $k$ changes. This is not unlike the 'facility-location' formulation of the clustering problem, where each "`facility"' (or cluster) must be paid for with an 'opening cost'. The main advantage of the information criteria though is that the quantification is based on a principled cost function (log likelihood) and the two terms quantifying the complexity and the quality of the clustering have the same units.

Coming up next: Diminishing returns, the ROC curve, and phase transitions.

Tuesday, February 23, 2010

Guest Post: Update from the CRA Career Mentoring Workshop, Day II

(ed note: Jeff Phillips is at the CRA Career Mentoring Workshop. His Day 1 dispatch is here)

It is day two at the CRA Career Mentoring Workshop.

Today was all about funding, with speakers from NIH (Terry Yoo), DARPA (Peter Lee), Laboratory for Telecommunications Science (Mark Segal), and NSF (Jan Cuny). Jeanette Wing, the assistant director at NSF CISE also made an appearance at reception yesterday.

(ed. note: I just heard that Jeannette Wing is leaving CISE in July. This is sad news - she was a strong and dynamic presence at CISE)

NIH advertised having a lot of money (about $30 billion, compared to $7 billion in NSF). The NIH has many sub-institutes with many different topics, but all applications are funneled through grants.gov. Terry Yoo was very enthusiastic about us applying for a piece of his large pie. It seems a bit tricky, however, to fit a pure computer science project into one of these institutes, specific health-related applications are enough.

We all (CI Fellows) thanked Peter Lee who helped spearhead the CI Fellows program. He recently joined DARPA to head the Transformational Convergence Technology Office (TCTO or "tic-toe"), a new program that will oversee many funded computer science programs. See the DARPA_News twitter feed for information on DARPA solicitations. Among other goals of this office, is to eliminate harsh "go or no go" conditions associated with DARPA grants.
For young researchers, look for CSG or YSA programs, similar in some ways to NSF CAREER awards.

The Laboratory of Telecommunications Science is part of NSA. They hire many many Ph.D.s for advanced computer science research. He could not tell us specifics about what they do, but compared it to an industrial research labs (e.g. AT&T, Yahoo Research, etc.). Movement between research parts and non-research parts is more fluid and is pretty hands on. Even the theoretical computer scientists and mathematicians they hire often build systems to implement their work.

To get funding through them, they generally have close and specific collaborations with faculty. The best way to start a relationship is sending a student on a summer internship (maybe even an undergrad) or via a sabbatical.

Jan Cuny from NSF decided she did not need to convince us that we should apply to NSF; rather, she just assumed we would and gave a howto on applying for NSF grants. Most important tip: **talk to program officers!** (before you submit). Otherwise, it is hard to give specific summaries from her talk (the slides will eventually be online--definitely look for them). The presentation nicely demystified some of the reviewing process; such as how grants are reviewed and why she may choose a certain proposal (for diversity) that scored slightly lower than another unfunded proposal. The other key advice: follow the guidelines precisely and carefully, make it easy for reviewers, and focus the content section on proposed work, not existing work.

A parting thought. It has been great to see many friends who are recent faculty or postdocs, in areas who I might not meet in my normal conferences. But it was a bit odd to have such a large fraction of my competition for jobs in the next year or two in the same room. The funding agencies were definitely here advertising how to get their funding, but if we did not realize that this was important, we would probably not have much luck getting jobs. If you were a department looking to hire to someone, perhaps it would have made sense to come here to recruit postdocs to apply ? Although I guess that is a bit optimistic, as it is a hirer's market.

Is there some way that having many people looking for jobs all in one place can facilitate the hiring process, or has this been out-dated with the electronic age? I would argue that personal interaction is underrated, and would help universities figure out not just who has a great resume on paper, but is also great to personally interact with.

Monday, February 22, 2010

Guest Post: Update from the CRA Career Mentoring Workshop

(ed note: Jeff Phillips is at the CRA Career Mentoring Workshop today and tomorrow, and filed this dispatch from Day 1.)

I am reporting from the CRA Career Mentoring Workshop. So far it has been excellent.

Frankly, I would not have even thought of coming (I did not even know about it), but it was "highly recommended" for all CI Fellows. But, having been here the first day, I would now recommend it to postdocs, early faculty, and senior graduate students set on a career in academics. Someone could obtain all of the information presented here by asking the right people around your department or field, but this workshop has really stressed what are the right questions to ask, and whom to ask.

The topics today were "Planning Your Research Career," "Career Networking," "Teaching," "Managing and Mentoring Students," "Preparing a Tenure Dossier," "Time Management and Family Life," and "Advice from Early Career Faculty." I thought an important aspect of how it was organized was that each topic had at least two speakers. This kept presentations short, and always provided at least two (often differing) perspectives. This ensured that there was just not someone lecturing us on their opinions on a topic, but instead demonstrating to us that there was no one right way to approach being a young faculty. The slides from all of the talks should eventually be online, found from one of the above links.

It's hard to pick out a handful of pieces of advice to share with you. Maybe I had heard 80% of the suggestions before, and 20% were new. But I would guess for an average person in my position a different 20% would be new. For instance, I was surprised by how different the tenure process can be from university to university. The solution: ask your department chair what are the key factors (journal vs. conference papers, is there a funding dollar threshold, student progress, who can write letters), and ask more senior faculty members in your department who recently got tenure for copies of their dossier. In general, the advice was to "ask for advice" and sometimes, according to Kim Hazelwood, "if you don't like the answer, then keep asking until you get an answer you like."

(ed. note: I've also heard the counter-advice "focus on doing the work to get tenure at a high ranking place, rather than just your department" - the rationale being that having a generically strong tenure case makes you more mobile if you need to be)

Also, your department is making a 5-6 year investment in you. So they should be there to help you succeed; if you don't get tenure, then everyone loses. This has all helped realize the great demands of the tenure process, but also make it seem quite possible. Intimidating and comforting at the same time.

Wednesday, February 17, 2010

SoCG author feedback

David Eppstein has an analysis of the SoCG author feedback with respect to his papers. Worth perusing: his overall conclusion is that having the rebuttal was a good idea, but he'd like to hear from the committee (perhaps at the business meeting?).

I had two papers rejected. For one there was no feedback requested, and the paper was rejected. The final reviews made it clear that the reviewers understood the main contributions of the paper - what was under contention (among other things) was how the material was presented, and that's obviously not something author feedback can help with.

The other paper had one request for feedback which was basically a long negative review, again focusing on the presentation. We tried to respond as best we could, but it didn't make too much of a difference.

He did concur that the quality of reviewing was very high.

SoCG Feedback...

It's my 1000th post !! Took longer than I expected, but then my average has been slipping. Here's hoping I get to 2000 !

Just got back my reviews from the Valentine's day massacre. I have to say that I'm stunned, and not for the reason you think.

I've been writing papers and getting reviews for a LONG time now, and I have to say that these are the best reviews I've ever seen, and are way beyond the standard for the typical theory conference. Both papers were rejected, and so the reviews necessarily were negative, but in an extremely constructive way. They critiqued without being critical, were detailed and thorough, and clearly got to the heart of what the paper was about. The comments pointed out what was good in the paper, and more importantly, pointed out what the reviewers felt was missing, and how best to address the problems. I actually felt better about the rejection after reading the reviews, because they came across as genuinely liking the work.

Now it wasn't all good. There was a basic premise at the heart of the rejection that I disagree with, but it's a reasonable point to disagree on, and I can at least see a way to resolving that problem.

At least one other author agrees with me on this assessment - you're welcome to share your experiences in the comments. Congratulations to the PC - theory conference reviews are often slammed, and rightly so, but these reviews stand out for their high quality.

Tuesday, February 16, 2010

SoCG accepted papers

After the Valentine's day massacre, comes the list. I'll link to PDFs if I'm pointed to them (add in the comments)

David Millman and Jack Snoeyink. Computing Planar Voronoi Diagrams in Double Precision: An Example of Degree-driven Algorithm Analysis
György Elekes and Micha Sharir. Incidences in Three Dimensions and Distinct Distances in the Plane
Micha Sharir, Adam Sheffer and Emo Welzl. On Degrees in Random Triangulations
Pankaj K. Agarwal, Rinat Ben Avraham and Micha Sharir. The 2-Center Problem in Three Dimensions
Florian Berger and Rolf Klein. A Traveller's Problem
Sergio Cabello and Bojan Mohar. Adding one edge to planar graphs makes crossing number hard
Tobias Christ, Dömötör Pálvölgyi and Miloš Stojaković. Consistent digital line segments
Dominique Attali and Andre Lieutier. Optimal reconstruction might be hard
Dominique Attali and Andre Lieutier. Reconstructing shapes with guarantees by unions of convex sets
Mark de Berg. Better Bounds on the Union Complexity of Locally Fat Objects
Marc Glisse and Sylvain Lazard. On the complexity of the sets of free lines and free line segments among balls in three dimensions
Tamal Dey, Jian Sun and Yusu Wang. Approximating loops in a shortest homology basis from point data
Mohammad Ali Abam and Sariel Har-Peled. New Constructions of SSPDs and their Applications
Sergio Cabello, Éric Colin de Verdière and Francis Lazarus. Output-sensitive algorithm for the edge-width of an embedded graph
Sergio Cabello, Éric Colin de Verdière and Francis Lazarus. Finding Shortest Non-Trivial Cycles in Directed Graphs on Surfaces
David Eppstein and Elena Mumford. Steinitz Theorems for Orthogonal Polyhedra
Akitoshi Kawamura, Jiri Matousek and Takeshi Tokuyama. Zone diagrams in Euclidean spaces and other normed spaces
Eryk Kopczynski, Igor Pak and Piotr Przytycki. Acute triangulations of polyhedra and the space
Keiko Imai, Akitoshi Kawamura, Jiri Matousek, Daniel Reem and Takeshi Tokuyama. Distance k-sectors exist
Joseph Mitchell. A Constant-Factor Approximation Algorithm for TSP with Neighborhoods in the Plane
Benjamin A. Burton. The complexity of the normal surface solution space
Karl Bringmann. Klee's Measure Problem on Fat Boxes in Time $O(n^{(d+2)/3})$
Natan Rubin. Lines Avoiding Balls in Three Dimensions Revisited
Pankaj K. Agarwal, Jie Gao, Leonidas Guibas, Haim Kaplan, Vladlen Koltun, Natan Rubin and Micha Sharir. Kinetic Stable Delaunay Graphs
Haim Kaplan, Micha Sharir and Natan Rubin. A Kinetic Triangulation Scheme For Moving Points in The Plane
Anne Driemel, Sariel Har-Peled and Carola Wenk. Approximating the \Frechet Distance for Realistic Curves in Near Linear Time
Joachim Gudmundsson and Pat Morin. Planar Visibility: Testing and Counting
Ken-ichi Kawarabayashi, Stephan Kreutzer and Bojan Mohar. Linkless and flat embeddings in 3-space and the Unknot problem
Roel Apfelbaum, Itay Ben-Dan, Stefan Felsner, Tillmann Miltzow, Rom Pinchasi, Torsten Ueckerdt and Ran Ziv. Points with Large Quadrant-Depth
Sunil Arya, David Mount and Jian Xia. Tight Lower Bounds for Halfspace Range Searching
Don Sheehy, Benoit Hudson, Gary Miller and Steve Oudot. Topological Inference via Meshing
Gur Harary and Ayellet Tal. 3D Euler Spirals for 3D Curve Completion
Peyman Afshani, Lars Arge and Kasper Dalgaard Larsen. Orthogonal Range Reporting: Query lower bounds, optimal structures in 3-d, and higher-dimensional improvements
Abdul Basit, Nabil Mustafa, Saurabh Ray and Sarfraz Raza. Improving the First Selection Lemma in $\Re3$
Bernard Chazelle. A Geometric Approach to Collective Motion
Bernard Chazelle. The Geometry of Flocking
Pankaj Agarwal, Boris Aronov, Marc van Kreveld, Maarten Löffler and Rodrigo Silveira. Computing Similarity between Piecewise-Linear Functions
Jean-Daniel Boissonnat and Arijit Ghosh. Manifold Reconstruction using Tangential Delaunay Complexes
Pankaj K. Agarwal. An Improved Algorithm for Computing the Volume of the Union of Cubes
William Harvey, Yusu Wang and Rephael Wenger. A Randomized $O(m\log m)$ Time Algorithm for Computing Reeb Graphs of Arbitrary Simplicial Complexes
Omid Amini, Jean-Daniel Boissonnat and Pooran Memari. Geometric Tomography With Topological Guarantees
Lars Arge, Morten Revsbaek and Norbert Zeh. I/O-efficient computation of water flow across a terrain
Timothy M. Chan. Optimal Partition Trees
Afra Zomorodian. The Tidy Set: Minimal Simplicial Set for Computing Homology of Clique Complexes
Umut Acar, Andrew Cotter, Benoit Hudson and Duru Türkoğlu. Dynamic Well-Spaced Point Sets
David Mount and Eunhui Park. A Dynamic Data Structure for Approximate Range Searching
Janos Pach, Andrew Suk and Miroslav Treml. Tangencies between families of disjoint regions in the plane

Friday, February 12, 2010

Papers and SVN

Way back when, I had promised to do a brief post on the use of SVN (or other versioning systems) for paper writing. Writing this post reminds of all the times I've sniggered at mathematicians unaware of (or just discovering) BibTeX: I suspect all my more 'systemsy' friends are going to snigger at me for this.

For those of you not familiar with the (cvs, svn, git, ...) family of software, these are versioning systems that (generally speaking) maintain a repository of your files, allow you to check files out, make local changes, and check them back in, simultaneously with others who might be editing other files in the same directory, or even the same file itself.

This is perfect come paper writing time. Rather than passing around tokens, or copies of tex files, (or worse, zip files containing images etc), you just check the relevant files into a repository and your collaborator(s) can check them out at leisure. SVN is particularly good at merging files and identifying conflicts, making it easy to fix things.

My setup for SVN works like this: Each research project has a directory containing four subdirectories. Two are easy to explain: one is a "trunk" directory where all the draft documents go, and another is an "unversioned" directory for storing all relevant papers (I keep these separate so that when you're checking out the trunk, you don't need to keep downloading the papers that get added in)

The other two come in handy for maintaining multiple versions of the current paper. The 'branches' directory is what I use when it comes close to submission deadline time, and the only changes that need to be made are format-specific, or relate to shrinking text etc. The 'tags' directory is a place to store frozen versions of a paper (i.e post-submission, post-final version, arxiv-version, journal version, etc etc)

It seems complicated, but it works quite well. The basic workflow near deadline time is simply "check out trunk; make changes, check in trunk; repeat...". A couple of things make the process even smoother:

Providing detailed log messages when checking in a version: helps to record what exactly has changed from version to version - helpful when a collaborator needs to know what edits were made.
Configuring SVN to send email to the participants in a project whenever changes are committed. Apart from the subtle social engineering ("Oh no ! they're editing, I need to work on the paper as well now!"), it helps keep everyone in sync, so you know when updates have been made, and who made them.
Having a separate directory containing all the relevant tex style files. Makes it easy to add styles, conference specific class files etc.

I can't imagine going back to the old ways now that I have SVN. It's made writing papers with others tremendously streamlined.

Caveat:

SVN isn't ideal for collaborations across institutions. Much of my current work is with local folks, so this isn't a big problem, but it can be. Versioning software like git works better for distributed sharing, from what I understand.

Thursday, February 11, 2010

Graphs excluding a fixed minor

Families of graphs that exclude a fixed minor H have all kinds of nice properties: for example

If G excludes a fixed minor H, then G has O(n) edges
If G excludes a fixed minor H, then G has O(\sqrt{n}) treewidth
If G excludes a fixed minor H, then G has a nice decomposition into a few graphs of small treewidth
If G excludes a fixed minor H, then G can be decomposed into a clique sum of graphs that are almost embeddable on surfaces of bounded genus.

(all of these and more in Jeff Erickson's excellent comp. topology notes).

In all these cases, the O() notation hides terms that depend on the excluded graph H. for example, if H is a clique on k vertices, then G has at most O(nk\sqrt{log k}) edges.

So the question is: given a graph G, what's the smallest graph H that G excludes ? This problem is almost certainly NP-hard, and probably at least somewhat hard to approximate, but some approximation of the smallest graph (measured by edges or vertices) might be useful.

I was asked this question during our algorithms seminar a few days ago, and didn't have an answer.

Monday, February 08, 2010

Good prototyping software

All the code for my recent paper was written in MATLAB. it was convenient, especially since a lot of the prior work was in MATLAB too. I actually know almost no MATLAB, preferring to do my rapid protoptyping in C++ (yes, I'm crazy, I know).

Which brings me to the question that I know my lurking software hackers might have an answer to. If I have to invest in learning a new language for doing the empirical side of my work, what should it be ? Here are some desiderata:

Extensive package base: I don't want to reinvent wheels if I can avoid it. In this respect, the C++ STL is great, as is Boost, and Python has PADS, as well as many nifty packages. MATLAB is of course excellent.
Portability: I'm sure there's an exotic language out there that does EXACTLY what I need in 0.33 lines of code. But if I write code, I want to be able to put it out there for people to use, and I'd like to use it across multiple platforms. So Lua, not so great (at least in the research community) (sorry Otfried)
Good I/O modules: if I write code, I'll often want to send output to a graph, or plot some pictures etc. Some systems (MATLAB) are excellent for graphing data.
Performance: I don't want to sacrifice performance too much for ease of coding. I've always been afraid of things like Java for this reason. Of course, I'm told I'm dead wrong about this.

I deliberately haven't listed 'learning curve' as an option. If the language merits it, I'm willing to invest more time in switching, but obviously the benefits have to pay for the time spent. In terms of background, I'm most familiar with C++, and am a nodding acquaintance of python, and perl. I occasionally nod at MATLAB in the street when I see it, but usually cross the road to the other side if I see Java approaching. I used to be BFF with OpenGL, but then we broke up over finances (specifically the cost of GPUs).

Thoughts ?

Sunday, February 07, 2010

The challenge of doing good experimental work

I recently submitted (with Arvind Agarwal and Jeff Phillips) a paper on a unified view of multidimensional scaling. It's primarily empirical, in that the main technique is a heuristic that has many nice properties (including providing a single technique for optimizing a whole host of cost measures for MDS). For anyone interested though, there's also a nice JL-style theorem for dimensionality reduction from (hi-D) sphere to (lo-D) sphere, which gets the "right" bound for the number of dimensions.

But this post isn't really about the paper. It's more about the challenges of doing good empirical work when you're trained to think formally about problems. This post is influenced by Michael Mitzenmacher's exhortations (one, two) on the importance of implementations, and Mikkel Thorup's guest lament on the lack of appreciation for simple algorithmic results that have major consequences in practice.

So you're looking at practical ramifications of some nice theory result, or you're trying to apply formal algorithmic tools to some application problem. If you're lucky, the problem doesn't have an existing base of heuristics to compare against, and so even a straight-up implementation of your ideas is a reasonable contribution.

Of course, you're rarely this lucky, and there's usually some mish-mash of heuristics to compare against. Some of them might be head-scratchingly weird, in the "why on earth should this work" category, and some are possibly more principled. At any rate, you go and implement your idea, and you suddenly realize to your horror that worst-case bounds don't mean s*** when your principled method is ten times slower than the crazy heuristics. So now what ?

The central point that I want to make here is that while actually paying attention to implementation issues is of immense value if we want people to actually care about theoretical work, I don't think we get the right kind of training to do it well.

First off, we lack training in various heuristic design strategies. Now I don't actually mean the kinds of heuristics that one might come across in the Kleinberg-Tardos book (local search, annealing, and the like). I mean the much larger body of principle-driven heuristics that the optimization community is quite familiar with. Without even thinking too hard, it's easy to list heuristics like Newton's method, conjugate gradients, alternating optimizations, matching pursuit, majorization, the frank-wolfe method, iteratively reweighted least-squares, (and I could keep going on...)

Of course you might point out that I seem to know all about these heuristics. Well, not quite. The second problem is that even if one knows about these methods, that's not the same thing as having a good intuitive feel for when and why they work well. Case in point: one of the thorns in our side in this paper was a heuristic for MDS called SMACOF. It's a nice technique based on majorization, and although it's a heuristic, it works pretty darn well most of the time and takes a lot of effort to beat, even though there's no clear reason (at least to me) why it should work so well. The only real way to get a sense for how different heuristics work is to implement them all really, or at least have the right set of MATLAB/C/C++ tools lying around. I notice that ML folks tend to do this a lot.

The third problem that often comes up is by far the most irritating one: the actual cost function you're optimizing doesn't even matter that much at all. Returning again to our paper, there are many ways to define the "error" when embedding a metric into another metric. The traditional theoryCS way looks at dilation/contraction: the worst-case ratio between distances in the original and target space. Most variations on MDS actually look at an average difference (take the difference between the distances, and average some function of this). As anyone who mucks around with metric spaces will tell you, the actual error function used can make a huge difference to the complexity of the problem, the ability to approximate, and so on and so forth.

But here's the thing we discovered: it's actually possible to run heuristics designed explicitly for one kind of error function that do just great for another kind of error function, and it takes a lot of work to construct examples that that demonstrate the difference.

These points tie together in a more general theme. I was reading a post by Jake Abernathy at Inherent Uncertainty, and he makes a valuable point about the difference between algorithms/theory culture and ML culture (although ML could be replaced by other applied areas like db/data mining as well). His point is in theoryCS, we are problem-centric: the goal is to prove results about problems, and taxonomize them well. Improve asymptotic run-time - great ! get a better approximation ratio - excellent ! reimplement the same algorithm with the same running time to get a better behaviour in practice - hmmm. This is in contrast (as he puts it) to a lot of ML research, where the algorithmic technique comes first, and it's later on that some results are generated to go along with it.

This of course drives us nuts: NOT EVERY CLUSTERING PROBLEM SHOULD BE SOLVED WITH k-MEANS ! (phew - I feel better now). But if you reexamine this situation from the setting of applied situations, you realize its utility. I want to solve a problem - I pick up some off-the-shelf algorithm. Maybe it's doesn't solve my problem exactly; maybe it solves a related problem. But it's either this, or some very complicated theoretical method that has no extant implementation, and is optimizing for a worst-case that I might never encounter. What do I do then ?

This is not a rant about worst-case analysis. Far from it. It's not a rant about O() notation either. What I'm merely trying to say is that a focus on worst-case analysis, asymptotic improvements, and provable guarantees, while essential to the enterprise of doing theory, leaves us little room for the kind of experience needed to do effective practical implementation of our ideas.

Monday, February 01, 2010

The limerick I used to introduce Emmanuel Candes

I was assigned the task of introducing Emmanuel Candes for his invited talk at SODA. After getting tired of my incessant pacing up and down the hotel room at 3am mumbling about the text of the intro, my roommate took pity on me and composed a cute little limerick for the occasion.

Now whether it was madness, desperation or both, I don't know. But I decided to use that limerick to open the introduction, much to the consternation of my (unnamed) roommate, who hadn't intended his little throwaway to take on such prominence.

But use it I did, and it didn't fall entirely flat, for which I am very grateful. I won't out the composer unless he does it himself :), but here's the limerick:

There once was a professor from Caltech
who represented his signals with curvelets
But upon reflection
realized, he'd prefer projection
As they could better capture sparse sets.

Thank you all: I'll be here all night....

Could the IPad make computer science obsolete ?

OK fine. it's a provocative title. But hear me out.

Most non-cave-dwelling luddites have heard about the new Apple tablet (aka IPad). The simplest (and most misleading) way to describe it is as a gigantic ipod touch, with all the multitouch goodness of the ipod/iphone, as well as the numerous app store apps.

There's a vigorous debate going on over the merits and impact of the IPad, and while it's clear that it's not a work laptop/netbook replacement, it's probably the first internet appliance with some mojo.

The word 'appliance' is chosen deliberately. The IPad essentially behaves like a completely sealed off appliance - you can't hack it or customize it directly, and are only allowed the interface that's provided to you by Apple and the app store (also controlled by Apple). This is viewed (correctly, on many levels) as a feature, and not a bug. After all, most people don't care to know how their large and complicated computers really work, and all they want is a way to check email, surf the web, watch movies, etc etc.

But here's the thing. As long as the computer has been this complicated, hard to manage and yet critically important device, it's been easy to make the case for computer science as an important, lucrative discipline, and one worth getting into. Even in the past few years, with enrollments plummeting (they seem to be recovering now), there's been no argument about the importance of studying computer science (even if it comes across as boring to many).

And yet, how many people enroll in 'toaster science' ? More importantly, how many people are jumping on the chance to become automotive engineers ? As the computer becomes more and more of an appliance that we "just use", the direct connection between the person and the underlying computing engine go away.

Obviously there'll always be a need for computer scientists. Those social networks aren't going to data mine themselves, and someone needs to design an auction for pricing IPad ads. But it's quite conceivable that computer science will shrink dramatically from its current size down to a much smaller discipline that generates the experts working in the backend of the big internet companies (Microsoft, I'm not optimistic about your chances of survival).

This cuts both ways: a smaller discipline with more specialized skills means that we can teach more complex material early on, and are likely to attract only the dedicated few. However, it'll be a "few": which means that for a while, till we reach a new stable equilibrium, there'll be way fewer jobs at lower salaries.

I make this observation with no great pleasure. I'm an academic and my job might disappear within 10 years for other reasons. But along with rejoicing in the mainstreaming of what appears to be fairly slick use of technology, I'm also worried about what it means for our field as a whole.

Tuesday, January 26, 2010

Author Feedback, or "Conference Review process considered harmful"

Author feedback is the latest attempt to put a band-aid on the bleeding carcass of the conference review process. We had author feedback at SoCG, and it's a common feature at many other conferences. The ostensible purpose of author feedback is to allow authors to clarify any misconceptions/confusions the reviewer might have so as to make the review process a bit more orderly (or less random?).

Usually, the process works like this: reviewers submit their reviews and have the option of requesting clarification on specific points from the authors. Authors get the questions, are required to submit a rebuttal/response by a certain date, and then deliberation continues. Variations on this include:

Length of the author response
When it's asked for (before discussions start, or after)
Whether it's called a 'rebuttal' or a 'response' or even just 'feedback' - I think the choice of word is significant
Whether the reviewers' current scoring for the paper is revealed or not.

While a good idea in principle, it can cause some headache for program committees, and often devolves into a game of cat and mouse: the reviewer carefully encrypts their questions so as not to tip their hand, the author tries to glean the reviewers' true intent from the questions, while trying to estimate which reviewer has the knife in, and so on and so forth.

What I want to rant about is the author feedback system for a conference I recently submitted to. The reviews came back long and vicious: as far as one reviewer is concerned, we should probably go and hide under a rock for the rest of our pathetic (and hopefully short) lives.

That doesn't bother me as much as it used to - I've grown a thick hide for these sorts of things ;). However, a combination of things has sent me into a fury:

The reviewer is actually wrong on most counts. This is isn't a matter of disagreeing over motivation, relevance etc. It's just a basic "please read section 5, column 1, sentence 3" type problem.
The author feedback limit is 2048 characters (which is a rather tiny amount if you're counting at home)

There's a basic issue of fairness here. Why does a reviewer get to go off on a rant for pages, while we have to limit our response to essentially sentences of the form "Yes. No. Maybe" ? Especially when the reviewer is basically wrong on a number of points, it takes a while to document the inaccuracies. At the very least, we should get as many characters in our response as the reviewers got in theirs ! (point of note: the set of reviews were 11225 characters long, and the specific reviewer I'm complaining about had a 2500 character long review)

This paper is not getting in, no matter what we say: that much is clear. I've almost never heard of a paper successfully rebutting the reviews, and in all fairness the other reviewers have issues that are matters of opinion and can't be resolved easily. That is a little disappointing, but perfectly fine within the way the review process works. But I'm annoyed that there's no good way to express my dissatisfaction with the reviewing short of emailing the PC chair, and it's not clear to me that this does any good anyway.

Overall, I think that author feedback in the limit gets us to journal reviews, which is a good thing (and my colleague actually suggested that conference reviewing should have more rounds of author feedback and less time for actual paper reviewing). But the way it's done right now, it's hard to see it as anything other than 'reviewing theater', to borrow a Bruce Schneier term. It looks nice, and might make authors and PCs feel good, but has little value overall.

Update: in case it was implied, this conference is NOT SoCG :)

Monday, January 18, 2010

SODA Day 1

Priceless: seeing the look on a speaker's face when they're asked for the exact constant in their 'constant factor approximation'

Yes, it's SODA time. And having an iphone means I can leave my laptop in my room, which means blogging waits till the day is over.

It was a good first day. I chaired the session on (as I put it), "geometry, graphs, and the geometry of graphs", and really enjoyed all the papers in the session (David has a summary of the Lee-Sidiropoulos talk). The phrase of the session was 'Poincare inequality'.

Alex Andoni talked about work with T. S. Jayram and Mihai Pătraşcu on lower bounds for computing the edit distance (and related distances like the Ulam distance). The program of attack was via the use of information complexity - a technique first developed for use in streaming lower bounds, but which has general applicability to communication complexity problems. Here's roughly speaking how the argument goes:

The direct-sum family of results says that the communication complexity of a function f expressible as an AND of n functions g is at least n * the information complexity of g. The overall plan is therefore to break down the strings being compared into pieces, and lower bound the information complexity of each piece.

Now let g be a threshold distance function that correctly reports if the distance is "too small" or "too large", but not inbetween. It turns out that the information complexity of g can be lower bounded by a constant relating to the Poincare inequality. So what is this inequality ?

In a sense, it captures the difficulty of distinguishing short distances from long. Specifically, look at the average distance of pairs of points sampled over all "short" pairs, and do the same for "long pairs". If the two resulting numbers are within some constant, then that's the constant used above, and intuitively tells us that we can't tell the pairs apart distributionally speaking.

It's not easy in general to prove Poincare inequalities, although these appear to be at the heart of non-embeddability results. What the authors go on to do is show that if the distance metric being used is "complex" i.e has a reasonably large communication complexity, then this can be used to show a Poincare-type inequality.

Neat stuff.

Saturday, January 16, 2010

Author rebuttal for SoCG 2010

There appears to be an author rebuttal phase for socg 2010. This is causing some confusion since many of the reviews are blank. I suspect based on my experience with DB reviewing that this means the reviewers didn't want clarification.

I got one nontrivial "review" back, and am assuming that some response is called for. The text is phrased not so much as a request for clarification but as an actual review. It's a bit easier when there's a specific question to be answered.

I'm glad the committee is doing this though, even with the extra effort involved. It's a good way of clarifying things that could otherwise have consequences beyond their importance.

Friday, January 08, 2010

Guest Post: Question on posting to the arxiv

ed. note: this post is by Jeff Phillips. For another recent post on arxiv publishing issues, see Hal Daume on the arxiv, NLP and ML.

It seems that over the last few months, the number of papers posted to the arXiv has been noticeably increasing, especially in the categories of Computational Geometry and Data Structures and Algorithms.

I have posted several (but not all) of my papers on the arXiv. I still do not have a consistent set of rule under which I post the papers. Here are a couple circumstances under which I have posted paper to the arXiv.

A: Along with Proceedings Version:
When conference version does not have space for full proofs, so in conjunction with proceedings version, post full version to arXiv. This is a placeholder for the full version until the journal version appears. Additionally, the arXiv paper can be updated when the final journal version appears if it has changed.

Sometimes, I link to the arXiv version in the proceedings version. This makes it easy for a reader of the proceedings to find the full proofs.

If more conferences move to the SODA model where proceedings versions can be much longer (~20 pages), then this situation may not often be necessary.

B: Along with Submitted Version:
When you want to advertise a piece of work, but it has only been submitted, post a version to arXiv. This is useful if you are giving talks on the work, and want a documented time stamp so you can't get scooped, or say, you are applying for jobs and want to make your work very available and public.

This is closer to the math philosophy where many (most?) people submit a version of a paper to arXiv as soon as they submit it to a journal. I think it would be great if CS adapted this policy, as it would be a lot easier to track results. I have a friend who as a math graduate student would start every day by perusing the dozen or so new arXiv post in his area and choosing one paper to read. He told me that almost every paper he read as a grad student was on the arXiv. Wouldn't a world like that be extremely convenient?

However, I have had an issue following this rule. Last year I submitted a paper to a conference and concurrently, submitted a longer version to the arXiv. The paper was unfortunately, not accepted to the conference. My coauthor and I extended the results to the point where it made sense to split the paper. Half was then submitted and accepted to another conference, and full proofs were made available through a tech report at my coauthor's institution, as he was required to do. The second half which has also been extended is now under submission.

I might like to post the (full) second half to the arXiv, but do not want to double the part from the previous post. I am not sure if it make sense to merge the papers at this point either. And I would also like to note on the arXiv page that that version has been extended and part appears as a tech report.

What is the proper arXiv etiquette for this situation?

Thursday, January 07, 2010

Mixture Models: Classification vs clustering

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

We're all family.

k-means is arguably the most popular clustering algorithm (and one of the top 10 data mining algorithms to boot). It has a close relative though (some might say a generalization) in the expectation-maximization algorithm (or EM).

The EM algorithm represents yet another paradigm shift in how we think about clustering, to the point where it's arguable whether we are clustering anything at all. What the EM algorithm really does is density estimation: it assumes the data being clustered is generated by picking one of k distributions according to some probability distribution (the mixture) and then sampling from this distribution.

In other words, the definition of a cluster is not in terms of relationships betweens pairs of data points any more: it's in terms of how a collection of points behave together as a group, as representative samples from a distribution.

It's not particularly hard to see that the k-means algorithm is really a special case of EM on Gaussians(if I might be permitted an indulgence, a "tropical" limiting case of EM). In general though, the EM algorithm is an alternating optimization that finds the maximum likelihood parameters for distributions that capture the data. In the "E" step, it fixes the current parameters (the current hypothesis for the distributions) and computes the expected (log) likelihood of the data by first estimating the "cluster membership probabilities" of assigning a point to a "cluster" or distribution, and then using this to estimate the overall likelihood function. The "M" step then finds the set of parameters that maximizes the likelihood function. In k-means language, the first step figures out which clusters to assign points to, and the second step assigns new centers to each cluster.

The EM algorithm is very powerful. It's generally known that if the distributions under consideration are from an exponential family, then this method is easy to implement, because the E and M steps reduce to simple operations that have closed forms. Via the usual Bregman connection between divergence functions and distributions, it's also possible to give a purely geometric interpretation to the EM method akin to k-means, but with a different distance structure.

What's even more fascinating is the information-geometric perspective. There's a wonderful Riemannian world in which distributions are points on a manifold with coordinate systems given by their (natural) parameters, and where various information distances can be related to affine connections on said manifold. Too much to go into here, but the key insight is this: the E and M steps of the EM algorithm can be seen as projection operations in primal and dual manifolds, so the EM algorithm really looks like a primal-dual procedure. Way cool !

So is this not clustering ? For a detailed argument along these lines, you should read Hal Daume's analysis (with a nifty example). My take on this is that while density estimation is indeed different to the problem of clustering, I think of mixture-model-based clustering as just a much more "volume-based" approach to doing clustering, you expect not that point associate with each other per se, but that the ensemble of points itself satisfies some global properties.

I'll develop this idea further when I talk about information-theoretic clustering: to me, density-based clustering suggests a new abstraction for thinking about what a clustering ought to mean in the absence of even metric structure, and allows us to make pleasant detours into discrepancy, quasi-randomness and kolmogorov complexity. Stay tuned....

p.s apologies for the long hiatus.

Wednesday, December 30, 2009

Reading papers.

It's been a bad few months for blogging, while I've been doing (shudder) "real work". Whenever people ask me how I make time for blogging, I always say that it takes less time than one thinks. This is generally true, but the last few months have been a blizzard of paper and proposal deadlines, and I just never got the time to sit down and pen coherent thoughts. I've also noticed that my tweeting is bleeding thoughts away from the blog: random throw-away comments that might been assembled into a blog post end up merely getting tweeted.

For the first time in a long time, I took a "true" vacation, where not even a laptop came with me (ok I took my iphone, but that's not the same :)). It was only a week, but I appreciated the complete downtime, especially coming over the frenzy of deadlines. It's been hard to get back into the swing of things, but it was a really good way to recharge.

But you don't want to hear about any of that. What you really want to hear about is..... my lament on how to organize paper reading.

When I was a grad student, life seemed simple. STOC/FOCS/SODA (and then SoCG) would come around, we'd peruse the paper list, chatter on about them, and figure out if someone had scooped us or not. Now, things seem more complicated. Especially since I'm the sole algorithms person at the U, I feel the need to catch up on the latest (and not-so-latest) hot topics in theoryCS, at least to keep abreast of things and share the latest news with my students. I also work in a number of more 'applied' areas, which means that there's a SIGMOD/VLDB/PODS/ICDE timeline to keep track of, and more recently a NIPS/ICML/COLT timeline, not to mention even more applied areas like ICCV/CVPR/MICCAI (more on that later).

There's a large and complicated taxonomy of paper reading: some of the main items are:

Papers I'm reading for technical material when working on a problem - this is the easiest kind to manage, because you have to read it RIGHT NOW.
Papers that seem related to the problem I'm working on, and probably need to be cited, but are not in the critical path. This isn't too hard either - things like Mendeley allow me to save (and tag) papers for later use with just a few clicks. It's not a perfect system (what if the paper's on someone's home page), but it mostly works.

Then come the more complicated categories:

Papers related to a problem that I'm not quite working on right now, but seem relevant. I can sock them away, but I have to remember to read them when I return to the other problem
Papers that everyone's talking about at the latest-greatest conference, but might have nothing to do with my specific suite of problems (like for example the Moser LLL proof).
Papers that might help me catch up on an area that is very hot, but which I wasn't following from the beginning (cough cough AGT cough cough)
Papers that were just announced in the latest conference/arxiv/eccc issue, that sound worthy of perusal.

There are many technological solutions to squirrel stuff away: I even use Dave Bacon's arxiview app for the iPhone (and btw, I've found the only use for 2-column format - reading papers on the iphone). I've also experimented with private wordpress installations to allow me to bookmark interesting papers.

But the real problem, that I have yet to crack, is how to systematically plow through the mountains of reading that I must do to stay abreast of the fields I'm interested in. I've tried "read one paper a day", or "read papers over the weekend" and things like that, but nothing ever seems to stick, and I'm curious about what techniques people have used that actually work. I'm not excluding technological solutions, but I think the problem goes beyond that.

So what say you all ?

Tuesday, November 24, 2009

Locating SODA outside the US

David Johnson forwards a note from the SODA Steering Committee about the possibility of having SODA outside the US. Summary: it can be done, with the kind of bid information that accompanies bids to other conferences with independent local arrangement. What follows is the note:

To those who might be interested in having a future SODA conference take place outside North America:

As you may know, local arrangements for SODA have to date always been handled by SIAM. The conference Department at SIAM has wide experience in organizing conferences in North America, but typically relies on help from local organizers when a conference they sponsor takes place elsewhere.

This presented difficulties when, at last January's SODA, the Business Meeting voted a preference for holding the 2011 SODA in Paris. SIAM and the SODA steering committee were unable to find any French organization that was prepared to handle the meeting in Paris. One organization might have been willing to host the meeting in the Parisian suburbs, but we judged that most of the voters for Paris would not have voted for "suburbs of Paris".

(No hotel was available in the second vote-getter, the Virgin Islands, and so SODA will be back at San Francisco, the 3rd-place choice, in 2011, January 22-25.)

In light of the experience in finding a location for the 2011 SODA, we are going to change the ground rules a bit for the 2012 site selection. Proposals of sites outside of North America will still be welcome, but to be considered, they must include the details of what group will handle local arrangements, where precisely the conference will be held, and what the expected hotel costs will be. In other words, the kind of information typically provided by groups proposing to host STOC or FOCS. (Proposals from groups that can collect registrations and take financial responsibility are preferred because of the added costs of SIAM providing these services in a country outside North America.)

If you are interested in making such a proposal at the SODA 2009 Business Meeting, please contact Kirsten Wilden (wilden@siam.org) for more details about what kinds of information SIAM will need about your proposal.

There has also been some thought that costs for SODA in North America could be reduced if local arrangements were placed in the hands of local volunteers, as they are for FOCS and STOC. If anyone wishes to propose to host SODA 2012 in this way, their detailed proposals will also be considered at the Business Meeting. (Again, contact Kirsten Wilden at SIAM in advance.)

And of course, proposals for sites that the SIAM conference staff can handle on their own can be made as usual.

Hope to see you all at SODA 2009 in Austin, January 16-19.

David Johnson, SODA Steering Committee Chair.

Sunday, November 08, 2009

Anathem, and mathematical stereotypes

Neal Stephenson is (or should be !) a familiar figure in the sci-fi/speculative fiction landscape: his Cryptonomicon is a retelling of the story of Turing, along with much modern day drama involving advanced crypto and security. His new book, Anathem, came out with much fanfare, and is a vast tale set in a place where mathematics is a pursuit conducted in monastery-like places with strong religious overtones.

I'm reading Anathem right now, and am only at the beginning (so no spoilers please), but there's already a beautifully rendered discourse on stereotypes of the mathematican (or scientist in general). Inside the 'concent' (the monastery), these are termed 'Iconographies', as formal templates by which to understand how the "saecular" world perceives the mathematicians. I was reminded of this when writing I was writing the post on Soviet-style mathematics and realized the stereotypes at the heart of the referenced WSJ article.

So here are some of the iconographies discussed early on (you get one point for each one that you've encountered in real life):

Tennestrian: seemingly clownish eccentric figures that have a darker side, luring impressionables and innocents into folly (a play on the story of Socrates)
Doxan: brilliant, unemotional and cold, but as a consequence subordinate to passionate leaders with true human feeling. (can you say 'Spock'?)
Yorran: Criminally insane mad scientist shut up in a lab with plans to conquer the world.
Rhetorian: Sinister conspiracy plotters, out to take over the world by planting minions out in the secular world to attain positions of power.
Muncostran: Eccentric, lovable, dishevelled theoreticians, absent-minded and good-natured (the subtext being: ignorable)
Pendarthan: High-strung nervous meddling know-it-alls, unable to understand the realities of life, always subordinate to more 'masculine' (his words, not mine) secular folk.
Klevan: Awesomely wise elder statesman who can solve all the problems of the world (apart from Einstein, I don't know of anyone who achieves such stature in our world)
Baudan: Cynical frauds living in luxury at the expense of the common man (this sounds like the viewpoint of letter-writers to the Salt Lake Tribune)
Penthabrian: keepers of mystical secrets handed down from above, with 'theory' as a smokescreen used to fool the lay folk (if only...)
Moshianic: A combination of Klevan and Penthabrian - viewed as the most dangerous iconograph because of the high expectations placed on the theorist shoulders.

It's a neat trick to identify the ways in which the outside world perceives the 'theors', as they are called, and in doing so understand where the outsider is coming from, and what kind of threat they pose. I suspect I'm going to start classifying people in "the real world" the same way when I describe what I do.

Saturday, November 07, 2009

Soviet-style mathematics

Via Anand Kulkarni (aka polybot) comes an interesting article in the WSJ by Masha Gessen on Grigori Perelman, Soviet-era mathematics and the question of 'big math'. The premise of the article (Masha Gessen has a book out on Perelman and the Poincare conjecture) is that special environments are needed to prove big results, and the Soviet-era mathematical enclaves fostered this environment both because of, and inspite of the Soviet political system.

It is indeed true that amazing work came out of the isolated confines of Soviet mathematical institutes, often parallel to or well before similar work in the Western world. There's a joke that goes around theoryCS circles that for every theorem proved before the 80s in the west, there's an equivalent result proved 10 years earlier by a Russian mathematician. We need look no further than the Cook-Levin theorem, the Koebe-Andreev-Thurston theorem (on circle packings), Kolmogorov-Chaitin-Solomonoff complexity (and according to some, the Cauchy-BUNYAKOVSKY-Schwarz inequality, though this is disputed).

But in the article is a more thought-provoking claim:

The flow is probably unstoppable by now: A promising graduate student in Moscow or St. Petersburg, unable to find a suitable academic adviser at home, is most likely to follow the trail to the U.S.

But the math culture they find in America, while less back-stabbing than that of the Soviet math establishment, is far from the meritocratic ideal that Russia's unofficial math world had taught them to expect. American math culture has intellectual rigor but also suffers from allegations of favoritism, small-time competitiveness, occasional plagiarism scandals, as well as the usual tenure battles, funding pressures and administrative chores that characterize American academic life. This culture offers the kinds of opportunities for professional communication that a Soviet mathematician could hardly have dreamed of, but it doesn't foster the sort of luxurious, timeless creative work that was typical of the Soviet math counterculture.

For example, the American model may not be able to produce a breakthrough like the proof of the Poincaré Conjecture, carried out by the St. Petersburg mathematician Grigory Perelman.

This is a reflection of one of the enduring myths of mathematical research, "a mathematician would be happy in jail if they had paper and pen", with a bit of the 'a mathematician is a solitary (and slightly crazy) genius'. I can see the allure in the idea: mathematics requires great concentration, and removal of distractions would surely make it easier to focus on a big problem.

But is this really impossible to achieve in the Western model of research ? After all, even Perelman's work built heavily on a program first outlined by Richard Hamilton from Columbia. Andrew Wiles proved Fermat's theorem while at Princeton. Ketan Mulmuley has been banging away at P vs NP while shuttling between Chicago and IIT Bombay (yes, I know it's not a perfect comparison because it hasn't been resolved yet). Stephen Cook proved that SAT is NP-Complete while at Toronto. And so on and so forth.

Possibly one argument in favor of the 'isolation: good' theory is that Perelman didn't need to prove himself for 6-7 years, maintain a steady stream of funding, and teach lots of classes in order to "earn" the right to study such a hard problem. It's hard to imagine a researcher in the US being able to do this before they get some kind of job security (tenure, or otherwise).

Monday, November 02, 2009

Innovation in Computer Science

As the polylogblogdogslogglog blog points out, the ICS results are out. 39 papers were accepted in all - at some point I knew the number of submissions, but I've forgotten since.

The ICS folks didn't make life easy for themselves by explicitly stating that they wanted "conceptual contributions". But looking over the list of papers, a few things come to mind:

It's a great list of papers. Nothing to complain about really, and any of these could have been a credible paper at FOCS/STOC
The Arora et al paper on designing derivatives using NP-hard problems has already received so much chatter, one might argue that the conference mandate has already been satisfied. Similarly for the quantum money followup.
Even if the whole 'conceptual contributions' thing doesn't pan out, I see no harm in having a third conference inserted between FOCS and STOC - the more the merrier.
I guess innovation = "game theory + crypto + quantum + misc" :)

A side note. 8 of the 12 PC members have papers at the conference. This is a departure from the theory "norm". While I don't necessarily think it's a problem (everyone except theoryCS does this, and this is also a new conference), it's worth discussing. Is this something we'd like to have happen in other theory conferences as well ? Recall that the first year of the SODA 2-page experiment also had PC-submitted papers (and that was mainly to ensure submission volume, from what I recall).

Update: Shiva Kintali has PDFs for the accepted papers.

Sunday, November 01, 2009

What is computational topology ?

Jeff's post on his computational topology class (and the wonderful class outline), prompted me to write about something that I've tried to explain to people before.

Computational topology is arguably the hottest thing in SoCG-land right now, and has been so for a number of years (for curious folk, the "other" hot topic is high dimensional approximate geometry). But if you ask different people, you'll get different answers to the question "what is computational topology ?". I've even had to explain to local students why my computational geometry class is different from the computational topology class being offered. So here goes:

CT I: Using topological ideas to understand the 'shape' of data
This is the version of CT that has taken up the largest fraction of CT mindshare in the SoCG land. Dating back to work by Edelsbrunner and Mucke on alpha-shapes, the field now encompasses a range of ideas for extracting topological structure from data. New topological constructs that have come from this area include various notions of persistence, as well as related work on reconstruction, data smoothing, and visualization.

Afra Zomorodian's thesis (now a book) is a nice introduction to these concepts for a CS audience. Herbert Edelsbrunner is coming out with a book on this topic very soon (Jan 16, 2010! Mark your amazons!), and Valerio Pascucci teaches a class on computational topology at the U.

CT II: Asking computational questions about fundamental topological objects and concepts.
This is closest to the 'Computational X' flavor of work, where a computational perspective is brought to the study of X. There are many interesting problems here, and they have a nice discrete flavor that makes them 'friendly' for theoryCS folk. For example, computing the Betti numbers of a simplicial complex efficiently, or finding homotopy equivalent paths on a surface, or lots of work on graphs on surfaces.

I don't think there's a single book on this topic. JeffE has put together a fantastic course outline (with reading material), and there's also a great book on graphs on surfaces by Mohar and Thomasson. It's worth noting that some of the deeper tools in the Robertson-Seymour graph minor results take us into graphs-on-surfaces-of-large-genus land.

CT III: Using topological ideas in the heart of complexity theory.
Almost no one I know uses 'computational topology' in this sense, and there isn't a coherent and connected body of work to talk about as such. But there are some fascinating results at the core of complexity theory that rely on topological constructions. There's the algebraic complexity work of Yao/Steele/Ben-Or and others, showing that lower bounds for algebraic complexity of certain problems can be related to the sum of Betti numbers of associated surfaces. There's the Kahn-Saks-Sturtevant (and Chakrabarti-Khot-Shi) work on evasiveness of graph properties, and then the work (that I don't quite understand) on topology in distributed computing (that got Herlihy, Saks, Shavit and Zahoroglou the Godel Prize)

This is one of the reasons I think a course in topology (of some flavor, whether it be combinatorial, algebraic or point-set) should be required mathematical background for all theoryCS students.

Wednesday, October 28, 2009

Clustering interlude: Time series

In the absence of an actual post on clustering, I encourage all of you to go and read Sorelle's (or should I say ***SORELLE***'s) post on time series clustering.

Monday, October 19, 2009

"Wave"ing...

(while I wait for actual inspiration to strike)

I just acquired an invite to Google Wave (thanks to John Moeller), and have been finding it quite intriguing (note: you can't "get" Google Wave unless you're invited, and no, I don't have any invites to give out).

Google Wave is a mysterious combination of email, chat, wikis and browser extensions that's hard to describe - you can read descriptions of it all over the web, but unless you are able to get in and start playing, it's like trying to buy wine based on text descriptions of the taste.

So far I've used it to:

discuss an outline for next semester's seminar with my students (Topics in Graph Algorithms, for those curious)
Start working with a collaborator on an outline for a tutorial we want to do
set up administrative struts for our research group svn server (I've started using svn for writing papers - it's an amazing experience - more on that some other time)

By far the coolest feature of Wave for me is that you can include a LaTeX extension into any "wave"or conversation, and it automatically converts things between $$...$$ marks into latex, which makes me hopeful that it will be useful for more substantive discussions (since I often wish I had such capabilities in Skype/Google chat)

Although the preview is painfully slow, and is crippled in various ways, the potential is clearly there, and as more people get on it, it will only start getting more effective. I'm looking forward to it.

Monday, October 12, 2009

On the virtue of NOT working on a problem

Semester has hit with a vengeance, and while I've been busy (among other things) with this and this, my clustering series has gone on temporary hiatus, hopefully to return shortly.

In all the pages and pages of advice given to grad students, postdocs, and starting faculty, I think one item tends to get left by the wayside, or at least is not explicitly stated.

You always underestimate the time spent managing a project from start to finish.

What I mean is this: problems (at least in theoryCS) are easy to state, and fun to work on. Sometimes they take a while to crack, and sometimes they give up their secrets easily. But the time you spend on any given project is much more than the actual time spent thinking about it. There's

Writing up the first few drafts
Iterating to get a polished submission version
(...this step repeats until the paper is accepted)
Preparing the final (often very cramped) version
Making slides for the talk/talks you'll be giving
Preparing a full/journal/arxiv version, which often involves simplifying, rewriting, reproving, adding new references, etc etc.
Submitting to a journal, and waiting endlessly for updates on its status.
Addressing reviewer concerns, and resubmitting
And finally, getting it into print.

The few-months thrill of actually thinking about the problem and solving it ends up being a multi-year odyssey filled with many fallow periods punctuated by bursts of activity.

It's not so much the time involved - papers tend to time-multiplex quite well so you're usually in different phases of the above sequence for different papers.

It's more a matter of motivation. I don't think I'm the only person who feels this, but once I have some nice results, and especially if there isn't follow-on work to be done, I get bored with a paper. Having to deal with it for months and months afterwards is then as excruciating as killing off zombies that keep coming back (not to mention what happens if it keeps getting rejected).

So be careful when you choose a project: make sure it can last through at least a few papers, or you'll be spending a lot of time cursing yourself for the time you spend.

Pages