## Sunday, September 30, 2007

### "Related Work" vs "Prior Work"

(ed: As deadlines draw near, I get increasingly caught up with the minutiae of paper-writing. Be warned)

"Related work" or "Prior work" ? I've seen both section titles used when describing all the work that came before the magnificient specimen of research that I'm reading right now. Personally, I prefer 'Prior work': it seems somehow closer to the spirit of 'I stand on the shoulders of giants', even if that was really just a dig at Hooke. 'Related Work' to me has the tone of a grudging 'yeah yeah, these guys did something that might vaguely resemble what we do here, and I guess I should include it, but don't imagine that their work in any way influences my brilliant thoughts, which are sui generis'.

An issue that appears to be related to this is the placement of the Related/Prior Work section. I personally prefer to have prior work discussions just after the introduction, unless discussing prior work requires the use of notation that I introduce in a definitions section. In many papers (and some of mine as well), the prior work gets shunted off to the no-man's land before the conclusions. I wonder if there's some correlation between the choice of the title, 'Related Work' and the banishing of the section to the very end, a kind of double dismissal of the value of the cited works.

## Wednesday, September 26, 2007

### The "Voronoi trick"

Only a few days after talking about work on redistricting based on computing a geometric clustering, I came across (via Andrew Gelman) yet another paper on automatic redistricting by Roland Fryer and Richard Holden.

It's a very interesting paper, and at the risk of sounding patronizing, I have to say that I'm impressed at their use of power diagrams and the Voronoi trick, something that many theoryCS folks may not have encountered. Once again, the problem they're looking at is redistricting: can you design voting districts to satisfy certain fairness criteria ?

One specific question that they ask is: can you evaluate the quality of a state's redistricting plan ? Their answer is in the form of an approximation ratio. They fix a measure of quality for a redistricting, and take the ratio of the actual plan cost to the cost of the optimal plan. Their cost, in this case, is a well known clustering measure: the cost of a cluster (in this case, a district) is the all-pairs sum of distances squared, and the cost of a clustering (in this case, a redistricting) is the sum of cluster costs. One extra twist is that they require all clusters to be equally sized upto rounding errors. That is, a 2-clustering of 9 people must have one cluster with 5 and one cluster with 4 people.

The problem of computing an optimal clustering is NP-hard if k, the number of clusters, is part of the input. However, if not, then there's an elegant method for solving these kinds of problems which I'll call the "Voronoi trick".

The Voronoi Trick:

A clustering of n items into k clusters is a partitioning problem, and as such, there are roughly k^n ways of assigning items to clusters (k^n comes from treating all clusters as distinct; in general, the amount might be reduced by as much as k!). But for geometric problems, we fully expect that not all partitions are likely to be optimal, because of proximity constraints and the triangle inequality. Is there some way of restricting the set of possible partitions ?

Consider the following set of possible partitions of a set of n points in a Euclidean space. We put down k centers, and assign each point of the input to its nearest center. Now not all placement of k centers lead to distinct partitions: I can always perturb one center slightly without changing the combinatorial structure of the partition. We'll call such a partition a Voronoi partition, from the connection to Voronoi diagrams.

Now it's clear that not all partitions of the input are realizable as Voronoi partitions. For example, if we consider k=2 points in the plane, then there are only O(n^2) distinct partitions, rather than the 2^n ways of partitioning a set of items into two groups. In general, for a set of n points in d dimensions, there are roughly n^{kd} distinct Voronoi partitions, which is polynomial in n, rather than being exponential.

Now here comes the trick: for many clustering problems, including the one mentioned above, we can show that the optimal clustering must be realizable as a Voronoi partition. This kind of result has appeared in many different forms: Inaba et al showed that for two different kinds of minimum variance clustering (one where the cost of a cluster is the sum of squared distances from the center, and the other where the cost is the pairwise sum of squared distances of points in the cluster), the optimal clustering is a Voronoi partition. Other forms of this trick were used by Aurenhammer et al in their paper on least-squares partitioning where they showed that we could compute the optimal RMS matching between two point sets using Voronoi partitions.

One aspect of this trick that I elided: often, the Voronoi partitioning doesn't come from using a straight distance function, but from using a weighted distance function. One way this is done is to associate a weight w(s) with each site, and define the Voronoi bisector as the set of points satisfying d(x,s) - w(s) = d(x,s') - w(s'), rather than the usual d(x,s) = d(x,s'). If you recall the view of a Voronoi diagram as the lower envelope of a collection of cones in one higher dimension, with apexes at the site locations, then a weighted Voronoi diagram corresponds to lifting the cones by different (even negative) heights, corresponding to the weight function w(s). This diagram is also called a power diagram.

The Voronoi trick and clustering:
Apart from the fact that the Voronoi trick allows us to reduce the space of possible partitions to consider, it also allows us to isolate the influence of points on clusters. This idea was first exploited in the above-mentioned paper by Inaba et al, and has since been used in approximate clustering work by Badoiu, Har-Peled and Indyk, as well as the recent linear-time k-means algorithms by Kumar, Sabharwal and Sen. The argument goes something like this:

Since we know that each center is "influenced" by points in its Voronoi cell, we can focus on the problem of finding a single center out of the k. This is typically solved using a (simpler) 1-median or 1-means algorithm, run on a random sample of points that is likely to contain a large number of points from a single cluster. Once this is done, we can "peel" off this center, and repeat. The ability to do random sampling and peeling comes from the idea of irreducibility: namely, that either a k-clustering can be approximated using a k-1-clustering, or if not, it means that the k^th cluster is "reasonably well-separated" from the rest, and can be isolated using the above ideas.

Returning to redistricting:
Coming back to the redistricting paper, what they do is essentially show that their all-pairs measure of the cost of a cluster reduces to the 'cost to the centroid' measure, since all clusters are required to have the same size. In this case, they can make use of the power diagram to search efficiently for solutions. If I may nitpick, I wonder why they didn't look into the many approximation schemes available for these problems (but this is really a nitpick).

Postscript:
There are other tricks that are also worth of being called a 'Voronoi trick'. One such idea dates back to a paper by Dobkin, Drysdale and Guibas from 1983 ("D.P. Dobkin, R.L. Drysdale, IIi, and L.J. Guibas. Finding smallest polygons. In Advances in Computing Research, Voi. 1, JAI Press, 1983, 181-214."), and works as follows:

Suppose you want to find the k points from a point set of size n that optimize some function: for example, you want the k points with the smallest enclosing ball, or the k points with the smallest convex hull area, and so on. Further, let's say you have some expensive algorithm that can solve this problem in time O(n^c). For many of these functions, it can be shown that the optimal set of k points must lie within a single cell of the O(k)^th order Voronoi diagram of the set of points. In case you hadn't encountered this notion before, the k^th order Voronoi diagram is what you get if instead of taking the lower envelope of your cones, you take the k^th level of the arrangement. Intuitively, it consists of cells in which for any point, its k nearest neighbours are a fixed set of points.

The k^th order Voronoi diagram of points in the plane has complexity O(kn). Using the above fact, we can enumerate over all cells, and run the expensive algorithm over the O(k) points in each cell. This gives a running time of O(k^c * kn = k^{c+1} n), rather than n^c. This idea works upto a point: see this paper by fellow bloggers David and Jeff for improvements that bypass the Voronoi construction in favor of iterated nearest neighbors.

## Friday, September 21, 2007

### Manifold learning and Geometry

In June, Partha Niyogi gave the invited lecture at SoCG. The subject was manifold learning:
Given a collection of (unlabelled) data inhabiting some high dimensional space, can you determine whether they actually lie on some lower dimensional manifold in this space ?
This talk was a great window into an area of machine learning with strong geometric connections, an area where geometers could profitably play and make useful contributions.

Now NIPS (one of the major machine learning conferences) is returning the favor, with a workshop devoted to the topic of Topology Learning:

There is a growing interest in Machine Learning, in applying geometrical and topological tools to high-dimensional data analysis and processing.

Considering a finite set of points in a high-dimensional space, the approaches developed in the field of Topology Learning intend to learn, explore and exploit the topology of the shapes (topological invariants such as the connectedness, the intrinsic dimension or the Betti numbers), manifolds or not, from which these points are supposed to be drawn.

Applications likely to benefit from these topological characteristics have been identified in the field of Exploratory Data Analysis, Pattern Recognition, Process Control, Semi-Supervised Learning, Manifold Learning and Clustering.

However it appears that the integration in the Machine Learning and Statistics frameworks of the problems we are faced with in Topology Learning, is still in its infancy. So we wish this workshop to ignite cross-fertilization between Machine Learning, Computational Geometry and Topology, likely to benefit to all of them by leading to new approaches, deeper understanding, and stronger theoretical results about the problems carried by Topology Learning.

The list of invited speakers bridges geometry, topology and learning. It looks like a great forum for continuing the machine learning-computational geometry cross-talk that Partha kicked off at SoCG.

## Thursday, September 20, 2007

### Modelling via the algorithmic lens

Much of my time is spent on the interface between algorithms research and other areas, in the dreaded trenches of 'algorithmic modelling'. Algorithmic modelling can be a tremendously useful tool, if wielded skillfully, and in the few cases where one hits the jackpot, you have an instant argument for the value of theoryCS outside computer science. Most importantly, it provides a kind of rationale for the intrinsic value of the algorithmic method: that thinking algorithmically leads to new ways of thinking about computational problems, and hooks problems in other domains into the engine of algorithmic improvements, so that better algorithm design leads quickly to better solutions to "real-world" problems.

Such work however can be an endless source of frustration. Most commonly, there are cultural difference between areas, and terminology barriers that need to be overcome. But most of all, there's a sense of apathy towards any effort at producing formal guarantees for computational methods (either for running times, or quality). In general, methods based on formal algorithmic analysis still prove themselves in the experimental setting, rather than possessing any intrinsic value by virtue of being correct, having running time and quality guarantees etc.

This in itself can be defended: if the main bottlenecks towards providing an effective practical solution to a problem are not computational (which they often are though), then it's not clear that a practitioner needs to care about esoteric notions like formal guarantees (which can often be quite weak). What is harder to surmount is the 'I've dug my hole and I like it' phenomenon.

For example, the RMS (root-mean-square) distance is commonly used to compare two 3D protein models. But I've never heard any fundamental argument as to why this measure is superior to other possible candidates for matching shape. What's worse, the RMS measure is harder to minimize (under transformations) than other equally plausible measures of shape similarity. However, as the status quo, it's hard to do anything to change this, short of voluminous studies that argue for the relative merits of a different measure based on comparisons on shapes.

Lest it seem that I'm merely whining, Barbie-style, because 'Modelling is hard', let me summarize my main point:

The computational efficacy and guarantees associated with a particular modelling technique are not in and of themselves sufficient to convince a practitioner that this approach is valid, even if in all other respects, the method is identical to other, more entrenched methods.

This makes me wonder about the best way to do algorithmic modelling, if one chooses to do it at all.

## Sunday, September 16, 2007

### Sunday afternoon MathTubing

Via John Baez, learning category theory on YouTube ! Now what we need is a YouTube video for Scott Aaronson's soap film demonstrations.

The videos are listed under the category, 'How-to and DIY'. Indeed ! Make your own monads, in 9 minutes or less !

## Saturday, September 15, 2007

### The hardness of fair redistricting

Through a series of pointers, I ended up at Brian Olson's page on redistricting, where he proposes an algorithm to do "fair redistricting" of districts so as to avoid problems with gerrymandering. The measure proposed appears to be a discrete k-median variant: namely, cluster "blocks" into k districts such that the average distance to the center of the district (defined as the centroid of the collection of blocks) is minimized. His code is online, and appears to use a GA-type heuristic. This is essentially a planar k-median problem, and is solvable approximately by a variety of methods.

In reading some of the (critical) comments on some of the original posts, (most of the form (a) this has been proposed ever since the early 60s (b) automation is not the hardest problem in redistricting), I was led to an award-winning political science dissertation by Micah Altman, in which (among other things) he shows that most of the reasonable redistricting formulations (although not the one above) are NP-hard (they tend to be variants of set packing). It was interesting to read a lengthy discourse on computational complexity and NP-hardness in the middle of such a dissertation.

## Monday, September 10, 2007

### An interesting blog

Derrick Coetzee writes a blog at MSDN titled, 'Developing for Developers'. There's lots of stuff related to code development, but also articles on general computing and theory. Some recent posts:
None of this is necessarily new to people reading this blog, but it's heartening to see an audience for this material in the developer community. And of course, anyone who says,
In reality, computing has nothing to do with computers. Computing is fundamentally about solving problems using strictly defined processes under resource constraints. The theory of computing, by design, has little regard for what device you use to carry those processes out.
will always have a special spot in my blogroll :)

## Sunday, September 09, 2007

### The TheoryCS blogosphere flourishes !

Jeez. I leave my computer for two hours, and four different bloggers announce the SODA accepts. If for some extremely obscure reason you haven't yet seen the postings, here's the list.

## Saturday, September 08, 2007

### Carnival of Mathematics XVI

is up, courtesy Kurt van Etten @ Learning Computation. The theme is 'Good, Bad and Ugly', and for some reason, computer science is "Ugly" :). Given that two of the entries involve mathematical feuds, maybe this is fitting.

### Physician, heal thyself ?

There's been a lot of grief over the reduction in price for iPhones only a few months after they were released. Wired magazine interviewed people who don't regret paying the higher price for the virtue of being an early adopter/arbiter-of-cool, and this comment caught my eye (emphasis mine):
"If they told me at the outset the iPhone would be $200 cheaper the next day, I would have thought about it for a second - and still bought it," said Andrew Brin, a 47-year-old addiction therapist in Los Angeles. "It was$600 and that was the price I was willing to pay for it."
I don't know: I think there are some other people who should get their money back.

## Wednesday, September 05, 2007

### Heuristics for NP-hard problems

A while back, Michael was discussing the teaching of heuristics in his undergraduate algorithms class, and had this request:
I wish the standard textbooks would devote a solid chapter on heuristic techniques; in real life, most students are probably more likely to have to implement a heuristic for some NP-hard problem than to have to implement a minimum spanning tree algorithm.
If you remove the word, 'standard', he may just have got his wish. From the press blurb for the Handbook of Approximation Algorithms and Metaheuristics:
Delineating the tremendous growth in this area, the Handbook of Approximation Algorithms and Metaheuristics covers fundamental, theoretical topics as well as advanced, practical applications. It is the first book to comprehensively study both approximation algorithms and metaheuristics. Starting with basic approaches, the handbook presents the methodologies to design and analyze efficient approximation algorithms for a large class of problems, and to establish inapproximability results for another class of problems. It also discusses local search, neural networks, and metaheuristics, as well as multiobjective problems, sensitivity analysis, and stability.
Browsing the table of contents, the reader encounters an entire section on heuristics, with chapters on local search, stochastic local search, neural networks, genetic algorithms, tabu search, simulated annealing, ant colony optimization and the like. The book is published by CRC, which stands for "Organization-That-Singlehandedly-Destroys-Rain-Forests-While-Making-Your-Biceps-Extremely-Strong".

I generally have mixed feelings about CRC books: they are good reference texts, but tend to skim over content way too easily, and CRC has this unhealhy obsession with handbooks for everything. But for teaching heuristics, this might be an invaluable reference.

I should mention that the vast majority of the handbook deals with "proper" approximation algorithms, as well as applications to specific areas. The problem is that approximation algorithms is a fast moving field (though maybe not as fast as in the middle 90s), and so a handbook, being a static snapshot of the time, will get dated quickly.

## Tuesday, September 04, 2007

### Teacher's Day

I've been out of India for far too long, or so it seems. I had to be reminded by this article that Sep 5, Wednesday, is Teacher's Day in India. This day was chosen for its co-incidence with the birthday of India's second president, S. Radhakrishnan, a statesman and politician famous for his erudition and learning. According to Wikipedia, only 21 countries have an official celebration of Teacher's Day. Much to my surprise, the US is among them, with the day being part of a week called Teacher Appreciation Week.

Teacher's day in my school was always a big event; teachers did not teach that day, and came to school mainly to be entertained and feted in an assembly with variety shows/skits/gentle mockery. Senior students would mind the junior classes. when I was younger, this meant a day of partying in school. When I became one of the seniors, this meant I was able to come to school in "civilian clothing" (most schools in Delhi had strict uniform policies) to boss over the younger students. Good times...

Now I'm a teacher myself (after a fashion), and find it both easy (and fashionable) to grumble about the lack of respect shown by students in the US, as compared to students in India. The truth of course is far more complicated than that. Schools in Delhi appear to have acquired many of the "bad traits" of the worst high schools here; my mother has taught in Delhi schools for nearly 20 years and has seen them acquire more "western" habits (which is a shorthand for more boy-girl interaction, less "studiousness", less fear/reverence for the teachers, take your pick). And ultimately, as a university professor, I'm not even on the frontlines of education here in the US, and have no business complaining about the largely excellent students I teach.

In any case, here's a cheer for the teachers I had; the ones who tolerated my brashness, general arrogance, and constant questions, and helped me reach the teaching pedestal I find myself at today. And here's hoping that one day there'll be some student who might think as fondly of me as I think of all my teachers long past.

## Monday, September 03, 2007

### NP hardness of Euclidean k-median clustering

Suppose you're given a metric space (X, d) and a parameter k, and your goal is to find k "clusters" such that the sum of cluster costs is minimized. Here, the cost of a cluster is the sum (over all points in the cluster) of their distance to the cluster "center" (a designated point).

This is the well-known k-median problem (which differs from the also popular k-means problem by the use of distances, rather than squares of distances). In a general metric space, the k-median problem is known to be NP-hard, as well as being hard to approximate to within arbitrary constant factor. The current best known approximation ratio for the k-median is 4, due to Charikar and Guha 3 + eps, via a local search heuristic due to Arya, Garg, Khandekar, Meyerson, Munagala and Pandit (thanks, Chandra).

If the underlying metric is the Euclidean metric, then the problem complexity changes: in fact a PTAS for the Euclidean k-median can be obtained, due to the results of Arora, Raghavan and Rao, and then Kolliopoulos and Rao (who provide an almost-linear time algorithm). But as far as I am aware, there is still no NP-hardness proof for the Euclidean k-median problem, and I'd be interested in knowing if I am wrong here.

Note that the related problem of Euclidean k-means is known to be NP-hard from an observation by Drineas, Frieze, Kannan, Vempala and Vinay.

Update: As pointed out in the comments, the NP-hardness of k-median in the plane was shown in 1984 by Megiddo and Supowit. Once again, the wisdom of the internet beats independent research ;)

### Tenure committee is to Colombian drug cartel as ...

Via HWTW, a hilarious advice piece to new profs by Phil Ford at Inside Higher Ed, adapted from a Notorious B.I.G rap, "The Ten Crack Commandments". The article is great, but the comments are even funnier.

### "Data Mining" = voodoo science ?

On the Statistical Modeling blog, Aleks Jakulin has a rant on the virtues of data mining:

I view data analysis as summarization: use the machine to work with large quantities of data that would otherwise be hard to deal with by hand. I am also curious about what would the data suggest, and open to suggestions. Automated model selection can be used to list a few hypotheses that stick out of the crowd: I was not using model selection to select anything, but merely to be able to quantify how much a hypothesis sticks out from the morass of the null.

The response from several social scientists has been rather unappreciative along the following lines: "Where is your hypothesis? What you're doing isn't science! You're doing DATA MINING !"
I had almost the same reaction a while back when I was visiting JPL: the climatologists there were horrified at the idea of trolling for patterns in climate data, and to the person, asked me the dreaded 'But what is the science question?" question. Of course, given the general hot-potato-ness of climatology right now, one might sympathize with their skittishness.

Data mining is a tricky area to work in, and I've discussed this problem earlier. It's a veritable treasure-chest of rich algorithmic problems, especially in high dimensional geometry, and especially over large data sets. However, it's often difficult to get a sense of forward progress, especially since the underlying analysis questions often seem like elaborate fishing expeditions.

In that context, the distinction Aleks makes between confirmatory data analysis (check if the data validates or invalidates a hypothesis) and exploratory data analysis (play with the data to create a non-uniform distribution on plausible hypotheses) is quite helpful. It also emphasizes the interactive and very visual nature of data mining; interactive tools and visualizations are as important as the underlying analysis tools as well.

Update: Chris Wiggins points me to some of the earlier references to 'data mining'. One of the most vituperative is a paper by Michael Lovell in 1983 in The Review of Economics And Statistics. This paper drips with scorn for 'data miners', but makes a point that is at the very least worthy of consideration: namely that because of the large dimensionality of the space of hypotheses that a data mining application typically explores (here couched in terms of explanatory variables for a regression), patterns with apparently high p-values might not actually be that significant (or stated another way, in high dimensional spaces, there are many seemingly rare patterns that aren't that rare).