Wednesday, June 06, 2007

SoCG 2007: Approximate Clustering

I was listening to a couple of talks that improve known bounds for various kinds of approximate clustering in high dimensions, and I got to thinking.

One of the revolutions in geometry over the last 10 years has been the development of nontrivial tools for dealing with approximations in high dimensions. This is of course necessitated by the curse of dimensionality, and the hardness of most high-D data analysis problems (most exact solutions are exponential in the dimension). So rather than computing the optimal solution to some problem on n points in d dimensions, we compute a (1+e)-approximation to the optimal solution.

One problem this creates is that every algorithm is now described by a complicated expression involving three parameters (n, d, e). Some algorithms are exponential in 1/e, but polynomial in d. Some are poly in 1/e, and exponential in d. The exponent could have a base of n, or 2, or even d, or 1/e.

In short, it's an unholy mess of strictly incomparable methods that tradeoff different parameters against each other.

This makes life for me, the "user" of approximation technology, rather difficult. What I'd really like to understand are the gadgets that transform one kind of bound to another (and there are many such gadgets: discretization, enumeration, random sampling, etc). But it's hard to gather these from the papers directly: these gadgets (the really useful tools) are buried deep inside lemmas, and inside people's heads.

What I'd like to see is some kind of "taxonomization" or classification of the different "tricks of the trade" in high-dimensional geometric approximation, with some sense of which techniques apply when, and why. In fact, I like this idea so much I might even try to run a seminar on it..

Disqus for The Geomblog