## Wednesday, June 06, 2007

### SoCG 2007: Approximate Clustering

I was listening to a couple of talks that improve known bounds for various kinds of approximate clustering in high dimensions, and I got to thinking.

One of the revolutions in geometry over the last 10 years has been the development of nontrivial tools for dealing with approximations in high dimensions. This is of course necessitated by the curse of dimensionality, and the hardness of most high-D data analysis problems (most exact solutions are exponential in the dimension). So rather than computing the optimal solution to some problem on n points in d dimensions, we compute a (1+e)-approximation to the optimal solution.

One problem this creates is that every algorithm is now described by a complicated expression involving three parameters (n, d, e). Some algorithms are exponential in 1/e, but polynomial in d. Some are poly in 1/e, and exponential in d. The exponent could have a base of n, or 2, or even d, or 1/e.

In short, it's an unholy mess of strictly incomparable methods that tradeoff different parameters against each other.

This makes life for me, the "user" of approximation technology, rather difficult. What I'd really like to understand are the gadgets that transform one kind of bound to another (and there are many such gadgets: discretization, enumeration, random sampling, etc). But it's hard to gather these from the papers directly: these gadgets (the really useful tools) are buried deep inside lemmas, and inside people's heads.

What I'd like to see is some kind of "taxonomization" or classification of the different "tricks of the trade" in high-dimensional geometric approximation, with some sense of which techniques apply when, and why. In fact, I like this idea so much I might even try to run a seminar on it..

1. Look at the notes from the first half of the class Pankaj taught last semester:
http://www.cs.duke.edu/education/courses/spring07/cps296.2/lectures.html

Also, Sariel's book-to-be should have some nice tricks as well:
http://valis.cs.uiuc.edu/~sariel/teach/notes/aprx/lec/

this two places should be a great place to start for someone who already has a basic understanding of algorithms and geometry.

Jeff

2. What I'd like to see is some kind of "taxonomization" or classification of the different "tricks of the trade" in high-dimensional geometric approximation

Sounds like a job for parameterized complexity.

3. Good questions, Suresh.

Like Gareth I am curious what complexity/lower bounds theory has to say about this mess of parameters, specifically, in problems where NP-completeness is not the appropriate concept.

For instance, exact nearest-neighbor search can be done in time polynomial in the database, but is there work which suggests that it will necessarily suffer more from high dimensionality than the clever algorithms for approximate nearest-neighbor?

4. Hi,

There are some provable separations between the exact and approximate nearest neighbor. It is actually easier to talk about the *near* neighbor problem, where you are given a parameter r, and for any query q, the goal is to check if there is any data point within distance r from q (the approximate version allows wrong answer if the distance to the nearest neighbor of q is in the range [r, (1+eps)r].)

What is known:

* (1+eps)-approx near neighbor in d-dimensional spaces (Hamming, etc) can be solved using only one query to a data structure of size n^O(1/eps^2)

* exact near neighbor in d-dimensional Hamming space requires roughly Omega(d) queries to any data structure of polynomial size.

For references, see:
"Nearest Neighbors in High-dimensional Spaces" at
http://theory.lcs.mit.edu/~indyk/39.ps

Piotr

5. To add to what Piotr pointed out; although this is not exactly the same thing, Luca Trevisan showed that for geometric TSP, the doubly exponential dependence on dimension for approximation algorithms cannot be removed unless NP has subexponential algorithms.

There are other results that show that if the dimension is part of the input, various clustering problems are NP-hard to solve exactly, and in some cases have approximation hardness results. But there is still no nuanced understanding of the relation between n, d, and e. One thing to keep in mind is that these aren't independent parameters, in the sense that any lower bound will have to trade one parameter off against the other.

6. Oh yes, another separation: it is known that you can solve c-approximate near neighbor with subquadratic space and dn^{1/c^2} query time, while there is a dn^Omega(1/c^2) query time lower bound. However, the lower bound only works for a restricted class of "hashing-based algorithms". See Andoni et al, FOCS'06, and Motwani et al, SOCG'06.

Piotr