Monday, April 07, 2014

Directed isoperimetry and Bregman divergences.

A new cell probe lower bound for Bregman near neighbor search via directed hypercontractivity. 

Amirali Abdullah and I have been pondering Bregman divergences for a while now. You can read all about them in my previous post on the topic. While they generalize Euclidean geometry in some nice ways, they are also quite nasty. The most important nastiness that they exhibit is asymmetry:
\[ D_\phi(x, y) \ne D_\phi(y, x)\]
What's worse is that this asymmetry can grow without bound. In particular, we can quantify the degree of asymmetry by the parameter $\mu$:
\[ \mu = \max_{x,y \in \Delta} \frac{D_\phi(x,y)}{D_\phi(y,x)} \]
where $\Delta$ is the domain of interest.

There's been a ton of work on clustering Bregman divergences, and our work on low-dimensional approximate nearest neighbor search. In almost all these results, $\mu$ shows up in the various resources (space and/or time), and it seems hard to get rid of it without making other assumptions.

After our low-D ANN result, we started trying to work on the high-dimensional version, and our thoughts turned to locality-sensitive hashing. After struggling for a while to get something that might be useful, we started pondering lower bounds. In particular, it seemed clear that the worse $\mu$ was, the harder it became to design an algorithm. So could we actually come up with a lower bound that depended on $\mu$ ?

Our first result was via a reduction from the Hamming cube (and $\ell_1$ near neighbors). While these gave us the desired lower bound, it wasn't $\mu$-sensitive. So we went looking for something stronger.

And that's where isoperimetry made an entrance. There's been a long line of results that prove lower bounds for LSH-like data structures for near neighbor search. Roughly speaking, they work in the following manner:

  1. Construct a gap distribution: a distribution over inputs and queries such that a query point is very close to its nearest neighbor and very far from any other point in the space. In the $\ell_1$ case, what you want is something like $\epsilon d$ distance to the nearest neighbor, and $\Omega(d)$ distance to the second nearest neighbor. 
  2. Imagine your data structure to be a function over the Hamming cube (assigning blocks of points to cells) and look at what happens when you perturb it (formally, by defining the function value to be its average over all nearby points)
  3. Use isoperimetry (or hypercontractivity) to argue that the function "expands" a lot. What this implies is that points get scattered everywhere, and so any particular cell you probe doesn't have enough information to determine where the true nearest neighbor actually is. This last step can be proved in different ways, either using Fourier methods, or via the use of Poincaré-type inequalities.

Our initial hope was to use this framework to prove our bound, while incorporating terms relating to the asymmetry of Bregman divergences more directly.

This turned out to be hard. The asymmetry destroys many of the nice algebraic properties associated with the noise operator and related inner products. There's even a deeper problem: if you think of an asymmetric noise operator as pushing things "forward" on a cube with one probability and backwards with another, then it's not hard to imagine a scenario where applying it actually ends up concentrating the mass of the function in a smaller region (which would break hypercontractivity).

We needed two things: a way to get around this not-so-small issue that hypercontractivity isn't true in a directed setting, and a way to analyze the asymmetric noise operator. It turned out that we could circumvent the first problem: we could restrict the class of hash functions we considered in a way that didn't significantly change their entropy (only a constant). Alternatively, we could view hypercontractivity as being true "on average".

The second problem was a bit harder. The solution we came up with was to try and link the directed noise operator to a symmetric operator on a biased space. The intuition here was that the directed operator was creating a nonuniform distribution, so maybe starting off with one might give us what we need.

This actually worked ! There are versions of the Bonami-Beckner inequality in biased spaces that look almost just like their counterparts in uniform space (you merely set the probability of a bit being 1 to $p \ne 1/2$, and the Fourier coefficients are defined in terms of $p$).

There was lots more to address even after the isoperimetry result, but you'll have to read the paper for that :).

I should note that in many ways, this was a "Simons baby": Amir visited during one of the workshops for the program on real analysis, and we had fruitful conversations with Elchanan Mossel and Nathan Keller in early stages of this work.                                                                                                               

No comments:

Post a Comment

Disqus for The Geomblog