## Tuesday, November 15, 2011

### An intuitive argument for the JL lemma ?

Hal Daumé (as is his wont) asked me a simple question with no easy answer. His question was
Is there a proof of the Johnson-Lindenstrauss Lemma that can be explained to an undergraduate ?
While there are different "simple" proofs of the JL Lemma (the Gupta-Dasgupta proof is one such), it's not clear whether these are "undergraduate" level. So instead of answering his original question, I decided to change it to
Is there a proof of the JL Lemma that isn't "geometric" ?
It's a little odd to even ask the question, considering the intrinsic geometric nature of the lemma. But there's a reasonably straightforward way of seeing how the bound emerges without needing to worry too much about random rotations, matrices of Gaussians or the Brunn-Minkowski theorem.

Warning: what follows is a heuristic argument that helps suggest why the bound is in the form that it is: it should not be confused for an actual proof.

In its original form, the JL Lemma says that any set of $n$ points in $R^d$ can be embedded in $R^k$ with $k = O(\log n/\epsilon^2)$ such that all distances are preserved to within a $1+\epsilon$ factor. But the real result at the core of this is that there is a linear mapping taking a unit vector in $R^d$ to a vector of norm in the range $1\pm \epsilon$ in $R^k$, where $k = 1/\epsilon^2$ (the rest follows by scaling and an application of the union bound).

Trick #1: Take a set of values $a_1, \ldots, a_n$ and set $Y = \sum_i a_i r_i$, where $r_i$ is chosen (iid) to be +1 or -1 with equal probability. Then $E[Y^2] = \sum a_i^2$.This can be verified by an easy calculation.

So now consider the vector $v$. Let's assume that $v$'s "mass" is roughly equally distributed among its coordinates. Take a random sample of $d/k$ of the coordinates of $v$ and apply the above trick to the values. Under the above assumption, the resulting $Y^2$ will have roughly $1/k$ of the total (squared) mass of $v$. Scale up by $k$.

This is one estimator of the norm of $v$. It is unbiased and it has a bounded maximum value because of the assumption. This means that we can apply a Chernoff bound over a set of $k$ such estimators. Roughly speaking, the probability of deviation from the mean is $\exp(-\epsilon^2 k)$, giving the desired value of $k$.

But how do we enforce the assumption ? By applying a random Fourier transform (or actually, a random Hadamard transform). this "spreads" the mass of the vector out among the coordinates (technically by ensuring an upper bound on the $\ell_\infty$ norm).

That's basically it. Almost all the papers that follow the Ailon-Chazelle work proceed in this manner, with increasing amounts of cleverness to reuse samples, or only run the Fourier transform locally, or even derandomize the process. What distinguishes this presentation of the result from the earlier approaches (which basically boil down to: populate a matrix with entries drawn from a distribution having subGaussian tails) is that it separates the "spreading" step (called the preconditioner) from the latter, more elementary step (the sampling of coordinates). It turns out that in practice the preconditioner can often be omitted without incurring too much error, yielding an extremely efficient (and sparse) linear transform.

1. At least for the Achlioptas construction with random signs, there is an "undergrad" proof that bounds the log(1/delta)th moment via some counting. It's more "grungy" than "intuitive" though.

2. is it something you can summarize here ?

3. The basic idea is to write |Sx|_2^2 = |x|_2^2 + Z, where S is your sign matrix and Z is the error term. Then Z = (1/k)*sum_{r=1}^k Z_r where Z_r = sum_{i!=j}S_{r,i}*S_{r,j}*x_i*x_j. Then Pr[|Z| > eps] < eps^{-t}*E[Z^t] for t even. Take t = log(1/delta) and bound E[Z^t] by (eps/2)^t and you're done. To bound E[Z^t], expand it out into monomials and observe things boil down to bounding E[Z_r^p] for p <= t. This can be done by expanding out Z_r^p, associating a graph to every monomial, grouping monomials with isomorphic graphs, then doing a bunch of computations.

It's basically the approach Daniel and I used to analyze sparse JL (not the version with codes, but with random hashing). The proof is (only slightly) simpler in this case because there's no hashing. Nothing in the proof requires more than undergrad knowledge (basically, stirling's formula), but overall it is not a short computation.

4. (Note you can also prove things like the Chernoff bound by this sort of approach: rather than use the moment generating function, just bound a high-order moment from first principles. Personally this is how I usually end proving it since I tend to forget which clever Taylor-approximation tricks to use.)