## Friday, September 06, 2013

### More camping with high dimensional boots...

I discovered today that you don't have to wait for talk videos to be posted on the Simons website. All videos are live streamed via ustream, and they have their own channel for the boot camp talks, where you can watch the videos immediately after they're streamed.

Martin Wainwright gave a 3 hour presentation on high dimensional statistics. Michael Jordan's talk earlier was good preparation for this, just to get familiar with basic concepts like the risk of an estimator and minimax theory.

Wainwright's talks were much denser, and it would be hard do an intelligent summary of the entire presentation. The central theme of his talk was this:

In classical statistics, we can evaluate the quality of an estimator as $n \rightarrow \infty$ using standard asymptotic methods, and for the most part they are well understood (convergence, rates of convergence, sampling distributions, and so on). But in all these results, it's assumed that the data dimensionality stays fixed. Suppose it doesn't though ?

In particular, suppose you have a situation where $d/n \rightarrow \alpha > 0$ (this notion was first introduced by Kolmogorov). For example, suppose $\alpha = 0.5$. What happens to the behavior of your estimation methods ?  He worked out a few simple examples with experiments to show that in such cases, classical asymptotic bounds fail to capture what's really going on. For example, suppose you wanted to estimate the mean of a distribution and you used the sample mean, relying on the central limit theorem. Then it turns out that in the $d/n \rightarrow \alpha$ regime, the convergence is slowed down by the parameter $\alpha$.

Another example of this problem is estimating the covariance of a matrix. Assume you have a sample of iid Gaussian random variables drawn from $N(0, I_{d \times d})$, and you want to use the sample covariance to estimate the population covariance (in this case, the identity matrix). You can look at the distribution of eigenvalues of the resulting matrix (which you expect to be sharply concentrated around 1) and in fact you get a much more spread out distribution (this is known as the Marcenko-Pastur distribution). You can show that that the maximum singular value of the matrix is no bigger than $1 + \sqrt{d/n}$ with high probability. But this error term $\sqrt{d/n}$ does not go away.

If the data is indeed high dimensional, is there low-dimensional/sparse structure one can exploit to do inference more efficiently ? This gets us into the realm of sparse approximations and compressed sensing, and he spends some time explaining why sparse recovery via the LASSO is actually possible, and describes a condition called the "restricted null space property" that characterizes when exact recovery can be done (this property is implied by the RIP, but is strictly weaker).

In the second part of the talk he talked more generally about so-called regularized M-estimators, and how one might prove minimax bounds for parameter estimation. Again, while the specifics are quite technical, he brought up one point in passing that I think is worth highlighting.

When doing parameter estimation, the "curvature" of the space plays an important role, and has done so since the Cramer-Rao bound, the idea of Fisher information and Efron's differential-geometric perspective. The idea is that if the optimal parameter lies in a highly curved region of loss-parameter space, then estimation is easy, because any deviation from the true parameter incurs a huge loss. Conversely, a region of space where the loss function doesn't change a lot is difficult for parameter estimation, because the parameter can change significantly.

Once again, this is in sharp contrast to how we view the landscape of approximations for a hard problem. If we have a cost function that varies gently over the space, then this actually makes approximating the function a little easier. But a cost function that has sharp spikes is a little trickier to approximate, because a small movement away from the optimal solution changes the cost dramatically.

There are some problems with this intuition. After all, a sharp potential well is good for gradient descent methods. But the difference here is that in estimation, you only care about the loss function as a tool to get at the true goal: the desired parameter. However, in much of algorithm design, the loss function IS the thing you're optimizing, and you don't necessarily care about the object that achieves that optimal loss. This goes back to my earlier point about the data versus the problem. If optimizing the loss is the end goal, then a certain set of tools come into play. But if the goal is to find the "true" answer, and the loss function is merely a means to this end, then our focus on problem definition isn't necessarily helpful.