## Monday, September 16, 2013

### SIGACT CG Column: Moving heaven and earth

My new column for SIGACT News is up (non pay-walled version here). In it, I discuss two different ways of defining distances between distributions.

The earthmover distance is a familar measure to TCS folk, and the kernel distance (or the maximum mean distortion, or the distance covariance) is a familiar character in machine learning and statistics.

It turns out that while statistically, these are not as different as one might think, they are different when it comes to computing them, and understanding their geometry.

The column describes both measures and their properties, and does a compare-contrast between them. More generally, I try to describe how one should think about the process of constructing a distance function between two distributions.

## Monday, September 09, 2013

### Life at Simons

There have been many posts on the technical talks happening at Simons (see +Moritz Hardt's latest on gradient descent if you haven't yet). Not as many yet on the less-formed discussions that are happening in and around the building, but maybe it's too soon for that.

In the meantime, I thought it might be fun to describe "a day at Calvin Hall".

I usually bike in after 9 am. Biking in Berkeley is not for the (literally) weak-kneed, and I'm discovering how weak my knees actually are. I haven't yet found a good site to plot elevation gains and losses for an arbitrary route, but this one is pretty decent.

The Institute lends out bikes if you don't have your own. It's a great way to get around if you're staying near campus, because parking is difficult and expensive.

Of course, the next step is:

It's a great machine. Only downside is that it shuts down at 5pm when the building does. But there are plenty of cafes around for later.

Then of course the decision: which talks should I feel guilty about missing today ? There are A LOT OF TALKS going on at the various workshops/boot camps. I feel embarrassed complaining about this, because we have such a stellar crowd of speakers and such fascinating talks. But there are almost too many talks: I have to shut down the feelings of guilt in order to find big chunks of time to think (or write this blog post <ahem>).

Thankfully, there's the live stream. I've already mentioned the Simons livestream on G+, but it's worth mentioning again. If you're late to the start of any talk you can watch it on the high quality live feed. If you can't watch it live you can watch it as soon as it's done on the Ustream site for the event. And a few days after that the talks are archived on the Simons site directly. There's also this cute feature where the live stream plays the most recent talk during coffee breaks. I know people outside the Institute are watching talks live because someone pinged me on twitter to ask the speaker to repeat questions they get from the audience.

So I've either attended talks or I've spent the morning brainstorming on some of the projects I have going. It's now time for lunch: thankfully, we're walking distance from both Northside and Southside Berkeley, which means a huge variety of eating places all within 5-10 minutes walking. All we need to do is avoid the huge class let-out at 12:20 or so. On Tuesdays we have a lunch seminar (organized by +Moritz Hardt) as part of the big-data program.

Once we're back, it's time for this:

Did I mention that the coffee machine is awesome ?

Usually, discussions start picking up in the afternoon, and you'll often find me here:

3:30 is time for the daily tea, at which we get all kinds of nice cookies, and the occasional cake/ice cream (on thursdays). This is much like the Dagstuhl tea time, but without the fancy tortes (hint hint!). Of course, no daily tea is complete with the addition of this:

By five, people are starting to trickle out slowly. On Tuesdays we have a happy hour at a local bar ($3 pints!). And it's time for me to check the sad state of my knees for the ride back, which is hard because it's uphill both ways ! ## Friday, September 06, 2013 ### More camping with high dimensional boots... I discovered today that you don't have to wait for talk videos to be posted on the Simons website. All videos are live streamed via ustream, and they have their own channel for the boot camp talks, where you can watch the videos immediately after they're streamed. Martin Wainwright gave a 3 hour presentation on high dimensional statistics. Michael Jordan's talk earlier was good preparation for this, just to get familiar with basic concepts like the risk of an estimator and minimax theory. Wainwright's talks were much denser, and it would be hard do an intelligent summary of the entire presentation. The central theme of his talk was this: In classical statistics, we can evaluate the quality of an estimator as$n \rightarrow \infty$using standard asymptotic methods, and for the most part they are well understood (convergence, rates of convergence, sampling distributions, and so on). But in all these results, it's assumed that the data dimensionality stays fixed. Suppose it doesn't though ? In particular, suppose you have a situation where$d/n \rightarrow \alpha > 0$(this notion was first introduced by Kolmogorov). For example, suppose$\alpha = 0.5$. What happens to the behavior of your estimation methods ? He worked out a few simple examples with experiments to show that in such cases, classical asymptotic bounds fail to capture what's really going on. For example, suppose you wanted to estimate the mean of a distribution and you used the sample mean, relying on the central limit theorem. Then it turns out that in the$d/n \rightarrow \alpha$regime, the convergence is slowed down by the parameter$\alpha$. Another example of this problem is estimating the covariance of a matrix. Assume you have a sample of iid Gaussian random variables drawn from$N(0, I_{d \times d})$, and you want to use the sample covariance to estimate the population covariance (in this case, the identity matrix). You can look at the distribution of eigenvalues of the resulting matrix (which you expect to be sharply concentrated around 1) and in fact you get a much more spread out distribution (this is known as the Marcenko-Pastur distribution). You can show that that the maximum singular value of the matrix is no bigger than$1 + \sqrt{d/n}$with high probability. But this error term$\sqrt{d/n}$does not go away. If the data is indeed high dimensional, is there low-dimensional/sparse structure one can exploit to do inference more efficiently ? This gets us into the realm of sparse approximations and compressed sensing, and he spends some time explaining why sparse recovery via the LASSO is actually possible, and describes a condition called the "restricted null space property" that characterizes when exact recovery can be done (this property is implied by the RIP, but is strictly weaker). In the second part of the talk he talked more generally about so-called regularized M-estimators, and how one might prove minimax bounds for parameter estimation. Again, while the specifics are quite technical, he brought up one point in passing that I think is worth highlighting. When doing parameter estimation, the "curvature" of the space plays an important role, and has done so since the Cramer-Rao bound, the idea of Fisher information and Efron's differential-geometric perspective. The idea is that if the optimal parameter lies in a highly curved region of loss-parameter space, then estimation is easy, because any deviation from the true parameter incurs a huge loss. Conversely, a region of space where the loss function doesn't change a lot is difficult for parameter estimation, because the parameter can change significantly. Once again, this is in sharp contrast to how we view the landscape of approximations for a hard problem. If we have a cost function that varies gently over the space, then this actually makes approximating the function a little easier. But a cost function that has sharp spikes is a little trickier to approximate, because a small movement away from the optimal solution changes the cost dramatically. There are some problems with this intuition. After all, a sharp potential well is good for gradient descent methods. But the difference here is that in estimation, you only care about the loss function as a tool to get at the true goal: the desired parameter. However, in much of algorithm design, the loss function IS the thing you're optimizing, and you don't necessarily care about the object that achieves that optimal loss. This goes back to my earlier point about the data versus the problem. If optimizing the loss is the end goal, then a certain set of tools come into play. But if the goal is to find the "true" answer, and the loss function is merely a means to this end, then our focus on problem definition isn't necessarily helpful. ## Wednesday, September 04, 2013 ### Statistics, geometry and computer science. Geometry is the glue between statistics and computer science. -- Michael Jordan That's why everyone gets stuck in it. -- Graham Cormode Today Yesterday was the first day of the "Big Data boot camp" at the Simons Institute. The idea behind these boot camps is to get participants in a program "on the same page". In a program like ours, with statisticians. ML folks, algorithms people and optimizers all mingling together, getting "on the same page" with "the same language" is critical. For a more general overview of the first day activities, see Muthu's post. Michael Jordan opened the proceedings with a talk titled "Big Data: The computation/statistics interface". From a CS perspective, it was an excellent talk that hit just the right level of detail for me to grasp some basic terminology in statistics, as well as understand some of the questions people are pondering right now. We think of big data as a PROBLEM. More data, more scaling issues, O(n) becomes infeasible, and so on. In particular, we think of large data as a workload that taxes our resources: time, space, communication, and so on. Michael presented a counterpoint. From a statistical perspective, more data is better. Estimators converge, error reduces, and the law of large numbers kicks in. Rather than treating data as something that we have to manage efficiently, can we think of data (quality) as a resource itself that can be managed efficiently ? In other words, can we tradeoff estimator accuracy against other resources ? He proceeded to give a wonderful exposition of the basics of statistical decision theory, including the frequentist and Bayesian views of estimator risk. Along the way, he described a geometric interpretation of the James-Stein estimator which is too neat to skip. The James-Stein estimator is really a very counterintuitive object. Consider a set of samples$X_i \sim N(\theta_i, 1), i \le n$, where the$\theta_i$are unknown parameters. The goal is to sample$X_i$to determine the$\theta_i$, by minimizing some loss function$\sum (\theta_i - \hat{\theta_i})^2$. The$\theta_i$are assumed unrelated to each other. The "obvious" estimator is$\hat{\theta_i} = X_i$. It's the MLE after all. But it turns out that this estimator is strictly dominated (in terms of expected loss aka risk) by the so-called shrinkage estimator: $\hat{\theta_i} = (1 - \frac{c}{S^2}) X_i$ where$S = \sum_j X_j^2$. What's surprising about this is that the estimator for a single$\theta_i$takes into account samples from other$X_i$, even though the$\theta_i$are assumed to be unrelated. It turns out that there's an elegant geometric interpretation of this estimator, first attributed to Galton by Stephen Stigler. Consider the pairs$(X_i, \theta_i)$as points in the plane. We don't quite know where these points lie because we don't know what$\theta_i$is. Because the$X_i$are normally distributed around the$\theta_i$, each point is really a horizontal interval of interest. Now the standard estimator$\hat{\theta_i} = X_i$arises from trying to solve a regression problem of$X$versus$\theta$, or more precisely solving the regression$\hat{X} = E(X\mid \theta)$. But really, what we're trying to do is solve the regression$\hat{\theta} = E(\theta \mid X)$. In other words, regression of$y$against$x$is different from the regression of$x$against$y$, as long as we have more than three points. And this is precisely what the JS estimator says. Returning to the question he started the talk with, can we show a formal tradeoff between data estimation accuracy and sampling cost ? In a recent paper with Venkat Chandrasekharan from Caltech, they show a very nice (and counterintuitive) result: that using a cruder relaxation of an optimization problem can actually lead to more efficient estimation as long as you have sufficient data. Note that this goes against the standard idea that a crude relaxation is "worse" in an approximation sense. The idea is as follows. Consider the very simple problem of denoising where the observations$\mathbf{y}$are generated from the input$\mathbf{x}$by a noise perturbation: $\mathbf{y} = \mathbf{x} + \sigma \mathbf{z}$ where$\mathbf{z}$is normally distributed. Let us assume that$\mathbf{x}$is drawn from some set$S$(for example,$S$is the set of low-rank matrices, or the set of permutation matrices). The simplest estimator for$x$is an$\ell_2$projection: compute the sample mean$\overline{\mathbf{y}}$and then find its projection onto$S$. But this involves a minimization over$S$, which might be intractable. We can relax$S$to some$S'$, where$S \subset S'$and minimization becomes easier. But this would worsen the approximation ratio of the optimization depending on the relaxation. And here's where the insight comes in. Suppose you instead look at the statistical risk of this estimator. In other words, look at the expected difference between the true$\mathbf{x}^*$and the estimated$\hat{\mathbf{x}}$. The main result they show in this paper (paraphrased) is expected risk is upper bounded by (C/n) * geometric complexity of$S, \mathbf{x}^*$where$n$is the number of samples. Suppose we fix the desired risk. Then an increase in$n$can be used to "pay for" increased "geometric complexity". And here's where the final idea comes in. The "geometric complexity" used here is the Gaussian-squared complexity, which is defined as $g(D) = E[ \sup_{x \in D} \langle x, z\rangle^2 ]$ where$z$is normally distributed. In particular, the set$D$used in the above expression is the set of directions from$\mathbf{x}^*$to points in$S$. Suppose$S$is very "pointed" at$\mathbf{x}^*$. Then the Gaussian-squared complexity is small and the number of samples needed is small. But if instead we use a relaxation$S'$that is "blunt". The Gaussian complexity goes up, but if we have more samples, that keeps the risk fixed. If the optimization for$S'$is significantly easier than the corresponding problem for$S$, then we still win, even though$n\$ has increased, and even though the classical "approximation ratio" might have become worse.

In the final part of his talk, Michael talked about his recent work on "bags of little bootstraps", which Muthu also covered in his post. This post is already too long, so I'll defer this to another time.