## Wednesday, September 04, 2013

### Statistics, geometry and computer science.

Geometry is the glue between statistics and computer science. -- Michael Jordan
That's why everyone gets stuck in it.
-- Graham Cormode

Today Yesterday was the first day of the "Big Data boot camp" at the Simons Institute. The idea behind these boot camps is to get participants in a program "on the same page". In a program like ours, with statisticians. ML folks, algorithms people and optimizers all mingling together, getting "on the same page" with "the same language" is critical. For a more general overview of the first day activities, see Muthu's post.

Michael Jordan opened the proceedings with a talk titled "Big Data: The computation/statistics interface". From a CS perspective, it was an excellent talk that hit just the right level of detail for me to grasp some basic terminology in statistics, as well as understand some of the questions people are pondering right now.

We think of big data as a PROBLEM. More data, more scaling issues, O(n) becomes infeasible, and so on. In particular, we think of large data as a workload that taxes our resources: time, space, communication, and so on.

Michael presented a counterpoint. From a statistical perspective, more data is better. Estimators converge, error reduces, and the law of large numbers kicks in. Rather than treating data as something that we have to manage efficiently, can we think of data (quality) as a resource itself that can be managed efficiently ? In other words, can we tradeoff estimator accuracy against other resources ?

He proceeded to give a wonderful exposition of the basics of statistical decision theory, including the frequentist and Bayesian views of estimator risk. Along the way, he described a geometric interpretation of the James-Stein estimator which is too neat to skip.

The James-Stein estimator is really a very counterintuitive object. Consider a set of samples $X_i \sim N(\theta_i, 1), i \le n$, where the $\theta_i$ are unknown parameters. The goal is to sample $X_i$ to determine the $\theta_i$, by minimizing some loss function $\sum (\theta_i - \hat{\theta_i})^2$. The $\theta_i$ are assumed unrelated to each other.

The "obvious" estimator is $\hat{\theta_i} = X_i$. It's the MLE after all. But it turns out that this estimator is strictly dominated (in terms of expected loss aka risk) by the so-called shrinkage estimator:
$\hat{\theta_i} = (1 - \frac{c}{S^2}) X_i$
where $S = \sum_j X_j^2$.

What's surprising about this is that the estimator for a single $\theta_i$ takes into account samples from other $X_i$, even though the $\theta_i$ are assumed to be unrelated.

It turns out that there's an elegant geometric interpretation of this estimator, first attributed to Galton by Stephen Stigler. Consider the pairs $(X_i, \theta_i)$ as points in the plane. We don't quite know where these points lie because we don't know what $\theta_i$ is. Because the $X_i$ are normally distributed around the $\theta_i$, each point is really a horizontal interval of interest.

Now the standard estimator $\hat{\theta_i} = X_i$ arises from trying to solve a regression problem of $X$ versus $\theta$, or more precisely solving the regression $\hat{X} = E(X\mid \theta)$. But really, what we're trying to do is solve the regression $\hat{\theta} = E(\theta \mid X)$. In other words, regression of $y$ against $x$ is different from the regression of $x$ against $y$, as long as we have more than three points. And this is precisely what the JS estimator says.

Returning to the question he started the talk with, can we show a formal tradeoff between data estimation accuracy and sampling cost ? In a recent paper with Venkat Chandrasekharan from Caltech, they show a very nice (and counterintuitive) result: that using a cruder relaxation of an optimization problem can actually lead to more efficient estimation as long as you have sufficient data. Note that this goes against the standard idea that a crude relaxation is "worse" in an approximation sense.

The idea is as follows. Consider the very simple problem of denoising where the observations $\mathbf{y}$ are generated from the input $\mathbf{x}$ by a noise perturbation:
$\mathbf{y} = \mathbf{x} + \sigma \mathbf{z}$
where $\mathbf{z}$ is normally distributed. Let us assume that $\mathbf{x}$ is drawn from some set $S$ (for example, $S$ is the set of low-rank matrices, or the set of permutation matrices).

The simplest estimator for $x$ is an $\ell_2$ projection: compute the sample mean $\overline{\mathbf{y}}$ and then find its projection onto $S$. But this involves a minimization over $S$, which might be intractable.

We can relax $S$ to some $S'$, where $S \subset S'$ and minimization becomes easier. But this would worsen the approximation ratio of the optimization depending on the relaxation. And here's where the insight comes in.

Suppose you instead look at the statistical risk of this estimator. In other words, look at the expected difference between the true $\mathbf{x}^*$ and the estimated $\hat{\mathbf{x}}$. The main result they show in this paper (paraphrased) is
expected risk is upper bounded by (C/n) * geometric complexity of $S, \mathbf{x}^*$
where $n$ is the number of samples.

Suppose we fix the desired risk. Then an increase in $n$ can be used to "pay for" increased "geometric complexity". And here's where the final idea comes in. The "geometric complexity" used here is the Gaussian-squared complexity, which is defined as
$g(D) = E[ \sup_{x \in D} \langle x, z\rangle^2 ]$
where $z$ is normally distributed.

In particular, the set $D$ used in the above expression is the set of directions from $\mathbf{x}^*$ to points in $S$. Suppose $S$ is very "pointed" at $\mathbf{x}^*$. Then the Gaussian-squared complexity is small and the number of samples needed is small. But if instead we use a relaxation $S'$ that is "blunt". The Gaussian complexity goes up, but if we have more samples, that keeps the risk fixed. If the optimization for $S'$ is significantly easier than the corresponding problem for $S$, then we still win, even though $n$ has increased, and even though the classical "approximation ratio" might have become worse.

In the final part of his talk, Michael talked about his recent work on "bags of little bootstraps", which Muthu also covered in his post. This post is already too long, so I'll defer this to another time.