Thursday, November 14, 2013

The many stages of writing a paper, and how to close the deal.

[tl;dr] Producing a piece of research for publication has many stages, and each stage has different needs, requiring different ways of operating. Learning these stages is a key developmental step for a graduate student. In what follows, I describe this process.

Caveat: this is obviously customized to more theoretical-style work, although I'll talk about how experimental evaluation fits in, in the context of my work. So YMMV.

From my conversations with students (mine and others), I think this accurately describes how students think a paper gets written:

  1. Advisor produces problem miraculously from thin air. 
  2. Come up with solution.
  3. Write down solution
  4. Advisor makes annoying and mystifying edit requests on irrelevant introductory stuff, while throwing out long complicated proofs (or experiments) student has spent many hours sweating over.
  5. Make final edits and submit paper.
Most student figure out  how to do 2) , and eventually learn how to do 3) (which is itself a topic for another post). 5) is probably the first thing students learn how to do: fix typos, edit latex, and generally do yak-shaving.

But step 4) is perhaps the most mysterious part of the writing process for a new researcher, and the least structured. I call it "closing the deal" and it's really about going from a bag of results to an actual submittable paper. 

Let me elaborate.

1) Coming up with a problem.
    Of course coming up with a problem is the essence of the research process ("it's about the questions, not the answers", he shrieks). This takes experience and vision, and can often be changed by things you do in stage 4. I'll say no more about it here.

2) Solving a problem.
    This is the stage that everyone knows about. That's what we do, after all - solve problems ! This is where we drink lots of coffee, live Eye of the Tiger montages, get inspiration in our sleep, and so on. 

   It often happens that you don't exactly solve the problem you set out to attack, but you make many dents in it, solving special cases and variants. It's important to be flexible here, instead of banging your head against a wall head-on. At any rate, you either exit this stage of the project completely stuck, with a complete solution, or with a collection of results, ideas and conjectures surrounding the problem (the most typical case)

3) Writing it all down.
   Again, I could spend hours talking about this, and many people better than I have. It's a skill to learn in and of itself, and depends tremendously in the community you're in.

4) Closing, or getting to a submission.
    But this is the part that that's often the most critical, and the least understood. I call it "closing the deal": getting from 80% to 100% of a submission, and it requires a different kind of skill. The overarching message is this:
A paper tells a story, and you have to shape your results - their ordering, presentation, and even what you keep and what you leave out - in order to tell a consistent and clear story. 
(before people start howling, I'm not talking about leaving out results that contradict the story; that would be dishonest. I'm talking about selecting which story to tell from among the many that might present themselves)

So you have a bag of results centering around a problem you're trying to solve. If the story that emerges is: "here's a problem that's been open for 20 years and we solved it", then your story is relatively easy to tell. All you have to do is explain how, and using what tools. 

But in general, life isn't that easy. Your results probably give some insights into the hard core of the problem: what parts are trivial, what directions might be blocked off, and so on. 

Now you need to find/discover the story of your paper. You can't do this too early in the research process: you need to explore the landscape of the problem and prove some results first. But you shouldn't wait too long either: this stage can take time, especially if the story changes.

And the story will change. One way of thinking about what you need for a conference submission is a relatively tight, compelling and interesting story. While the loose ends and unexplored directions are probably the thing most interesting to you and your research, they are best left to a conclusions section rather than the main body. What the body should contain is a well-thought out march through what you have discovered and what it says about the problem you're solving. In doing so, you will find yourself making decisions about what to keep, and what to leave out, and how to order what you keep. 

And so, speculations need to be made into concrete claims or triaged. Experiments need to be run till they tell a definite story. Introductions need to be made coherent with the rest of the paper. There's also an element of bolt-tightening: making the bounds as tight as possible, stating claims as generally as makes sense for the overarching story (if your story is about points in the plane, then stating some claims in a general metric space might not always make sense). 

And all of this has to be done to serve the overarching story that will make the most compelling paper possible. The story can change as new results come in, or expand, or sometimes even die, (but this latter is rare). But there is this constant drumbeat of "am I getting closer to a submission with a nice story with each step".

Telling a good story is important. For someone to appreciate your paper, cite it, or even talk about it (whether it's accepted, or on the arxiv) they have to be willing to read it and retain its results. And they'll be able to do that if it tells a clear story, which is not just a union of results. 

Tuesday, November 05, 2013

SODA 2014 travel awards and other news

Via Cliff Stein and Kirsten Wilden:

The SIAM Symposium on Discrete Algorithms (SODA14) will be held January 5-7, 2014 at the Hilton Portland & Executive Tower in Portland, Oregon, USA. Algorithm Engineering and Experiments (ALENEX14) and Analytic Algorithmics and Combinatorics (ANALCO14) will be co-located with the conference on January 5 and 6, respectively.
 The deadline for students to apply for travel support is approaching!   The NSF, Google, IBM Research, and Microsoft Corporation have provided generous travel support for students interested in attending SODA14 or its associated meetings, ALENEX14 and ANALCO14. Application materials and a letter of recommendation from your advisor are due by November 15. Please visit http://www.siam.org/meetings/da14/tsupport.php for detailed funding information and application procedures.

Pre-registration for all three meetings is also available at http://www.siam.org/meetings/da14/reginfo.php.

Additional conference information is available at the following websites:
SODA14:  http://www.siam.org/meetings/da14/ANALCO14: http://www.siam.org/meetings/analco14/ALENEX14: http://www.siam.org/meetings/alenex14/

Sunday, October 27, 2013

FOCS Reception sing-along

Tomorrow night at 7:30 (before the FOCS business meeting at 9), the Positive Eigenvalues will be playing a live set for the entertainment of  FOCSers. 

Who are the Positive Eigenvalues, you might ask ? Well, you'll have to come and hear us to find out :). Two of the songs we're playing are compositions by our resident philosopher/graphic novelist/keyboardist/composer/complexity theorist, Christos Papadimitriou.

The first is a riff off the Dylan song "Man gave names to all the animals", with the same tune but different lyrics. If you're so inspired, you're encouraged to sing along to the chorus (which repeats after each verse). It goes like this:
Chorus: 
Theorists gave names to all the classes
in the beginning, in the beginning
Theorists gave names to all the classes
in the beginning long time ago 

Verse 1:
They saw a woman slim and tall
forging steadily to her goal
she went by fast but that's no crime
uh I think I'll call her PTIME

Verse 2:
They saw a kid who soldiered on
he got no rest from dusk to dawn
he kept sweating in the same place
uh I think I'll call him PSPACE

Verse 3:
They saw a blind man comb his shack
to find his walkstick in the black
but once he found it he could see
uh I think I'll call him NP

Verse 4:
They saw a boy who walked and walked all night
never veered to the left or right
was kind of slow but not that bad
uh I think I'll call him PPAD

Verse 5:
There was a beast and on its back
it carried Heisenberg and Dirac
it was as weird as a beast can be
uh...
For more on the tailed off line at the end, see the wikipedia entry.

The second song is an original composition by Christos in rock ballad/power rock mode. The lyrics speak poignantly about the life of a theoretician:
In Theory
Verse
Think of a carpenter of sorts
he has the flu he has the blues
He burns his oil every night
all night frustrated and confused

Bridge
And worst of all he can't complain
In theory his job is plain  (twice)

Chorus
Build something sturdy and beautiful and useful
He's always short in one goal
He juggles true with elegant and useful
and he keeps dropping one ball

Verse 2
His buddies like him alright
except they love to criticize
They go and take his stuff apart
with clever words with clever eyes

Verse 3
Some nights a girl comes to his shop
to sip his tea learn something raw
his foggy mind his tangled soul
want her to stay but let her go

Thursday, October 17, 2013

Searching randomly for needles.

Suppose you're given a large set of objects $X$, and you know that some subset $I$ is "interesting". A particular example that hits close to (my regular) home is in bug-testing: $X$ is the set of possible inputs to a program, and $I$ is the set of inputs that generate bugs (this is the downside of talking to +John Regehr too much). We'll assume that if you're given a candidate object, you can check easily if it's interesting or not (for example, by running the program)

You'd like to find the interesting items, so you consult an expert (in our running example, maybe it's a method to generate inputs that test certain kinds of bugs). The expert produces items that it thinks are interesting. But experts have biases: maybe your expert only cares about certain kinds of interesting items.

So you ask multiple experts, in the hope that their biases are different, and that together they can cover the space of objects. But you don't know anything about their biases except what you can learn from repeatedly asking them for candidate objects.

What's your strategy for asking questions so that at any stage (say after asking $N$ questions), you've found the optimal number of interesting items ?

This was the topic of a talk by +Sebastien Bubeck at the AMPLab in Berkeley on Tuesday, based on this paper.  A key idea in the algorithm design is to make use of estimators of "things I haven't seen yet", or the famous Good-Turing estimator (which your humble blogger wrote about many æons ago). Here's how it works.

Formally, let us assume that each "expert" $i$ is a distribution $P_i$ over $X$. At each step, the algorithm will determine some $i$, and ask it for a random draw from $P_i$. Suppose we knew the fraction of items that $i$ had not seen yet. Then a simple greedy strategy would be to pick the $i$ that had the largest value of this "not-yet-seen" quantity. That's all well and good, as long as we know the fraction of items not yet seen.

Here's where the G-T estimator comes in. What it says is that if we're given samples from a distribution, and count the number of items in the sample that occurred exactly once, then this "frequency of frequency", divided by the number of samples, is a good estimate for the mass of items not yet seen. Moreover, it can be shown that this estimator has good concentration properties around the true answer.

So that's what the algorithm does. It maintains estimates (for each expert) of the mass not yet seen, and in each round picks the expert that maximizes this term, corrected by an adjustment coming from the tail of the concentration bound.

The algorithm is really simple, and elegant. The question is, how well does it do ? And now things get a little harder. The above ideal greedy strategy is optimal as long as the supports of the experts are disjoint. Under the same assumption (and some other technical ones), it can be shown that the expected difference between the number of items found by the algorithm and the number found by the ideal greedy algorithm is $O(\sqrt{Kt \log t})$ where $K$ is the number of experts and $t$ is the current time step.

It's not clear how to extend these bounds to the case when supports can overlap (or rather, it's not clear to me :)), but the authors show that the algorithm works quite well in practice even when this assumption is broken, which is encouraging.

Wednesday, October 16, 2013

Progressively complex representations for graphs

After NIPS last year, I had posted a note on the evolution of data models and how we think about the "shape" of data. In brief, we have increasingly complex ways of thinking about the (geometric) shape of data that carefully balance the need for expressivity and the ability to determine the parameters of the representation efficiently.

There's a basic fact of data analysis that you come up against once you start playing with data. Either you endow the space with a geometry, or you endow it with a graph structure (and of course, nowadays you might throw in a spherical cow or two). And there are only a few ways to mix the representations (spectral methods being one such approach).

But as far as I know, there are no standard ways to create a layered representation of graphs to model increasing complexity and expressivity in a similarly balanced manner. There are of course an infinite number of ways to parametrize graphs by quantities that capture increasing complexity (treewidth, connectivity, expansion, induced metric structure, and so on). But I don't know of any established and standard ways to model families of graphs with increasing complexity that capture data patterns of interest.

One of the things that I've been tasked with (along with Peter Richtarik and Han Liu) is to draft out a report on the activities at the big data program at Simons this semester. We're looking also at challenges in the theory of big data, and in my mind coming up with good models for graph structures that capture the intricacies of real data is one of them.

Wednesday, October 02, 2013

Call for tutorials at SIAM Data Mining

I'm the tutorials chair for the 2014  SIAM conference on data mining (SDM) (for readers not aware of the data mining landscape, SDM is a SODA-style data mining conference run by SIAM - i.e it's a CS-style venue with peer-reviewed papers, rather than a math-style conference like SIAM discrete math). SDM is one of the major venues for data mining research, especially the more statistically focused kind. It will be held in Philadelphia between Apr 24-26, 2014.

SDM runs 4-5 tutorials each year on various aspects of data mining (ranging from theory/algorithms to specific application areas). I'm personally very interested in encouraging submissions from people in the theory community working with data who might want to share their expertise about new methods/techniques from theoryCS land with the larger data analysis community.

The deadline is Oct 13, and all you need is a few pages describing the tutorial content, target audience, and some sample material (more details here). If you have an idea and are not sure whether it will fit, feel free to email me directly as well.

Monday, September 16, 2013

SIGACT CG Column: Moving heaven and earth

My new column for SIGACT News is up (non pay-walled version here). In it, I discuss two different ways of defining distances between distributions.

The earthmover distance is a familar measure to TCS folk, and the kernel distance (or the maximum mean distortion, or the distance covariance) is a familiar character in machine learning and statistics.

It turns out that while statistically, these are not as different as one might think, they are different when it comes to computing them, and understanding their geometry.

The column describes both measures and their properties, and does a compare-contrast between them. More generally, I try to describe how one should think about the process of constructing a distance function between two distributions.

Monday, September 09, 2013

Life at Simons

There have been many posts on the technical talks happening at Simons (see +Moritz Hardt's latest on gradient descent if you haven't yet). Not as many yet on the less-formed discussions that are happening in and around the building, but maybe it's too soon for that.

In the meantime, I thought it might be fun to describe "a day at Calvin Hall".

I usually bike in after 9 am. Biking in Berkeley is not for the (literally) weak-kneed, and I'm discovering how weak my knees actually are. I haven't yet found a good site to plot elevation gains and losses for an arbitrary route, but this one is pretty decent.

The Institute lends out bikes if you don't have your own. It's a great way to get around if you're staying near campus, because parking is difficult and expensive.

Of course, the next step is:


It's a great machine. Only downside is that it shuts down at 5pm when the building does. But there are plenty of cafes around for later.

Then of course the decision: which talks should I feel guilty about missing today ? There are A LOT OF TALKS going on at the various workshops/boot camps. I feel embarrassed complaining about this, because we have such a stellar crowd of speakers and such fascinating talks. But there are almost too many talks: I have to shut down the feelings of guilt in order to find big chunks of time to think (or write this blog post <ahem>).

Thankfully, there's the live stream. I've already mentioned the Simons livestream on G+, but it's worth mentioning again. If you're late to the start of any talk you can watch it on the high quality live feed. If you can't watch it live you can watch it as soon as it's done on the Ustream site for the event. And a few days after that the talks are archived on the Simons site directly. There's also this cute feature where the live stream plays the most recent talk during coffee breaks. I know people outside the Institute are watching talks live because someone pinged me on twitter to ask the speaker to repeat questions they get from the audience.

So I've either attended talks or I've spent the morning brainstorming on some of the projects I have going. It's now time for lunch: thankfully, we're walking distance from both Northside and Southside Berkeley, which means a huge variety of eating places all within 5-10 minutes walking. All we need to do is avoid the huge class let-out at 12:20 or so. On Tuesdays we have a lunch seminar (organized by +Moritz Hardt) as part of the big-data program.

Once we're back, it's time for this:

Did I mention that the coffee machine is awesome ?

Usually, discussions start picking up in the afternoon, and you'll often find me here:

3:30 is time for the daily tea, at which we get all kinds of nice cookies, and the occasional cake/ice cream (on thursdays). This is much like the Dagstuhl tea time, but without the fancy tortes (hint hint!). Of course, no daily tea is complete with the addition of this:

By five, people are starting to trickle out slowly. On Tuesdays we have a happy hour at a local bar ($3 pints!). And it's time for me to check the sad state of my knees for the ride back, which is hard because it's uphill both ways !

Friday, September 06, 2013

More camping with high dimensional boots...

I discovered today that you don't have to wait for talk videos to be posted on the Simons website. All videos are live streamed via ustream, and they have their own channel for the boot camp talks, where you can watch the videos immediately after they're streamed.

Martin Wainwright gave a 3 hour presentation on high dimensional statistics. Michael Jordan's talk earlier was good preparation for this, just to get familiar with basic concepts like the risk of an estimator and minimax theory.

Wainwright's talks were much denser, and it would be hard do an intelligent summary of the entire presentation. The central theme of his talk was this:

In classical statistics, we can evaluate the quality of an estimator as $n \rightarrow \infty$ using standard asymptotic methods, and for the most part they are well understood (convergence, rates of convergence, sampling distributions, and so on). But in all these results, it's assumed that the data dimensionality stays fixed. Suppose it doesn't though ?

In particular, suppose you have a situation where $d/n \rightarrow \alpha > 0$ (this notion was first introduced by Kolmogorov). For example, suppose $\alpha = 0.5$. What happens to the behavior of your estimation methods ?  He worked out a few simple examples with experiments to show that in such cases, classical asymptotic bounds fail to capture what's really going on. For example, suppose you wanted to estimate the mean of a distribution and you used the sample mean, relying on the central limit theorem. Then it turns out that in the $d/n \rightarrow \alpha$ regime, the convergence is slowed down by the parameter $\alpha$.

Another example of this problem is estimating the covariance of a matrix. Assume you have a sample of iid Gaussian random variables drawn from $N(0, I_{d \times d})$, and you want to use the sample covariance to estimate the population covariance (in this case, the identity matrix). You can look at the distribution of eigenvalues of the resulting matrix (which you expect to be sharply concentrated around 1) and in fact you get a much more spread out distribution (this is known as the Marcenko-Pastur distribution). You can show that that the maximum singular value of the matrix is no bigger than $1 + \sqrt{d/n}$ with high probability. But this error term $\sqrt{d/n}$ does not go away.

If the data is indeed high dimensional, is there low-dimensional/sparse structure one can exploit to do inference more efficiently ? This gets us into the realm of sparse approximations and compressed sensing, and he spends some time explaining why sparse recovery via the LASSO is actually possible, and describes a condition called the "restricted null space property" that characterizes when exact recovery can be done (this property is implied by the RIP, but is strictly weaker).

In the second part of the talk he talked more generally about so-called regularized M-estimators, and how one might prove minimax bounds for parameter estimation. Again, while the specifics are quite technical, he brought up one point in passing that I think is worth highlighting.

When doing parameter estimation, the "curvature" of the space plays an important role, and has done so since the Cramer-Rao bound, the idea of Fisher information and Efron's differential-geometric perspective. The idea is that if the optimal parameter lies in a highly curved region of loss-parameter space, then estimation is easy, because any deviation from the true parameter incurs a huge loss. Conversely, a region of space where the loss function doesn't change a lot is difficult for parameter estimation, because the parameter can change significantly.

Once again, this is in sharp contrast to how we view the landscape of approximations for a hard problem. If we have a cost function that varies gently over the space, then this actually makes approximating the function a little easier. But a cost function that has sharp spikes is a little trickier to approximate, because a small movement away from the optimal solution changes the cost dramatically.

There are some problems with this intuition. After all, a sharp potential well is good for gradient descent methods. But the difference here is that in estimation, you only care about the loss function as a tool to get at the true goal: the desired parameter. However, in much of algorithm design, the loss function IS the thing you're optimizing, and you don't necessarily care about the object that achieves that optimal loss. This goes back to my earlier point about the data versus the problem. If optimizing the loss is the end goal, then a certain set of tools come into play. But if the goal is to find the "true" answer, and the loss function is merely a means to this end, then our focus on problem definition isn't necessarily helpful.




Wednesday, September 04, 2013

Statistics, geometry and computer science.

Geometry is the glue between statistics and computer science. -- Michael Jordan
That's why everyone gets stuck in it. 
-- Graham Cormode

Today Yesterday was the first day of the "Big Data boot camp" at the Simons Institute. The idea behind these boot camps is to get participants in a program "on the same page". In a program like ours, with statisticians. ML folks, algorithms people and optimizers all mingling together, getting "on the same page" with "the same language" is critical. For a more general overview of the first day activities, see Muthu's post.

Michael Jordan opened the proceedings with a talk titled "Big Data: The computation/statistics interface". From a CS perspective, it was an excellent talk that hit just the right level of detail for me to grasp some basic terminology in statistics, as well as understand some of the questions people are pondering right now.

We think of big data as a PROBLEM. More data, more scaling issues, O(n) becomes infeasible, and so on. In particular, we think of large data as a workload that taxes our resources: time, space, communication, and so on.

Michael presented a counterpoint. From a statistical perspective, more data is better. Estimators converge, error reduces, and the law of large numbers kicks in. Rather than treating data as something that we have to manage efficiently, can we think of data (quality) as a resource itself that can be managed efficiently ? In other words, can we tradeoff estimator accuracy against other resources ?

He proceeded to give a wonderful exposition of the basics of statistical decision theory, including the frequentist and Bayesian views of estimator risk. Along the way, he described a geometric interpretation of the James-Stein estimator which is too neat to skip.

The James-Stein estimator is really a very counterintuitive object. Consider a set of samples $X_i \sim N(\theta_i, 1), i \le n$, where the $\theta_i$ are unknown parameters. The goal is to sample $X_i$ to determine the $\theta_i$, by minimizing some loss function $\sum (\theta_i - \hat{\theta_i})^2$. The $\theta_i$ are assumed unrelated to each other.

The "obvious" estimator is $\hat{\theta_i} = X_i$. It's the MLE after all. But it turns out that this estimator is strictly dominated (in terms of expected loss aka risk) by the so-called shrinkage estimator:
\[ \hat{\theta_i} = (1 - \frac{c}{S^2}) X_i \]
where $S = \sum_j X_j^2$.

What's surprising about this is that the estimator for a single $\theta_i$ takes into account samples from other $X_i$, even though the $\theta_i$ are assumed to be unrelated.

It turns out that there's an elegant geometric interpretation of this estimator, first attributed to Galton by Stephen Stigler. Consider the pairs $(X_i, \theta_i)$ as points in the plane. We don't quite know where these points lie because we don't know what $\theta_i$ is. Because the $X_i$ are normally distributed around the $\theta_i$, each point is really a horizontal interval of interest.

Now the standard estimator $\hat{\theta_i} = X_i$ arises from trying to solve a regression problem of $X$ versus $\theta$, or more precisely solving the regression $\hat{X} = E(X\mid \theta)$. But really, what we're trying to do is solve the regression $\hat{\theta} = E(\theta \mid X)$. In other words, regression of $y$ against $x$ is different from the regression of $x$ against $y$, as long as we have more than three points. And this is precisely what the JS estimator says.

Returning to the question he started the talk with, can we show a formal tradeoff between data estimation accuracy and sampling cost ? In a recent paper with Venkat Chandrasekharan from Caltech, they show a very nice (and counterintuitive) result: that using a cruder relaxation of an optimization problem can actually lead to more efficient estimation as long as you have sufficient data. Note that this goes against the standard idea that a crude relaxation is "worse" in an approximation sense.

The idea is as follows. Consider the very simple problem of denoising where the observations $\mathbf{y}$ are generated from the input $\mathbf{x}$ by a noise perturbation:
\[ \mathbf{y} = \mathbf{x} + \sigma \mathbf{z} \]
where $\mathbf{z}$ is normally distributed. Let us assume that $\mathbf{x}$ is drawn from some set $S$ (for example, $S$ is the set of low-rank matrices, or the set of permutation matrices).

The simplest estimator for $x$ is an $\ell_2$ projection: compute the sample mean $\overline{\mathbf{y}}$ and then find its projection onto $S$. But this involves a minimization over $S$, which might be intractable.

We can relax $S$ to some $S'$, where $S \subset S'$ and minimization becomes easier. But this would worsen the approximation ratio of the optimization depending on the relaxation. And here's where the insight comes in.

Suppose you instead look at the statistical risk of this estimator. In other words, look at the expected difference between the true $\mathbf{x}^*$ and the estimated $\hat{\mathbf{x}}$. The main result they show in this paper (paraphrased) is
 expected risk is upper bounded by (C/n) * geometric complexity of $S, \mathbf{x}^*$
where $n$ is the number of samples.

Suppose we fix the desired risk. Then an increase in $n$ can be used to "pay for" increased "geometric complexity". And here's where the final idea comes in. The "geometric complexity" used here is the Gaussian-squared complexity, which is defined as
\[ g(D) = E[ \sup_{x \in D} \langle x, z\rangle^2 ] \]
where $z$ is normally distributed.

In particular, the set $D$ used in the above expression is the set of directions from $\mathbf{x}^*$ to points in $S$. Suppose $S$ is very "pointed" at $\mathbf{x}^*$. Then the Gaussian-squared complexity is small and the number of samples needed is small. But if instead we use a relaxation $S'$ that is "blunt". The Gaussian complexity goes up, but if we have more samples, that keeps the risk fixed. If the optimization for $S'$ is significantly easier than the corresponding problem for $S$, then we still win, even though $n$ has increased, and even though the classical "approximation ratio" might have become worse.

In the final part of his talk, Michael talked about his recent work on "bags of little bootstraps", which Muthu also covered in his post. This post is already too long, so I'll defer this to another time.

Tuesday, August 27, 2013

And now for something different...

People who know me through my blog probably assume I have no life at all outside blogging and big data.

oh wait...

But more seriously, my family is here with me during this sa-battle-cal and +Karen Ho is ably keeping up the non-work blogginess. If you're interested in our adventures outside big theories of data, check out her blog Simply Batty: you might even learn what "academic headcheese" is.

On "a theory of big data"

+Moritz Hardt kicked off our Simons big data program with an excellent rumination on the role of theory in "big data". Some followup thoughts:

Moritz makes the point that theory, for better or for worse (mostly for better) made the choice to give problems primacy over data. I harp on this a lot in my algorithms class, and also talked about this in the context of computational thinking: the idea of 'naming a problem' is a key intellectual contribution of theoryCS. 

But to look at what might happen if we allowed more flexibility in problem definition, we don't have to look too far. Machine learning (a discipline that faces data head on) is characterized by the study of problems that are well defined in the broad sense, but have a lot of wiggle room in the specific (and I'm now hoping +Sebastien Bubeck will respond and critique what follows)

Consider classification. In a broad sense, it's very well defined: given a collection of points labeled with (1,-1), find a decision rule that can be used to separate the +s from the -s. But in the specifics: what's the decision rule ? what's the penalty for making a mistake ? how much labeled data do we have ? does it cost us to get labels ? and so on.

For each possible answer to these questions, you can construct a well defined problem. And you could focus on solutions to that very well defined problem. But that's not particularly important. Maybe I use hinge loss to capture errors in my classification, or maybe I use some other error function. I don't care as much as I care about a formulation that allows me to solve different flavors of the problem using  a single paradigm: possibly some form of gradient descent. 

This theme shows up again and again. In clustering. In regression. In the different flavors of learning (active, semisupervised, transfer, multitask, ...). A good solution to a problem focuses not on the specific definition, but a general framework that captures different variations on the problem and reduces them to solving some optimization that can then be engineered. This is also why optimization (and understanding heuristics for optimization) is such a focus on machine learning (Bubeck's lecture notes are a great resource on this, btw)

There is of course a downside. The downside is that you (could) miss out on connections between problems: the reductions that are the lifeblood of work in theoryCS. In fact, John Langford has looked into this issue specifically, with his machine learning reductions project.

So returning to the original question, what should a theory of big data look like ? A big part of theoryCS research is the skillful management of resources (space, time, random bits, communication, what have you..). But an even more important part is the abstraction of computation as a first-order phenomenon, a "theory of the feasible", as it were. 

Consider the example of privacy. Privacy is a concern that arises from access to data, and is a core component of any "big data" discussion. What differential privacy achieved was to frame the problem computationally, as both a definition of what's feasible to protect, and as a set of mechanisms to guarantee protection, and guarantee what cannot be protected.

Similarly, I'm interested in other problems arising out of our need to interact with data that can be framed abstractly in "the language of the feasible". There's work on how to value published data, and how to price it. There's work on how to verify computations, and how to do computations securely. There's work on how to understand and interpret data analysis. And then of course there's the large body of work on how to manage and compute efficiently with data under brand new models of computation. 

The "language of the feasible" is our not-so-secret weapon in defining abstractions: it's more than just resource allocation and management, and it's what gives theoryCS power to speak to other domains. 

Friday, August 23, 2013

Simons Institute opening, with pictures.

Today was the official kickoff for the Simons Institute, as well as the start of the two special programs in big data and real analysis. For the last few days I've been busy getting my paperwork organized over at Calvin Hall, which is a beautifully redone circular building named after the Nobel Laureate and Berkeley professor (more on the building here).

The inside is very conducive to collaborative work. The offices are organized around the periphery of the building, and are functional, but not overly so. The idea is that people will congregate in the large interior open spaces that are covered with whiteboards and comfy chairs.

This is what the interior looks like: (apologies for the picture quality)


The second floor open lounge


Comfy chairs in front of a whiteboard


even more chairs and more whiteboards


Let us not forget the very fancy coffee machine (which makes nice espresso)



and of course you need your Simons Institute mug to drink it in.


This is the main auditorium for workshops, on the first floor.

The next plan is to pave the outer walls of the building at ground level with chalkboards, so people can think and chat outside as well as inside. Ah, Berkeley weather. I'm told the boards are already there, and just need to be installed, so I hope to have pictures soon. 

There are 93 visitors between the two programs (53/39 by my rough count) which includes 22 Fellows. Offices are either two or three-person, and there are some large cubicle spaces with 7+ inhabitants. The visitors are all on the second and third floors of the building, which both have large open areas (the pictures above are from the second floor). 

Wednesday, August 14, 2013

The Sa-battle-cal starts..

I'm off on sabbatical, and we (two cars, two adults, two children and one cat) just started the long drive to Berkeley from SLC. This has turned out to be more exciting than I anticipated...

Our original plan was to drive 4 hours each day, making up the 12 hour drive to Berkeley in three days. With two drivers for two cars, this seemed like the best option to prevent us from getting over-tired.

Right at Wendover (Las Vegas for poor people!), about halfway on our first leg, my car broke down. Thankfully, I was able to coast it to a mechanic's shop just off the highway as we entered town. I had no clue what the problem was, and the mechanic wouldn't be able to take a look at it for a few hours (this is the week of the Bonneville races on the salt flats).

So I called my mechanic back in Salt Lake, and described the problem to him. He diagnosed it on the spot as a faulty ignition coil, which is apparently an easy part to procure and replace.... if you're near a dealership.

Which I was not...

He also needed me to figure out which coil was failing, which needed a computer scanner, which needed a mechanic.

So....

Here's what needed to happen, in short order. It was now 5pm. We (my wife and I) needed to

  • get the mechanic to at least scan the car to get the appropriate error codes
  • Call the car parts store (closing at 6) to see if they could procure the needed parts
  • Find a hotel in Wendover (did I mention this was the week of the Bonneville Races, and almost everything in town was booked?)
  • Change our reservations downstream because we were stuck in Wendover. 
Thankfully, through a series of small miracles and the generosity of many strangers and non-strangers, we managed to get all of this done. My mechanic even offered to to walk me through the installation myself once I did the appropriate Youtube self-study (MOOCs FTW !!)

Long story short, it's now day II of our trek. We doubled up the driving on day II to make up for lost time, and we're back on schedule (minus one set of car keys that I managed to lose in all the hurry). 

So far, this sabbatical is certainly not relaxing. 

p.s Shout out to Utah Imports of SLC, and S&R auto repair in Wendover. 

p.p.s I-80 through Nevada is one of the most mind-numbingly boring drives imaginable. 

Monday, August 05, 2013

Hi ho, hi ho, it's off to sabbatical we go...

It is now 7 days and counting before I head off on sabbatical, not to return till August 2014. I'll be heading to the Simons Institute to think big thoughts about the theory of data, (or was it big data about theory, or maybe just the theory of big data). After that, I'll be enjoying the fine life at Google for a semester, followed by a visit to MADALGO in Aarhus (because big data in Europe just tastes better, you know).

I've been using this summer to nurse my psychic wounds from six years of the grind, and gently slide into sabbatical mode. The rush of joy that comes everytime I delete a departmental email without reading beyond the subject tells me I'm ready :).

So far, advice I've received on how to conduct myself during a sabbatical includes:

  • Don't plan too much
  • Work very hard, but on something other than your current research
  • Have an adventure, and if work happens, don't blame yourself. 
  • Say NO repeatedly (this also applies to life post-tenure, apparently). I maybe took this advice a little too literally and managed to decline a review request in about 0.5 seconds, which surprised (and possibly annoyed) the editor who had made the request. 
  • Do something different (or do something that you've been meaning to do for years but never got a chance to). 
What else ? 

Friday, July 05, 2013

FSTTCS in December

At this point, you're probably wondering: exactly how much more coffee do I need to infuse into my system to get my 1/3/10  papers submitted to SODA before the deadline ? Do your stomach lining (and your adenosine receptors) a favor and consider submitting to FSTTCS: the abstracts deadline is Jul 8, and the submission deadline is July 15. You get to go to Guwahati in December and you might even get to stay here:



Wednesday, June 26, 2013

SODA 2014 abstract submission deadline approaching

+Chandra Chekuri informs us that the abstract submission deadline for SODA 2014 is approaching rapidly. Abstracts MUST be in by July 3 or else you cannot submit a full paper (by Jul 8).

For more information, visit the SODA site: http://siam.org/meetings/da14/submissions.php

Tuesday, June 04, 2013

Computational thinking in other disciplines.

While we wait for the (literal) flames to die down at the STOC business meeting (who'd have thought a two-tier PC would be so controversial), a thought on "computational thinking". 

I've been on PhD committees for students in other departments far removed from computer science. I'm there partly as an external person, and partly as someone who "understands computation" and can guide the student in a project that involves some nontrivial algorithmic thinking.

It's very striking to see how these students approach the notion of computing. They have strong engineering/math backgrounds, but they tend to view computational problems as black boxes for which the first idea that comes to mind is the correct answer. This is pernicious enough that when I try to nudge them to define the problem more generally, they have no idea what to do.

This is a common theme in disciplines that make use of computation. Everyone has their favorite biology story along these lines, but this article in Science magazine is a good example of the problem of scientists using code. In brief, they tend to use code as a blackbox without really understanding what the code does or why, or how to validate the results.

In my algorithms class, I emphasize the important of naming problems. Especially when one is trying to model a fuzzy real-world problem, it's very important to recognize named problems that might be hidden inside. But this very fundamental idea - that the problem comes first, and that by naming it we can recognize and understand this - is foreign to people not trained to think computationally. Rather, they think in terms of solutions, which is dangerous until they really know what they want to solve.

We talk a lot about computational thinking and the intellectual content of computer science. For anyone without CS training who wishes to make use of computational tools, this is one of the most important lessons to learn: how to name a problem.




Wednesday, May 29, 2013

MOOCification and summer programs

I was walking across campus today and saw the usual crowds of high school and middle school students doing various summer camps (the U. does a GREAT - yes it's called GREAT - summer camp in robotics/graphics, for example).

In our MOOCified future where all classes are taught by "course assistants" who report to "quality assurance staff", will there still be a way for excited middle and high school students to come to a university campus and get an interactive hands-on education at a camp ? I don't doubt that Udacity (or Coursera, or edX) will be happy to spin off a junior varsity division for the high schoolers (and if you do, please note that I said it first). And that will be one more demographic that the university loses as our educational mission gets sliced and diced.

Sunday, May 19, 2013

21st century proverbs

Having nothing better to do, I decided to create a twitter meme: updating proverbs for the 21st century. Use the hashtag #21stcentproverbs and join in the fun below.

 

Coding, Complexity and Sparsity 2013 (SPARC) 2013.

Atri Rudra reminds us that the 3rd incarnation of SPARC is soon to be upon us (disclaimer: I'll be speaking at the event):
Efficient and effective transmission, storage, and retrieval of information on a large-scale are among the core technical problems in the modern digital revolution. The massive volume of data necessitates the quest for mathematical and algorithmic methods for efficiently describing, summarizing, synthesizing, and, increasingly more critical, deciding when and how to discard data before storing or transmitting it. Such methods have been developed in two areas: coding theory, and sparse approximation (SA) (and its variants called compressive sensing (CS) and streaming algorithms). 
Coding theory and computational complexity are both well established fields that enjoy fruitful interactions with one another. On the other hand, while significant progress on the SA/CS problem has been made, much of that progress is concentrated on the feasibility of the problems, including a number of algorithmic innovations that leverage coding theory techniques, but a systematic computational complexity treatment of these problems is sorely lacking. The workshop organizers aim to develop a general computational theory of SA and CS (as well as related areas such as group testing) and its relationship to coding theory. This goal can be achieved only by bringing together researchers from a variety of areas. 
The Coding, Complexity and Sparsity workshop (SPARC 13) will be held in Ann Arbor, MI on Aug 5-7. 
These will be hour-long lectures designed to give students an introduction to coding theory, complexity theory/pseudo-randomness, and compressive sensing/streaming algorithms. We will have a poster session during the workshop and everyone is welcome to bring a poster but graduate students and postdocs are especially encouraged to give a poster presentation. 
This is the third incarnation of the workshop and the previous two workshops were also held in Ann Arbor in August of 2011 and 2012. 
Confirmed speakers:
* Jin Yi Cai (University of Wisconsin, Madison)
* Shafi Goldwasser (MIT)
* Piotr Indyk (MIT)
* Swastik Kopparty (Rutgers University)
* Dick Lipton (Georgia Tech)
* Andrew McGregor (University of Massachusetts, Amherst)
* Raghu Meka (IAS)
* Jelani Nelson    (Harvard)
* Eric Price (MIT)
* Christopher Ré (University of Wisconsin, Madison)
* Shubhangi Saraf (Rutgers University)
* Suresh Venkatasubramanian (University of Utah)
* David Woodruff (IBM)
* Mary Wootters    (Michigan)
* Shuheng Zhou (Michigan)

We have some funding for graduate students and postdocs with preference given to those who will be presenting posters. For registration and other details, please look at the workshop webpage: http://eecs.umich.edu/eecs/SPARC2013/

Friday, May 10, 2013

On GPU algorithms

Lance talks about GPU algorithms in his latest post:
The theory community hasn't seem to catch on yet. There should be some nice theoretical model that captures the vector and other operations of a GPGPU and then we should search for algorithms that make the best use of the hardware. The theory world writes algorithms for non-existent quantum computers but not for the machines that we currently use.
Well.

As someone who was doing GPU algorithms back when this meant flipping state bits on the fixed function pipeline (this is the GPU analog of "I wrote video games in assembly"), maybe a little perspective is in order.

Aside: I actually gave a talk on GPU algorithms at Chicago when Lance was there back in 2004. Clearly I wasn't interesting enough to get him to take note :).

Back in the early 2000s, I got quite interested in GPU algorithms. This came about from a summer project that Nabil Mustafa was doing with Shankar Krishnan and myself on real-time map simplification. Shankar had this cool idea to try and accelerate the Douglas-Peuker algorithm by implementing it on a GPU, which at the time meant trying to retrofit the algorithm to a very strange OpenGL rendering pipeline.

One thing led to another, and it got us thinking more abstractly about what kinds of computations could be done on the GPU. This was pre-CUDA, which meant that essentially we could abstract the GPU machine model as a massively parallel SIMD architecture in which each "processor" ran a simple straight line program (no loops, limited conditionals, and small memory - much like a streaming algorithm). The parallelism lent itself nicely to geometric problems, and we wrote papers on map simplification, depth contours, and general geometric optimization. We also designed an algorithm for a CSG rendering problem that had buried inside it a simple streaming model for proving lower bounds: in fact we were able to show a exponential gap (in the number of streaming passes needed) between deterministic and randomized median finding in this model (the conference wasn't interested in the lower bounds though :)).

In a bout of perfect timing, I stopped working on GPUs a year or so before CUDA. At the time, my reasons were simple. I was tired of the computational model changing on me every six months, and didn't have the resources to keep buying the latest and greatest graphics cards (although AT&T was very generous in getting me a state of the art card originally). It was also annoying that in order to really exploit the power of the hardware, I needed secret sauce that was only revealed if you had NVIDIA inside connections (or went to the NVIDIA U events).

Then CUDA came along and everyone went nuts. If you go to hgpu.org (originally gpgpu.org) you'll see the number of papers being published every year on GPU-accelerated methods. CUDA changed (quite radically) the way in which we thought about GPU algorithms: the memory model became more sophisticated, and the SIMD component was more powerful, although it was still a good idea to avoid excessive looping and conditionals.

Eventually I got dragged back into the GPU world. I collaborated on a  paper on graph coloring which was very instructive in demonstrating the differences between a GPU and straight up parallel machine. I also did a series of lectures on GPU algorithms at a 2012 MADALGO summer school on new models of computing. Incidentally, folks at Utah spend a lot of time working on issues relating to GPUs: Mary Hall's group has some nifty autotuning tools for mapping algorithms to the GPU, and Ganesh Gopalakrishnan has been thinking about memory consistency issues for GPUs.

Is there a good "GPU model" that theoreticians can study ? I think this is the wrong question. I think the GPU is one point (multi-core CPUs are another) in a space of tradeoffs between local computation, memory latency, and communication. I've been saying for a while now that communication appears to be the key resource to understand theoretically when dealing with our new networked/distributed/multiprocessor world. GPU algorithms are interesting to study as one example of how communication affects a computation. But the "right" way to deal with this involves focusing on communication, and not the GPU model per se. The latter has many idiosyncracies and what I'd consider distractions that take away from a clean formal understanding.

I should add that I'm not even the only person from theory-land who's been thinking about GPU-derived models. There's a great body of work on multicore models (see Phil Gibbons' lecture notes at the same summer school), and I recall Michael Mitzenmacher having recent work on GPU-accelerated hashing. Uzi Vishkin had a STOC workshop a few years ago on multicore models as well. (Turing Award winner) Valiant's bridging model clearly needs to get a lot more attention as well so that people don't think no one's paying attention to the models.

For more on this, also see my colleague Jeff Phillips' lecture notes on GPUs from his class on new models of computation.

Saturday, March 23, 2013

Free, Freemium, and Paid

There was a time when I'd bridle at the idea of having to pay for software or services. But I browse the iTunes app store now, and see people pleading to have the chance to pay for an app that they like, so that the authors won't stop updating it. The whole kerfuffle with Google Reader, Keep and Evernote is another example of how people have begun to prefer to pay for products, rather than rely on something free.

It feels like the end of an era where open source and free software (not the same thing, but often referred to in the same breath) were the default. Maybe we've come to the realization that nothing is really ever free, and that it's more realistic to get the costs out in the open rather than "being the product".

Friday, March 15, 2013

The SIGACT CG column

As you might have just discovered, I'm the second half of the two-headed monster that's taken over the SIGACT Geometry Column after Joe O'Rourke stepped down (Adrian Dumitrescu, who hopefully does not mind being referred to as the head of a monster, is the other half). My first column is up, and it talks about some recent fascinating developments in nonnegative matrix factorization.

My hope with the pieces I write is to cover areas of geometry that may have not had sufficient representation in the past (especially things closer to problems I'm interested in). My next column is due in August, and apart from doing a wrapup on SoCG, other things that come to mind include Laplacians and graph geometry, reproducing kernels, or even Bregman geometry.

But everything is now mobile, crowdsourced and social networked, so I'm looking for your suggestions on interesting topics to cover, new emerging areas, results that I'm not tracking, and so on. So post away here or on G+.

Monday, February 25, 2013

Data analysis, interpretation and explanations

There was a recent data-related kerfuffle between the New York Times and the makers of the Tesla electric car. If you haven't read the articles, (and the NYT public editor's post on this has good links), the crude summary is this:

  • NYT reporter takes Tesla on long test drive, and reports problems with running out of charge.
  • Tesla Motors CEO Elon Musk writes a long takedown of the reporter's review, complete with graphs generated from data the car recorded during the drive
  • NYT comes back with their own interpretation of data, rebutting Musk's claims.
  • Others attempt to reproduce the reporter's experience and fail, but arguably in different weather conditions that might or might not make a difference.
In an insightful meta-analysis of the dustup, Taylor Owen of the Tow Center for Digital Journalism discusses the implications for journalism in a data-driven world. He also references an article by David Brooks that makes the point:
People are really good at telling stories that weave together multiple causes and multiple contexts. Data analysis is pretty bad at narrative and emergent thinking...
I've felt for a while now that it's time to design mechanisms for providing context and narrative to data analysis*. Some of the research my student Parasaran does is on metaclustering: essentially the meta-analysis of different perspectives (clusterings) of data to draw out a larger meaning. We've also just submitted a paper on how to derive local markers of prediction validity in clustering (rather than just relying on global measures of quality). And my seminar this semester is about the larger problems of explanations and accounting in data mining.

I think as computer scientists, we have a lot to offer in the realm of data mining - not just in the design of tools for prediction, but in the design of tools to facilitate better understanding.


* None of this is surprising to experimental scientists, who will routinely attack a problem from multiple directions in order to acquire a larger understanding of a phenomenon rather than just the ability to predict. 


Friday, February 08, 2013

TCS+ hangout on the TSP polytope.

+Thomas Vidick  and the folks at the +TCS+ community have started a new experiment in G+ hangout talks. The first talk in this series was by Ronald de Wolf on the STOC 2012 paper that showed that there was no polynomial-sized lifted representation of the TSP polytope.

Overall, it was a very pleasant experience. I was able to reserve one of the 10 slots (a Hangout limit) for myself and my students at the University of Utah to attend and interact with the speaker, and from Thomas's post-review it seems that many more were signed on for "view-only" access to the stream.

There were very few technical hiccups, and +Oded Regev did a great job making sure that people were muted when not talking, and had a chance to ask questions in an orderly way. The chat sidechannel was a great help.

The talk itself was the best part: de Wolf did an excellent job conveying the main ideas of the proof without getting bogged down in details, and it felt as comfortable as listening to a talk live at a conference. Given the number of people listening in, this was already approaching medium-sized-workshop levels.

I'm looking forward to more of these events, and I'm glad that the +TCS+  folks are doing this. I also hope they can try more experiments with Google Hangout. For example, two ideas come to mind:

  • A reddit-style AMA ("Ask Me Anything"). One way to make this work is that the speaker would do a short presentation (maybe 5-10 minutes) and then would open up the floor for questions. To keep things manageable, people could write in questions on chat, and the moderator could filter them and ask the questions live. With sufficient preparation, some questions could be submitted ahead of time.
  • A live panel discussion with a moderator and a few participants, which again could have questions from the audience moderated by the moderator.

Monday, January 21, 2013

Accountability in Data Mining: A new seminar

Glencora Borradaile makes a number of interesting points about the implications of Aaron Swartz's life and work for us academic computer scientists. 
As computer science academics we are in a very powerful position. We are trusted with shaping the next generation that will make very important decisions that will have far-reaching social implications. Decisions like those over Facebook’s privacy defaults, motivating technology that enables autonomous private vehicles at the expense of the public interest, defining ownership of electronic media. We make those decisions ourselves in our research; what we research, how we allow our research to be used.
 Whether or not you agree with her particular policy preferences, the point remains that the technologies we develop can have lasting consequences for the "shape" of the world to come. And if we don't speak up (either through our work, or through our own advocacy), others will take the technologies we develop and find their own ways of using or misusing them.

All of this leads up to my current interests in data mining. I've been thinking about the problems of accountability in data mining for a while now: initially mostly in private, but now more and more in public (along with +Graham Cormode and +Andrew McGregor) as I see the importance of the issue.

What is accountability in data mining ? It means many things really, but for me, for now, it means this:

The fruits of data mining pervade every aspect of our life. We are recommended books and movies; given differential pricing for insurance; screened for potential terror threats; diagnosed with various diseases; and targeted for political advertising. The ability to sift through massive data sets with sophisticated algorithms has resulted in applications with impressive predictive power. 
Since the internals of a learning process can be complex and opaque, it is important to verify that the results satisfy the properties claimed by the algorithm. The importance of this goes beyond merely checking the algorithm itself. Validation mechanisms also provide a way for users to demand accountability from authorities who might use the results of mining to affect users in some way (by changing their insurance rates, or even putting them on a terrorism watchlist). As the results of data mining affect more and more of our lives, the more crucial it is that the user be able to validate decisions made on their behalf and that affect them. 
The above is an introduction to a seminar I'm running this semester on this topic. I'm a little nervous about it, because the topic is vast and unstructured, and almost anything I see nowadays on data mining appears to be "in scope". I encourage you to check out the outline, and comment on topics you think might be missing, or on other things worth covering. Given that it's a 1-credit seminar that meets once a week, I obviously can't cover everything I'd like, but I'd like to flesh out the readings with related work that people can peruse later.

It's entirely possible that we don't care about our privacy any more (as Facebook seems to think*). But we will care about the consequences of the use of our data. My perspective is that a better understanding of what is technically possible (and not possible) to demand and get accountability will be critical to informing larger policy discussions on this topic.

* In an earlier version of this post I wrongly attributed this sentiment to an individual. Apologies.

Wednesday, January 16, 2013

A sampling gem: sampling from $\ell_p$ balls.

A while back, I had written about uniform sampling from the $d$-simplex. In brief, sample exponentially distributed random variables, and normalize them. Note that the simplex is one piece of the $d$-dimensional $\ell_1$ unit sphere.

I had also mentioned a similar trick to sample from the ($\ell_2$) unit sphere: sample Gaussian random variables, and normalize them.

As I discovered today, these are really just special cases of a much more general result that includes both of these approaches, as well as providing a general way to sample from the $d$-dimensional unit ball (as well as sphere) under any $\ell_p$ norm.

The result is due to Barthe, Guedon, Mendelson and Naor, and is breathtakingly elegant. Here it is:

To sample from the unit ball under the $\ell_p$ norm in $d$ dimensions, randomly sample $d$ coordinates $x_1, \ldots, x_d$ i.i.d from a density proportional to $\exp(-|x|^p)$. Also sample $Y$ from the exponential distribution with parameter $\lambda = 1$. Then the desired sample is
$$\frac{(x_1, x_2, \ldots, x_d)}{\Bigl(\sum_i |x_i|^p + Y\Bigr)^{1/p}}$$
Notice that merely eliminating the $Y$ recovers the procedure for sampling from the unit sphere and includes both of the above sampling procedures as a special case. It's far more efficient than either rejection sampling, or even the metropolis-sampling method from the theory of sampling from convex bodies. Also, if you only want to sample from an $\ell_2$ ball you can do it without this hammer using the sphere sampling technique and a nonuniform sampling of the length, but this seems so much more elegant.

Thursday, January 10, 2013

LEGO and math teaching

I was discussing Paul Tough's book with my wife yesterday at breakfast, and we somehow got onto my pet peeve: the mechanical way in which math is taught in schools, and how math gets reduced to arithmetic and counting. Of course the definitive word on this is Paul Lockhart's A Mathematician's Lament.

I looked over the floor of the living room, strewn with random bits of deadly adult-foot-stabbing Lego, and had a minor epiphany. As anyone with small children knows, LEGO sells a lot more Hollywood tie-in kits now compared to the "bags of bricks" that it used to sell. You can buy the Millenium Falcon, an X-wing fighter, a spiderman action scene, or some kit related any cartoon on Nickelodeon.

Which is a pain. Lots of specialized parts, key pieces that if misplaced causes massive bouts of tears and anguished searching, and so on.

But how do kids play with them ? They put it together carefully from the instructions the first time. This lasts about a day or so. Then they get bored, take it apart and mix up all the pieces with the OTHER kits they have, and spend many happy days building all kinds of weird contraptions.

Here's the thing about the kits. They show you how to build fairly complicated devices and doohickeys. The kids pick up on this, and they mix and match the doohickeys in their own new constructions. Essentially they figure out the functions of the different parts, and realize how to build brand new things with them.

But suppose they were told that all they could do was repeatedly assemble the kits (and maybe even different kits) over and over again. It's unlikely they'd learn anything more than just how to assemble kits really quickly. They'd also get terribly bored. In fact, the popular toys in our house are multipurpose objects: any single-purpose toy gets discarded rather quickly.

To cut a long story short, it feels to me that a lot of math (and theoryCS) education at the school and college level involves looking at various "kits", and seeing how they get put together. You get rewarded for putting kits together correctly, but not for playing with the kits and creating bizarre new objects. But finding a way to let students (in an unstructured way) play with mathematical concepts is the key to having them really understand the concepts and their different facets.

Tuesday, January 08, 2013

ICERM and Simons postdocs

Two upcoming TCS postdoc deadlines:

ICERM Program on Network Science and Graph Algorithms

This is a program out of Brown organized by Kelner, Klein, Mathieu, Shmoys and Upfal. It sounds quite fascinating if you're doing anything with graph data and spectral analysis. Deadlines for postdoc applications is Jan 14.

Simons special program on the theory of big data. 

As I've mentioned before, the Simons Institute is running a semester long program on the theory of big data. The deadline for applying for postdocs is soon (middle of January)

These two programs are coordinating, and if you're so inclined you might be able to spend one semester at Berkeley and another in Providence. Please let the organizers know if this is something you're interested in.

Monday, January 07, 2013

SODA 2013 4/n: Business

If you haven't been following my live-tweets at the SODA business meeting, here's a summary of the unusually quiet meeting:
  • Attendance at the conference was 311, which is quite low, but is a local high for New Orleans (third time in the last 10 years). 
  • 135 papers were accepted out of a net 459 submissions. This is an acceptance rate of nearly 30%, which should strike fear into the hearts of tenure-track assistant professors everywhere. 
  • PC members had to review "only" 42 papers each. Yes, I know this is insane. And no, we're not going to do anything about it. 
  • Shiri Chechik received one of two best student papers for her paper "New Additive Spanners". Bernhard Haeupler received the other for "Simple, fast and deterministic gossip and rumor spreading".
  • I've already mentioned the two best papers, on graph minors and dynamic connectivity
  • SODA 2014 is in Portland, land of beer, books and birds. Chandra Chekuri is chair. 
  • After Salt Lake City observers had to withdraw because of serious voting irregularities, the trumped up unfair winner of the SODA 2015 sweepstakes was San Diego. But since we never go with our first choice location (Honolulu, anyone?), San Francisco was listed as a second choice, with DC as a third choice. 
  • David Johnson is handing the baton over to Cliff Stein, after running SODA since before I knew it wasn't a fizzy drink. 

SODA 2013: Part 3/n

I just attended Valerie King's talk on her paper with Bruce Kapron and Ben Mountjoy that does dynamic connectivity in worst-case polylog time (randomized). This paper received a Best Paper award (along with the Grohe et al paper on graph minors I mentioned yesterday).

The problem is a classic: given a stream of updates (edge insertions/deletions) to a graph, maintain a data structure that can answer "Are u and v connected" efficiently. There's been a slew of work on the problem: if you want worst-case bounds for updates, the best till now was $O(\sqrt{n})$ by Eppstein, Galil, Italiano and Nissenzweig (from 1992). There are polylogarithmic amortized bounds, but they can incur worst-case update times of $\Theta(n)$.

In this paper, after 20 years, they knock down the worst-case update times to polylogarithmic: the algorithms are randomized (with 1-sided error: it might occasionally declare two nodes disconnected when they were connected).

The idea itself is very elegant, and it connects to techniques in streaming algorithms in a way that I think is neat. WARNING: IANAE in this area, so my rendition might be slightly naïve, and is drawn from my understanding of Valerie King's excellent presentation.

The basic problem is this: I can maintain connectivity by maintaining a collection of trees. Adding an edge can be done without too much difficulty (indeed, if all you want to do is insert things, then you're back to union-find). It's insertion and deletion together that makes things hard: imagine that you delete an edge and disconnect a tree: you need to quickly determine if there's some other non-tree edge that can be used to reconnect the pieces, or if the two sides are now really disconnected.

More generally, consider the cutset maintainence problem. You have updates as before, but now a query is: given a set S, find a cut edge if one exists between S and V - S. It's not hard to see that this is the core routine you need for the tree.

The first elegant idea is this: suppose you represent each edge by a bit string consisting of the encoding of one endpoint followed by an encoding of the other. For example, the edge 2-3 might be written as 010011. Now for each vertex, compute the XOR of all the edges adjacent to it and call this its signature.

If you XOR all the signatures of vertices in a set, and if the set has exactly one cut edge, then what you get is the signature of that edge ! This is because each edge that's inside S will occur twice and therefore will zero out.

So this works if your cut set is of size 1. But suppose it's not ? Now you maintain $O(\log n)$ sets for each vertex. Each edge adjacent to the vertex is thrown into the $i^{\text{th}}$ set with probability $1/2^i$. With some appropriate duplication, you can show that if the cut set is of size $k$, then w.h.p the $\log k$th cut set will have exactly one element in it, and that's the element you can retrieve when you need to find a replacement edge.

There's a lot more technical work to do to get the whole algorithm working, and I won't go into that. But those of you who are familiar with streaming will recognize something here.

In essence, they've created a sketch for the set of edges adjacent to the vertex, and they have a way to represent the set compactly and retrieve individual elements from it. The trick with exponentially decaying levels is standard in streaming count estimation, and the GF(2) trick for storing the sketch is very similar to the trick used by Ahn, Guha and McGregor in their SODA 2012 paper on graph sketching.

And that's what I thought was the best part: that ideas that really have power in streaming can be used to help solve a longstanding open problem in graph algorithms. I should point out that the connection is post-facto: this paper was developed independently of the streaming work. But one does have to wonder what other connections we might be able to establish between sketching tricks on graphs and basic graph algorithms.

SODA 2013: Part 2/n

My previous post is here.

Day 2 of SODA, and the tenth time I've been asked "are you chairing all the sessions" ? No, just that many of my PC colleagues didn't (or couldn't) show up :), so those of us who did are doing more lifting. In reward, we got a nice dinner in the French quarter, and I tasted boudin for the first time (and maybe the last). 

An interesting talk today morning by Dror Aiger on reporting near neighbors. They were able to show a not-super-exponential relation between the number of points at unit $\ell_\infty$ distance from a query, and the number of points at unit $\ell_2$ distance. This was wrapped into a fast algorithm for reporting Euclidean near neighbors in high dimensions that has some interesting (if preliminary) experimental backing as well in comparison with ANN, FLANN and LSH. 

Jan Vondrák gave the invited talk on submodular optimization. I mentioned Satoru Fujishige's talk at NIPS, and this was an excellent complement. Fujishige's talk focused on the geometric intuition behind submodular functions (especially the associated polymatroid). Vondrák's talk focused on the algorithmic implications of submodularity, and he gave very convincing arguments for why it can be viewed as discrete convexity OR discrete concavity, or even neither. He pointed out how the Lovasz extension is useful for minimization and the multilinear extension is more useful for maximization, and gave a number of "recipes" for designing algorithms that optimize submodular functions. I hope the slides go online at some point: they were very clear and well-balanced. 

There was some discussion over whether next year's SODA should adopt the two-tier PC that STOC is currently experimenting. The jury's still out on that, and since the STOC PC is not done with their work, we don't yet have formal feedback. I will admit to being a little frustrated with the level of conservativeness on display here: it's not as if EVERY OTHER COMMUNITY IN CS doesn't do this and doesn't have best practices that we can learn from, and given our reviewing loads, it's really crazy that we aren't desperately trying things to alleviate the problem. 

Sunday, January 06, 2013

SODA 2013, Part I/n

On twitter, it's common to post longer thoughts in 140 character bursts. If you don't know how long the thought will be, you label them as 1/n, 2/n, and so on, so as not to exceed an arbitrary limit, but also imply that there is a finite limit. 

So we're back in the Big Easy for SODA. This time the conference has merged ALENEX and ANALCO with the main program, so at least for today and tomorrow, we have four parallel tracks. Having just come back from NIPS, SODA feels like a small cozy cocktail party in comparison (ed: not that I know anything about cocktail parties: I have small children)

The highlight was Bob Sedgewick's tribute to Philippe Flajolet. I've talked about their work on analytic combinatorics before, but it was nice to hear some of it again. Flajolet's contributions to the field are immense and valuable. He did big data before it was cool: his $\ell_0$ estimation work with Nigel Martin is now a seminal paper in the world of streaming, and his HyperLogLog paper on counting a stream (with Fusy, Gandouet and Meunier) is the best-in-class on this problem. Bob quoted one of Flajolet's students as saying, "Read any of his papers. You will learn something".

I chaired a geometry session in the morning, and one of the papers there caught my eye a while back. The Fréchet distance problem (commonly known as the dog-walking problem) is the problem of computing a monotone mapping between two curves that has minimim maximum length (you can think of this as a man on one curve walking a dog on another, and asking for the minimum leash length when both man and dog can walk forward at arbitrary speeds).

There's a relatively straightforward dynamic program that can compute the Fréchet distance between two polygonal chains in time $O(nm\log (nm))$ (if $n$ and $m$ are the number of links in the two chains). But it's been a big open problem to figure out whether this can be done in subquadratic time. The paper, (by Agarwal, Ben Avraham, Kaplan and Sharir) shows that for the discrete version of the problem (that Rinat Ben Avraham in her talk calls the frog hopping problem: the frogs hop from vertex to vertex) you can get a slightly subquadratic time algorithm. The main idea involves adapting "string-like" methods for the problem, which is hard because the "alphabet" is infinitely large.

It will be interesting to see if this spurs more progress. There's already a newer paper by Buchin et al that gets a slightly improved (but still super-quadratic) algorithm for the continuous Fréchet distance (i.e the original problem). If for no other reason, please click the link to see their awesome title.

Martin Grohe gave an award talk for his work with Kawarabayashi and Reed on an improved algorithm for graph minor decompositions. The problem is as follows: given a graph G and a graph H supposedly excluded by G, compute a decomposition of G promised by the graph minor theorem, or produce an H-minor.

The improvement reduces the dependency on the size of $G$ to quadratic (from cubic) and makes use of the wonderful and mysterious Courcelle's theorem. The dependence on the size of $H$ is equally mysterious (at least Martin declined to tell us), but is nowhere near as wonderful.

Please post papers you thought were interesting (and why) in the comments, and I'll add them as updates.

Disqus for The Geomblog