Thursday, April 17, 2014

STOC 2014 announcement.

Howard Karloff writes in to remind everyone that the STOC 2014 early registration deadline is coming up soon (Apr 30 !). Please make sure to register early and often (ok maybe not the last part). There will be tutorials ! workshops ! posters ! papers ! and an off-off-Broadway production of Let It Go, a tragicomic musical about Dick Lipton's doomed effort to stop working on proving P = NP.

At least a constant fraction of the above statements are true.

And if you are still unconvinced, here's a picture of Columbia University, where the workshops and tutorials will take place:


Monday, April 07, 2014

Directed isoperimetry and Bregman divergences.

A new cell probe lower bound for Bregman near neighbor search via directed hypercontractivity. 

Amirali Abdullah and I have been pondering Bregman divergences for a while now. You can read all about them in my previous post on the topic. While they generalize Euclidean geometry in some nice ways, they are also quite nasty. The most important nastiness that they exhibit is asymmetry:
\[ D_\phi(x, y) \ne D_\phi(y, x)\]
What's worse is that this asymmetry can grow without bound. In particular, we can quantify the degree of asymmetry by the parameter $\mu$:
\[ \mu = \max_{x,y \in \Delta} \frac{D_\phi(x,y)}{D_\phi(y,x)} \]
where $\Delta$ is the domain of interest.

There's been a ton of work on clustering Bregman divergences, and our work on low-dimensional approximate nearest neighbor search. In almost all these results, $\mu$ shows up in the various resources (space and/or time), and it seems hard to get rid of it without making other assumptions.

After our low-D ANN result, we started trying to work on the high-dimensional version, and our thoughts turned to locality-sensitive hashing. After struggling for a while to get something that might be useful, we started pondering lower bounds. In particular, it seemed clear that the worse $\mu$ was, the harder it became to design an algorithm. So could we actually come up with a lower bound that depended on $\mu$ ?

Our first result was via a reduction from the Hamming cube (and $\ell_1$ near neighbors). While these gave us the desired lower bound, it wasn't $\mu$-sensitive. So we went looking for something stronger.

And that's where isoperimetry made an entrance. There's been a long line of results that prove lower bounds for LSH-like data structures for near neighbor search. Roughly speaking, they work in the following manner:

  1. Construct a gap distribution: a distribution over inputs and queries such that a query point is very close to its nearest neighbor and very far from any other point in the space. In the $\ell_1$ case, what you want is something like $\epsilon d$ distance to the nearest neighbor, and $\Omega(d)$ distance to the second nearest neighbor. 
  2. Imagine your data structure to be a function over the Hamming cube (assigning blocks of points to cells) and look at what happens when you perturb it (formally, by defining the function value to be its average over all nearby points)
  3. Use isoperimetry (or hypercontractivity) to argue that the function "expands" a lot. What this implies is that points get scattered everywhere, and so any particular cell you probe doesn't have enough information to determine where the true nearest neighbor actually is. This last step can be proved in different ways, either using Fourier methods, or via the use of Poincaré-type inequalities.

Our initial hope was to use this framework to prove our bound, while incorporating terms relating to the asymmetry of Bregman divergences more directly.

This turned out to be hard. The asymmetry destroys many of the nice algebraic properties associated with the noise operator and related inner products. There's even a deeper problem: if you think of an asymmetric noise operator as pushing things "forward" on a cube with one probability and backwards with another, then it's not hard to imagine a scenario where applying it actually ends up concentrating the mass of the function in a smaller region (which would break hypercontractivity).

We needed two things: a way to get around this not-so-small issue that hypercontractivity isn't true in a directed setting, and a way to analyze the asymmetric noise operator. It turned out that we could circumvent the first problem: we could restrict the class of hash functions we considered in a way that didn't significantly change their entropy (only a constant). Alternatively, we could view hypercontractivity as being true "on average".

The second problem was a bit harder. The solution we came up with was to try and link the directed noise operator to a symmetric operator on a biased space. The intuition here was that the directed operator was creating a nonuniform distribution, so maybe starting off with one might give us what we need.

This actually worked ! There are versions of the Bonami-Beckner inequality in biased spaces that look almost just like their counterparts in uniform space (you merely set the probability of a bit being 1 to $p \ne 1/2$, and the Fourier coefficients are defined in terms of $p$).

There was lots more to address even after the isoperimetry result, but you'll have to read the paper for that :).

I should note that in many ways, this was a "Simons baby": Amir visited during one of the workshops for the program on real analysis, and we had fruitful conversations with Elchanan Mossel and Nathan Keller in early stages of this work.                                                                                                               


Thursday, March 20, 2014

Data Mining, machine learning and statistics.

How does one tell data mining, machine learning and statistics apart ?

If you spend enough time wandering the increasingly crowded landscape of Big Data-istan, you'll come across the warring tribes of Datamine, MachLearn and Stat, whose constant bickering will make you think fondly of the People's front of Judea:


Cosma Shalizi has what I think is a useful delineation of the three tribes that isn't prejudicial to any of them ("Stats is just inefficient learning !", "MachLearn is just the reinvention of statistics!" "DataMine is a series of hacks!"). It goes something like this:

  • Data mining is the art of finding patterns in data.
  • Statistics is the mathematical science associated with drawing reliable inferences from noisy data
  • Machine learning is [the branch of computer science] that develops technology for automated inference (his original characterization was as a branch of engineering).
I like this characterization because it emphasizes the different focus: data mining is driven by applications, machine learning by algorithms, and statistics by mathematical foundations. 

This is not to say that the foci don't overlap: there's a lot of algorithm design in data mining and plenty of mathematics in ML. And of course applied stats work is an art as much as a science. 

But the primary driving force is captured well. 

Wednesday, March 19, 2014

Reproducibility in data mining/learning

Is reproducibility in data mining/learning different from reproducibility in systems research ?

+John Regehr and +Shriram Krishnamurthi have discussions (John, Shriram) arising from a new paper on  reproducibility in computer systems research from people at the U. of Arizona. You should read their posts/G+ discussions for the specifics.

But it got me thinking: is there anything intrinsically different about reproducibility in my neck of the woods ? Many concerns are shared:

  • The need for a standard computing environment to build and test code: MATLAB and scikit-learn make that a lot easier for us: we don't need to worry too much about details of the computing environment/hardware just to get the code to run. 
  • Discrepancies between what is stated in the paper and what the code actually does: that's a problem with any experimental work: John in fact mentions one such issue in one of his recent papers.
  • Code rot: otherwise known as 'graduation' :). I think if more people used public repos like github from the beginning, some of these problem might go away. I would be remiss if I didn't also plug +Robert Ricci's new system AptLab for helping distribute research.
But here's one that may not be shared: data preparation.

It's no secret that data preparation takes most of the time in a data mining pipeline. This includes
  • data cleaning (removing errors, reformatting data)
  • feature engineering (deciding how to truncate, modify or retain certain features)
  • training (cross-validation levels, what data is used for model building and so on)
As with any of the above, a proper specification can ensure true reproducibility, but there are lots of things we might do to data without necessarily even realizing it (if there are bugs in the transformation pipeline) or without thinking we need to mention it (truncating ranges, defining null values away). 

Feature engineering can also be a problem, especially with models that use many features (for example, deep learning systems have lots of "knobs" to tune, and they can be quite sensitive to the knobs).

So one thing I often look for when reviewing such papers is sensitivity: how well can the authors demonstrate robustness with respect to the parameter/algorithm choices. If they can, then I feel much more confident that the result is real and is not just an artifact of a random collection of knob settings combined with twirling around and around holding one's nose and scratching one's ear. 




Tuesday, March 11, 2014

A short not-review of the first episode of Cosmos

If you've been living on a rogue planet wandering through the far reaches of the Local group, you may not have heard of the Cosmos reboot on FOX with Neil deGrasse Tyson.

We finally got around to seeing the first episode yesterday. I was nervous because I loved the original so much and hoped that my kids would like it as well (as all parents know, you can't show fear or the sharks will smell it :)).

Things I liked:

  • The homage to Carl Sagan was done beautifully. If you're going to reboot a franchise, you don't have to pretend that nothing existed before, you know. 
  • The mix of CG + real person and animation was a good way to tell the different stories. I was a little worried about the denouement of the Giordano Bruno tale because my six-year old was watching, but it was done without being too graphic. 
  • The graphics were non-cheesy: but then I wasn't ever worried about the quality of graphics in the original. I'm pretty sure that 20 years from now these graphics will look cheesy as well. The year-long calendar of the universe was a fantastic demonstration of the immensity of time. 
Things I could live without:

Granted, it's the first episode, so there's no point in being too harsh. But 
  • If I had any heartstrings ever, they've been pulled out of shape into long strands of spaghetti. The constant sweeping orchestral background made me feel like I was watching NBC's coverage of the Winter Olympics. I almost expected to see a personal profile of how the universe overcame the adversity of the Big Bang to become the star performer it is today. 
  • As with all modern American shows purporting to be about science, I found the balance between sweeping rhetoric and actual facts to be disturbingly skewed towards the soaring fluffy. Watch some David Attenborough, people ! or even, I don't know, COSMOS itself. But see below...
Overall:
I liked it. No question that I'm watching it again. And the eight-year old loved it ! Thought it was full of information. So maybe my assessment of graphics-to-information ratio is completely off. 

The funniest thing: when it was over, this happened:
Kids: "Let's see the next episode now!"
Me: We'll have to wait a week for it.
Kids: "What do you mean, we have to wait a week ?"
Netflix should be proud. Binge watching on demand is now the default mode. 

Tuesday, March 04, 2014

Two big data talks at Haverford College

I'm giving two talks at Haverford College just before SDM 2014 in Philadelphia. If you're in the area. stop by and say hello !


Thursday, February 06, 2014

On "reverse engineering" algorithms, contextual readings, and the invisible choices that add value to research.

tl;dr: Understanding the value in a piece of work (a paper, an algorithm, a system) is not just a matter of understanding the various parts and how they fit. It is also a matter of understanding why the problem is decomposed that way. This observation has implications for how we read, write and evaluate research. 

Nick Seaver is an cultural anthropologist at UC Irvine.  He just wrote an article at Medium on reverse engineering algorithms. In it he distinguished between the factual reconstruction of an algorithm ("how does it work") from a more probing examination of the way it was designed ("why it is broken down in a particular way").

He comes to a conclusion that is quite sound and not at all controversial:
I want to suggest here that, while reverse engineering might be a useful strategy for figuring out how an existing technology works, it is less useful for telling us how it came to work that way. Because reverse engineering starts from a finished technical object, it misses the accidents that happened along the way — the abandoned paths, the unusual stories behind features that made it to release, moments of interpretation, arbitrary choice, and failure. Decisions that seemed rather uncertain and subjective as they were being made come to appear necessary in retrospect. Engineering looks a lot different in reverse.
But along the way, he makes an insightful observation about the very idea of structuralism as applied to algorithm design: namely, the idea that by breaking down the parts of the algorithm and understanding the pieces and how they fit together we can "understand" the algorithm at some higher level.
When you break an object down into its parts and put it back together again, you have not simply copied it — you’ve made something new. A movie’s set of microtags, no matter how fine-grained, is not the same thing as the movie. It is, as Barthes writes, a “directed, interested simulacrum” of the movie, a re-creation made with particular goals in mind. If you had different goals — different ideas about what the significant parts of movies were, different imagined use-cases — you might decompose differently. There is more than one way to tear apart content. (emphasis added)
In other words, the value of a reductionist analysis of a piece of work is not just in understanding the parts and how they fit, but in understanding the choices that led to that particular decomposition.

I think there are important lessons here for anyone involved in the creation and evaluation of new work. In particular:

Reading: While this is mostly advice for students, it applies whenever you're reading unfamiliar material. The particular form of the paper -- how the proofs are structured, how the system is built, or how the algorithm components work -- is a choice made by the authors and should not be left unexamined or unquestioned. All too often a student will read a paper as a factual news report, rather than reading it as a specific interpretation of a problem/algorithm/theorem that could lend itself to multiple interpretations (just a few days ago I was discussing some JL results with a colleague and realized that we had totally distinct and valid interpretations of some recent work in the area).

Reviewing: It's very easy (and common) to read a paper, understand the proofs, and then be completely underwhelmed by it: the dreaded "too simple" feeling that leads papers to get rejected from unnamed theory conferences. This is especially true when you're not familiar with an area, and don't realize that it was the choices the author(s) made that made the paper look simple. And so your assessment has to factor that choice in, rather than taking it for granted.

Writing/Presenting: Of course all of this impacts how you as a creator choose to present your work. There is not one way to tell a story (in writing or in a talk). But you do make choices about how to tell it. And so it's important to make those choices visible (either by explaining why they are needed, or why different choices don't work, or how they get around certain roadblocks) so that your work can receive a fair evaluation.

This can be excruciating, especially when (like many good papers) your work admits multiple interpretations, and you have to gamble on the right one to please the fickle and time-stretched reviewer. But be mindful of these choices in your work.

p.s On a separate note, it's intriguing to me how so many tools from the study of literature (narrative, contextual analysis, critical reading) show up in other forms of writing far removed from the humanities. This is yet another reason why I'm saddened (even though I'm on the "right" team) by the relentless focus on STEM in opposition to the liberal arts. It's a particularly virulent form of divide-and-conquer (the British colonial technique, not the algorithmic paradigm).

p.p.s Has it really been three months since I wrote anything ? I must be working !!

Thursday, November 14, 2013

The many stages of writing a paper, and how to close the deal.

[tl;dr] Producing a piece of research for publication has many stages, and each stage has different needs, requiring different ways of operating. Learning these stages is a key developmental step for a graduate student. In what follows, I describe this process.

Caveat: this is obviously customized to more theoretical-style work, although I'll talk about how experimental evaluation fits in, in the context of my work. So YMMV.

From my conversations with students (mine and others), I think this accurately describes how students think a paper gets written:

  1. Advisor produces problem miraculously from thin air. 
  2. Come up with solution.
  3. Write down solution
  4. Advisor makes annoying and mystifying edit requests on irrelevant introductory stuff, while throwing out long complicated proofs (or experiments) student has spent many hours sweating over.
  5. Make final edits and submit paper.
Most student figure out  how to do 2) , and eventually learn how to do 3) (which is itself a topic for another post). 5) is probably the first thing students learn how to do: fix typos, edit latex, and generally do yak-shaving.

But step 4) is perhaps the most mysterious part of the writing process for a new researcher, and the least structured. I call it "closing the deal" and it's really about going from a bag of results to an actual submittable paper. 

Let me elaborate.

1) Coming up with a problem.
    Of course coming up with a problem is the essence of the research process ("it's about the questions, not the answers", he shrieks). This takes experience and vision, and can often be changed by things you do in stage 4. I'll say no more about it here.

2) Solving a problem.
    This is the stage that everyone knows about. That's what we do, after all - solve problems ! This is where we drink lots of coffee, live Eye of the Tiger montages, get inspiration in our sleep, and so on. 

   It often happens that you don't exactly solve the problem you set out to attack, but you make many dents in it, solving special cases and variants. It's important to be flexible here, instead of banging your head against a wall head-on. At any rate, you either exit this stage of the project completely stuck, with a complete solution, or with a collection of results, ideas and conjectures surrounding the problem (the most typical case)

3) Writing it all down.
   Again, I could spend hours talking about this, and many people better than I have. It's a skill to learn in and of itself, and depends tremendously in the community you're in.

4) Closing, or getting to a submission.
    But this is the part that that's often the most critical, and the least understood. I call it "closing the deal": getting from 80% to 100% of a submission, and it requires a different kind of skill. The overarching message is this:
A paper tells a story, and you have to shape your results - their ordering, presentation, and even what you keep and what you leave out - in order to tell a consistent and clear story. 
(before people start howling, I'm not talking about leaving out results that contradict the story; that would be dishonest. I'm talking about selecting which story to tell from among the many that might present themselves)

So you have a bag of results centering around a problem you're trying to solve. If the story that emerges is: "here's a problem that's been open for 20 years and we solved it", then your story is relatively easy to tell. All you have to do is explain how, and using what tools. 

But in general, life isn't that easy. Your results probably give some insights into the hard core of the problem: what parts are trivial, what directions might be blocked off, and so on. 

Now you need to find/discover the story of your paper. You can't do this too early in the research process: you need to explore the landscape of the problem and prove some results first. But you shouldn't wait too long either: this stage can take time, especially if the story changes.

And the story will change. One way of thinking about what you need for a conference submission is a relatively tight, compelling and interesting story. While the loose ends and unexplored directions are probably the thing most interesting to you and your research, they are best left to a conclusions section rather than the main body. What the body should contain is a well-thought out march through what you have discovered and what it says about the problem you're solving. In doing so, you will find yourself making decisions about what to keep, and what to leave out, and how to order what you keep. 

And so, speculations need to be made into concrete claims or triaged. Experiments need to be run till they tell a definite story. Introductions need to be made coherent with the rest of the paper. There's also an element of bolt-tightening: making the bounds as tight as possible, stating claims as generally as makes sense for the overarching story (if your story is about points in the plane, then stating some claims in a general metric space might not always make sense). 

And all of this has to be done to serve the overarching story that will make the most compelling paper possible. The story can change as new results come in, or expand, or sometimes even die, (but this latter is rare). But there is this constant drumbeat of "am I getting closer to a submission with a nice story with each step".

Telling a good story is important. For someone to appreciate your paper, cite it, or even talk about it (whether it's accepted, or on the arxiv) they have to be willing to read it and retain its results. And they'll be able to do that if it tells a clear story, which is not just a union of results. 

Tuesday, November 05, 2013

SODA 2014 travel awards and other news

Via Cliff Stein and Kirsten Wilden:

The SIAM Symposium on Discrete Algorithms (SODA14) will be held January 5-7, 2014 at the Hilton Portland & Executive Tower in Portland, Oregon, USA. Algorithm Engineering and Experiments (ALENEX14) and Analytic Algorithmics and Combinatorics (ANALCO14) will be co-located with the conference on January 5 and 6, respectively.
 The deadline for students to apply for travel support is approaching!   The NSF, Google, IBM Research, and Microsoft Corporation have provided generous travel support for students interested in attending SODA14 or its associated meetings, ALENEX14 and ANALCO14. Application materials and a letter of recommendation from your advisor are due by November 15. Please visit http://www.siam.org/meetings/da14/tsupport.php for detailed funding information and application procedures.

Pre-registration for all three meetings is also available at http://www.siam.org/meetings/da14/reginfo.php.

Additional conference information is available at the following websites:
SODA14:  http://www.siam.org/meetings/da14/ANALCO14: http://www.siam.org/meetings/analco14/ALENEX14: http://www.siam.org/meetings/alenex14/

Sunday, October 27, 2013

FOCS Reception sing-along

Tomorrow night at 7:30 (before the FOCS business meeting at 9), the Positive Eigenvalues will be playing a live set for the entertainment of  FOCSers. 

Who are the Positive Eigenvalues, you might ask ? Well, you'll have to come and hear us to find out :). Two of the songs we're playing are compositions by our resident philosopher/graphic novelist/keyboardist/composer/complexity theorist, Christos Papadimitriou.

The first is a riff off the Dylan song "Man gave names to all the animals", with the same tune but different lyrics. If you're so inspired, you're encouraged to sing along to the chorus (which repeats after each verse). It goes like this:
Chorus: 
Theorists gave names to all the classes
in the beginning, in the beginning
Theorists gave names to all the classes
in the beginning long time ago 

Verse 1:
They saw a woman slim and tall
forging steadily to her goal
she went by fast but that's no crime
uh I think I'll call her PTIME

Verse 2:
They saw a kid who soldiered on
he got no rest from dusk to dawn
he kept sweating in the same place
uh I think I'll call him PSPACE

Verse 3:
They saw a blind man comb his shack
to find his walkstick in the black
but once he found it he could see
uh I think I'll call him NP

Verse 4:
They saw a boy who walked and walked all night
never veered to the left or right
was kind of slow but not that bad
uh I think I'll call him PPAD

Verse 5:
There was a beast and on its back
it carried Heisenberg and Dirac
it was as weird as a beast can be
uh...
For more on the tailed off line at the end, see the wikipedia entry.

The second song is an original composition by Christos in rock ballad/power rock mode. The lyrics speak poignantly about the life of a theoretician:
In Theory
Verse
Think of a carpenter of sorts
he has the flu he has the blues
He burns his oil every night
all night frustrated and confused

Bridge
And worst of all he can't complain
In theory his job is plain  (twice)

Chorus
Build something sturdy and beautiful and useful
He's always short in one goal
He juggles true with elegant and useful
and he keeps dropping one ball

Verse 2
His buddies like him alright
except they love to criticize
They go and take his stuff apart
with clever words with clever eyes

Verse 3
Some nights a girl comes to his shop
to sip his tea learn something raw
his foggy mind his tangled soul
want her to stay but let her go

Thursday, October 17, 2013

Searching randomly for needles.

Suppose you're given a large set of objects $X$, and you know that some subset $I$ is "interesting". A particular example that hits close to (my regular) home is in bug-testing: $X$ is the set of possible inputs to a program, and $I$ is the set of inputs that generate bugs (this is the downside of talking to +John Regehr too much). We'll assume that if you're given a candidate object, you can check easily if it's interesting or not (for example, by running the program)

You'd like to find the interesting items, so you consult an expert (in our running example, maybe it's a method to generate inputs that test certain kinds of bugs). The expert produces items that it thinks are interesting. But experts have biases: maybe your expert only cares about certain kinds of interesting items.

So you ask multiple experts, in the hope that their biases are different, and that together they can cover the space of objects. But you don't know anything about their biases except what you can learn from repeatedly asking them for candidate objects.

What's your strategy for asking questions so that at any stage (say after asking $N$ questions), you've found the optimal number of interesting items ?

This was the topic of a talk by +Sebastien Bubeck at the AMPLab in Berkeley on Tuesday, based on this paper.  A key idea in the algorithm design is to make use of estimators of "things I haven't seen yet", or the famous Good-Turing estimator (which your humble blogger wrote about many æons ago). Here's how it works.

Formally, let us assume that each "expert" $i$ is a distribution $P_i$ over $X$. At each step, the algorithm will determine some $i$, and ask it for a random draw from $P_i$. Suppose we knew the fraction of items that $i$ had not seen yet. Then a simple greedy strategy would be to pick the $i$ that had the largest value of this "not-yet-seen" quantity. That's all well and good, as long as we know the fraction of items not yet seen.

Here's where the G-T estimator comes in. What it says is that if we're given samples from a distribution, and count the number of items in the sample that occurred exactly once, then this "frequency of frequency", divided by the number of samples, is a good estimate for the mass of items not yet seen. Moreover, it can be shown that this estimator has good concentration properties around the true answer.

So that's what the algorithm does. It maintains estimates (for each expert) of the mass not yet seen, and in each round picks the expert that maximizes this term, corrected by an adjustment coming from the tail of the concentration bound.

The algorithm is really simple, and elegant. The question is, how well does it do ? And now things get a little harder. The above ideal greedy strategy is optimal as long as the supports of the experts are disjoint. Under the same assumption (and some other technical ones), it can be shown that the expected difference between the number of items found by the algorithm and the number found by the ideal greedy algorithm is $O(\sqrt{Kt \log t})$ where $K$ is the number of experts and $t$ is the current time step.

It's not clear how to extend these bounds to the case when supports can overlap (or rather, it's not clear to me :)), but the authors show that the algorithm works quite well in practice even when this assumption is broken, which is encouraging.

Wednesday, October 16, 2013

Progressively complex representations for graphs

After NIPS last year, I had posted a note on the evolution of data models and how we think about the "shape" of data. In brief, we have increasingly complex ways of thinking about the (geometric) shape of data that carefully balance the need for expressivity and the ability to determine the parameters of the representation efficiently.

There's a basic fact of data analysis that you come up against once you start playing with data. Either you endow the space with a geometry, or you endow it with a graph structure (and of course, nowadays you might throw in a spherical cow or two). And there are only a few ways to mix the representations (spectral methods being one such approach).

But as far as I know, there are no standard ways to create a layered representation of graphs to model increasing complexity and expressivity in a similarly balanced manner. There are of course an infinite number of ways to parametrize graphs by quantities that capture increasing complexity (treewidth, connectivity, expansion, induced metric structure, and so on). But I don't know of any established and standard ways to model families of graphs with increasing complexity that capture data patterns of interest.

One of the things that I've been tasked with (along with Peter Richtarik and Han Liu) is to draft out a report on the activities at the big data program at Simons this semester. We're looking also at challenges in the theory of big data, and in my mind coming up with good models for graph structures that capture the intricacies of real data is one of them.

Wednesday, October 02, 2013

Call for tutorials at SIAM Data Mining

I'm the tutorials chair for the 2014  SIAM conference on data mining (SDM) (for readers not aware of the data mining landscape, SDM is a SODA-style data mining conference run by SIAM - i.e it's a CS-style venue with peer-reviewed papers, rather than a math-style conference like SIAM discrete math). SDM is one of the major venues for data mining research, especially the more statistically focused kind. It will be held in Philadelphia between Apr 24-26, 2014.

SDM runs 4-5 tutorials each year on various aspects of data mining (ranging from theory/algorithms to specific application areas). I'm personally very interested in encouraging submissions from people in the theory community working with data who might want to share their expertise about new methods/techniques from theoryCS land with the larger data analysis community.

The deadline is Oct 13, and all you need is a few pages describing the tutorial content, target audience, and some sample material (more details here). If you have an idea and are not sure whether it will fit, feel free to email me directly as well.

Monday, September 16, 2013

SIGACT CG Column: Moving heaven and earth

My new column for SIGACT News is up (non pay-walled version here). In it, I discuss two different ways of defining distances between distributions.

The earthmover distance is a familar measure to TCS folk, and the kernel distance (or the maximum mean distortion, or the distance covariance) is a familiar character in machine learning and statistics.

It turns out that while statistically, these are not as different as one might think, they are different when it comes to computing them, and understanding their geometry.

The column describes both measures and their properties, and does a compare-contrast between them. More generally, I try to describe how one should think about the process of constructing a distance function between two distributions.

Monday, September 09, 2013

Life at Simons

There have been many posts on the technical talks happening at Simons (see +Moritz Hardt's latest on gradient descent if you haven't yet). Not as many yet on the less-formed discussions that are happening in and around the building, but maybe it's too soon for that.

In the meantime, I thought it might be fun to describe "a day at Calvin Hall".

I usually bike in after 9 am. Biking in Berkeley is not for the (literally) weak-kneed, and I'm discovering how weak my knees actually are. I haven't yet found a good site to plot elevation gains and losses for an arbitrary route, but this one is pretty decent.

The Institute lends out bikes if you don't have your own. It's a great way to get around if you're staying near campus, because parking is difficult and expensive.

Of course, the next step is:


It's a great machine. Only downside is that it shuts down at 5pm when the building does. But there are plenty of cafes around for later.

Then of course the decision: which talks should I feel guilty about missing today ? There are A LOT OF TALKS going on at the various workshops/boot camps. I feel embarrassed complaining about this, because we have such a stellar crowd of speakers and such fascinating talks. But there are almost too many talks: I have to shut down the feelings of guilt in order to find big chunks of time to think (or write this blog post <ahem>).

Thankfully, there's the live stream. I've already mentioned the Simons livestream on G+, but it's worth mentioning again. If you're late to the start of any talk you can watch it on the high quality live feed. If you can't watch it live you can watch it as soon as it's done on the Ustream site for the event. And a few days after that the talks are archived on the Simons site directly. There's also this cute feature where the live stream plays the most recent talk during coffee breaks. I know people outside the Institute are watching talks live because someone pinged me on twitter to ask the speaker to repeat questions they get from the audience.

So I've either attended talks or I've spent the morning brainstorming on some of the projects I have going. It's now time for lunch: thankfully, we're walking distance from both Northside and Southside Berkeley, which means a huge variety of eating places all within 5-10 minutes walking. All we need to do is avoid the huge class let-out at 12:20 or so. On Tuesdays we have a lunch seminar (organized by +Moritz Hardt) as part of the big-data program.

Once we're back, it's time for this:

Did I mention that the coffee machine is awesome ?

Usually, discussions start picking up in the afternoon, and you'll often find me here:

3:30 is time for the daily tea, at which we get all kinds of nice cookies, and the occasional cake/ice cream (on thursdays). This is much like the Dagstuhl tea time, but without the fancy tortes (hint hint!). Of course, no daily tea is complete with the addition of this:

By five, people are starting to trickle out slowly. On Tuesdays we have a happy hour at a local bar ($3 pints!). And it's time for me to check the sad state of my knees for the ride back, which is hard because it's uphill both ways !

Friday, September 06, 2013

More camping with high dimensional boots...

I discovered today that you don't have to wait for talk videos to be posted on the Simons website. All videos are live streamed via ustream, and they have their own channel for the boot camp talks, where you can watch the videos immediately after they're streamed.

Martin Wainwright gave a 3 hour presentation on high dimensional statistics. Michael Jordan's talk earlier was good preparation for this, just to get familiar with basic concepts like the risk of an estimator and minimax theory.

Wainwright's talks were much denser, and it would be hard do an intelligent summary of the entire presentation. The central theme of his talk was this:

In classical statistics, we can evaluate the quality of an estimator as $n \rightarrow \infty$ using standard asymptotic methods, and for the most part they are well understood (convergence, rates of convergence, sampling distributions, and so on). But in all these results, it's assumed that the data dimensionality stays fixed. Suppose it doesn't though ?

In particular, suppose you have a situation where $d/n \rightarrow \alpha > 0$ (this notion was first introduced by Kolmogorov). For example, suppose $\alpha = 0.5$. What happens to the behavior of your estimation methods ?  He worked out a few simple examples with experiments to show that in such cases, classical asymptotic bounds fail to capture what's really going on. For example, suppose you wanted to estimate the mean of a distribution and you used the sample mean, relying on the central limit theorem. Then it turns out that in the $d/n \rightarrow \alpha$ regime, the convergence is slowed down by the parameter $\alpha$.

Another example of this problem is estimating the covariance of a matrix. Assume you have a sample of iid Gaussian random variables drawn from $N(0, I_{d \times d})$, and you want to use the sample covariance to estimate the population covariance (in this case, the identity matrix). You can look at the distribution of eigenvalues of the resulting matrix (which you expect to be sharply concentrated around 1) and in fact you get a much more spread out distribution (this is known as the Marcenko-Pastur distribution). You can show that that the maximum singular value of the matrix is no bigger than $1 + \sqrt{d/n}$ with high probability. But this error term $\sqrt{d/n}$ does not go away.

If the data is indeed high dimensional, is there low-dimensional/sparse structure one can exploit to do inference more efficiently ? This gets us into the realm of sparse approximations and compressed sensing, and he spends some time explaining why sparse recovery via the LASSO is actually possible, and describes a condition called the "restricted null space property" that characterizes when exact recovery can be done (this property is implied by the RIP, but is strictly weaker).

In the second part of the talk he talked more generally about so-called regularized M-estimators, and how one might prove minimax bounds for parameter estimation. Again, while the specifics are quite technical, he brought up one point in passing that I think is worth highlighting.

When doing parameter estimation, the "curvature" of the space plays an important role, and has done so since the Cramer-Rao bound, the idea of Fisher information and Efron's differential-geometric perspective. The idea is that if the optimal parameter lies in a highly curved region of loss-parameter space, then estimation is easy, because any deviation from the true parameter incurs a huge loss. Conversely, a region of space where the loss function doesn't change a lot is difficult for parameter estimation, because the parameter can change significantly.

Once again, this is in sharp contrast to how we view the landscape of approximations for a hard problem. If we have a cost function that varies gently over the space, then this actually makes approximating the function a little easier. But a cost function that has sharp spikes is a little trickier to approximate, because a small movement away from the optimal solution changes the cost dramatically.

There are some problems with this intuition. After all, a sharp potential well is good for gradient descent methods. But the difference here is that in estimation, you only care about the loss function as a tool to get at the true goal: the desired parameter. However, in much of algorithm design, the loss function IS the thing you're optimizing, and you don't necessarily care about the object that achieves that optimal loss. This goes back to my earlier point about the data versus the problem. If optimizing the loss is the end goal, then a certain set of tools come into play. But if the goal is to find the "true" answer, and the loss function is merely a means to this end, then our focus on problem definition isn't necessarily helpful.




Wednesday, September 04, 2013

Statistics, geometry and computer science.

Geometry is the glue between statistics and computer science. -- Michael Jordan
That's why everyone gets stuck in it. 
-- Graham Cormode

Today Yesterday was the first day of the "Big Data boot camp" at the Simons Institute. The idea behind these boot camps is to get participants in a program "on the same page". In a program like ours, with statisticians. ML folks, algorithms people and optimizers all mingling together, getting "on the same page" with "the same language" is critical. For a more general overview of the first day activities, see Muthu's post.

Michael Jordan opened the proceedings with a talk titled "Big Data: The computation/statistics interface". From a CS perspective, it was an excellent talk that hit just the right level of detail for me to grasp some basic terminology in statistics, as well as understand some of the questions people are pondering right now.

We think of big data as a PROBLEM. More data, more scaling issues, O(n) becomes infeasible, and so on. In particular, we think of large data as a workload that taxes our resources: time, space, communication, and so on.

Michael presented a counterpoint. From a statistical perspective, more data is better. Estimators converge, error reduces, and the law of large numbers kicks in. Rather than treating data as something that we have to manage efficiently, can we think of data (quality) as a resource itself that can be managed efficiently ? In other words, can we tradeoff estimator accuracy against other resources ?

He proceeded to give a wonderful exposition of the basics of statistical decision theory, including the frequentist and Bayesian views of estimator risk. Along the way, he described a geometric interpretation of the James-Stein estimator which is too neat to skip.

The James-Stein estimator is really a very counterintuitive object. Consider a set of samples $X_i \sim N(\theta_i, 1), i \le n$, where the $\theta_i$ are unknown parameters. The goal is to sample $X_i$ to determine the $\theta_i$, by minimizing some loss function $\sum (\theta_i - \hat{\theta_i})^2$. The $\theta_i$ are assumed unrelated to each other.

The "obvious" estimator is $\hat{\theta_i} = X_i$. It's the MLE after all. But it turns out that this estimator is strictly dominated (in terms of expected loss aka risk) by the so-called shrinkage estimator:
\[ \hat{\theta_i} = (1 - \frac{c}{S^2}) X_i \]
where $S = \sum_j X_j^2$.

What's surprising about this is that the estimator for a single $\theta_i$ takes into account samples from other $X_i$, even though the $\theta_i$ are assumed to be unrelated.

It turns out that there's an elegant geometric interpretation of this estimator, first attributed to Galton by Stephen Stigler. Consider the pairs $(X_i, \theta_i)$ as points in the plane. We don't quite know where these points lie because we don't know what $\theta_i$ is. Because the $X_i$ are normally distributed around the $\theta_i$, each point is really a horizontal interval of interest.

Now the standard estimator $\hat{\theta_i} = X_i$ arises from trying to solve a regression problem of $X$ versus $\theta$, or more precisely solving the regression $\hat{X} = E(X\mid \theta)$. But really, what we're trying to do is solve the regression $\hat{\theta} = E(\theta \mid X)$. In other words, regression of $y$ against $x$ is different from the regression of $x$ against $y$, as long as we have more than three points. And this is precisely what the JS estimator says.

Returning to the question he started the talk with, can we show a formal tradeoff between data estimation accuracy and sampling cost ? In a recent paper with Venkat Chandrasekharan from Caltech, they show a very nice (and counterintuitive) result: that using a cruder relaxation of an optimization problem can actually lead to more efficient estimation as long as you have sufficient data. Note that this goes against the standard idea that a crude relaxation is "worse" in an approximation sense.

The idea is as follows. Consider the very simple problem of denoising where the observations $\mathbf{y}$ are generated from the input $\mathbf{x}$ by a noise perturbation:
\[ \mathbf{y} = \mathbf{x} + \sigma \mathbf{z} \]
where $\mathbf{z}$ is normally distributed. Let us assume that $\mathbf{x}$ is drawn from some set $S$ (for example, $S$ is the set of low-rank matrices, or the set of permutation matrices).

The simplest estimator for $x$ is an $\ell_2$ projection: compute the sample mean $\overline{\mathbf{y}}$ and then find its projection onto $S$. But this involves a minimization over $S$, which might be intractable.

We can relax $S$ to some $S'$, where $S \subset S'$ and minimization becomes easier. But this would worsen the approximation ratio of the optimization depending on the relaxation. And here's where the insight comes in.

Suppose you instead look at the statistical risk of this estimator. In other words, look at the expected difference between the true $\mathbf{x}^*$ and the estimated $\hat{\mathbf{x}}$. The main result they show in this paper (paraphrased) is
 expected risk is upper bounded by (C/n) * geometric complexity of $S, \mathbf{x}^*$
where $n$ is the number of samples.

Suppose we fix the desired risk. Then an increase in $n$ can be used to "pay for" increased "geometric complexity". And here's where the final idea comes in. The "geometric complexity" used here is the Gaussian-squared complexity, which is defined as
\[ g(D) = E[ \sup_{x \in D} \langle x, z\rangle^2 ] \]
where $z$ is normally distributed.

In particular, the set $D$ used in the above expression is the set of directions from $\mathbf{x}^*$ to points in $S$. Suppose $S$ is very "pointed" at $\mathbf{x}^*$. Then the Gaussian-squared complexity is small and the number of samples needed is small. But if instead we use a relaxation $S'$ that is "blunt". The Gaussian complexity goes up, but if we have more samples, that keeps the risk fixed. If the optimization for $S'$ is significantly easier than the corresponding problem for $S$, then we still win, even though $n$ has increased, and even though the classical "approximation ratio" might have become worse.

In the final part of his talk, Michael talked about his recent work on "bags of little bootstraps", which Muthu also covered in his post. This post is already too long, so I'll defer this to another time.

Tuesday, August 27, 2013

And now for something different...

People who know me through my blog probably assume I have no life at all outside blogging and big data.

oh wait...

But more seriously, my family is here with me during this sa-battle-cal and +Karen Ho is ably keeping up the non-work blogginess. If you're interested in our adventures outside big theories of data, check out her blog Simply Batty: you might even learn what "academic headcheese" is.

On "a theory of big data"

+Moritz Hardt kicked off our Simons big data program with an excellent rumination on the role of theory in "big data". Some followup thoughts:

Moritz makes the point that theory, for better or for worse (mostly for better) made the choice to give problems primacy over data. I harp on this a lot in my algorithms class, and also talked about this in the context of computational thinking: the idea of 'naming a problem' is a key intellectual contribution of theoryCS. 

But to look at what might happen if we allowed more flexibility in problem definition, we don't have to look too far. Machine learning (a discipline that faces data head on) is characterized by the study of problems that are well defined in the broad sense, but have a lot of wiggle room in the specific (and I'm now hoping +Sebastien Bubeck will respond and critique what follows)

Consider classification. In a broad sense, it's very well defined: given a collection of points labeled with (1,-1), find a decision rule that can be used to separate the +s from the -s. But in the specifics: what's the decision rule ? what's the penalty for making a mistake ? how much labeled data do we have ? does it cost us to get labels ? and so on.

For each possible answer to these questions, you can construct a well defined problem. And you could focus on solutions to that very well defined problem. But that's not particularly important. Maybe I use hinge loss to capture errors in my classification, or maybe I use some other error function. I don't care as much as I care about a formulation that allows me to solve different flavors of the problem using  a single paradigm: possibly some form of gradient descent. 

This theme shows up again and again. In clustering. In regression. In the different flavors of learning (active, semisupervised, transfer, multitask, ...). A good solution to a problem focuses not on the specific definition, but a general framework that captures different variations on the problem and reduces them to solving some optimization that can then be engineered. This is also why optimization (and understanding heuristics for optimization) is such a focus on machine learning (Bubeck's lecture notes are a great resource on this, btw)

There is of course a downside. The downside is that you (could) miss out on connections between problems: the reductions that are the lifeblood of work in theoryCS. In fact, John Langford has looked into this issue specifically, with his machine learning reductions project.

So returning to the original question, what should a theory of big data look like ? A big part of theoryCS research is the skillful management of resources (space, time, random bits, communication, what have you..). But an even more important part is the abstraction of computation as a first-order phenomenon, a "theory of the feasible", as it were. 

Consider the example of privacy. Privacy is a concern that arises from access to data, and is a core component of any "big data" discussion. What differential privacy achieved was to frame the problem computationally, as both a definition of what's feasible to protect, and as a set of mechanisms to guarantee protection, and guarantee what cannot be protected.

Similarly, I'm interested in other problems arising out of our need to interact with data that can be framed abstractly in "the language of the feasible". There's work on how to value published data, and how to price it. There's work on how to verify computations, and how to do computations securely. There's work on how to understand and interpret data analysis. And then of course there's the large body of work on how to manage and compute efficiently with data under brand new models of computation. 

The "language of the feasible" is our not-so-secret weapon in defining abstractions: it's more than just resource allocation and management, and it's what gives theoryCS power to speak to other domains. 

Friday, August 23, 2013

Simons Institute opening, with pictures.

Today was the official kickoff for the Simons Institute, as well as the start of the two special programs in big data and real analysis. For the last few days I've been busy getting my paperwork organized over at Calvin Hall, which is a beautifully redone circular building named after the Nobel Laureate and Berkeley professor (more on the building here).

The inside is very conducive to collaborative work. The offices are organized around the periphery of the building, and are functional, but not overly so. The idea is that people will congregate in the large interior open spaces that are covered with whiteboards and comfy chairs.

This is what the interior looks like: (apologies for the picture quality)


The second floor open lounge


Comfy chairs in front of a whiteboard


even more chairs and more whiteboards


Let us not forget the very fancy coffee machine (which makes nice espresso)



and of course you need your Simons Institute mug to drink it in.


This is the main auditorium for workshops, on the first floor.

The next plan is to pave the outer walls of the building at ground level with chalkboards, so people can think and chat outside as well as inside. Ah, Berkeley weather. I'm told the boards are already there, and just need to be installed, so I hope to have pictures soon. 

There are 93 visitors between the two programs (53/39 by my rough count) which includes 22 Fellows. Offices are either two or three-person, and there are some large cubicle spaces with 7+ inhabitants. The visitors are all on the second and third floors of the building, which both have large open areas (the pictures above are from the second floor). 

Wednesday, August 14, 2013

The Sa-battle-cal starts..

I'm off on sabbatical, and we (two cars, two adults, two children and one cat) just started the long drive to Berkeley from SLC. This has turned out to be more exciting than I anticipated...

Our original plan was to drive 4 hours each day, making up the 12 hour drive to Berkeley in three days. With two drivers for two cars, this seemed like the best option to prevent us from getting over-tired.

Right at Wendover (Las Vegas for poor people!), about halfway on our first leg, my car broke down. Thankfully, I was able to coast it to a mechanic's shop just off the highway as we entered town. I had no clue what the problem was, and the mechanic wouldn't be able to take a look at it for a few hours (this is the week of the Bonneville races on the salt flats).

So I called my mechanic back in Salt Lake, and described the problem to him. He diagnosed it on the spot as a faulty ignition coil, which is apparently an easy part to procure and replace.... if you're near a dealership.

Which I was not...

He also needed me to figure out which coil was failing, which needed a computer scanner, which needed a mechanic.

So....

Here's what needed to happen, in short order. It was now 5pm. We (my wife and I) needed to

  • get the mechanic to at least scan the car to get the appropriate error codes
  • Call the car parts store (closing at 6) to see if they could procure the needed parts
  • Find a hotel in Wendover (did I mention this was the week of the Bonneville Races, and almost everything in town was booked?)
  • Change our reservations downstream because we were stuck in Wendover. 
Thankfully, through a series of small miracles and the generosity of many strangers and non-strangers, we managed to get all of this done. My mechanic even offered to to walk me through the installation myself once I did the appropriate Youtube self-study (MOOCs FTW !!)

Long story short, it's now day II of our trek. We doubled up the driving on day II to make up for lost time, and we're back on schedule (minus one set of car keys that I managed to lose in all the hurry). 

So far, this sabbatical is certainly not relaxing. 

p.s Shout out to Utah Imports of SLC, and S&R auto repair in Wendover. 

p.p.s I-80 through Nevada is one of the most mind-numbingly boring drives imaginable. 

Monday, August 05, 2013

Hi ho, hi ho, it's off to sabbatical we go...

It is now 7 days and counting before I head off on sabbatical, not to return till August 2014. I'll be heading to the Simons Institute to think big thoughts about the theory of data, (or was it big data about theory, or maybe just the theory of big data). After that, I'll be enjoying the fine life at Google for a semester, followed by a visit to MADALGO in Aarhus (because big data in Europe just tastes better, you know).

I've been using this summer to nurse my psychic wounds from six years of the grind, and gently slide into sabbatical mode. The rush of joy that comes everytime I delete a departmental email without reading beyond the subject tells me I'm ready :).

So far, advice I've received on how to conduct myself during a sabbatical includes:

  • Don't plan too much
  • Work very hard, but on something other than your current research
  • Have an adventure, and if work happens, don't blame yourself. 
  • Say NO repeatedly (this also applies to life post-tenure, apparently). I maybe took this advice a little too literally and managed to decline a review request in about 0.5 seconds, which surprised (and possibly annoyed) the editor who had made the request. 
  • Do something different (or do something that you've been meaning to do for years but never got a chance to). 
What else ? 

Friday, July 05, 2013

FSTTCS in December

At this point, you're probably wondering: exactly how much more coffee do I need to infuse into my system to get my 1/3/10  papers submitted to SODA before the deadline ? Do your stomach lining (and your adenosine receptors) a favor and consider submitting to FSTTCS: the abstracts deadline is Jul 8, and the submission deadline is July 15. You get to go to Guwahati in December and you might even get to stay here:



Wednesday, June 26, 2013

SODA 2014 abstract submission deadline approaching

+Chandra Chekuri informs us that the abstract submission deadline for SODA 2014 is approaching rapidly. Abstracts MUST be in by July 3 or else you cannot submit a full paper (by Jul 8).

For more information, visit the SODA site: http://siam.org/meetings/da14/submissions.php

Tuesday, June 04, 2013

Computational thinking in other disciplines.

While we wait for the (literal) flames to die down at the STOC business meeting (who'd have thought a two-tier PC would be so controversial), a thought on "computational thinking". 

I've been on PhD committees for students in other departments far removed from computer science. I'm there partly as an external person, and partly as someone who "understands computation" and can guide the student in a project that involves some nontrivial algorithmic thinking.

It's very striking to see how these students approach the notion of computing. They have strong engineering/math backgrounds, but they tend to view computational problems as black boxes for which the first idea that comes to mind is the correct answer. This is pernicious enough that when I try to nudge them to define the problem more generally, they have no idea what to do.

This is a common theme in disciplines that make use of computation. Everyone has their favorite biology story along these lines, but this article in Science magazine is a good example of the problem of scientists using code. In brief, they tend to use code as a blackbox without really understanding what the code does or why, or how to validate the results.

In my algorithms class, I emphasize the important of naming problems. Especially when one is trying to model a fuzzy real-world problem, it's very important to recognize named problems that might be hidden inside. But this very fundamental idea - that the problem comes first, and that by naming it we can recognize and understand this - is foreign to people not trained to think computationally. Rather, they think in terms of solutions, which is dangerous until they really know what they want to solve.

We talk a lot about computational thinking and the intellectual content of computer science. For anyone without CS training who wishes to make use of computational tools, this is one of the most important lessons to learn: how to name a problem.




Wednesday, May 29, 2013

MOOCification and summer programs

I was walking across campus today and saw the usual crowds of high school and middle school students doing various summer camps (the U. does a GREAT - yes it's called GREAT - summer camp in robotics/graphics, for example).

In our MOOCified future where all classes are taught by "course assistants" who report to "quality assurance staff", will there still be a way for excited middle and high school students to come to a university campus and get an interactive hands-on education at a camp ? I don't doubt that Udacity (or Coursera, or edX) will be happy to spin off a junior varsity division for the high schoolers (and if you do, please note that I said it first). And that will be one more demographic that the university loses as our educational mission gets sliced and diced.

Sunday, May 19, 2013

21st century proverbs

Having nothing better to do, I decided to create a twitter meme: updating proverbs for the 21st century. Use the hashtag #21stcentproverbs and join in the fun below.

 

Coding, Complexity and Sparsity 2013 (SPARC) 2013.

Atri Rudra reminds us that the 3rd incarnation of SPARC is soon to be upon us (disclaimer: I'll be speaking at the event):
Efficient and effective transmission, storage, and retrieval of information on a large-scale are among the core technical problems in the modern digital revolution. The massive volume of data necessitates the quest for mathematical and algorithmic methods for efficiently describing, summarizing, synthesizing, and, increasingly more critical, deciding when and how to discard data before storing or transmitting it. Such methods have been developed in two areas: coding theory, and sparse approximation (SA) (and its variants called compressive sensing (CS) and streaming algorithms). 
Coding theory and computational complexity are both well established fields that enjoy fruitful interactions with one another. On the other hand, while significant progress on the SA/CS problem has been made, much of that progress is concentrated on the feasibility of the problems, including a number of algorithmic innovations that leverage coding theory techniques, but a systematic computational complexity treatment of these problems is sorely lacking. The workshop organizers aim to develop a general computational theory of SA and CS (as well as related areas such as group testing) and its relationship to coding theory. This goal can be achieved only by bringing together researchers from a variety of areas. 
The Coding, Complexity and Sparsity workshop (SPARC 13) will be held in Ann Arbor, MI on Aug 5-7. 
These will be hour-long lectures designed to give students an introduction to coding theory, complexity theory/pseudo-randomness, and compressive sensing/streaming algorithms. We will have a poster session during the workshop and everyone is welcome to bring a poster but graduate students and postdocs are especially encouraged to give a poster presentation. 
This is the third incarnation of the workshop and the previous two workshops were also held in Ann Arbor in August of 2011 and 2012. 
Confirmed speakers:
* Jin Yi Cai (University of Wisconsin, Madison)
* Shafi Goldwasser (MIT)
* Piotr Indyk (MIT)
* Swastik Kopparty (Rutgers University)
* Dick Lipton (Georgia Tech)
* Andrew McGregor (University of Massachusetts, Amherst)
* Raghu Meka (IAS)
* Jelani Nelson    (Harvard)
* Eric Price (MIT)
* Christopher RĂ© (University of Wisconsin, Madison)
* Shubhangi Saraf (Rutgers University)
* Suresh Venkatasubramanian (University of Utah)
* David Woodruff (IBM)
* Mary Wootters    (Michigan)
* Shuheng Zhou (Michigan)

We have some funding for graduate students and postdocs with preference given to those who will be presenting posters. For registration and other details, please look at the workshop webpage: http://eecs.umich.edu/eecs/SPARC2013/

Friday, May 10, 2013

On GPU algorithms

Lance talks about GPU algorithms in his latest post:
The theory community hasn't seem to catch on yet. There should be some nice theoretical model that captures the vector and other operations of a GPGPU and then we should search for algorithms that make the best use of the hardware. The theory world writes algorithms for non-existent quantum computers but not for the machines that we currently use.
Well.

As someone who was doing GPU algorithms back when this meant flipping state bits on the fixed function pipeline (this is the GPU analog of "I wrote video games in assembly"), maybe a little perspective is in order.

Aside: I actually gave a talk on GPU algorithms at Chicago when Lance was there back in 2004. Clearly I wasn't interesting enough to get him to take note :).

Back in the early 2000s, I got quite interested in GPU algorithms. This came about from a summer project that Nabil Mustafa was doing with Shankar Krishnan and myself on real-time map simplification. Shankar had this cool idea to try and accelerate the Douglas-Peuker algorithm by implementing it on a GPU, which at the time meant trying to retrofit the algorithm to a very strange OpenGL rendering pipeline.

One thing led to another, and it got us thinking more abstractly about what kinds of computations could be done on the GPU. This was pre-CUDA, which meant that essentially we could abstract the GPU machine model as a massively parallel SIMD architecture in which each "processor" ran a simple straight line program (no loops, limited conditionals, and small memory - much like a streaming algorithm). The parallelism lent itself nicely to geometric problems, and we wrote papers on map simplification, depth contours, and general geometric optimization. We also designed an algorithm for a CSG rendering problem that had buried inside it a simple streaming model for proving lower bounds: in fact we were able to show a exponential gap (in the number of streaming passes needed) between deterministic and randomized median finding in this model (the conference wasn't interested in the lower bounds though :)).

In a bout of perfect timing, I stopped working on GPUs a year or so before CUDA. At the time, my reasons were simple. I was tired of the computational model changing on me every six months, and didn't have the resources to keep buying the latest and greatest graphics cards (although AT&T was very generous in getting me a state of the art card originally). It was also annoying that in order to really exploit the power of the hardware, I needed secret sauce that was only revealed if you had NVIDIA inside connections (or went to the NVIDIA U events).

Then CUDA came along and everyone went nuts. If you go to hgpu.org (originally gpgpu.org) you'll see the number of papers being published every year on GPU-accelerated methods. CUDA changed (quite radically) the way in which we thought about GPU algorithms: the memory model became more sophisticated, and the SIMD component was more powerful, although it was still a good idea to avoid excessive looping and conditionals.

Eventually I got dragged back into the GPU world. I collaborated on a  paper on graph coloring which was very instructive in demonstrating the differences between a GPU and straight up parallel machine. I also did a series of lectures on GPU algorithms at a 2012 MADALGO summer school on new models of computing. Incidentally, folks at Utah spend a lot of time working on issues relating to GPUs: Mary Hall's group has some nifty autotuning tools for mapping algorithms to the GPU, and Ganesh Gopalakrishnan has been thinking about memory consistency issues for GPUs.

Is there a good "GPU model" that theoreticians can study ? I think this is the wrong question. I think the GPU is one point (multi-core CPUs are another) in a space of tradeoffs between local computation, memory latency, and communication. I've been saying for a while now that communication appears to be the key resource to understand theoretically when dealing with our new networked/distributed/multiprocessor world. GPU algorithms are interesting to study as one example of how communication affects a computation. But the "right" way to deal with this involves focusing on communication, and not the GPU model per se. The latter has many idiosyncracies and what I'd consider distractions that take away from a clean formal understanding.

I should add that I'm not even the only person from theory-land who's been thinking about GPU-derived models. There's a great body of work on multicore models (see Phil Gibbons' lecture notes at the same summer school), and I recall Michael Mitzenmacher having recent work on GPU-accelerated hashing. Uzi Vishkin had a STOC workshop a few years ago on multicore models as well. (Turing Award winner) Valiant's bridging model clearly needs to get a lot more attention as well so that people don't think no one's paying attention to the models.

For more on this, also see my colleague Jeff Phillips' lecture notes on GPUs from his class on new models of computation.

Saturday, March 23, 2013

Free, Freemium, and Paid

There was a time when I'd bridle at the idea of having to pay for software or services. But I browse the iTunes app store now, and see people pleading to have the chance to pay for an app that they like, so that the authors won't stop updating it. The whole kerfuffle with Google Reader, Keep and Evernote is another example of how people have begun to prefer to pay for products, rather than rely on something free.

It feels like the end of an era where open source and free software (not the same thing, but often referred to in the same breath) were the default. Maybe we've come to the realization that nothing is really ever free, and that it's more realistic to get the costs out in the open rather than "being the product".

Friday, March 15, 2013

The SIGACT CG column

As you might have just discovered, I'm the second half of the two-headed monster that's taken over the SIGACT Geometry Column after Joe O'Rourke stepped down (Adrian Dumitrescu, who hopefully does not mind being referred to as the head of a monster, is the other half). My first column is up, and it talks about some recent fascinating developments in nonnegative matrix factorization.

My hope with the pieces I write is to cover areas of geometry that may have not had sufficient representation in the past (especially things closer to problems I'm interested in). My next column is due in August, and apart from doing a wrapup on SoCG, other things that come to mind include Laplacians and graph geometry, reproducing kernels, or even Bregman geometry.

But everything is now mobile, crowdsourced and social networked, so I'm looking for your suggestions on interesting topics to cover, new emerging areas, results that I'm not tracking, and so on. So post away here or on G+.

Monday, February 25, 2013

Data analysis, interpretation and explanations

There was a recent data-related kerfuffle between the New York Times and the makers of the Tesla electric car. If you haven't read the articles, (and the NYT public editor's post on this has good links), the crude summary is this:

  • NYT reporter takes Tesla on long test drive, and reports problems with running out of charge.
  • Tesla Motors CEO Elon Musk writes a long takedown of the reporter's review, complete with graphs generated from data the car recorded during the drive
  • NYT comes back with their own interpretation of data, rebutting Musk's claims.
  • Others attempt to reproduce the reporter's experience and fail, but arguably in different weather conditions that might or might not make a difference.
In an insightful meta-analysis of the dustup, Taylor Owen of the Tow Center for Digital Journalism discusses the implications for journalism in a data-driven world. He also references an article by David Brooks that makes the point:
People are really good at telling stories that weave together multiple causes and multiple contexts. Data analysis is pretty bad at narrative and emergent thinking...
I've felt for a while now that it's time to design mechanisms for providing context and narrative to data analysis*. Some of the research my student Parasaran does is on metaclustering: essentially the meta-analysis of different perspectives (clusterings) of data to draw out a larger meaning. We've also just submitted a paper on how to derive local markers of prediction validity in clustering (rather than just relying on global measures of quality). And my seminar this semester is about the larger problems of explanations and accounting in data mining.

I think as computer scientists, we have a lot to offer in the realm of data mining - not just in the design of tools for prediction, but in the design of tools to facilitate better understanding.


* None of this is surprising to experimental scientists, who will routinely attack a problem from multiple directions in order to acquire a larger understanding of a phenomenon rather than just the ability to predict. 


Friday, February 08, 2013

TCS+ hangout on the TSP polytope.

+Thomas Vidick  and the folks at the +TCS+ community have started a new experiment in G+ hangout talks. The first talk in this series was by Ronald de Wolf on the STOC 2012 paper that showed that there was no polynomial-sized lifted representation of the TSP polytope.

Overall, it was a very pleasant experience. I was able to reserve one of the 10 slots (a Hangout limit) for myself and my students at the University of Utah to attend and interact with the speaker, and from Thomas's post-review it seems that many more were signed on for "view-only" access to the stream.

There were very few technical hiccups, and +Oded Regev did a great job making sure that people were muted when not talking, and had a chance to ask questions in an orderly way. The chat sidechannel was a great help.

The talk itself was the best part: de Wolf did an excellent job conveying the main ideas of the proof without getting bogged down in details, and it felt as comfortable as listening to a talk live at a conference. Given the number of people listening in, this was already approaching medium-sized-workshop levels.

I'm looking forward to more of these events, and I'm glad that the +TCS+  folks are doing this. I also hope they can try more experiments with Google Hangout. For example, two ideas come to mind:

  • A reddit-style AMA ("Ask Me Anything"). One way to make this work is that the speaker would do a short presentation (maybe 5-10 minutes) and then would open up the floor for questions. To keep things manageable, people could write in questions on chat, and the moderator could filter them and ask the questions live. With sufficient preparation, some questions could be submitted ahead of time.
  • A live panel discussion with a moderator and a few participants, which again could have questions from the audience moderated by the moderator.

Disqus for The Geomblog