The Geomblog: 10/01/2013

Sunday, October 27, 2013

FOCS Reception sing-along

Tomorrow night at 7:30 (before the FOCS business meeting at 9), the Positive Eigenvalues will be playing a live set for the entertainment of FOCSers.

Who are the Positive Eigenvalues, you might ask ? Well, you'll have to come and hear us to find out :). Two of the songs we're playing are compositions by our resident philosopher/graphic novelist/keyboardist/composer/complexity theorist, Christos Papadimitriou.

The first is a riff off the Dylan song "Man gave names to all the animals", with the same tune but different lyrics. If you're so inspired, you're encouraged to sing along to the chorus (which repeats after each verse). It goes like this:

Chorus:
Theorists gave names to all the classes
in the beginning, in the beginning
Theorists gave names to all the classes
in the beginning long time ago

Verse 1:
They saw a woman slim and tall
forging steadily to her goal
she went by fast but that's no crime
uh I think I'll call her PTIME

Verse 2:
They saw a kid who soldiered on
he got no rest from dusk to dawn
he kept sweating in the same place
uh I think I'll call him PSPACE

Verse 3:
They saw a blind man comb his shack
to find his walkstick in the black
but once he found it he could see
uh I think I'll call him NP

Verse 4:
They saw a boy who walked and walked all night
never veered to the left or right
was kind of slow but not that bad
uh I think I'll call him PPAD

Verse 5:
There was a beast and on its back
it carried Heisenberg and Dirac
it was as weird as a beast can be
uh...

For more on the tailed off line at the end, see the wikipedia entry.

The second song is an original composition by Christos in rock ballad/power rock mode. The lyrics speak poignantly about the life of a theoretician:

In Theory
Verse
Think of a carpenter of sorts
he has the flu he has the blues
He burns his oil every night
all night frustrated and confused

Bridge
And worst of all he can't complain
In theory his job is plain (twice)

Chorus
Build something sturdy and beautiful and useful
He's always short in one goal
He juggles true with elegant and useful
and he keeps dropping one ball

Verse 2
His buddies like him alright
except they love to criticize
They go and take his stuff apart
with clever words with clever eyes

Verse 3
Some nights a girl comes to his shop
to sip his tea learn something raw
his foggy mind his tangled soul
want her to stay but let her go

Thursday, October 17, 2013

Searching randomly for needles.

Suppose you're given a large set of objects $X$, and you know that some subset $I$ is "interesting". A particular example that hits close to (my regular) home is in bug-testing: $X$ is the set of possible inputs to a program, and $I$ is the set of inputs that generate bugs (this is the downside of talking to +John Regehr too much). We'll assume that if you're given a candidate object, you can check easily if it's interesting or not (for example, by running the program)

You'd like to find the interesting items, so you consult an expert (in our running example, maybe it's a method to generate inputs that test certain kinds of bugs). The expert produces items that it thinks are interesting. But experts have biases: maybe your expert only cares about certain kinds of interesting items.

So you ask multiple experts, in the hope that their biases are different, and that together they can cover the space of objects. But you don't know anything about their biases except what you can learn from repeatedly asking them for candidate objects.

What's your strategy for asking questions so that at any stage (say after asking $N$ questions), you've found the optimal number of interesting items ?

This was the topic of a talk by +Sebastien Bubeck at the AMPLab in Berkeley on Tuesday, based on this paper. A key idea in the algorithm design is to make use of estimators of "things I haven't seen yet", or the famous Good-Turing estimator (which your humble blogger wrote about many æons ago). Here's how it works.

Formally, let us assume that each "expert" $i$ is a distribution $P_i$ over $X$. At each step, the algorithm will determine some $i$, and ask it for a random draw from $P_i$. Suppose we knew the fraction of items that $i$ had not seen yet. Then a simple greedy strategy would be to pick the $i$ that had the largest value of this "not-yet-seen" quantity. That's all well and good, as long as we know the fraction of items not yet seen.

Here's where the G-T estimator comes in. What it says is that if we're given samples from a distribution, and count the number of items in the sample that occurred exactly once, then this "frequency of frequency", divided by the number of samples, is a good estimate for the mass of items not yet seen. Moreover, it can be shown that this estimator has good concentration properties around the true answer.

So that's what the algorithm does. It maintains estimates (for each expert) of the mass not yet seen, and in each round picks the expert that maximizes this term, corrected by an adjustment coming from the tail of the concentration bound.

The algorithm is really simple, and elegant. The question is, how well does it do ? And now things get a little harder. The above ideal greedy strategy is optimal as long as the supports of the experts are disjoint. Under the same assumption (and some other technical ones), it can be shown that the expected difference between the number of items found by the algorithm and the number found by the ideal greedy algorithm is $O(\sqrt{Kt \log t})$ where $K$ is the number of experts and $t$ is the current time step.

It's not clear how to extend these bounds to the case when supports can overlap (or rather, it's not clear to me :)), but the authors show that the algorithm works quite well in practice even when this assumption is broken, which is encouraging.

Wednesday, October 16, 2013

Progressively complex representations for graphs

After NIPS last year, I had posted a note on the evolution of data models and how we think about the "shape" of data. In brief, we have increasingly complex ways of thinking about the (geometric) shape of data that carefully balance the need for expressivity and the ability to determine the parameters of the representation efficiently.

There's a basic fact of data analysis that you come up against once you start playing with data. Either you endow the space with a geometry, or you endow it with a graph structure (and of course, nowadays you might throw in a spherical cow or two). And there are only a few ways to mix the representations (spectral methods being one such approach).

But as far as I know, there are no standard ways to create a layered representation of graphs to model increasing complexity and expressivity in a similarly balanced manner. There are of course an infinite number of ways to parametrize graphs by quantities that capture increasing complexity (treewidth, connectivity, expansion, induced metric structure, and so on). But I don't know of any established and standard ways to model families of graphs with increasing complexity that capture data patterns of interest.

One of the things that I've been tasked with (along with Peter Richtarik and Han Liu) is to draft out a report on the activities at the big data program at Simons this semester. We're looking also at challenges in the theory of big data, and in my mind coming up with good models for graph structures that capture the intricacies of real data is one of them.

Wednesday, October 02, 2013

Call for tutorials at SIAM Data Mining

I'm the tutorials chair for the 2014 SIAM conference on data mining (SDM) (for readers not aware of the data mining landscape, SDM is a SODA-style data mining conference run by SIAM - i.e it's a CS-style venue with peer-reviewed papers, rather than a math-style conference like SIAM discrete math). SDM is one of the major venues for data mining research, especially the more statistically focused kind. It will be held in Philadelphia between Apr 24-26, 2014.

SDM runs 4-5 tutorials each year on various aspects of data mining (ranging from theory/algorithms to specific application areas). I'm personally very interested in encouraging submissions from people in the theory community working with data who might want to share their expertise about new methods/techniques from theoryCS land with the larger data analysis community.

The deadline is Oct 13, and all you need is a few pages describing the tutorial content, target audience, and some sample material (more details here). If you have an idea and are not sure whether it will fit, feel free to email me directly as well.

The Geomblog

Pages