The Geomblog

Wednesday, May 30, 2012

Workshop on coding theory, complexity and sparsity

Atri Rudra asks me to post an announcement for the second incarnation of the workshop on coding theory, complexity and sparsity. The first event went off quite well - I was sure that Dick Lipton had a post on it but I couldn't find it (Update: here it is, thanks to Atri). At any rate, do attend and get your students to attend: they have a great roster of speakers and travel support for students.

Efficient and effective transmission, storage, and retrieval of information on a large-scale are among the core technical problems in the modern digital revolution. The massive volume of data necessitates the quest for mathematical and algorithmic methods for efficiently describing, summarizing, synthesizing, and, increasingly more critical, deciding when and how to discard data before storing or transmitting it. Such methods have been developed in two areas: coding theory, and sparse approximation (SA) (and its variants called compressive sensing (CS) and streaming algorithms).

Coding theory and computational complexity are both well established fields that enjoy fruitful interactions with one another. On the other hand, while significant progress on the SA/CS problem has been made, much of that progress is concentrated on the feasibility of the problems, including a number of algorithmic innovations that leverage coding theory techniques, but a systematic computational complexity treatment of these problems is sorely lacking.

The workshop organizers aim to develop a general computational theory of SA and CS (as well as related areas such as group testing) and its relationship to coding theory. This goal can be achieved only by bringing together researchers from a variety of areas. We will have several tutorial lectures that will be directed to graduate students and postdocs.
These will be hour-long lectures designed to give students an introduction to coding theory, complexity theory/pseudo-randomness, and compressive sensing/streaming algorithms.

We will have a poster session during the workshop and everyone is welcome to bring a poster but graduate students and postdocs are especially encouraged to give a poster presentation.

Confirmed speakers:

Eric         Allender          Rutgers

Mark       Braverman      Princeton

Mahdi     Cheraghchi     Carnegie Mellon University

Anna      Gal                  The University of Texas at Austin

Piotr       Indyk               MIT

Swastik Kopparty         Rutgers

Dick       Lipton             Georgia Tech

Andrew McGregor       University of Massachusetts, Amherst

Raghu    Meka               IAS

Eric        Price                MIT

Ronitt   Rubinfeld          MIT

Shubhangi Saraf            IAS

Chris      Umans            Caltech

David    Woodruff         IBM

We have some funding for graduate students and postdocs. For registration and other details, please look at the workshop webpage: https://sites.google.com/site/umccsworkshop2012/

Thursday, May 17, 2012

8F Computational Geometry

What is North Carolina known for?

Basketball?

Barbecue?

Great Universities?

Well, this June it will be known for geometry, as SoCG will be held this year very near Durham, NC. Unfortunately, it will be at UNC and not Duke. (I kid, I know Jack will do a great job organizing.)

If those things are not enough, Suresh and I (Jeff) are organizing an 8F-Computational Geometry workshop at SoCG. In case thats not clear, 8F stands for AITF, and AITF stands for Algorithms in the Field.

The idea is to both showcase how computational geometry has had impact on "the field" and highlight some areas that we hope are ripe for future impact from the CG community. There are two days of talks Tuesday, June 19 on ongoing impact and Wednesday, June 20 on potential (more) impact. We already have a great line-up of speakers; on the first day we have talks by

Valerio Pascuci on Visualization,

Lars Arge on GIS,

David Mount on Data Retrieval, and

Nina Amenta on Meshing and Surface Reconstruction.
On the second day we have talks by

Dan Halperin on Robotics and Automation,

Jie Gao on Mobile Networks,

Michael Mahoney on High Dimensional Data Analysis, and

Tom Fletcher on Medical Imaging.

We encourage you all to attend, and to see what the Durham area has to offer. There is actually a full 4 days of activities at SoCG this year. The early registration deadline is next week (May 23) and its only $100 for students!

We should mention that this 8F-Computational Geometry workshop is part of a larger push (through sponsorship) by NSF. There was a more general 8F Workshop last year, and there are more in the works.

Wednesday, May 16, 2012

"In the long run..."

"In the long run, we're all dead" is a famous quote attributed to John Maynard Keynes. The context was his arguments with economists of the time: he was trying to argue for government intervention in the markets to control inflation, rather than just letting it play out.

It's an apt response also to occasional claims that asymptotics will eventually win out, especially with large data. Asymptotics will eventually win out, as long as everything else stays fixed.

But that's the precise problem. Everything else doesn't stay fixed. Well before your $C n \log n$ algorithm beats the $c n^2$ algorithm, we run out of memory, or local cache, or something else, and the computational model changes on us.

We come up with external memory models, and are informed that in fact even a factor of log N is too much. We design streaming models and Mapreduce and so on and so forth, and realize that all this communication is burning a hole in our atmosphere.

Lesson to be learned (and re-learned): asymptotic analysis is a powerful tool, but it's only as good as the model of computation you're REALLY working with.

Wednesday, May 09, 2012

Multiplicative Weight Updates as Zombie Binary Search

The multiplicative weight update method (MWU hereafter) is a neat algorithm design technique with applications in machine learning, geometry and optimization among others. However, it's viewed (and discussed) as an advanced technique, with very technical examples requiring lots of background (see the Arora-Hazan-Kale survey first example).

After pondering the use of MWU for a while now, it seems to me that this should be taught as a standard algorithms tool that naturally follows from divide and conquer and prune-and-search. In what follows, I'll sketch out how this might work.

A quick caveat: the MWU, like many powerful tools, has multiple interpretations, and the survey hints at many of them (for example, MWU = derandomization of Chernoff bound). I don't intend to imply that there's only one way to interpret the technique, but that the approach I'll describe is the most accessible one when learning the method for the first time.

The divide and conquer unit in an algorithms class might cover sorting and FFTs, driven by the standard D&C recurrence relation. Prune and search is introduced as a variation, with median finding and binary search being the prototypical examples.

Prune-and-search works like magic, if you're seeing it for the first time. Do some work, throw away the rest, and your $n \log n$ running time goes down to linear. (As an Indian of a certain age, this always reminds me of "thoda khao, thoda phenko" (eat a bit, throw a bit) from Jaane Bhi Do Yaaro).

But what makes it tick is determining the constant factor that needs to be thrown away. The most direct way to do this is deterministically: on a linear time budget, find a point that splits the input into two roughly balanced parts.

But that's expensive in practice. Indeed, algorithms that use median finding as a subroutine try to avoid using a deterministic procedure because it can be slow. When you get to more advanced methods like parametric search, it gets even worse.

The neat trick to apply here is to randomize ! We know that we don't have to find an exact split point - merely one that will approximately balance the two sides of the recursion. For median finding, we can pick three points at random, and take their median. With a sufficiently high probability, this median will create a 1/3-2/3 split, and away we go !

This idea surfaces again and again, especially in the many randomized algorithms in computational geometry. As a design strategy, it's quite effective - design an algorithm that works if you can do balanced splits, and then find the split by choosing randomly. Invoke the Chernoff God and some satellite deities, and you're done.

Which brings us to the MWU.

Randomized splitting says that we're willing to lie a little about the split process, and things still work out. But whether you do randomized splitting or deterministic, the end result is still that you throw away some reasonable fraction of the input, NEVER TO SEE IT AGAIN.

Suppose you can't even do that ?

Think of noisy binary search (or 20 questions with a liar). Now, even your decision on what to prune is error-prone. You might be eliminating things that you need to consider later. So it's not clear that you can make any kind of search progress. But let's be reasonable and limit the adversary (or the noise) in some manner. Let's say that you'll only misclassify a small fraction of the input in each iteration (where small is "strictly less than half"). Let's also assume that I can tell you how important certain points are, so that you have to take that into consideration when defining your "small fraction".

So how does MWU now work ? I tell you a distribution over the input, and you give me back a rule that's reasonably good at (say) classifying points. I increase the importance of things you made mistakes on (so you try not to do it again), and repeat.

Eventually, what happens is that the weight of things that you make mistakes on increases rapidly. But the overall weight of the input can't increase too much, because I only increase the weight of things you make mistakes on, which is not a lot. Intuitively, what happens is tha the first weight catches up with the second, at which point the algorithm is complete. You can show that if the updating schedule is chosen correctly, the process terminates in logarithmic steps.

Essentially, MWU functions as a zombie binary search. You want to kill elements, but they keep coming back. Thankfully, each time they come back they're slightly weaker, so you have a better chance of hitting them next time. Eventually, head is severed from neck, and your zombies are dead (and your points are found).

My only wish at this point is that whenever you think of MWU you think of zombies. Now that's impact :)

Wednesday, April 18, 2012

Distributed Learning: A new model

Communication is now the key to modelling distributed/multicore computations. Jim Demmel has been writing papers and giving talks on this theme for a while now, and as processors get faster, and the cloud becomes a standard computing platform, communication between nodes is turning out to be the major bottleneck.

So suppose you want to learn in this setting ? Suppose you have data sitting on different nodes (you have a data center, or a heterogeneous sensor network, and so on) and you'd like to learn something on the union of the data sets. You can't afford to ship everything to a single server for processing: the data might be too large to store, and the time to ship might be prohibitive.

So can you learn over the (implicit) union of all the data, with as little discussion among nodes as possible ? This was the topic of my Shonan talk, as well as two papers that I've been working on with my student Avishek Saha, in collaboration with Jeff Phillips and Hal Daume. The first one will be presented at AISTATS this week, and the second was just posted to the arxiv.

We started out with the simplest of learning problems: classification. Supppose you have data sitting on two nodes (A and B), and you wish to learn a hypothesis over the union of A and B. What you'd like is a way for the nodes to communicate as little as possible with each other while still generating a hypothesis close to the optimal solution.

It's not hard to see that you could compute an $\epsilon$-sample on A, and ship it over to B. By the usual properties of an $\epsilon$-sample, you guarantee that any classifier on B's data combined with the sample will also classify A correctly to within some $\epsilon$-error. It's also not too hard to show a lower bound that matches this upper bound. The amount of communication is nearly linear in $1/\epsilon$.

But can you do better ? In fact yes, if you let the nodes talk to each other, rather than only allowing one-way communication. One way of gaining intuition for this is that $A$ can generate classifiers, and send them over to $B$, and $B$ can tell $A$ to turn the classifier left or right. Effectively, $B$ acts as an oracle for binary search. The hard part is showing that this is actually a decimation (in that a constant fraction of points are eliminated from consideration as support points in each step), and once we do that, we can show an exponential improvement over one-way communication. There's a trivial way to extend this to more than 2 players, with a $k^2$ blow up in communication for $k$ players.

This binary search intuition only works for points in 2D, because then the search space of classifiers is on the circle, which lends itself naturally to a binary search. In higher dimensions, we have to use what is essentially a generalization of binary search - the multiplicative weight update method. I'll have more to say about this in a later post, but you can think of the MWU as a "confused zombie" binary search, in that you only sort of know "which way to go" when doing the search, and even then points that you dismissed earlier might rise from the dead.

It takes a little more work to bring the overhead for k-players down to a factor k. This comes by selecting one node as a coordinator, and implementing one of the distributed continuous sampling techniques to pass data to the coordinator.

You can read the paper for more details on the method. One thing to note is that the MWU can be "imported" from other methods that use it, which means that we get distributed algorithms for many optimization problems for free. This is great because a number of ML problems essentially reduce to some kind of optimization.

A second design template is multipass streaming: it's fairly easy to see that any multipass sublinear streaming algorithm can be placed in the k-player distributed setting, and so if you want a distributed algorithm, design a multipass streaming algorithm first.

One weakness of our algorithms was that we didn't work in the "agnostic" case, where the optimal solution itself might not be a perfect classifier (or where the data isn't separable, to view it differently). This can be fixed: in an arxiv upload made simultaneously with ours, Blum, Balcan, Fine and Mansour solve this problem very neatly, in addition to proving a number of PAC-learning results in this model.

It's nice to see different groups exploring this view of distributed learning. It shows that the model itself has legs. There are a number of problems that remain to be explored, and I'm hoping we can crack some of them. In all of this, the key is to get from a 'near linear in error' bound to a 'logarithmic in error' bound via replacing sampling by active sampling (or binary search).

Monday, April 09, 2012

Approximate Bregman near neighbors.

(tl;dr: Our upcoming paper in SoCG 2012 shows that with a nontrivial amount of work, you can do approximate Bregman near neighbor queries in low dimensional spaces in logarithmic query time)

One of the things that has haunted me over the years (yes, haunted!) is the geometry of Bregman divergences. A Bregman divergence is a very interesting creature. It's defined as the difference between a convex function and its linear approximation starting at another point:
$$ d_\phi(p,q) = \phi(p) - \phi(q) - \langle \nabla \phi(q), p - q\rangle $$
Because the Bregman divergences are parametrized by the function $\phi$, they yield a family of divergences. The most familiar one is $\ell_2^2$, but the most important one is the Kullback-Leibler divergence. There are a ton of reasons why one should study Bregman divergences, and in a shameless plug I'll refer to you my slides (one, two, three) from a geometry summer school last year. Suffice it to say that it's possible to unify a number of different algorithms in machine learning under a single framework by realizing that they're all particular instances of doing something with a Bregman divergence: two notable examples of this are Adaboost and information-theoretic clustering.

So what's the geometric perspective on Bregman divergences ? They generalize duality !

There's a standard way to think about traditional point-hyperplane duality via a lifting map: take a point, lift it to the paraboloid one dimension up, and find the tangent plane. But suppose we replace the paraboloid by a generic convex function ? what we get is a general convex duality (technically a Fenchel-Legendre duality) defined by
$$\phi^*(u) = \sup_v \langle u,v \rangle - \phi(v)$$
The reason this doesn't come up in Euclidean space is that the mapping is self-dual if we set $\phi(x) = (1/2) \|x\|^2$, which explains the symmetry in $\ell_2^2$.

It turns out that this observation is enough to import much of standard combinatorial geometry over to general Bregman spaces. We can compute convex hulls, Voronoi Diagrams, delaunay triangulations etc, as long as we're careful to keep primal and dual straight. for example, the locus of points equidistant from two points under a Bregman divergence is either a primal straight line or a dual straight line (depending on how you measure things), and so on. This was all elegantly worked out by Nielsen, Nock and Boissonnat a while ago.

Alas, this beautiful connection doesn't help us with approximation problems.

One of the most interesting questions regarding Bregman divergences is solving the near-neighbor problem. This is interesting because the Kullback-Leibler distance is often used (for example) to compare images, and so there's been a lot of empirical work on this problem.

But here's the problem. Let's consider the low dimensional ANN problem. In Euclidean space (or even in spaces of bounded doubling dimension), here's how you solve the problem. You build some kind of quad-tree-like data structure, using the triangle inequality to reason about which cells you need to explore, and using packing bounds to bound the number of cells explored. You also need a crude bound on the near neighbor to start with, and to do this, you use some variant of a ring-tree.

The key ingredients: triangle inequality (twice over) and packing bounds.

Bregman divergences don't even satisfy a directed triangle inequality in general. And to date, we didn't even know how to define packing bounds properly for these directed distances.

In an upcoming paper at SoCG 2012 with my students Amirali Abdullah and John Moeller, we finally figured out how to get some "approximation" of the ANN machinery to work with Bregman divergences, and get logarithmic query times with small space.

If I say so myself, there are some nice ideas in the paper. Firstly, in a somewhat surprising twist, a Bregman divergence satisfies a "reverse triangle inequality" on the line:

\[ d_\phi(a,b) + d_\phi(b,c) \le d_\phi(a, c), a, b, c, \in R\]

This is neat, because it gives packing bounds ! Intuitively, if the sum of lengths of subintervals along an interval is at most the length of the interval, then you can't have too many subintervals.

The next trick is an observation that the square root of a Bregman divergence satisfies a kind of relaxed triangle inequality that we call $\mu$-defectiveness:

\[ |d(x, y) - d(x,z)| \le \mu d(y, z)\]

This allows us to import some of the ring-tree machinery to get a coarse approximation.

And I should mention that the $\mu$ values involved here are quite small. If you're comparing distributions with the KL-divergence, then the value of $\mu$ is less than 2 even quite close to the boundary of the simplex.

Even with all of this in place, the quad-tree argument breaks. This is because of $\mu$-defectiveness: we can't assume that cell sizes are "reasonably large" at some number of levels below the root of the tree, and so our packing bounds look terrible. It takes a lot more work to fix this problem: essentially we exploit second-order structure of the Bregman divergences to bound the cell sizes and get the desired packing bound.

Afterthoughts:

Most of the complexity of the algorithm is in the analysis: the actual algorithm looks mostly like a Euclidean ANN procedure. While we haven't implemented it, I'm hopeful that when we do, we'll be able enjoy the empirical behavior of our Euclidean cousins.
We're looking at high dimensional Bregman near neighbors, as well as some other approximate Bregman geometry questions. While our low-D ANN result comes from the "I will break your limbs one by one till you submit" school of algorithms, the hope is that we can start to exploit more of the dual geometry as we learn more.

Sunday, April 08, 2012

Postdoc Announcement

Piotr Indyk asked me to post this announcement for a postdoc position:

Applications are invited for a Postdoctoral Research Assistant position for the MIT-Shell-Draper Lab research project

"Applications of compressive sensing and sparse structure exploitation in hydrocarbon reservoir exploration and surveillance"

The goal of the project is to develop novel compressive sensing algorithms for geoscience problems in the area of hydrocarbon field exploration and surveillance. The appointment is initially for one-year, with the possibility of renewal for up to 3 years. The appointment should start either during the summer (the preferred option) or the fall of 2012.

Duties and Responsibilities:

Provide expertise on and contribute to the development of compressive sensing and sparse recovery algorithms for geoscience applications

Help MIT faculty in coordination of research projects, including periodic reporting and documentation as required by the program timeline

Frequent travel to Shell facilities in Houston

Minimum Qualifications
Ph.D. in Computer Science, Electrical Engineering, Mathematics or related disciplines

Preferred Qualifications

Expertise in sparse recovery, compressive sensing, streaming/sketching, machine learning and statistical inference algorithms

Experience and/or interest in research projects of interdisciplinary nature

Programming experience (MATLAB)

Applications (including CV and three reference letters) should be submitted to https://postdoc.csail.mit.edu/searches/sparsityp-search/ ideally by April 15, 2012. However, applications will be considered until the position is filled.

Thursday, March 29, 2012

STOC 2012 news: posters, registration and student travel.

This is a friendly reminder to submit your 1-2 paragraph abstracts for the STOC 2012 Poster session. The deadline is Mar 31 @ 5pm PDT (via email to stoc2012posters@gmail.com). Here's how to do it, and here's why you should do it !

Remember, you DON'T need to submit an actual poster by Mar 31 - just the abstract and information on who will present.

In related STOC news, STOC 2012 registration is now live and the deadline for applying for student travel funding is April 4th.

See you in NYC !

Sunday, March 18, 2012

STOC 2012 Posters: Submit your abstract now !

As I mentioned earlier, STOC 2012 is repeating the successful poster event from STOC 2011. For more details on the logistics of submitting, see my post. What I'd like to do is give you more reasons to submit !

No one attends talks !

Ok, that's a little bit of an exaggeration: but between "conference = journal in a hotel" and "chicken chicken chicken" it's fair to say that talks at conferences tend to draw only those who are particularly interested in the specifics of a work, and is less likely to draw casual attendees.

A poster session on the other hand is like wandering through a nice bookstore: you can browse the topics as you see fit, and jump into anything that catches the eye. If you're a poster presenter, this is a great opportunity to convey a higher level intuition for your work to an audience that might not be conversant in the specifics

It's all about the eyeballs !

Everyone should make posters ! We are in an attention-based economy now, and your work gets known and disseminated only if you can get people's attention. A poster is an extremely effective way to communicate with your audience, especially if you use the visual medium effectively. Most conferences now have elaborate poster sessions and it's a great way to meet people and hear about material I wouldn't have otherwise had time to assimilate.

It's all about the networking !

If you're a student, either presenting (or better yet, not presenting) at STOC, what better way to get conversations going with more senior researchers, instead of huddling together with your fellow students, wondering if you can make enough eye contact to get an intro (yes, we do notice :)). I can almost guarantee that you'd have more meaningful interactions with researchers at a poster than at a talk where people have half an eye on their email.

It's all about the impact !

So you got a paper into STOC ! Congratulations - there's a nice CV bullet for you. But what's next ? You want people to read your paper, talk about it, argue about it, and build upon it. Don't you ? Again, a well designed poster can draw in attention much more effectively than a talk in our attention-deficit world, and more attention means more discussions, and more potential impact.

It's easy !!

Now you surely don't believe me :). If you don't. consider your options. You can use LaTeX/beamer, and here's a fantastic resource to help. You can use Powerpoint or Inkscape, or if you're one of those fancy Mac people, you can use whatever fancy Mac tools I'm not cool enough to talk about.

Best of all, you don't need to have the poster ready by the deadline of Mar 31. All you need is a short abstract.

So don't think ! Whether you have a paper or not, do consider submitting a poster - by Mar 31 - to stoc2012posters@gmail.com - and here are the details.

Traffic...

When I first started blogging, I obsessed over page views and visits and other metrics. Eventually I got over myself. But reading Igor's post about pageviews (congrats on the million !) got me wondering, and I decided to look at my sitemeter:

Took me a while to get there, but it's nice to see regardless.

Monday, February 27, 2012

STOC 2012 Call For Posters

This year, STOC 2012 will have a poster session, reprising the successful event from STOC 2011. The poster session will be held during the main conference, and should be thought of as an extended hallway-discussion with a visual aid. The poster session will be ba separate session on its own, accompanied by refreshments. We welcome posters from attendees on research in any aspect of theoretical computer sciences, and believe presenting posters should be especially appealing for researchers with recent relevant result.

Researchers with papers in other conferences that would be of interest to the STOC community.
STOC 2012 authors who want to have a visual aid for extended/related discussions of their work.
Students who wish to discuss their work in a broader context (for example, by presenting an overview of a line of research)

Submission

In order to present a poster, authors must email the following information to stoc2012posters@gmail.com by March 31, 2012, 5pm PT

Title
A 1-2 paragraph abstract
Name of person presenting the poster
Whether the presenter is planning to register for STOC

Note that poster submissions are not refereed. However, the committee reserves the right to turn down submissions deemed out of scope. If the number of poster submissions exceeds capacity, priority will be given to registered STOC attendees and secondarily according to the date of submission. Poster slots will be confirmed by Apr 10, 2012. If any slots remain, submissions received after the deadline may be considered at the discretion of the committee.

Poster Preparation

Posters should be designed to fit within a space that is 42in wide and 48in high, and may consist of a single printed poster or separate pieces. Mounted posterboards, as well as pushpins and tape for attaching your poster to the posterboard, will be provided at the conference.

STOC 2012 Posters Committee
Chandra Chekuri
Sergei Vassilvitskii
Suresh Venkatasubramanian

Friday, January 20, 2012

The Shonan Meeting (Part 3): Optimal Distributed Sampling

Reservoir sampling is a beautiful gems of sampling: easy to explain, and almost magic in how it works. The setting is this:

A stream of items passes by you, and your goal is to extract a sample of size $s$ that is uniform over all the elements you've seen so far.

The technique works as follows: if you're currently examining the i-th element, select it with probability 1/i, and then pick an element uniformly from the current sample to be replaced by it. I've talked about this result before, including three different ways of proving it.

But what happens if you're in a continuous distributed setting ? Now each of $k$ players is reading a stream of items, and they all talk to a coordinator who wishes to maintain a random sample of the union of streams. Let's assume for now that $s \le k$

Each player can run the above protocol and send an item to the coordinator, and the coordinator can pick a random subset from these. But this won't work ! At least, not unless each player has read in exactly the same amount of data as each other player. This is because we need to weight the sample element sent by a player with the number of elements that player has read.

It's not hard to see that each player sends roughly log n messages to the coordinator for a stream of length n. So maybe each player also annotates the element with the number of elements it has seen so far. This sort of works, but the counts could be off significantly, since a stream that doesn't send a sample might have read many more elements since the previous time it sent an update.

This can be fixed by having each player send an extra control message when its stream increases in size by a factor of 2, and that would not change the asymptotic complexity of the process, but we still don't get a truly uniform sample.

The problem with this approach is that it's trying to get around knowing $n$, the size of the stream, which is expensive to communicate in a distributed setting. So can we revisit the original reservoir method in a 'communication friendly way' ?

Let's design a new strategy for reservoir sampling that works as follows.

Maintain a current "threshold" t. When a new item arrives, assign it a random value r between 0 and 1. If r < t, keep the new item and set t = r, else discard it.

By using the principle of deferred decisions, you can convince yourself that this does exactly the same thing as the previous strategy (because at step i, the probability of the current element being retained is its probability of r being the minimum over the set seen so far, which is 1/i). the good thing is that this approach doesn't need to know how many elements have passed so far.

This approach can be extended almost immediately to the distributed setting. Each player now runs this protocol instead of the previous one, and every time the coordinate gets an update, it sends out a new global threshold (the minimum over all thresholds sent in) to all nodes. If you want to maintain a sample of size $s$, the coordinator keeps $s$ of the $k$ elements sent in, and the overall complexity is $O(ks \log n)$.

But you can do even better.

Now, each player maintains its own threshold. The coordinator doesn't broadcast the "correct" threshold until a player sends an element whose random value is above the global threshold. This tells the coordinator that the player had the wrong threshold, and it then updates that player (and only that player)

Analyzing this approach takes a little more work, but the resulting bound is much better:

The (expected) amount of communication is $O(k \frac{\log (n/s)}{\log (k/s)})$

What's even more impressive: this is optimal !

This last algorithm and the lower bound, were presented by Srikanta Tirthapura at the Shonan meeting, based on his DISC 2011 work with David Woodruff. Key elements of this result (including a broadcast-the-threshold variant of the upper bound) also appeared in a PODS 2010 paper by Muthu, Graham Cormode, Kevin Yi and Qin Zhang. The optimal lower bound is new, and rather neat.

Thursday, January 19, 2012

SODA Review II: The business meeting

Jeff Phillips posts a roundup of the SODA business meeting. Hawaii !!!!!!

I thought I would also post a few notes on the SODA business meeting. I am sure I missed some details, but here are the main points.

Everyone thought the organization of the conference was excellent (so far). The part in parenthesis is a joke by Kazuo Iwama towards his students - I guess that is Japanese humor, and encouragement.

Despite being outside of North America for the first time, the attendance was quite high, I think around 350 people. And the splits were almost exactly 1/3 NA, 1/3 Europe, 1/3 Asia.

Yuval Rabani talked about being PC for the conference. He said there were the most submissions ever, most accepted paper ever, and largest PC ever for SODA. Each PC member reviewers about 49 papers, and over 500 were sub-reviewed. We all thank Yuval for all of his hard work.

We voted next on location for 2014 (2013 is in New Orleans). The final votes were down to Honolulu, HI and Washington DC, where Honolulu won about 60 to 49. David Johnson said they would try to book a hotel in Honolulu if he could get hotel prices below 200 USD/night. A quick look on kayak.com made it appear several large hotels could be booked next year around this time for about 160 USD/night or so. Otherwise it will be in DC where the theory group at UMaryland (via David Mount) have stated they would help with local arrangements. They did a great job with SoCG a few years ago, but I heard many suggestions that it be held more downtown than by UofM campus. And also there were requests for good weather. We'll see what happens...

Finally, there was a discussion about how SODA is organized/governed. This discussion got quite lively. Bob Sedgewick led the discussion by providing a short series of slides outlining a rough plan for a "confederated SODA." I have linked to his slides. This could mean several things, for instance:

Having ALENEX and ANALCO (and SODA) talks spread out over 4 days and intermixed even possibly in the same session (much like ESA).
The PCs would stay separate most likely (although merging them was discussed, but this had less support).
For SODA the PC could be made more hierarchical where there are, say, 6 main area chairs. Then each area chair supervises say 12 or so PC members. The general chair would coordinate and normalize all of the reviews, but otherwise it would be more hierarchical and partitioned. Then PC members in each area would have fewer papers to review, and could submit to other subareas even.
There was also a suggestion that PC chairs / steering committee members have some SODA attendance requirements. (Currently it consists of David Johnson, 2 people appointed by SIAM, and the past two PC chairs - as I think I understand. David Johnson said he would provide a link to the official SODA bylaws somewhere.).

Anyways, there was a lot of discussion that was boiled down to 3 votes (I will try to paraphrase, all vote totals approximate):

Should the steering committee consider spreading ALENEX, ANALCO, and SODA talks over 4 days? About 50 to 7 in favor.
Should the steering committee consider/explore some variant of the Confederated SODA model? About 50 to 2 in favor.
Should the steering committee consider making the steering committee members elected? About 50 to 1 in favor.

There were about 100+ votes for location, so usually about half the crowd abstained for all votes. There were various arguments on either side of the positions. And other suggestions. Some people had very strong and well-argued sides of these discussion points, so I don't want to try to paraphrase (and probably get something nuanced wrong), but I encourage people to post opinions and ideas in the comments.

Wednesday, January 18, 2012

SODA review I: Talks, talks and more talks.

I asked Jeff Phillips (a regular contributor) if he'd do some conference posting for those of us unable to make it to SODA. Here's the first of his two missives.

Suresh requested I write a conference report for SODA. I never know how to write these reports since I always feel like I must have left out some nice talk/paper and then I risk offending people. The fact is, there are 3 parallel sessions, and I can't pay close attention to talks for 3 days straight, especially after spending the previous week at the Shonan meeting that Suresh has been blogging about.

Perhaps, it is apt to contrast it with the Shonan meeting. At Shonan there were many talks (often informal with much back and forth) on topics very well clustered on "Large-scale Distributed Computation". There were several talks earlier in the workshop that just overlaid the main techniques that have become quite powerful within an area, and then there were new talks on recent breakthroughs. But although we mixed up the ordering of subtopics a bit, there was never that far of a context switch, and you could see larger views coalescing in people's minds throughout the week.

At SODA, the spectrum is much more diverse - probably the most diverse conference on the theoretical end of computer science. The great thing is that I get to see colleagues across a much broader spectrum of areas. But the talks are often a bit more specific, and despite having usually fairly coherent sessions, the context switches are typically quite a bit larger and it seems harder to stay focused enough to really get at the heart at what is in each talk. Really getting the point requires both paying attention and being in the correct mind set to start with. Also, there are not too many talks in my areas of interest (i.e. geometry, big data algorithmics).

So then what is there to report. I've spent most of my time in the hallways, catching up on gossip (which either is personal, or I probably shouldn't blog about without tenure - or even with tenure), or discussing on-going or new research problems with friends (again not yet ready for a blog). And of the talks I saw, I generally captured vague notions or concepts. Usually stored away for when I think about a related problem and I need to make a similar connection, or look up a technique in the paper. And, although, I was given a CD of the proceedings, but my laptop has not CD drive. For the reasons discussed above, I rarely completely get how something works from a short conference talk. Here are some example snip-its of what I took away from a few talks:

Private Data Release Via Learning Thresholds | Moritz Hardt, Guy Rothblum, Rocco A. Servedio.

Take-away : There is a deep connection between PAC learning and differential privacy. Some results from one can be applied to the other, but perhaps many others can be as well.

Submatrix Maximum Queries in Monge Matrices and Monge Partial Matrices, and Their Applications | Haim Kaplan, Shay Mozes, Yahav Nussbaum and Micha Sharir

Take-away: There is a cool "Monge" property that matrices can have which makes many subset query operations more efficient. This can be thought of as each row represents a pseudo-line. Looks useful for matrix problems where the geometric intuition about what the columns mean is relevant.

Analyzing Graph Structure Via Linear Measurements | Kook Jin Ahn, Sudipto Guha, Andrew McGregor

Take-away : They presented a very cool linear sketch for graphs. This allows several graph problems to be solved under a streaming (or similar models) in the way usually more abstract, if not geometric, data is. (ed: see my note on Andrew's talk at Shonan)

Lsh-Preserving Functions and Their Applications | Flavio Chierichetti, Ravi Kumar

Take-away: They present a nice characterization of what sorts of similarities (based on combinatorial sets), showing which ones can and cannot be used within a LSH framework. There techniques seemed to be a bit more general that for these discrete similarities over sets, so if need to use this for another similarity may be good to check out in more detail.

Data Reduction for Weighted and Outlier-Resistant Clustering | Dan Feldman, Leonard Schulman

Take-away: They continue to develop the understanding on what can be done for core sets using sensitivity-based analysis. This helps outline not just what functions can be approximated with subsets as proxy, but also how the distribution of points affects these results. The previous talk by Xin Xiao (with Kasturi Varadarajan on A Near-Linear Algorithm for Projective Clustering Integer Points) also used these concepts.

There were many other very nice results and talks that I also enjoyed, but the take-away was often even less interesting to blog about. Or sometimes they just made more progress towards closing a specific subarea. I am not sure how others use a conference, but if you are preparing your talk, you might consider trying to build a clear concise take-away message into your talk so that people like me with finite attention spans can remember something precise out of it. And so people like me are most likely to look more carefully at the paper the next time we work on a related problem.

Upcoming deadline: Multiclust 2012

One of the posts queued up in the clustering series is a note on 'metaclustering', which starts with the idea that instead of looking for one good clustering, we should be trying to explore the space of clusterings obtained through different methods and pick out good solutions informed by the landscape of answers. While the concept has been around a while, the area has attracted much more interest in recent years, with two workshops on the topic at recent data mining conferences.

I'm involved with the organization of the latest incarnation, to be held in conjunction with SDM 2012 in April. If you have

unpublished original research papers (of upto 8 pages) that are not under review elsewhere, vision papers and descriptions of work-in-progress or case studies on benchmark data as short paper submissions of up to 4 pages

then you should submit it ! The deadline is Jan 25.

Tuesday, January 17, 2012

The Shonan Meeting (Part 2): Talks review I

I missed one whole day of the workshop because of classes, and also missed a half day because of an intense burst of slide-making. While I wouldn't apologize for missing talks at a conference, it feels worse to miss them at a small focused workshop. At any rate, the usual disclaimers apply: omissions are not due to my not liking a presentation, but because of having nothing even remotely intelligent to say about it.

Jeff Phillips led off with his work on mergeable summaries. The idea is that you have a distributed collection of nodes, each with their own data. The goal is to compute some kind of summary from all the nodes, with the caveat that each node only transmits a fixed size summary to other nodes (or the parent in an implied hierarchy). What's tricky about this is keeping the error down. It's easy to see for example that $\epsilon$-samples compose - you could take two $\epsilon$-samples and take an $\epsilon$-sample of that, giving you a $2\epsilon$-sample over the union. But you want to keep the error fixed AND the size the sample fixed. He showed a number of summary structures that could be maintained in this mergeable fashion, and there are a number of interesting questions that remain open, including how to do clustering in a mergeable way.

In the light of what I talked about earlier, you could think of the 'mergeable' model as a restricted kind of distributed computation, where the topology is fixed, and messages are fixed size. The topology is a key aspect, because nodes don't encounter data more than once. This is good, because otherwise the lack of idempotence of some of the operators could be a problem: indeed, it would be interesting to see how to deal with non-idempotent summaries in a truly distributed fashion.

Andrew McGregor talked about graph sketching problems (sorry, no abstract yet). One neat aspect of his work is that in order to build sketches for graph connectivity, he uses a vertex-edge representation that essentially looks like the cycle-basis vector in the 1-skeleton of a simplicial complex, and exploits the homology structure to compute the connected components (aka $\beta_0$). He also uses the bipartite double cover trick to reduce bipartiteness testing to connected component computation. It's kind of neat to see topological methods show up in a useful way in these settings, and his approach probably extends to other homological primitives.

Donatella Firmani and Luigi Laura talked about different aspects of graph sketching and MapReduce, studying core problems like the MST and bi/triconnectivity. Donatella's talk in particular had a detailed experimental study of various MR implementations for these problems, and had interesting (but preliminary) observations about tradeoff between the number of reducers and the amount of communication needed.

This theme was explored further by Jeff Ullman in his talk on one-pass MR algorithms (the actual talk title was slightly different, since the unwritten rule at the workshop was to change the name of the title from the official listing). Again, his argument was that one should be combining both the communication cost and the overall computation cost. A particularly neat aspect of his work was showing (for the problem of finding a particular shaped subgraph in a given large graph) when there was an efficient one-pass MR algorithm, given the existence of a serial algorithm for the same problem. He called such algorithms convertible algorithms: one result type is that if there's an algorithm running in time $n^\alpha m^\beta$ for finding a particular subgraph of size $s$, and $s \le \alpha + 2\beta$, then there's an efficient MR algorithm for the problem (in the sense of total computation time being comparable to the serial algorithm).

The Shonan Meeting (Part 1): In the beginning, there was a disk...

What follows is a personal view of the evolution of large-data models. This is not necessarily chronological, or even reflective of reality, but it's a retroactive take on the field, inspired by listening to talks at the Shonan meeting.

Arguably, the first formal algorithmic engagement with large data was the Aggarwal-Vitter external memory model from 1988. The idea was simple enough: accessing an arbitrary element of disk was orders of magnitude more expensive than accessing an element of main memory, so let's ignore main memory access and charge a single unit for accessing a block of disk.

The external memory model was (and is still) a very effective model of disk access. It wasn't just a good guide to thinking about algorithm design, it also encouraged design strategies that were borne out well in practice. One could prove that natural-sounding buffering strategies were in fact optimal, and that prioritizing sequential scans as far as possible (even to the extent of preparing data for sequential scans) was more efficient. Nothing earth-shattering, but a model that guides (and conforms to) proper practice is always a good one.

Two independent directions spawned off from the external memory model. One direction was to extend the hierarchy. Why stop at one main memory level when we have multilevel caches ? A simple extension to handle caches is tricky, because the access time differential between caches and main memory isn't sufficient to justify the idealized "0-1" model that the EM model used. But throwing in another twist - I don't actually know the correct block size for transfer of data between hierarchy levels - led us to cache-obliviousness.

I can't say for sure whether the cache-oblivious model speaks to the practice of programming with caches as effectively as the EM model. Being aware of your cache can bring significant benefits in principle. But the design principles ("repeated divide and conquer, and emphasizing locality of access") are sound, and there's already at least one company (Tokutek, founded by Martin Farach-Colton, ~~and~~ Michael Bender and Bradley Kuszmaul) that is capitalizing on the performance yielded by cache-oblivious data structures.

The other direction was to weaken the power of the model. Since a sequential scan was so much more efficient than random access to disk, a natural question was to ask what you could do with just one scan. And thus was born the streaming model, which is by far the most successful model for large-data to date, with theoretical depth and immense practical value.

What we've been seeing over the past few years is the evolution of the streaming model to capture ever more complex data processing scenarios and communication frameworks.

It's quite useful to think of a stream algorithm as "communicating" a limited amount of information (the working memory) from the first half of the stream to the second. Indeed, this view is the basis for communication-complexity-based lower bounds for stream algorithms.

But if we think of this as an algorithmic principle, we then get into the realm of a distributed computation, where one player possesss the "first half" of the data, the other player has "the second half" and the goal is for them to exchange a small number of bits with each other in order to compute something (while streaming is one-way communication, a multi-pass streaming algorithm is a two-way communication).

Of course, there's nothing saying that we only have two players. This gets you to the $k$-player setup for a distributed computation, in which you wish to minimize the amount of communication exchanged as part of a computation. This model is of course not new at all ! It's exactly the distributed computing model pioneered in the 80s and 90s and has a natural home at PODC. What appears to be different in its new uses is that the questions being asked are not the old classics like leader election or byzantine agreement, but statistical estimation on large data sets. In other words, the reason to limit communication is because of the need to process a large data set, rather than the need to merely coordinate. It's a fine distinction, and I'm not sure I entirely believe it myself :)

There are many questions about how to compute various objects in a distributed setting: of course the current motivation is to do with distributed data centers, sensor networks, and even different cores on a computer. Because of the focus on data analysis, there are sometimes surprising results that you can prove. For example, a recent SIGMOD paper by Zengfeng Huang, Lu Wang, Ke Yi, and Yunhao Liu shows that if you want to do quantile estimation, you only need communication that's sublinear in the number of players ! The trick here is that you don't need to have very careful error bounds on the estimates at each player before sending up the summary to a coordinator.

It's also quite interesting to think about distributed learning problems, where the information being exchanged is specifically in order to build a good model for whatever task you're trying to learn. Some recent work that I have together with Jeff Phillips, Hal Daume and my student Avishek Saha explores the communication complexity of doing classification in such a setting.

An even more interesting twist on the distributed setting is the so-called 'continuous streaming' setting. Here, you don't just have a one-shot communication problem. Each player receives a stream of data, and now the challenge is not just to communicate a few bits of information to solve a problem, but to update the information appropriately as new input comes in. Think of this as streaming with windows, or a dynamic version of the basic distributed setting.

Here too, there are a number of interesting results, a beautiful new sampling trick that I'll talk about next, and some lower bounds.

I haven't even got to MapReduce yet, and how it fits in: while you're waiting, you might want to revisit this post.

Sunday, January 15, 2012

The Shonan Meeting (Part 0): On Workshops

Coming up with new ideas requires concentration and immersion. When you spend enough unbroken time thinking about a problem, you start forming connections between thoughts, and eventually you get a giant "connected component" that's an actual idea.

Distractions, even technical ones, kill this process. And this is why even at a focused theory conference, I don't reach that level of "flow". While I'm bombarded from all directions by interesting theory, there's a lot of context switching. Look, a new TSP result ! origami and folding - fun ! Who'd have thought of analyzing Jenga ! did someone really prove that superlinear epsilon net lower bound ?

This is why focused workshops are so effective. You get bombarded with information for sure, but each piece reinforces aspects of the overall theme if it's done well. Slowly, over the course of the event, a bigger picture starts emerging, connections start being made, and you can feel the buzz of new ideas.

And this is why the trend of 'conferencizing' workshops, that Moshe Vardi lamented recently, is so pernicious. it's another example of perverse incentives ("conferences count more than workshops for academic review, and so let's redefine a workshop as a conference"). A good workshop (with submitted papers or otherwise) provides focus and intensity, and good things come of it. A workshop that's really just a miniconference doesn't have either the intense intimacy of a true workshop or the quality of a larger symposium.

All of this is a very roundabout way of congratulating Muthu, Graham Cormode and Ke Yi (ed: Can we just declare that Muthu has reached exalted one-word status, like Madonna and Adele ? I can't imagine anyone in the theory community hearing the name 'Muthu' and not knowing who that is) for putting on a fantastic workshop on Large-Scale Distributed Computing at the Shonan Village Center (the Japanese Dagstuhl, if you will). There was reinforcement, intensity, the buzz of new ideas, and table tennis ! There was also the abomination of fish-flavored cheese sticks, of which nothing more will be said.

In what follows, I'll have a series of posts from the event itself, with a personal overview of the evolution of the area, highlights from the talks, and a wrap up. Stay tuned...

Tuesday, January 03, 2012

Teaching review I: Programming

Please note: I'm in Boston this week at the Joint Math Meetings (6100 participants and counting!) for a SIAM minisymposium on computational geometry. If you happen to be in the area for the conference, stop by our session on Thursday at 8:30am.

I always assign programming questions in my graduate algorithms class. In the beginning, I used the ACM programming competition server, and also tried the Sphere Online system. This semester, I didn't do either, designing my own homeworks.

Why do I ask students to code ? There are many reasons it's a good idea to get students to program in a graduate algorithms class. There are some additional reasons in my setting: a majority of the students in my class take it as a required course for the MS or Ph.D program. They don't plan on doing research in algorithms, many of them are looking for jobs at tech companies, and many others will at best be using algorithmic thinking in their own research.

Given that, the kinds of programming assignments I tried to give this semester focused on the boundary between theory and practice (or where O() notation ends). In principle, the goal of each assignment on a specific topic was to convert the theory (dynamic programming, prune and search, hashing etc) into practice by experimenting with different design choices and seeing how they affected overall run time.

Designing assignments that couldn't be solved merely by downloading some code was particularly tricky. As Panos Ipeirotis recommended in his study of cheating in the classroom (original post deleted, here's the post-post), the key is to design questions whose answer requires some exploration after you've implemented the algorithm. I was reasonably happy with my hashing assignment, where students were asked to experiment with different hashing strategies (open addressing, chaining, cuckoo hashing), measure the hash table behavior for different parameter choices, and draw their own conclusions. There was a fair degree of consistency in the reported results, and I got the feeling that the students (and I!) learned something lasting about the choices one makes in hashing (for example, simple cuckoo hashing degrades rapidly after a 70% load ratio).

I did have a few disasters, and learned some important lessons.

If I expect to run student code on my test cases, I HAVE to provide a test harness into which they'd merely insert a subroutine. Otherwise, merely specifying input and output format is not enough. Students are not as familiar with a simple UNIX command line interface as I expected, and use pre-canned libraries in ways that make uniform testing almost impossible. Note that this is not language-independent, unless I'm willing to provide harnesses for the myriad different languages people use.
If you do want to accept code without using the 'subroutine-inside-harness' form, then you need some kind of testing environment where they can submit the code and get answers. This is essentially what ACM/TopCoder/SphereOnline do.
If you're accepting code from students, use Moss or something like it. It's great for detecting suspiciously common code. When I used it and detected duplicates, in all cases the students admitted to working together.
It's easier to design assignments where the submitted product is not code, but some analysis of the performance (like in the hashing assignment above). Firstly, this is platform independent, so I don't care if people use Java, C, C++, C#, python, perl, ruby, scheme, racket, haskell, .....Secondly, it avoids the issue of "did you write the code yourself" - not entirely, but partly.

But I'll definitely do this again. Firstly, it's good for the students - from what I hear, having written actual code for dynamic programming makes it much easier for them to write code under pressure during their tech interviews, and I've heard that knowledge of DPs and network flows is a key separator during tech hiring. Secondly, even for theory students, it helps illustrate where asymptotic analysis is effective and where it breaks down. I think this is invaluable to help theoreticians think beyond galactic algorithms.

Of course, no discussion of programming in algorithms classes is complete with a reference to Michael Mitzenmacher's exhortation.

Monday, December 26, 2011

On PC members submitting papers

Update: Michael Mitzenmacher's posts (one, two, and three, and the resulting comments) on implementing CoI at STOC are well worth reading (thanks, Michael). The comments there make me despair that *any* change will ever be implemented before the next century, but given that we've been able to make some changes already (electronic proceedings, contributed workshops, and so on), I remain hopeful.

For all but theory researchers, the reaction to the above statement is usually "don't they always?". In theoryCS, we pride ourselves on not having PC members submit papers to conferences. What ends up happening is:

You can't have too many PC members on a committee because otherwise there won't be enough submissions
The load on each PC member is much larger than reasonable (I'm managing 41 papers for STOC right now, and it's not uncommon to hit 60+ for SODA)

There's an ancillary effect that because of the first point, theory folks have fewer 'PC memberships' on their CV which can cause problems for academic performance review, but this is a classic Goodhart's Law issue, so I won't worry about it.

The main principle at play here is: we don't want potentially messy conflicts or complex conflict management issues if we do have PC members submitting papers. However, it seems to me that the practice of how we review papers is far different from this principle.

Consider: I get an assignment of X papers to review if I'm on a conference PC. I then scramble around finding subreviewers for a good fraction of the papers I'm assigned (I used to do this less, but I eventually realized that a qualified subreviewer is FAR better than me in most subareas outside my own expertise, and is better for the paper).

Note (and this is important) my subreviewers have typically submitted papers to this conference (although I don't check) and I rely on them to declare any conflicts as per conference guidelines.

Subreviewers also get requests from different PC members, and some subreviewers might themselves review 3-4 papers.

Compare this to (say) a data mining conference: there are 30+ "area chairs" or "vice chairs", and over 200 PC members. PC members each review between 5-10 papers, and often don't even know who the other reviewers are (although they can see their reviews once they're done). The area/vice chairs manage 20-30 papers each, and their job is to study the reviews, encourage discussion as needed, and formulate the final consensus decision and 'meta-review'.

If you set "theory subreviewer = PC member" and "theory PC member = vice chair", you get systems that aren't significantly different. The main differences are:

theory subreviewers don't typically get to see other reviews of the paper. So their score assignment is in a vacuum.
theory PC members are expected to produce a review for a paper taking the subreviewer comments into account (as opposed to merely scrutinizing the reviews being provided)
managing reviewer comments for 30 papers is quite different to generating 30 reviews yourself (even with subreviewer help)
A downside of the two-tier PC system is also that there isn't the same global view of the entire pool that a theory PC gets. But this is more a convention than a rule: there's nothing stopping a PC for opening up discussions to all vice chairs.
One advantage of area chairs is that at least all papers in a given area get one common (re)viewer. that's not necessarily the case in a theory PC without explicit coordination from the PC chair and the committee itself.

But the main claimed difference (that people submitting papers don't get to review them) is false. Even worse, when submitters do review papers, this is 'under the table' and so there isn't the same strict conflict management that happens with explicit PC membership.

We're dealing with problems of scale in all aspects of the paper review and evaluation process. This particular one though could be fixed quite easily.