The Geomblog

Thursday, June 28, 2012

Geometry in the Field: Talk Slides

Jeff Phillips and I organized a workshop on Geometry In The Field at SoCG 2012. I haven't blogged about the workshop yet because Jeff and I are working on a more detailed report for the SIGACT Computational Geometry column, and I'll post it here when it's done.

Many people had asked me when the talk slides would be available, and I'm happy to announce that all the talk slides are now up at the workshop site.

I strongly recommend looking over the slides if you have the time: these were excellent talks by great speakers that gave a masterful overview of their specific areas, as well as challenges for the future.

Friday, June 22, 2012

CGWeek II

Some highlights from day 2 of CGWeek:

On Laplacians of Random Complexes, by Anna Gundert and Uli Wagner.

The Cheeger inequality is a well known inequality in spectral graph theory that connects the "combinatorial expansion" of a graph with "spectral expansion". Among other things, it's useful for clustering, because you can split a graph using Cheeger's inequality to find a "thin cut" and then repeat. There's been work recently on a "higher-order" Cheeger inequality, that allows you to cut a graph into $k$ pieces instead of two by connecting this to observations about the first $k$ eigenvectors of the Laplacian.

The above paper generalizes the notion of combinatorial expansion versus spectral expansion in a very different way. Think of a graph as the 1-skeleton of a simplicial complex (i.e the set of faces of dimension at most 1). Is there a way to define the Laplacian of the complex itself ? And is there then an equivalent of Cheeger's inequality ?

It turns out that this can indeed be done. Recent work by Gromov and others have shown how to define the notion of "edge expansion" for a simplicial complex. Roughly speaking, you compute the edge expansion of a cochain (a function of a chain) by endowing it with a norm and then looking at the norm of the edge operator with respect to the norm (sort of) of the cochain itself. What is interesting is that if you choose the underlying coefficient field as $\mathbb{R}$ and the norm as $\ell_2$, you get the spectral equivalent, and if you choose instead $\mathbb{Z}_2$ and the Hamming distance, you get the combinatorial equivalent.

It's known that for 1-skeletons, and even for things like persistence, the underlying field used to define homology doesn't really matter. However, for this problem, it matters a lot. The authors show that there is no equivalent Cheeger's inequality for simplicial complexes ! They also look at random complexes and analyze their properties (just like we can do for random graphs).

Add Isotropic Gaussian Kernels at Own Risk: More and More Resilient Modes in Higher Dimensions, by Edelsbrunner, Fasy and Rote

Suppose you have three Gaussians in the plane, and you look at the resulting normalized distribution. You expect to see three bumps (modes), and if the Gaussians merge together, you'd expect to see the modes come together in a supermode.

Can you ever get more than three modes ?

This is the question the above paper asks. It was conjectured that this cannot happen, and in fact in 2003 it was shown that it was possible to get 4 modes from three Gaussians (you can get a little bump in the middle as the three Gaussians pull apart). In this paper, they show that in fact you can get a super-linear number of "bumps" or critical points for $n$ Gaussians, and these modes are not transient - they "persist" in a certain sense.

This is quite surprising in and of itself. But it's also important. A plausible approach to clustering a mixture of Gaussians might look for density modes and assign cluster centers there. What this paper says is that you can't really do that, because of the "ghost" centers that can appear.

Other quick hits:

Chris Bishop ran a very interesting workshop on Analysis and Geometry. While I couldn't attend all the talks, the first one was by Ranaan Schul gave an overview of the analytic TSP, in which you're given a continuous set of points in the plane, and want to know if there's a finite length curve that passes through all the points. The techniques used here relate to multiscale analysis of data and things like local PCA.
One of the cool innovations in CGWeek was the fast-forward session: each afternoon before the workshops started, speakers were allowed to give a 1-slide overview of what would happen in their events. The slides were all placed in a single presentation ahead of time, and it was clocked so that people couldn't run on. It was great - I didn't have to do careful planning, and got a much better idea of what was coming up. We should do this for all talks at the conference !

Sunday, June 17, 2012

CGWeek Day I

CGWeek day 1 has my head spinning. Attending a mix of conference talks and workshops is something I'm used to in the DB/data mining world, but not in theory conferences. It's refreshing, exhilarating, and exhausting, all at the same time.

There were a number of great talks today at the conference and the workshops -- too many to do justice to. Here are some random thoughts:

This year, there's a best student presentation award. My student Amirali Abdullah is one of the competitors, so I don't want to say too much about it, but I'll say that everyone uppped their game - the presentations by students were great ! Maybe they should have best presentation awards in general to improve everyone's presentation skills !
Amir's talk went pretty well (if I say so myself). It was his first conference talk, but I thought it was well above the average (maybe this says something about the average :)). The talk was about approximate Bregman near neighbors.
Very intriguing paper by Adamaszek and Stacho on the connection between homology structure of flag complexes and matchings that induce maximal independent sets.
The workshop on geometric learning was fantastic. Lots to think about on multiscale representations, and distances to measures.
Jeff Erickson has a four part liveplussing (one, II, trois, fear) of the business meeting
Most importantly: Rio in 2013 !! Kyoto in 2014 (boo) !! OSU no way (I kid, I kid).
Even more importantly: Herbert proposes changing the name of the conference to "Symposium on Computational Geometry and Topology". Major concern - what about the sausage ? Audience votes confusedly after Herbert promises a Klein bottle of beer for everyone.

A Reflection On Big Data (Guest post)

Ed: Graham Cormode kindly agreed to send a missive from the recently concluded big data workshop at Duke

There seem to be no shortage of Big Data events at the moment, reflecting some of the enthusiasm around this topic. Last week, there were meetings at NIST and at Duke and there are upcoming events at Stanford and SAMSI to name but a few. (Ed: also the new Simons Foundation program on big data)

I attended the Duke event, and was hoping to blog about the meeting, but suffered a big data problem of my own: with 25 speakers, plus panels and breakout sessions, how to pull out any meaningful summary? So instead, I'll just pick on one issue that was the subject of much discussion: what exactly is new about big data? In particular, there has been much interest in what we called 'massive' data over the last years (as captured by MADALGO, the center for massive data algorithmics and the accompanying MASSIVE workshop). Is Big Data just a rebranding of Massive Data? Which is bigger, Big Data, Massive Data, or Very Large Data ?

What came out from our discussions is that we believe that Big Data is qualitatively different from the challenges that have come before. The major change is the increasing heterogeneity of data that is being produced. Previously, we might have considered data that was represented in a simple form, as a very high-dimensional vector, over which we want to capture key properties. While Big Data does include such problems, it also includes many other types of data: database tables, text data, graph data, social network data, GPS trajectories, and more. Big Data problems can consist of a mixture of examples of such data, not all of which are individually "big", but making sense of which represents a big problem. Addressing these challenges requires not only algorithmics and systems, but machine learning, statistics and data mining insights.

This is by no means a new observation: the axes of Volume, Velocity and Variety have been widely discussed before. But it did remind me that we should not get too hung up on how big our big data is, since size is not the sole defining characteristic of handling big data (although it is a necessary characteristic).

Many other familiar topics came up in discussions, such as: how can big data sets be made more accessible to researchers? how can privacy concerns with big data be addressed? what are the right models for big data, for both the data and the computation? what are the key techniques for working with big data? do we even know enough to teach a course on big data?

What struck me about these discussions is that not only were we far from reaching a consensus on any point, but also no one seemed to think they had any strong candidate solutions. Consequently, I left assured that Big Data is much more than a rebranding of existing work. It represents a significant research agenda that will challenge us for years to come.

Friday, June 08, 2012

Geometry in the Field: Abstracts are up !

As Jeff mentioned, we're organizing a workshop at SoCG called "8F-Computational Geometry" that looks at the past and future of the impact computational geometry has had "in the field". It was quite difficult to narrow things down to 8 speakers, and even harder to list only four areas where CG has already had impact.

The abstracts for the talks are now up at the workshop page. We hope to see you there !

Wednesday, June 06, 2012

Mihai Patrascu, R.I.P

I just got word from Mikkel Thorup that Mihai Patrascu passed away. I had heard that he was ill, and only realized the seriousness of his condition when I saw him at STOC 2012. He came in a wheelchair to be honored for getting a Presburger award, and the entire room gave him a standing ovation.

I can only hope that he's out there proving lower bounds in the sky, and arguing fiercely with the clouds while he does so.

Update: There is now a memorial page for Mihai at mipmemorial.blogspot.com. Please visit there to post your comments, and have a drink for Mihai !

Friday, June 01, 2012

Minimum language complexity needed to construct exponential time algorithm

(while this could have been a question on cstheory, I think it's a little too vague - but it's perfect for a blog post!)

I was at dinner with some faculty colleagues (assistant professor beer night - best invention known to mankind!) and we were in the traditional "all our students are terrible" phase of the evening ^*. The discussion turned to "what programming concepts do students understand after an intro-CS class", which led to the following claim:

"you can't write an exponential time algorithm without knowing recursion"

While this seemed plausible, I was skeptical. Thankfully my PL colleague (and super blogger) Matt Might was at the table, and he pointed out that the lambda calculus formally has no recursive constructs and yet is Turing-complete, via the magic of the Y-combinator (which essentially computes fixed points directly).

Of course one might argue that a fixed point operator does give you recursion (indeed the Y-combinator is often referred to as anonymous recursion). So my question really is:

Is there a well defined notion of "programming languages without recursion or recursion-like substances" and if so, what is the most expensive algorithm that can be expressed in such a language ?

p.s BDDs seem like an obvious candidate...

1 (other regularly discussed phases include "I can't believe < latest cryptic thing > senior faculty member said", "How many frequent flyer miles do you have" and "chickens!". Beer night is awesome !^↩

Extracurricular events at STOC

I didn't do any STOC blogging this time, but I do want to say something about the extracurricular events.

For a while now, people (including me) have been clamoring for an expansion of the traditional three-days-of-talks format at theory conferences, and now STOC and FOCS (and SoCG soon!) are doing workshops, tutorials and poster sessions on a regular basis. Chandra Chekuri, Sergei Vassilvitskii and I organized the poster session at STOC this year, and Avrim Blum and Muthu (Muthu!) were in charge of workshops.

Workshops

How did they go ? The workshops, the day before the conference, looked extremely successful. Michael Kearns' morning tutorial on computational finance was standing room only in a large auditorium - I don't know the exact numbers but there were easily more than 150 people in the room. The workshops went pretty well as well: I spent much of my time at the distributed streaming workshop, and that was also standing room only, with probably at least 60 people in the room at all times.

I'm really glad that we had these events, and even happier that FOCS 2012 plans to continue the event. Speaking of which, Avrim is looking for workshop proposals for FOCS 2012, and specifically asked me to suggest that geometry folks suggest a workshop. The deadline is June 20, and all you need is a 2-page proposal and 2-3 people who'd organize it.

Posters

I'm of course biased about the poster session, but I do think it went off quite well, fueled by Howard Karloff's brilliant idea to provide ice cream sandwiches as refreshments during the event. There were 30 posters in all, and a ton of discussion. It was really nice to walk around looking at the posters and seeing the level of engagement. I polled a number of presenters and attendees, and they all seemed to enjoy the session.

The only complaint was that the poster session was too short, and that we should have left the posters up for people to browse while taking breaks. I think this is an excellent idea that the next posters committee (at FOCS 2012) should take into account.

The business meeting

We had an incredibly long business meeting - even our most rambunctious SODA meetings haven't gone on this long (almost 3 hours). For those not yet in the know, STOC 2012 is going to a two-tier format. Joan Feigenbaum is the chair, and the "must-not-be-called-senior"-PC will have 9 people on it. Their primary role will be managing the review process with the "certainly-not-junior"-PC consisting of 70-80 people who will review no more than 10 papers each, AND will be allowed to submit papers.

This is a major change for a theory conference, albeit one that was brought up for discussion at SODA 2012. I'm particularly curious to see how the whole "letting PC members submit papers" goes. Before everyone starts freaking out, it's worth pointing out that all this is effectively doing is bringing the unofficial reviewers into the fold, thereby giving them a chance to do reviews in context of the entire pool, rather than making isolated decisions. Since the unofficial reviewers were mostly drawn from the author pool, this is not too different from how it's always been. I'm hoping that the reduced load per reviewer will bring in better reviews and more consistency in decisions.

All in all, a time of experimentation - workshops, posters, tutorials, two-tier PCs - the theory community moves slowly, but when it does move, things happen quickly :)

Wednesday, May 30, 2012

Workshop on coding theory, complexity and sparsity

Atri Rudra asks me to post an announcement for the second incarnation of the workshop on coding theory, complexity and sparsity. The first event went off quite well - I was sure that Dick Lipton had a post on it but I couldn't find it (Update: here it is, thanks to Atri). At any rate, do attend and get your students to attend: they have a great roster of speakers and travel support for students.

Efficient and effective transmission, storage, and retrieval of information on a large-scale are among the core technical problems in the modern digital revolution. The massive volume of data necessitates the quest for mathematical and algorithmic methods for efficiently describing, summarizing, synthesizing, and, increasingly more critical, deciding when and how to discard data before storing or transmitting it. Such methods have been developed in two areas: coding theory, and sparse approximation (SA) (and its variants called compressive sensing (CS) and streaming algorithms).

Coding theory and computational complexity are both well established fields that enjoy fruitful interactions with one another. On the other hand, while significant progress on the SA/CS problem has been made, much of that progress is concentrated on the feasibility of the problems, including a number of algorithmic innovations that leverage coding theory techniques, but a systematic computational complexity treatment of these problems is sorely lacking.

The workshop organizers aim to develop a general computational theory of SA and CS (as well as related areas such as group testing) and its relationship to coding theory. This goal can be achieved only by bringing together researchers from a variety of areas. We will have several tutorial lectures that will be directed to graduate students and postdocs.
These will be hour-long lectures designed to give students an introduction to coding theory, complexity theory/pseudo-randomness, and compressive sensing/streaming algorithms.

We will have a poster session during the workshop and everyone is welcome to bring a poster but graduate students and postdocs are especially encouraged to give a poster presentation.

Confirmed speakers:

Eric         Allender          Rutgers

Mark       Braverman      Princeton

Mahdi     Cheraghchi     Carnegie Mellon University

Anna      Gal                  The University of Texas at Austin

Piotr       Indyk               MIT

Swastik Kopparty         Rutgers

Dick       Lipton             Georgia Tech

Andrew McGregor       University of Massachusetts, Amherst

Raghu    Meka               IAS

Eric        Price                MIT

Ronitt   Rubinfeld          MIT

Shubhangi Saraf            IAS

Chris      Umans            Caltech

David    Woodruff         IBM

We have some funding for graduate students and postdocs. For registration and other details, please look at the workshop webpage: https://sites.google.com/site/umccsworkshop2012/

Thursday, May 17, 2012

8F Computational Geometry

What is North Carolina known for?

Basketball?

Barbecue?

Great Universities?

Well, this June it will be known for geometry, as SoCG will be held this year very near Durham, NC. Unfortunately, it will be at UNC and not Duke. (I kid, I know Jack will do a great job organizing.)

If those things are not enough, Suresh and I (Jeff) are organizing an 8F-Computational Geometry workshop at SoCG. In case thats not clear, 8F stands for AITF, and AITF stands for Algorithms in the Field.

The idea is to both showcase how computational geometry has had impact on "the field" and highlight some areas that we hope are ripe for future impact from the CG community. There are two days of talks Tuesday, June 19 on ongoing impact and Wednesday, June 20 on potential (more) impact. We already have a great line-up of speakers; on the first day we have talks by

Valerio Pascuci on Visualization,

Lars Arge on GIS,

David Mount on Data Retrieval, and

Nina Amenta on Meshing and Surface Reconstruction.
On the second day we have talks by

Dan Halperin on Robotics and Automation,

Jie Gao on Mobile Networks,

Michael Mahoney on High Dimensional Data Analysis, and

Tom Fletcher on Medical Imaging.

We encourage you all to attend, and to see what the Durham area has to offer. There is actually a full 4 days of activities at SoCG this year. The early registration deadline is next week (May 23) and its only $100 for students!

We should mention that this 8F-Computational Geometry workshop is part of a larger push (through sponsorship) by NSF. There was a more general 8F Workshop last year, and there are more in the works.

Wednesday, May 16, 2012

"In the long run..."

"In the long run, we're all dead" is a famous quote attributed to John Maynard Keynes. The context was his arguments with economists of the time: he was trying to argue for government intervention in the markets to control inflation, rather than just letting it play out.

It's an apt response also to occasional claims that asymptotics will eventually win out, especially with large data. Asymptotics will eventually win out, as long as everything else stays fixed.

But that's the precise problem. Everything else doesn't stay fixed. Well before your $C n \log n$ algorithm beats the $c n^2$ algorithm, we run out of memory, or local cache, or something else, and the computational model changes on us.

We come up with external memory models, and are informed that in fact even a factor of log N is too much. We design streaming models and Mapreduce and so on and so forth, and realize that all this communication is burning a hole in our atmosphere.

Lesson to be learned (and re-learned): asymptotic analysis is a powerful tool, but it's only as good as the model of computation you're REALLY working with.

Wednesday, May 09, 2012

Multiplicative Weight Updates as Zombie Binary Search

The multiplicative weight update method (MWU hereafter) is a neat algorithm design technique with applications in machine learning, geometry and optimization among others. However, it's viewed (and discussed) as an advanced technique, with very technical examples requiring lots of background (see the Arora-Hazan-Kale survey first example).

After pondering the use of MWU for a while now, it seems to me that this should be taught as a standard algorithms tool that naturally follows from divide and conquer and prune-and-search. In what follows, I'll sketch out how this might work.

A quick caveat: the MWU, like many powerful tools, has multiple interpretations, and the survey hints at many of them (for example, MWU = derandomization of Chernoff bound). I don't intend to imply that there's only one way to interpret the technique, but that the approach I'll describe is the most accessible one when learning the method for the first time.

The divide and conquer unit in an algorithms class might cover sorting and FFTs, driven by the standard D&C recurrence relation. Prune and search is introduced as a variation, with median finding and binary search being the prototypical examples.

Prune-and-search works like magic, if you're seeing it for the first time. Do some work, throw away the rest, and your $n \log n$ running time goes down to linear. (As an Indian of a certain age, this always reminds me of "thoda khao, thoda phenko" (eat a bit, throw a bit) from Jaane Bhi Do Yaaro).

But what makes it tick is determining the constant factor that needs to be thrown away. The most direct way to do this is deterministically: on a linear time budget, find a point that splits the input into two roughly balanced parts.

But that's expensive in practice. Indeed, algorithms that use median finding as a subroutine try to avoid using a deterministic procedure because it can be slow. When you get to more advanced methods like parametric search, it gets even worse.

The neat trick to apply here is to randomize ! We know that we don't have to find an exact split point - merely one that will approximately balance the two sides of the recursion. For median finding, we can pick three points at random, and take their median. With a sufficiently high probability, this median will create a 1/3-2/3 split, and away we go !

This idea surfaces again and again, especially in the many randomized algorithms in computational geometry. As a design strategy, it's quite effective - design an algorithm that works if you can do balanced splits, and then find the split by choosing randomly. Invoke the Chernoff God and some satellite deities, and you're done.

Which brings us to the MWU.

Randomized splitting says that we're willing to lie a little about the split process, and things still work out. But whether you do randomized splitting or deterministic, the end result is still that you throw away some reasonable fraction of the input, NEVER TO SEE IT AGAIN.

Suppose you can't even do that ?

Think of noisy binary search (or 20 questions with a liar). Now, even your decision on what to prune is error-prone. You might be eliminating things that you need to consider later. So it's not clear that you can make any kind of search progress. But let's be reasonable and limit the adversary (or the noise) in some manner. Let's say that you'll only misclassify a small fraction of the input in each iteration (where small is "strictly less than half"). Let's also assume that I can tell you how important certain points are, so that you have to take that into consideration when defining your "small fraction".

So how does MWU now work ? I tell you a distribution over the input, and you give me back a rule that's reasonably good at (say) classifying points. I increase the importance of things you made mistakes on (so you try not to do it again), and repeat.

Eventually, what happens is that the weight of things that you make mistakes on increases rapidly. But the overall weight of the input can't increase too much, because I only increase the weight of things you make mistakes on, which is not a lot. Intuitively, what happens is tha the first weight catches up with the second, at which point the algorithm is complete. You can show that if the updating schedule is chosen correctly, the process terminates in logarithmic steps.

Essentially, MWU functions as a zombie binary search. You want to kill elements, but they keep coming back. Thankfully, each time they come back they're slightly weaker, so you have a better chance of hitting them next time. Eventually, head is severed from neck, and your zombies are dead (and your points are found).

My only wish at this point is that whenever you think of MWU you think of zombies. Now that's impact :)

Wednesday, April 18, 2012

Distributed Learning: A new model

Communication is now the key to modelling distributed/multicore computations. Jim Demmel has been writing papers and giving talks on this theme for a while now, and as processors get faster, and the cloud becomes a standard computing platform, communication between nodes is turning out to be the major bottleneck.

So suppose you want to learn in this setting ? Suppose you have data sitting on different nodes (you have a data center, or a heterogeneous sensor network, and so on) and you'd like to learn something on the union of the data sets. You can't afford to ship everything to a single server for processing: the data might be too large to store, and the time to ship might be prohibitive.

So can you learn over the (implicit) union of all the data, with as little discussion among nodes as possible ? This was the topic of my Shonan talk, as well as two papers that I've been working on with my student Avishek Saha, in collaboration with Jeff Phillips and Hal Daume. The first one will be presented at AISTATS this week, and the second was just posted to the arxiv.

We started out with the simplest of learning problems: classification. Supppose you have data sitting on two nodes (A and B), and you wish to learn a hypothesis over the union of A and B. What you'd like is a way for the nodes to communicate as little as possible with each other while still generating a hypothesis close to the optimal solution.

It's not hard to see that you could compute an $\epsilon$-sample on A, and ship it over to B. By the usual properties of an $\epsilon$-sample, you guarantee that any classifier on B's data combined with the sample will also classify A correctly to within some $\epsilon$-error. It's also not too hard to show a lower bound that matches this upper bound. The amount of communication is nearly linear in $1/\epsilon$.

But can you do better ? In fact yes, if you let the nodes talk to each other, rather than only allowing one-way communication. One way of gaining intuition for this is that $A$ can generate classifiers, and send them over to $B$, and $B$ can tell $A$ to turn the classifier left or right. Effectively, $B$ acts as an oracle for binary search. The hard part is showing that this is actually a decimation (in that a constant fraction of points are eliminated from consideration as support points in each step), and once we do that, we can show an exponential improvement over one-way communication. There's a trivial way to extend this to more than 2 players, with a $k^2$ blow up in communication for $k$ players.

This binary search intuition only works for points in 2D, because then the search space of classifiers is on the circle, which lends itself naturally to a binary search. In higher dimensions, we have to use what is essentially a generalization of binary search - the multiplicative weight update method. I'll have more to say about this in a later post, but you can think of the MWU as a "confused zombie" binary search, in that you only sort of know "which way to go" when doing the search, and even then points that you dismissed earlier might rise from the dead.

It takes a little more work to bring the overhead for k-players down to a factor k. This comes by selecting one node as a coordinator, and implementing one of the distributed continuous sampling techniques to pass data to the coordinator.

You can read the paper for more details on the method. One thing to note is that the MWU can be "imported" from other methods that use it, which means that we get distributed algorithms for many optimization problems for free. This is great because a number of ML problems essentially reduce to some kind of optimization.

A second design template is multipass streaming: it's fairly easy to see that any multipass sublinear streaming algorithm can be placed in the k-player distributed setting, and so if you want a distributed algorithm, design a multipass streaming algorithm first.

One weakness of our algorithms was that we didn't work in the "agnostic" case, where the optimal solution itself might not be a perfect classifier (or where the data isn't separable, to view it differently). This can be fixed: in an arxiv upload made simultaneously with ours, Blum, Balcan, Fine and Mansour solve this problem very neatly, in addition to proving a number of PAC-learning results in this model.

It's nice to see different groups exploring this view of distributed learning. It shows that the model itself has legs. There are a number of problems that remain to be explored, and I'm hoping we can crack some of them. In all of this, the key is to get from a 'near linear in error' bound to a 'logarithmic in error' bound via replacing sampling by active sampling (or binary search).

Monday, April 09, 2012

Approximate Bregman near neighbors.

(tl;dr: Our upcoming paper in SoCG 2012 shows that with a nontrivial amount of work, you can do approximate Bregman near neighbor queries in low dimensional spaces in logarithmic query time)

One of the things that has haunted me over the years (yes, haunted!) is the geometry of Bregman divergences. A Bregman divergence is a very interesting creature. It's defined as the difference between a convex function and its linear approximation starting at another point:
$$ d_\phi(p,q) = \phi(p) - \phi(q) - \langle \nabla \phi(q), p - q\rangle $$
Because the Bregman divergences are parametrized by the function $\phi$, they yield a family of divergences. The most familiar one is $\ell_2^2$, but the most important one is the Kullback-Leibler divergence. There are a ton of reasons why one should study Bregman divergences, and in a shameless plug I'll refer to you my slides (one, two, three) from a geometry summer school last year. Suffice it to say that it's possible to unify a number of different algorithms in machine learning under a single framework by realizing that they're all particular instances of doing something with a Bregman divergence: two notable examples of this are Adaboost and information-theoretic clustering.

So what's the geometric perspective on Bregman divergences ? They generalize duality !

There's a standard way to think about traditional point-hyperplane duality via a lifting map: take a point, lift it to the paraboloid one dimension up, and find the tangent plane. But suppose we replace the paraboloid by a generic convex function ? what we get is a general convex duality (technically a Fenchel-Legendre duality) defined by
$$\phi^*(u) = \sup_v \langle u,v \rangle - \phi(v)$$
The reason this doesn't come up in Euclidean space is that the mapping is self-dual if we set $\phi(x) = (1/2) \|x\|^2$, which explains the symmetry in $\ell_2^2$.

It turns out that this observation is enough to import much of standard combinatorial geometry over to general Bregman spaces. We can compute convex hulls, Voronoi Diagrams, delaunay triangulations etc, as long as we're careful to keep primal and dual straight. for example, the locus of points equidistant from two points under a Bregman divergence is either a primal straight line or a dual straight line (depending on how you measure things), and so on. This was all elegantly worked out by Nielsen, Nock and Boissonnat a while ago.

Alas, this beautiful connection doesn't help us with approximation problems.

One of the most interesting questions regarding Bregman divergences is solving the near-neighbor problem. This is interesting because the Kullback-Leibler distance is often used (for example) to compare images, and so there's been a lot of empirical work on this problem.

But here's the problem. Let's consider the low dimensional ANN problem. In Euclidean space (or even in spaces of bounded doubling dimension), here's how you solve the problem. You build some kind of quad-tree-like data structure, using the triangle inequality to reason about which cells you need to explore, and using packing bounds to bound the number of cells explored. You also need a crude bound on the near neighbor to start with, and to do this, you use some variant of a ring-tree.

The key ingredients: triangle inequality (twice over) and packing bounds.

Bregman divergences don't even satisfy a directed triangle inequality in general. And to date, we didn't even know how to define packing bounds properly for these directed distances.

In an upcoming paper at SoCG 2012 with my students Amirali Abdullah and John Moeller, we finally figured out how to get some "approximation" of the ANN machinery to work with Bregman divergences, and get logarithmic query times with small space.

If I say so myself, there are some nice ideas in the paper. Firstly, in a somewhat surprising twist, a Bregman divergence satisfies a "reverse triangle inequality" on the line:

\[ d_\phi(a,b) + d_\phi(b,c) \le d_\phi(a, c), a, b, c, \in R\]

This is neat, because it gives packing bounds ! Intuitively, if the sum of lengths of subintervals along an interval is at most the length of the interval, then you can't have too many subintervals.

The next trick is an observation that the square root of a Bregman divergence satisfies a kind of relaxed triangle inequality that we call $\mu$-defectiveness:

\[ |d(x, y) - d(x,z)| \le \mu d(y, z)\]

This allows us to import some of the ring-tree machinery to get a coarse approximation.

And I should mention that the $\mu$ values involved here are quite small. If you're comparing distributions with the KL-divergence, then the value of $\mu$ is less than 2 even quite close to the boundary of the simplex.

Even with all of this in place, the quad-tree argument breaks. This is because of $\mu$-defectiveness: we can't assume that cell sizes are "reasonably large" at some number of levels below the root of the tree, and so our packing bounds look terrible. It takes a lot more work to fix this problem: essentially we exploit second-order structure of the Bregman divergences to bound the cell sizes and get the desired packing bound.

Afterthoughts:

Most of the complexity of the algorithm is in the analysis: the actual algorithm looks mostly like a Euclidean ANN procedure. While we haven't implemented it, I'm hopeful that when we do, we'll be able enjoy the empirical behavior of our Euclidean cousins.
We're looking at high dimensional Bregman near neighbors, as well as some other approximate Bregman geometry questions. While our low-D ANN result comes from the "I will break your limbs one by one till you submit" school of algorithms, the hope is that we can start to exploit more of the dual geometry as we learn more.

Sunday, April 08, 2012

Postdoc Announcement

Piotr Indyk asked me to post this announcement for a postdoc position:

Applications are invited for a Postdoctoral Research Assistant position for the MIT-Shell-Draper Lab research project

"Applications of compressive sensing and sparse structure exploitation in hydrocarbon reservoir exploration and surveillance"

The goal of the project is to develop novel compressive sensing algorithms for geoscience problems in the area of hydrocarbon field exploration and surveillance. The appointment is initially for one-year, with the possibility of renewal for up to 3 years. The appointment should start either during the summer (the preferred option) or the fall of 2012.

Duties and Responsibilities:

Provide expertise on and contribute to the development of compressive sensing and sparse recovery algorithms for geoscience applications

Help MIT faculty in coordination of research projects, including periodic reporting and documentation as required by the program timeline

Frequent travel to Shell facilities in Houston

Minimum Qualifications
Ph.D. in Computer Science, Electrical Engineering, Mathematics or related disciplines

Preferred Qualifications

Expertise in sparse recovery, compressive sensing, streaming/sketching, machine learning and statistical inference algorithms

Experience and/or interest in research projects of interdisciplinary nature

Programming experience (MATLAB)

Applications (including CV and three reference letters) should be submitted to https://postdoc.csail.mit.edu/searches/sparsityp-search/ ideally by April 15, 2012. However, applications will be considered until the position is filled.

Thursday, March 29, 2012

STOC 2012 news: posters, registration and student travel.

This is a friendly reminder to submit your 1-2 paragraph abstracts for the STOC 2012 Poster session. The deadline is Mar 31 @ 5pm PDT (via email to stoc2012posters@gmail.com). Here's how to do it, and here's why you should do it !

Remember, you DON'T need to submit an actual poster by Mar 31 - just the abstract and information on who will present.

In related STOC news, STOC 2012 registration is now live and the deadline for applying for student travel funding is April 4th.

See you in NYC !

Sunday, March 18, 2012

STOC 2012 Posters: Submit your abstract now !

As I mentioned earlier, STOC 2012 is repeating the successful poster event from STOC 2011. For more details on the logistics of submitting, see my post. What I'd like to do is give you more reasons to submit !

No one attends talks !

Ok, that's a little bit of an exaggeration: but between "conference = journal in a hotel" and "chicken chicken chicken" it's fair to say that talks at conferences tend to draw only those who are particularly interested in the specifics of a work, and is less likely to draw casual attendees.

A poster session on the other hand is like wandering through a nice bookstore: you can browse the topics as you see fit, and jump into anything that catches the eye. If you're a poster presenter, this is a great opportunity to convey a higher level intuition for your work to an audience that might not be conversant in the specifics

It's all about the eyeballs !

Everyone should make posters ! We are in an attention-based economy now, and your work gets known and disseminated only if you can get people's attention. A poster is an extremely effective way to communicate with your audience, especially if you use the visual medium effectively. Most conferences now have elaborate poster sessions and it's a great way to meet people and hear about material I wouldn't have otherwise had time to assimilate.

It's all about the networking !

If you're a student, either presenting (or better yet, not presenting) at STOC, what better way to get conversations going with more senior researchers, instead of huddling together with your fellow students, wondering if you can make enough eye contact to get an intro (yes, we do notice :)). I can almost guarantee that you'd have more meaningful interactions with researchers at a poster than at a talk where people have half an eye on their email.

It's all about the impact !

So you got a paper into STOC ! Congratulations - there's a nice CV bullet for you. But what's next ? You want people to read your paper, talk about it, argue about it, and build upon it. Don't you ? Again, a well designed poster can draw in attention much more effectively than a talk in our attention-deficit world, and more attention means more discussions, and more potential impact.

It's easy !!

Now you surely don't believe me :). If you don't. consider your options. You can use LaTeX/beamer, and here's a fantastic resource to help. You can use Powerpoint or Inkscape, or if you're one of those fancy Mac people, you can use whatever fancy Mac tools I'm not cool enough to talk about.

Best of all, you don't need to have the poster ready by the deadline of Mar 31. All you need is a short abstract.

So don't think ! Whether you have a paper or not, do consider submitting a poster - by Mar 31 - to stoc2012posters@gmail.com - and here are the details.

Traffic...

When I first started blogging, I obsessed over page views and visits and other metrics. Eventually I got over myself. But reading Igor's post about pageviews (congrats on the million !) got me wondering, and I decided to look at my sitemeter:

Took me a while to get there, but it's nice to see regardless.

Monday, February 27, 2012

STOC 2012 Call For Posters

This year, STOC 2012 will have a poster session, reprising the successful event from STOC 2011. The poster session will be held during the main conference, and should be thought of as an extended hallway-discussion with a visual aid. The poster session will be ba separate session on its own, accompanied by refreshments. We welcome posters from attendees on research in any aspect of theoretical computer sciences, and believe presenting posters should be especially appealing for researchers with recent relevant result.

Researchers with papers in other conferences that would be of interest to the STOC community.
STOC 2012 authors who want to have a visual aid for extended/related discussions of their work.
Students who wish to discuss their work in a broader context (for example, by presenting an overview of a line of research)

Submission

In order to present a poster, authors must email the following information to stoc2012posters@gmail.com by March 31, 2012, 5pm PT

Title
A 1-2 paragraph abstract
Name of person presenting the poster
Whether the presenter is planning to register for STOC

Note that poster submissions are not refereed. However, the committee reserves the right to turn down submissions deemed out of scope. If the number of poster submissions exceeds capacity, priority will be given to registered STOC attendees and secondarily according to the date of submission. Poster slots will be confirmed by Apr 10, 2012. If any slots remain, submissions received after the deadline may be considered at the discretion of the committee.

Poster Preparation

Posters should be designed to fit within a space that is 42in wide and 48in high, and may consist of a single printed poster or separate pieces. Mounted posterboards, as well as pushpins and tape for attaching your poster to the posterboard, will be provided at the conference.

STOC 2012 Posters Committee
Chandra Chekuri
Sergei Vassilvitskii
Suresh Venkatasubramanian

Friday, January 20, 2012

The Shonan Meeting (Part 3): Optimal Distributed Sampling

Reservoir sampling is a beautiful gems of sampling: easy to explain, and almost magic in how it works. The setting is this:

A stream of items passes by you, and your goal is to extract a sample of size $s$ that is uniform over all the elements you've seen so far.

The technique works as follows: if you're currently examining the i-th element, select it with probability 1/i, and then pick an element uniformly from the current sample to be replaced by it. I've talked about this result before, including three different ways of proving it.

But what happens if you're in a continuous distributed setting ? Now each of $k$ players is reading a stream of items, and they all talk to a coordinator who wishes to maintain a random sample of the union of streams. Let's assume for now that $s \le k$

Each player can run the above protocol and send an item to the coordinator, and the coordinator can pick a random subset from these. But this won't work ! At least, not unless each player has read in exactly the same amount of data as each other player. This is because we need to weight the sample element sent by a player with the number of elements that player has read.

It's not hard to see that each player sends roughly log n messages to the coordinator for a stream of length n. So maybe each player also annotates the element with the number of elements it has seen so far. This sort of works, but the counts could be off significantly, since a stream that doesn't send a sample might have read many more elements since the previous time it sent an update.

This can be fixed by having each player send an extra control message when its stream increases in size by a factor of 2, and that would not change the asymptotic complexity of the process, but we still don't get a truly uniform sample.

The problem with this approach is that it's trying to get around knowing $n$, the size of the stream, which is expensive to communicate in a distributed setting. So can we revisit the original reservoir method in a 'communication friendly way' ?

Let's design a new strategy for reservoir sampling that works as follows.

Maintain a current "threshold" t. When a new item arrives, assign it a random value r between 0 and 1. If r < t, keep the new item and set t = r, else discard it.

By using the principle of deferred decisions, you can convince yourself that this does exactly the same thing as the previous strategy (because at step i, the probability of the current element being retained is its probability of r being the minimum over the set seen so far, which is 1/i). the good thing is that this approach doesn't need to know how many elements have passed so far.

This approach can be extended almost immediately to the distributed setting. Each player now runs this protocol instead of the previous one, and every time the coordinate gets an update, it sends out a new global threshold (the minimum over all thresholds sent in) to all nodes. If you want to maintain a sample of size $s$, the coordinator keeps $s$ of the $k$ elements sent in, and the overall complexity is $O(ks \log n)$.

But you can do even better.

Now, each player maintains its own threshold. The coordinator doesn't broadcast the "correct" threshold until a player sends an element whose random value is above the global threshold. This tells the coordinator that the player had the wrong threshold, and it then updates that player (and only that player)

Analyzing this approach takes a little more work, but the resulting bound is much better:

The (expected) amount of communication is $O(k \frac{\log (n/s)}{\log (k/s)})$

What's even more impressive: this is optimal !

This last algorithm and the lower bound, were presented by Srikanta Tirthapura at the Shonan meeting, based on his DISC 2011 work with David Woodruff. Key elements of this result (including a broadcast-the-threshold variant of the upper bound) also appeared in a PODS 2010 paper by Muthu, Graham Cormode, Kevin Yi and Qin Zhang. The optimal lower bound is new, and rather neat.