Monday, March 30, 2015

Revisiting the Misra-Gries estimator

If you've ever taken a class on streaming algorithms, or want to ace some tech company puzzle interview question, you've probably heard of this little gem:
Given n items, determine whether a majority element (one occurring strictly more than n/2 times) exists. You are allowed one pass over the data (which means you can read the elements in sequence, and that's it), and exactly TWO units of working storage.
This is a special case of the general Misra-Gries estimator for finding frequent items in a stream with small space. The MG estimator allows you to find (approximately) all items that occur more than $\epsilon n$ times using $O(1/\epsilon)$ space. Setting $\epsilon = 1/2$ yields the desired result.

This result has been proved and reproved a number of times, and there's a fairly general proof based on the fact that what you're really doing is approximating the true frequency to an additive error of $\epsilon n$ (and so any item that occurs more often than that will be retained).

What's more interesting is how you might go about proving the basic statement from first principles if you don't know the proof. This came up at dinner, when I was talking to people who hadn't heard of the problem before and were able to reconstruct a proof on the fly.

So let's return to the basic statement, and let's be a little more precise. The precise guarantee required is as follows:

  • If a strict majority element exists, you MUST return it
  • If not, you can return any element you want (even the most infrequent one)
Because of the second statement above, we only have to worry about streams in which a strict majority element does exist. 

Observation 1: if you haven't seen a majority element in any prefix of the stream, you can throw that prefix away. 

Why: we know there's a majority element somewhere. If it didn't show up so far, it has to be at least much of a majority in what's remaining. 

Observation 2: This prefix could be as short as two elements long. 

In other words, we see an element. Call it x. If we now see $y \ne x$, x is no longer a majority element so far, and we can chuck it, starting a new stream at $y$.

But if we do see $x$ again, then it could be a majority element. How do we keep track of whether it ever stops being a majority element ? 

We could keep track of the number of items seen so far. But that's an extra counter. Why not just pair instances of $x$ with instances of other elements seen so far by subtraction. If any time we hit zero, we can invoke observation 1 and start again. 

If not, then we will end with a nonzero count for $x$, which must be the correct element. 

And this gives us the exact algorithm we need. 

Thursday, March 19, 2015

The coming funding crunch

It's no secret we're in the middle of another boom in CS enrollment. Departments everywhere are struggling to keep up with the increased number of students wanting to get into our programs.

But there's a chain reaction with potentially nasty consequences coming our way. And while some of this will be obvious to academic types, it might not be obvious to the community at large, or even graduate students considering going into academia. 

  • As departments everywhere try to adjust to the new load of students, they have two options: hire a lot more teaching faculty to teach classes, or start negotiating with their universities for more tenure-track positions. The latter is clearly preferable (if you don't believe me, look at the plight of adjuncts in the humanities). But....
  • Universities can live with hiring more faculty, because the increased tuition from more students helps justify it, and more faculty means more research grants, and more hafta for the administration. But...
  • More faculty means more applications for research awards, to the various organizations that dole out money (NSF, NIH, DARPA, ONR, DoE, ...). But...
Have you seen science budgets lately ? They're basically flat. 

We've seen this Ponzi scheme before in biology, and the consequence of that is that the average age of a first time PI has crossed 40, coupled with increased time spent doing postdocs. It's now the norm rather than the exception in CS to see faculty candidates with at least one postdoc. 

And there's no easy way to de-escalate. All the possible dams we can build are bad, or difficult to execute on: 
  • Don't hire more faculty. Then we're stuck with ever increasing class sizes and lower quality of student education.
  • Hire more contingent faculty. This might be a short term solution, but it's a horrible way to treat new Ph.Ds, and frankly given the options out there in industry, you wouldn't get as many takers (at least if people think rationally about this)
  • De-link academic success from funding, or encourage the teaching mission more. This is a complete non-starter at most schools that rely on overhead. And for R1 universities, research dollars are not just a bottomline factor, but are a prestige element. 
And don't even get me started on how this is going to affect our already stretched-near-breaking-point conference review process. 

Sunday, February 22, 2015

STOC 2015 Call for Workshops and Tutorials

I've had two week from hell with proposal deadlines, paper deadlines, faculty candidates and ever-more demanding hungry cries of the baby chickadees  meetings with my students.

Thankfully, all that is now in the past and I can sally forth to the next deadline. Which is, you ask ?

STOC 2015 will have a workshops and tutorials day on Sunday June 14, the day before the conference starts. +Sanjeev Khanna and +Chandra Chekuri are running the event, and they want your proposals.
We invite groups of interested researchers to submit workshop or tutorial proposals. The goal of a workshop is to provide an informal forum for researchers to discuss important research questions, directions, and challenges. Connections between theoretical computer science and other areas, topics that are not well represented at STOC, and open problems are encouraged as workshop topics. Organizers are completely free to choose their workshop formats (invited speakers, panel discussions, etc.). The program for June 14th may also involve tutorials, each consisting of 1-2 survey talks on a particular area, and we welcome tutorial proposals as well.
Deadline for submissions is Mar 15, 2015. For more details on the format etc, visit the workshops/tutorials page

Monday, February 16, 2015

Research customs for the 21st century.

 I've begun to notice a number of customs that seem unique to our modern process of doing collaborative research in the 21st century. Most of them are technology-driven, and most of them involving annoying debates about the multitude of choices available for collaborating.

  • Research Meetings: whether to do them over Skype, or Google Hangouts, or or, or even with phones on a conference call. 
  • Audio/Video/Chat: if over skype/G+, whether to do audio, or video, or chat. And then the obligatory pre-videconference primping to look presentable (or the reliance on bandwidth failure to NOT go on video)
  • Coordination: careful calculations among multiple time zones and daylight savings times to plan meetings. The use of Doodle, Google calendar, or even random emails passed around haphazardly to plan said meetings.
  • Writing I: the protracted negotiations over whether to use git, svn, rcs, or cvs, or Dropbox, or random emails passed around haphazardly (and yes I've done all of this, sometimes at the same time)
  • Writing II: if using an actual modern VCS, then the equally protracted negotiations over who's hosting it, who needs account access, and where public keys need to be placed, and why CS researchers in the 21st century STILL need tutorials on ssh. 
  • Writing III: Dealing with comments on a writeup: as fixmes, as todonotes, as issues in the git repository hosting the document, or as random emails passed around haphazardly. 

And of course all the collaborator "digital" personalities that emerge irrationally and violently. Alice hates using bibtex, and Bob will only use personally crafted bibtex files. Charlie loves his own special fonts and will have raging debates over $\varepsilon$ vs $\epsilon$. Dan has never used version control and doesn't see the point. Erin handcoded her own version control system in a cutting-edge fragment of Haskell and refuses to use anything else. Frank hates online meetings, and Mallory only thinks online during a meeting. Oscar insists that Postscript is Turing-complete and is therefore sufficient for all drawing needs. Peggy insists that git rebase is Turing-complete and is therefore sufficient to fix all commit disasters... eventually.

Note to all my collaborators who I'm currently writing papers with: you are awesome and have NOTHING AT ALL to do with this list.

Wednesday, February 11, 2015

One Body Problems

It's hiring season, and we've all heard about two-body problems.

But you may not have heard of the one-body problem:
Is the place a single person takes a job in likely to have enough other single people around to facilitate searching for a partner ? 
I was 'spoken for' before I even finished my Ph.D, and never had to deal with this (rather, I had to work with the more traditional (ha!) two-body problem).

Friday, February 06, 2015

Friday chest-thumping.

  • My paper with Amirali Abdullah (or should I say Amir's paper with me) got into STOC ! One aspect not discussed in the linked blog post is the connection to Partial Match (a notoriously hard problem in the data structures literature).  In brief, our result allows for a "smooth-ish" interpolation between approximate near neighbor lower bounds in $\ell_1$ and general partial match lower bounds, reinforcing the intuition that Partial Match is an "extremely asymmetric" nearest neighbor problem. 
  • Another paper with 'the Streaming Kings' (said to the sound of a flamenco strum) of Cormode, Chakrabarti, McGregor and Thaler got into CCC (and this is my first paper at Complexity). I'll have more to say about this work in a separate post: in brief, we looked at streaming interactive proofs (of the kind first developed by Cormode, Thaler and Yi) where you have a prover and a streaming verifier and wish to verify a computation with constant rounds of communication. There's a 3-round nearest neighbor protocol in this paper, as well as a number of subtle results about the nature of multi-round communication protocols in the streaming setting. 
"Well Suresh, you have two new papers, what are you going to do next" ?

Wednesday, February 04, 2015

No one is in control...

The New Scientist has a cover this week on the way in which algorithms have taken over our lives. They interviewed some of the participants at the FATML workshop I participated in (thanks to Solon Barocas for sending the reporter our way), and Sorelle Friedler got some nice quotes in regarding the work she, Carlos Scheidegger and I have been doing.

The article is currently paywalled, but a friend of mine sent me the text. The relevant paragraph, lightly edited (the "she" here is Sorelle):

By understanding the biases inherent in the underlying data, she hopes to eliminate bias in the algorithm. Her system looks for correlations between arbitrary properties – like height or address – and demographic groupings like race or gender. If the correlation is expected to lead to unwanted bias, then it would make sense to normalise the data. It is essentially affirmative action for algorithms, she says.
I'm not sure I'd describe our work as affirmative action :), but that's a longer argument.

Disqus for The Geomblog