Monday, March 30, 2015

Revisiting the Misra-Gries estimator

If you've ever taken a class on streaming algorithms, or want to ace some tech company puzzle interview question, you've probably heard of this little gem:
Given n items, determine whether a majority element (one occurring strictly more than n/2 times) exists. You are allowed one pass over the data (which means you can read the elements in sequence, and that's it), and exactly TWO units of working storage.
This is a special case of the general Misra-Gries estimator for finding frequent items in a stream with small space. The MG estimator allows you to find (approximately) all items that occur more than $\epsilon n$ times using $O(1/\epsilon)$ space. Setting $\epsilon = 1/2$ yields the desired result.

This result has been proved and reproved a number of times, and there's a fairly general proof based on the fact that what you're really doing is approximating the true frequency to an additive error of $\epsilon n$ (and so any item that occurs more often than that will be retained).

What's more interesting is how you might go about proving the basic statement from first principles if you don't know the proof. This came up at dinner, when I was talking to people who hadn't heard of the problem before and were able to reconstruct a proof on the fly.

So let's return to the basic statement, and let's be a little more precise. The precise guarantee required is as follows:

  • If a strict majority element exists, you MUST return it
  • If not, you can return any element you want (even the most infrequent one)
Because of the second statement above, we only have to worry about streams in which a strict majority element does exist. 

Observation 1: if you haven't seen a majority element in any prefix of the stream, you can throw that prefix away. 

Why: we know there's a majority element somewhere. If it didn't show up so far, it has to be at least much of a majority in what's remaining. 

Observation 2: This prefix could be as short as two elements long. 

In other words, we see an element. Call it x. If we now see $y \ne x$, x is no longer a majority element so far, and we can chuck it, starting a new stream at $y$.

But if we do see $x$ again, then it could be a majority element. How do we keep track of whether it ever stops being a majority element ? 

We could keep track of the number of items seen so far. But that's an extra counter. Why not just pair instances of $x$ with instances of other elements seen so far by subtraction. If any time we hit zero, we can invoke observation 1 and start again. 

If not, then we will end with a nonzero count for $x$, which must be the correct element. 

And this gives us the exact algorithm we need. 

Disqus for The Geomblog