Thursday, January 10, 2008

Happy Birthday, Don Knuth !

Today is Don Knuth's birthday, and by way of tribute, there are posts from David, Luca and Scott that commemorate him. Among the many obsessions of Knuth is one that went into the new editions of TACP: typesetting the names of cited authors in their native language. This meant a laborious process of converting names into their original scripts and then doing a Unicode translation. I was at Stanford for some of this, and did my (tiny) share of translations in Hindi (and even once in Tamil!).

But what I wanted to highlight is a very innocuous result from Vol 2 of TACP that's both well known, elementary to prove, and provides a key tool for much of the work on streaming that came much later.
Can we pick a random element from a set of items whose cardinality we do not know ?
In this setting, we can examine an item and decide whether to retain it or discard it, but at the end of the process, we must hold in our hands an element chosen uniformly at random.

This is the kind of problem you might encounter on a Google/Microsoft interview: it's simple to state, and so is the answer, but it requires at least some "lateral" thinking. Without further ado, here's the answer (so stop right here if you want to puzzle it out yourself)

Pick an item and store it. Pick the next one, and replace the first one with it with probability 1/2. Pick the third one, and do a replace with probability 1/3, and so on. At the end, the item you've stored has a probability of 1/n of being any particular element.
This is of course a streaming algorithm. You can imagine the items arriving one by one, and using a local counter and a single word of storage, you can execute the algorithm. First of all, this can be generalized to maintaining a random sample of size k: in general, the technique is called reservoir sampling, and has spawned many generalizations.

There are at least three different ways of arguing the correctness of this procedure - another reason why I find it so elegant.

1. Suppose the sampling procedure is correct for the first k items. Now when item k+1 appears, it becomes the new sample with probability 1/(k+1). Every prior item currently has a probability of 1/k of being picked, and therefore remains the chosen sample item with probability k/(k+1) * 1/k = 1/(k+1). Hence, by induction, the procedure works

Like most induction proofs, this argument leaves something wanting. Here's a less rigorous, but in my mind more appealing argument:

2. Suppose we pick the kth item with probability p_k. Let an adversary suddenly halt the stream at this step. If p_k is not equal to 1/k, then the last item is not picked with a uniform probability over the current stream. So in some sense, the algorithm has no choice, since it doesn't know when the stream ends.

3. A more muscular proof of the "let's roll up our sleeves and crank things out" variety, goes as follows: Let p_i be the probability that the i^th item is picked for the sample, replacing the current sample element. The probability that i survives is

Since these numbers must be equal for all i, we can equate any two of these, in particular the terms for i and i+1, which yields the expression

Some simple telescoping later, once we observe that p_1 = 1, gives the desired probabilities.

CS researchers and the practitioners of computer science tend to view each other warily from a distance, each side convinced that the other has absolutely no idea of what "real CS" is about. Don Knuth straddled both worlds effortlessly, gaining respect from 15 year old hackers and 50 year old researchers alike. And that's the most tremendous feat of all.

Update: More tributes, by Bill Gasarch and the Quantum Pontiff.


  1. That's nice. But I was hoping for a tribute to Knuth's contributions to Computational Geometry. Like the randomized incremental algorithm for Delaunay Triangulations or something :(.

  2. Maybe I'm missing something, but in the solution to the original problem you have to scan through the entire set of objects to get your random element. Why not just skip all the "replace with probability x" stuff and just run through all the objects, find out how many there are (say N) and get a random one by generating a random number in the range 1 to N?

  3. I guess the solution makes more sense in a streaming context, where you can look at each item exactly once. In your solution, once you generate the random number, you'd have to pick an item, which would require reading it a second time (the first time was when you were counting the stream length). This is not allowed


Disqus for The Geomblog