Today's AT&T Labs math seminar was by
Mikkel Thorup, who talked about a
new paper with
Nick Duffield and
Carsten Lund on priority sampling.
Suppose you are given items 1, 2, 3, ... n with weights w
i. The goal is to be able to compute subset sum queries: For a given set of indices I (the query), compute the sum of all weights w
i, i \in I. In situations of interest, n is huge, so you can't really store all the weights. What you'd like to be able to do is sample from the universe, obtaining a representative set of indices S with appropriate weights, so that when a query comes in, you add up the weights of elements in the intersection of I and S, and return that as your answer.
Now you can sample the items uniformly at random, but you'd miss the (typically few) elements with high weight. You could do weighted sampling, but then you'd constantly be hitting the elements with high weight, and that's no good. What you really need is some kind of weighted-sampling without replacement (which you can simulate by throwing out duplicates, but then you waste time trying to find a reasonable sample).
Their approach is an elegant idea called
priority sampling. For each item, generate a random number a
i between 0 and 1. Assign a priority to i of value w
i/a
i. Whp, all priorities are distinct. Take the top k elements (with respect to priority), and let the priority of the (k+1)
th element be t. Now give each sampled element a weight w
'i = max(w
i, t). All unsampled elements get a new weight of 0.
The main bit of magic is the following fact:
E[w'i] = wi
In other words, the estimation is unbiased, from which it immediately follows that any subset sum estimate is also unbiased. An even more surprising fact (surprising because the threshold t appears to depend on k elements), is:
w'i and w'j are independent.
which then means that the variance of any subset sum is merely the sum of the individual variances. A conjecture of the authors, recently resolved in the affirmative by Mario Szegedy, is that this sampling scheme (using k+1 samples) provides the minimum total variance unbiased estimator over all such schemes that sample k elements from the input.
A formal expression for the variance of this sampling scheme is a bit complicated to write down in general, but for simple cases it can be written explicitly. The authors have experimental results showing how the estimates provides by priority sampling converge to the true estimates.
The sampling trick is very neat, and does appear magical even after you see the proof. There is a fascinating connection to a kind of sampling called threshold sampling, where each element is picked with a specific probability (and thus the sample size is not fixed), and a fixed threshold is used to decide whether an element is retained in the sample or not. Informally, threshold sampling provides a lower bound on the variance of such a scheme, and priority sampling is close to matching it.
What I understand less is the apparent connection to range sum histograms. In the data streaming world, a range sum histogram H is an approximation of an array A[1..n] such that range sum queries in A can be answered in H approximately.
Muthukrishnan and
Strauss have a
SODA paper from 2003 on this topic, where they show how to construct histograms on a*B buckets that are within some approximation error of the best B-bucket histogram. One key difference is that in their formulation, the average squared error is minimized rather than the worst-case error of a single query. Note also that the subset sum formulation is more general, since ranges are a special kind of subset.