Suppose you're given a large set of objects $X$, and you know that some subset $I$ is "interesting". A particular example that hits close to (my regular) home is in bug-testing: $X$ is the set of possible inputs to a program, and $I$ is the set of inputs that generate bugs (this is the downside of talking to +John Regehr too much). We'll assume that if you're given a candidate object, you can check easily if it's interesting or not (for example, by running the program)
You'd like to find the interesting items, so you consult an expert (in our running example, maybe it's a method to generate inputs that test certain kinds of bugs). The expert produces items that it thinks are interesting. But experts have biases: maybe your expert only cares about certain kinds of interesting items.
So you ask multiple experts, in the hope that their biases are different, and that together they can cover the space of objects. But you don't know anything about their biases except what you can learn from repeatedly asking them for candidate objects.
What's your strategy for asking questions so that at any stage (say after asking $N$ questions), you've found the optimal number of interesting items ?
This was the topic of a talk by +Sebastien Bubeck at the AMPLab in Berkeley on Tuesday, based on this paper. A key idea in the algorithm design is to make use of estimators of "things I haven't seen yet", or the famous Good-Turing estimator (which your humble blogger wrote about many æons ago). Here's how it works.
Formally, let us assume that each "expert" $i$ is a distribution $P_i$ over $X$. At each step, the algorithm will determine some $i$, and ask it for a random draw from $P_i$. Suppose we knew the fraction of items that $i$ had not seen yet. Then a simple greedy strategy would be to pick the $i$ that had the largest value of this "not-yet-seen" quantity. That's all well and good, as long as we know the fraction of items not yet seen.
Here's where the G-T estimator comes in. What it says is that if we're given samples from a distribution, and count the number of items in the sample that occurred exactly once, then this "frequency of frequency", divided by the number of samples, is a good estimate for the mass of items not yet seen. Moreover, it can be shown that this estimator has good concentration properties around the true answer.
So that's what the algorithm does. It maintains estimates (for each expert) of the mass not yet seen, and in each round picks the expert that maximizes this term, corrected by an adjustment coming from the tail of the concentration bound.
The algorithm is really simple, and elegant. The question is, how well does it do ? And now things get a little harder. The above ideal greedy strategy is optimal as long as the supports of the experts are disjoint. Under the same assumption (and some other technical ones), it can be shown that the expected difference between the number of items found by the algorithm and the number found by the ideal greedy algorithm is $O(\sqrt{Kt \log t})$ where $K$ is the number of experts and $t$ is the current time step.
It's not clear how to extend these bounds to the case when supports can overlap (or rather, it's not clear to me :)), but the authors show that the algorithm works quite well in practice even when this assumption is broken, which is encouraging.
No comments:
Post a Comment