Following on from the previous post, a second application of the "AMS sketch". There, I described how creating a variable z from the stream by summing hash values of items seen in the stream. By squaring this variable, we got an unbiased estimate for the sum of squares of the number of item occurrences.
Now, here's another application of the same data structure. Create z in the same way as before, but now we pick some item i and compute z*h(i) (remembering that h(i) maps uniformly onto {-1,+1}). What is the expectation of this value? We get the number of occurrences of i multiplied by h(i)^2 (which is always 1), and all other frequencies of items j multiplied by h(i)h(j). These two hash values are uncorrelated, so this product is equally likely to be +1 or -1. Hence, in expectation, this estimate is exactly the number of occurrences of i.
Of course, this estimate is not perfect, else we could go through all values of i and recover the number of occurrences of items exactly. But, by doing a similar analysis as for the sum of squares case, we find we can estimate individual item counts with some additional error that depends on the square root sum of squares of items. By taking more repetitions of this procedure, we can reduce the error to being small fractions of this quantity. So, if the item we are estimating occurs sufficiently many times, then we get an estimate which is (relatively) reasonably good. This makes no assumptions about the distributions of counts of items, and so works whatever the frequency distribution happens to be.
This approach was taken and improved by Charikar, Chen and Farach-Colton. There, the authors use some more hashing techniques to make a better estimator while using the same space, by spreading out the 'heavy items' which can lead to bad estimates. They also show that if the frequency distribution does match a particular distribution (a skewed, or Zipfian distribution, which is a good model for a wide variety of data sources) then one can prove improved estimation bounds.
This is not the final word on the matter. Some of my recent work has been on trying to give the tightest possible bounds for these problems of frequency estimation, with and without assumptions about the input distribution. In particular, one can improve the bounds for the space dependency on the quality of the estimate by a different sketch technique [warning! link leads to papers by current author. Be warned that this represents extreme self-indulgence]. The guarantee switches from the square root of the sum of squares of counts to being the sum of counts; for many problems this is the form of guarantee that is required. There is also ongoing work from practitioners on getting the best possible constant factors, fastest implementations and so on.
No comments:
Post a Comment