Probably everyone knows the standard trick for determining in one pass whether a stream admits an element occuring more than half the time. But what not everyone might now is the origin of this method.
Stuart Shieber writes about the algorithm (discovered by Boyer and Moore) and the mishaps it had on the way to publication. The article also describes how the followup paper by Misra and Gries came into being. Interestingly, I almost always see the Misra-Gries generalization being cited, and not the original Boyer-Moore method.
(HT Fernando Pereira)
I did my best to reconstruct the history surrounding this area when I surveyed the topic. See Sections 3.1 and 3.2 of
ReplyDeletehttp://www.dimacs.rutgers.edu/~graham/pubs/papers/freqvldbj.pdf
Graham
Cool. What's interesting also is the connection between Boyer/Moore and Misra/Gries, as outlined in the link.
ReplyDeleteIt should be observed that the generalization of the majority algorithm for k counters as proposed by Misra-Gries doesn't work in the data stream model.
ReplyDeleteAs Graham points out Karp et al. discussed this issue and Demaine et al. actually propose data structures to make the updates happen in O(1) time.
For tight bounds on the performance of the generalized algorithm we have those in Demaine et al. in the stochastic model and the those of Bose et al. for the adversarial model, both of which are tight.
This is clearly a case where single authorship (except for the vanilla majority algorithm which clearly belongs to Booyer and Moore) is hard to assign to a single group.
The k-counter algorithm as we use it today would be best described as the Misra-Gries-Karp-Papadimitriou-Shenker-Demaine-Lopez-Ortiz-Munro-Bose-Kranakis-Morin-Tang frequent items algorithm [or MGKPSDLMBKMT for "short" :) ]
@Anonymous: I am not sure why you believe that the Misra-Gries algorithm, as proposed in the paper, does not work in the streaming model.
ReplyDeleteThe algorithm (4) on page 5 of the
technical report considers all elements in the stream, in arbitrary order, and for each element performs the appropriate counter updates.
It is true that the algorithm is not particularly time-efficient: processing each element takes time O(k) or so. Still, this leads to roughly O(1/eps) update time when searching for elements that appear at least eps n times. This is not too bad if eps is not too small. Of course, we do know now how to do this much faster, as per the references you mentioned.
It is true that the algorithm is not particularly time-efficient:
ReplyDelete...and that is exactly the problem. Finding the most frequent element is trivial if time & space are not extremely limited.
@Anonymous: finding the most frequent elements is certainly not trivial in the streaming model, i.e., when the algorithm is allowed to make a single pass and use limited storage. Misra-Gries algorithm, although simple, is an elegant solution to this problem.
ReplyDelete