I hope to track down and review some of the papers once they start appearing online. For now, I will entertain you with one of my own (assuming this doesn't break some academic blogosphere code of etiquette ;)).

Deepak Agarwal, Jeff Phillips, Suresh Venkatasubramanian.

The title (or at least the first part) is a tip of the hat towards an earlier paper by Friedman and Fisher titled 'Bump Hunting in High Dimensional Data'. The basic setup is this: you're given points labelled red or blue, and a class of shapes (usually hypercubes, but you could have spheres, pyramids, etc). Given a shape, you count the number of red and blue points in it, and compute a discrepancy function that could be as simple as |#red - #blue|. The goal is to find the shape that maximizes this function (i.e, finds the biggest bump).

Where do the red and blue come from ? One of the more popular applications of this problem is in biosurveillance, where the red points represent incidence of disease, and the blue represents a baseline population (the points may be tagged with weights). A bump is an unusually high or low disease rate with respect to the population.

What discrepancy function do we use ? The "simple" combinatorial discrepancy function described above has actually been looked at before, in the context of learning. Here, the red points are "good" elements of a classifier and the blue points are "bad", and the shape is a hypothesis that has classifies the most points correctly while misclassifying the fewest. Dobkin, Gunopulos and Maass presented exact and approximate algorithms for this problem in arbitrary dimensions.

But in the biosurveillance domain, statistical measures are more commonly used. The idea is that we want the discrepancy function to model a likelihood ratio with respect to some underlying population distribution: how likely was it that I saw as many red and blue points that I did ? The resulting function often resembles a Kullback-Leibler distance, or a chi-squared distance, and one of the most popular measures is the Kulldorff scan statistic, developed by Martin Kulldorff in 1997.

Unfortunately, such measures are painful to work with from an algorithmic point of view. They are not as "nice" as the (linear) combinatorial discrepancy, and thus the best algorithms have been either heuristics or (theoretically) brute force methods that have good empirical behaviour.

What we show is that as long as the discrepancy function is convex (which likelihood-ratio based distances tend to be), you can approximately rewrite it as the envelope of a set of linear functions, which you then whack with the DGM algorithm. The question of course is, how many functions do you need ? This is the tricky part, and involves an asymptotic bound on the eigenvalues of the Hessian of the discrepancy function.

This is a fairly general crank that allows us to solve efficiently (and approximately) a number of (till now) different statistical discrepancy problems (including maximizing with the Kulldorff scan statistic) with the same basic method. There are other results in the paper, which I hope to have online soon.

The title (or at least the first part) is a tip of the hat towards an earlier paper by Friedman and Fisher titled 'Bump Hunting in High Dimensional Data'. The basic setup is this: you're given points labelled red or blue, and a class of shapes (usually hypercubes, but you could have spheres, pyramids, etc). Given a shape, you count the number of red and blue points in it, and compute a discrepancy function that could be as simple as |#red - #blue|. The goal is to find the shape that maximizes this function (i.e, finds the biggest bump).

Where do the red and blue come from ? One of the more popular applications of this problem is in biosurveillance, where the red points represent incidence of disease, and the blue represents a baseline population (the points may be tagged with weights). A bump is an unusually high or low disease rate with respect to the population.

What discrepancy function do we use ? The "simple" combinatorial discrepancy function described above has actually been looked at before, in the context of learning. Here, the red points are "good" elements of a classifier and the blue points are "bad", and the shape is a hypothesis that has classifies the most points correctly while misclassifying the fewest. Dobkin, Gunopulos and Maass presented exact and approximate algorithms for this problem in arbitrary dimensions.

But in the biosurveillance domain, statistical measures are more commonly used. The idea is that we want the discrepancy function to model a likelihood ratio with respect to some underlying population distribution: how likely was it that I saw as many red and blue points that I did ? The resulting function often resembles a Kullback-Leibler distance, or a chi-squared distance, and one of the most popular measures is the Kulldorff scan statistic, developed by Martin Kulldorff in 1997.

Unfortunately, such measures are painful to work with from an algorithmic point of view. They are not as "nice" as the (linear) combinatorial discrepancy, and thus the best algorithms have been either heuristics or (theoretically) brute force methods that have good empirical behaviour.

What we show is that as long as the discrepancy function is convex (which likelihood-ratio based distances tend to be), you can approximately rewrite it as the envelope of a set of linear functions, which you then whack with the DGM algorithm. The question of course is, how many functions do you need ? This is the tricky part, and involves an asymptotic bound on the eigenvalues of the Hessian of the discrepancy function.

This is a fairly general crank that allows us to solve efficiently (and approximately) a number of (till now) different statistical discrepancy problems (including maximizing with the Kulldorff scan statistic) with the same basic method. There are other results in the paper, which I hope to have online soon.

Lots of

Anonymous

ReplyDeletegeometry(16),embedding(9) andgame theory(10, if this includesauctions) results...Posted by

Not related to the subject, but since you used the word "blogosphere", I am curious why the more natural and nicer "blogspace" didn't catch on --- in fact, there now appears to be a smallish company with this name. OTOH, I know that the name itself is at least 3 years old when Ravi Kumar, Jasmine Novak, Prabhakar Raghavan and Andrew Tomkins wrote the paper

D. Sivakumar

ReplyDeleteOn the bursty evolution of Blogspace, where they even had a version of randomized blogspace (I don't know if they conjectured RbL = bL :-)

Siva

Posted by

interesting. for some reason the term 'blogosphere' has become the defacto term for the space of blogs. I am not sure what the origins of this word is, but it's become even more popular as bashing of the 'mainstream media' has become more common.

Suresh

ReplyDeletePosted by

> some academic blogosphere code of etiquette

Maverick Woo

ReplyDeleteI actually think that it is a good idea for the authors to review their own papers. It actually makes sense to me since the authors are supposed to know the paper better and the readers can always use some more informal introduction. I will follow suit and write one myself since my coauthors are not very "bloggy".

Posted by

test

Suresh

ReplyDeletePosted by