A core question of clustering is this: learn a mixture of $k$ $d$-dimensional Gaussians with different means and covariance matrices from samples drawn from the mixture.A long line of research, starting with a paper by Sanjoy Dasgupta back in 1999, produced variations on the following result:
If the Gaussians are "well separated" and have "nice" shapes, then a "small number" of samples suffices to determine the parameters of the system.Here, "parameters" would be means, covariance matrix elements, and relative weights. As time went on, the separations got smaller, the niceness reduced, and the "small number" got better.
These two papers take a slightly more general tack, in different ways. They both eliminate the conditional nature of prior bounds ("if the Gaussians are well separated"), instead replacing this by an unconditional bound ("poly in a parameter capturing the separation").
The first, by Moitra and Valiant (both Ph.D students!) measures the separation between the distributions via their statistical distance, and gives a bound for learning in terms of the reciprocal of this distance. Their main insight is to project the distributions onto a few directions (i.e 1D), learn what they can, and then recurse on the data that doesn't appear to be well-separated enough to learn properly. This takes a lot of technical skill to do correctly.
The second paper, by Belkin and Sinha, takes a different approach. They pay more attention to the parameters of the distribution. This can be a problem in general because closeness in distribution does not imply closeness in parameter space. However, they show that the set of all distributions with fixed moments forms a nice algebraic variety, and that the above problem can be quantified by determining how closely this variety folds in on itself (intuitively, if the variety intersects itself, then two far away parameter sets yield the same distribution). Using some standard (but not trivial) results in algebraic geometry, they can bound the complexity of the resulting structures (so that they can estimate the number of samples needed to get to a distiinct region of the space). There's more detail involved in the estimation of parameters and moments and so on.
While both papers are impressive achievements that essentially resolve the theoretical aspects of learning mixtures of distributions, I came away not entirely sure how to compare them. On the one hand, the Moitra/Valiant paper is more explicit and is likely to be less "galactic" in nature. But it appears to be limited to Gaussians. The Sinha/Belkin paper is more general, but uses more abstract machinery that is unlikely to be useful in practice.
So what remains is stilll an effective way of learning Gaussians or other families of distributions, and also a better understanding of how these bounds relate to each other.
Thanks for posting this.
ReplyDelete