Comments on The Geomblog: Choosing the number of clusters I: The Elbow Method

in elbow method how do you define your cost functi...

2012-04-24T13:19:14.206-06:00

in elbow method how do you define your cost function? could you please give more information?

I've also read about an interesting bayesian a...

2010-03-08T09:49:02.042-07:00

I've also read about an interesting bayesian approach in Bishop's book under the variational mixture of gaussians model. A prior distribution over the mixture weights is initially specified with a predetermined number of clusters which is assumed to be large enough. The EM algorithm then works in such a way that some clusters have weights being very small and hence are "eliminated", which seems to be some sort automatic relevance determination.

@cosma: the sieve idea is very interesting, and ye...

2010-03-08T09:46:17.702-07:00

@cosma: the sieve idea is very interesting, and yes I was a tad careless in calling BIC a nonparametric method

@Noel I've learnt everything I know about DP from Hal's seminars :)

@both it seems like the sieve method and the DP process are two examples of the same idea: using an infinite mixture model and letting the data decide the answer. I was trying to avoid generative-based clustering, but I think I will add another part to the series on letting the data pick the answer

Yes, I mean things like the Indian buffet process....

2010-03-08T07:46:23.509-07:00

Yes, I mean things like the Indian buffet process. The most relevant work here is the Dirichlet process (DP) which is an extension of the Dirichlet distribution to an infinite dimension. A sample from a DP is thus the mixing distribution for a mixture model with an infinite number of components (with most of the mass concentrated on a few components). With these type of models you don't have to choose k; k is infinite but only a finite number of elements will actually be used given finite data (and generally you get sensible results rather than one component per data point). Hal has published quite a bit in this area.

Related to the elbow method, are algorithms that c...

2010-03-08T07:15:37.153-07:00

Related to the elbow method, are algorithms that compute a permutation of the data such that any prefix is a good clustering under certain measure. The greedy permutation works well for k-center, and Plaxton has a paper showing how to do this for k-means (or is it k-median) clustering...

I like this post and am looking forward to the oth...

2010-03-08T06:24:15.855-07:00

I like this post and am looking forward to the others. That said, in the proud academic/Internet tradition, I will show my gratitude by quibbling.

1. At least in statistics, we often use Gaussian (etc.) mixture models without knowing how many components go into the mixture model. Then people use *IC, or cross-validation. (I prefer cross-validation.) My friends Chris Genovese and Larry Wasserman have an interesting paper on using the method of sieves for Gaussian mixtures, though I don't know if anyone's extended it beyond the Gaussian case.

2. How is BIC "nonparametric"? It's all about choosing between finitely-parametrized models. (Admittedly, real Bayesians never choose between models.)

@Noel: am not familiar with Bayesian nonparametric...

2010-03-08T04:15:23.928-07:00

@Noel: am not familiar with Bayesian nonparametric methods other than BIC. Do you mean things like LDA, the Indian buffet process and the like ?

I guess when I was talking about mixture model methods, I was thinking about things like mixture of Gaussian like models. The main question is if you put a price on the complexity of the model or not.

Are Bayesian non-parametric methods outside the sc...

2010-03-08T03:58:59.545-07:00

Are Bayesian non-parametric methods outside the scope of what you want to discuss? They provide an elegant way to handle the "how many clusters" problem.

Also, I don't really understand your point about mixture-model clustering ("it makes less sense to what to choose k") given you then talk about AIC/BIC etc. which AFAIK assume a mixture-model clustering setup.