Comments on The Geomblog: Clustering: k-means...

Here's the link to Andrea Vattani's paper ...

2012-12-18T22:25:00.415-07:00

Here's the link to Andrea Vattani's paper on this:

http://cseweb.ucsd.edu/~avattani/papers/kmeans.pdf

Thanks for writing this article. I have written K...

2012-12-18T21:42:35.425-07:00

Thanks for writing this article. I have written K-means couple of times but was always ignorant on its convergence property. So when you wrote
" .....
It might take exponential time to converge, even in the plane.", I just wonder where I can find the source.

Arthur Chan

Correct me if I am wrong, but the "outliers a...

2011-09-16T13:01:37.940-06:00

Correct me if I am wrong, but the "outliers as centroids" problem isn't it minimized with k++ (David Arthur and Sergei Vassilvitskii)?
I thought their way of calculating the points probabilities of becoming centroids selection include the density as well.

Thanks for the blog, Really interesting!!!

This is to The Tobacconist and to Parasaran Raman....

2010-08-15T23:02:51.451-06:00

This is to The Tobacconist and to Parasaran Raman. Do either of you have any code for choosing K ? I have a dataset for school that has alot of outliers. But I am interested in outliers. I just don't know how to pick K to start, or to seed the clusters to start.

thanks in advance if you could do that !! neways t...

2009-07-10T13:02:01.972-06:00

thanks in advance if you could do that !! neways thanks for this too..

It's an excellent point. the question you refe...

2009-07-10T10:27:11.589-06:00

It's an excellent point. the question you refer to is the problem of labelling the clusters with actual meaning.

Traditionally these two problems are viewed as somewhat separate. Or, the combined problem of labelling and clustering is generally termed multi-class classification. Maybe I'll try to discuss that a bit.

hii.. i am new to this clustering concept.. and un...

2009-07-10T07:47:34.888-06:00

hii.. i am new to this clustering concept.. and unable to understand even after getting clustering done , in which group should we put the clusters?
i mean the name of the group or some criterion to differentiate between groups !

I tried out the intelligent seeding on few (advers...

2009-07-06T08:38:33.868-06:00

I tried out the intelligent seeding on few (adversarial: with outliers and stray points) synthetic data sets in 2D/3D and the UCI repository. It turned out that I get much better clusterings than if I just run the kmeans matlab code. As for the outliers being picked up, one idea that comes to my mind is to neglect points that are unclusterable (stray) and only cluster the clusterable(?) points.

You're right about the problem with outliers. ...

2009-07-03T23:56:37.211-06:00

You're right about the problem with outliers. there are clustering variants that try to address this problem. One way I could imagine the initialization working on this is to bias the sampling by the density: if there's only one point far away, its probability of getting picked is lower than if there's a cloud of points far away.

as for choosing k, I'll have much more to say about that in a future post.

I am enjoying reading this series. I've been w...

2009-07-03T18:05:45.856-06:00

I am enjoying reading this series. I've been working on clustering for a couple of years now and what has struck me is just how popular k-means is within the computer vision community. My guess is that it is easily implementable (and just available on MATLAB) and maybe there is a view that in the end it isn't a critical component of the bigger vision system. I disagree with this, of course, but it has been hard to find other people who feel the same.

I stumbled upon the Arthur and Vassilvitski paper too and am implementing it for a k-means implementation I am currently using. It seems like a very good heuristic though I feel it might tend to pick outliers in initial iterations. And this may be detrimental to clustering at the end.

One of the bigger problems I've had with k-means is in deciding how many clusters there are in the data set. Lot of people just pick a number. Some plot curves and follow it up with voodoo. Recently I was experimenting with some of these methods to pick a K and none of them really produced remotely satisfying results (on both artificial and real data). The general wisdom in using clustering seems to lean towards a preference to "over-segment" clusters and just go with it. I wonder how prevalent this usage of k-means really is.

Thanks for doing this series. It has been fun reading.