Monday, September 03, 2007

NP hardness of Euclidean k-median clustering

Suppose you're given a metric space (X, d) and a parameter k, and your goal is to find k "clusters" such that the sum of cluster costs is minimized. Here, the cost of a cluster is the sum (over all points in the cluster) of their distance to the cluster "center" (a designated point).

This is the well-known k-median problem (which differs from the also popular k-means problem by the use of distances, rather than squares of distances). In a general metric space, the k-median problem is known to be NP-hard, as well as being hard to approximate to within arbitrary constant factor. The current best known approximation ratio for the k-median is 4, due to Charikar and Guha 3 + eps, via a local search heuristic due to Arya, Garg, Khandekar, Meyerson, Munagala and Pandit (thanks, Chandra).

If the underlying metric is the Euclidean metric, then the problem complexity changes: in fact a PTAS for the Euclidean k-median can be obtained, due to the results of Arora, Raghavan and Rao, and then Kolliopoulos and Rao (who provide an almost-linear time algorithm). But as far as I am aware, there is still no NP-hardness proof for the Euclidean k-median problem, and I'd be interested in knowing if I am wrong here.

Note that the related problem of Euclidean k-means is known to be NP-hard from an observation by Drineas, Frieze, Kannan, Vempala and Vinay.

Update: As pointed out in the comments, the NP-hardness of k-median in the plane was shown in 1984 by Megiddo and Supowit. Once again, the wisdom of the internet beats independent research ;)

1. The best known approximation for
k-median in general metric spaces
is 3+\eps due to a local search
algorithm of Arya etal.

2. WoW !! what a coincidence ? I was trying to find an answer to this exact question over this long-weekend. I could not get any NP-hardness proof after three days of crazy googling and acm-search. Thanks for asking this question.

I have a related question. What exactly is the k-variance problem ? I came across couple of different definitions. Is it's complexity open ? Are there any known results for k-variance on euclidean plane.

3. If by k-variance you mean minimizing the sum of squares of distances of intracluster pairs, then there's a paper by Schulman from STOC 2000 that's relevant here.

http://portal.acm.org/citation.cfm?doid=335305.335373

4. A quick comment: the Arora-Raghavan-Rao PTAS holds only for *low-dimensional* Euclidean metrics. Specifically, the running time of the algorithm is doubly exponential in d.

As far as the hardness goes, it is known that the doubly exponential dependence is (likely) necessary for a PTAS, *if* your medians must come from a given set (the "discrete median" case). This follows from a combination of papers, incl. Trevisan'97, see comments in Guruswami-Indyk, SODA'03.
So in particular, it is known that there is no PTAS for the discrete k-median in general Euclidean spaces (unless something strange happens)

Piotr

5. If I am not mistaken, the following paper shows that NP-Hardness of the k-median problem under Eucledean metric.

N. Megiddo and K. Supowit.
On the complexity of some common geometric location problems.
SIAM J Computing, 13(1), 1984.

6. ah ! and I've even read this paper before.

Yes, that indeed is a proof. Specifically they show that k-median in the plane is NP-hard, by a very complicated reduction from 3SAT.

thanks !

7. I thought that "designated point" meant that the cluster median was to be selected among the given points. Megiddo and Supowit proof the version where the medians are freely selectable from real plane.

The version where the medians must be selected from the given points (with Euclidean metrics) was proved to be NP-complete by Papadimitriou [Worst-Case and Probabilistic Analysis of a Geometric Location Problem, SIAM J. Comput. 10(3), 1989].

Anyhow, it's quite annoying when the same name refers to two different problems. I've seen somebody using "Geometric k-Median" to refer to the Megiddo and Supowit's version, but I don't know do I like it. Opinions?

8. I am wondering what is known for the k-median problem in hamming distance space. Any good solutions?