Monday, September 03, 2007

NP hardness of Euclidean k-median clustering

Suppose you're given a metric space (X, d) and a parameter k, and your goal is to find k "clusters" such that the sum of cluster costs is minimized. Here, the cost of a cluster is the sum (over all points in the cluster) of their distance to the cluster "center" (a designated point).

This is the well-known k-median problem (which differs from the also popular k-means problem by the use of distances, rather than squares of distances). In a general metric space, the k-median problem is known to be NP-hard, as well as being hard to approximate to within arbitrary constant factor. The current best known approximation ratio for the k-median is 4, due to Charikar and Guha 3 + eps, via a local search heuristic due to Arya, Garg, Khandekar, Meyerson, Munagala and Pandit (thanks, Chandra).

If the underlying metric is the Euclidean metric, then the problem complexity changes: in fact a PTAS for the Euclidean k-median can be obtained, due to the results of Arora, Raghavan and Rao, and then Kolliopoulos and Rao (who provide an almost-linear time algorithm). But as far as I am aware, there is still no NP-hardness proof for the Euclidean k-median problem, and I'd be interested in knowing if I am wrong here.

Note that the related problem of Euclidean k-means is known to be NP-hard from an observation by Drineas, Frieze, Kannan, Vempala and Vinay.

Update: As pointed out in the comments, the NP-hardness of k-median in the plane was shown in 1984 by Megiddo and Supowit. Once again, the wisdom of the internet beats independent research ;)

Disqus for The Geomblog