## Monday, June 13, 2011

### Clustering with outliers

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here

(ed. note: I'm happy to welcome Sergei Vassilvitskii (of k-means++ and mapreduce-algos fame, among other things) to this series. Sergei has great insights on clustering, and we decided to join forces as we continue this series. Here's his first post, on clustering with outliers)

When abstracting the clustering problem, we often assume that the data is perfectly clusterable and so we only need to find the right clusters. But what if your data is not so perfect? Maybe there's background noise, or a few random points were added to your data by an adversary. Some clustering formulations, in particular k-center or k-means, are not stable -- the addition of a single point can dramatically change the optimal clustering. For the case of k-center, if a single point $x$ is added far away from all of the original data points, it will become its own center in the optimum solution, necessitating that the other points are only clustered with $k-1$ centers.

It is easy to appeal to intuition for the definition of outliers, but formally specifying when a point becomes an outlier is a non-trivial task. Consider for example a set of points generated by a single Gaussian with small variance. Now we pick one data point and slowly start moving it away from this set. We agree that originally that point is not an outlier; as we move it out to infinity, it certainly becomes one. But where exactly did it cross this threshold?

There are two ways to define clustering with outliers. First we can simply chose to ignore some fraction of the points. In the case when there are no outliers, ignoring a small (say 1%) fraction of the points should not materially affect the final clustering results. In the case that outliers are present, however, the algorithm has the choice of ignoring them in computing the final metric. Consider again the example above; although we don't know when the point we are moving becomes an outlier, in some sense we don't care because we are always looking at the optimum clustering without it.

A generalization of the above approach assigns a cost for not classifying a particular point and gives a budget for unclassified points. Thus, if we know ahead of time that some points should be classified and some can be ignored, one can guide the algorithm to focus on the points that matter. Of course assigning each point an identical weight leads exactly to the first formulation.

Algorithms

The ability to ignore a fraction of the points seems like it would make the problem easier, since the algorithm now has explicit permission to 'sweep things under the rug.' It actually makes the problem harder, since the optimum solution can also do this. Put another way, the algorithm needs to not only find the right clustering of points, but also decide which points to cluster in the first place.

There are two extreme approaches to the latter problem: one can first decide which points are outliers and then run the baseline clustering algorithm, or one can chose not to ignore any of the points and decide which once are outliers given a clustering. Unfortunately neither one of these works well -- the first scales poorly, as the number of choices for outlier points grows exponentially; the second leads to poor solutions -- as we saw before ignoring the outliers can materially change the final outcome.

The key to finding clustering algorithms that handle outliers is to determine which points will be ignored during the algorithm execution, and not before or after. For an example, consider the $k$-center problem. Recall we are trying to find a set of centers $\mathcal{C}$ and a radius $r$ so that every point is within $r$ of some point in $\mathcal{C}$. There is a lower bound that says unless $\mathsf{P}=\mathsf{NP}$ no algorithm can find better than a 2-approximation on the radius. There are many ways to achieve a 2-approximation: we previously saw the furthest point algorithm by Gonzales: repeatedly choose the point that is furthest away from the current set $\mathcal{C}$ and add it as s center. The algorithm breaks down if there are outliers --- we will surely choose an outlier as a cluster center.

Here is an alternative algorithm for $k$-center:
1. Guess the optimal radius $r$.
2. While there are uncovered points, chose one point $p$ as a center, and mark all points within distance $2r$ as covered.
It is easy to see that in every iteration we mark as covered all points in the optimal cluster that $p$ belongs to. If after $k$ iterations there are still uncovered points remaining, it must be that the original guess of $r$ was incorrect.

To adapt this algorithm to the setting of outliers, we make two changes. First, we increase the slack on the radius from a factor of 2 to a factor of 3 (this turns out to be necessary; unless $\mathsf{P}=\mathsf{NP}$ no algorithm can find a better than a 3-approximation to the radius). Second, instead of selecting a point arbitrarily, we select the point that has the maximum number of uncovered points within radius $r$ from it.

Thus the final algorithm is as follows:
1. Guess the optimal radius $r$.
2. Repeat $k$ times: select the point $p$ that has the largest number of uncovered points within distance $r$. Mark as covered all points within distance $3r$ of $p$.
The points that are left over in the end are precisely the outliers: they are both far away from any individual cluster center, and are not well clustered themselves (otherwise one of them would have been chosen as a potential cluster). Charikar et al show that this algorithm attains a 3-approximation: if the optimum can cover all but $\ell$ points with radius $r$, then the algorithm above will cover all but $\ell$ points with radius $3r$.