However, perhaps one of the most important heuristics that is used in practice is not discussed, and I have yet to find a source that examines its power fully. This heuristic is "Alternating minimization", and appears "in the wild" in two wildly popular heuristics, the k-means algorithm in clustering and the Iterated Closest Pair algorithm in model registration.

Suppose you have a cost function D defined over the product of domains P and Q. For concreteness, let's say that P is the space of all assignments of points to clusters, and Q is the space of all assignments of centers to clusters. The standard k-means cost function asks for an assignment of points to clusters, and an assignment of centers to clusters, so that the sum of squared distances of points to their cluster centers is minimized.

In general, such an optimization is hard to perform. However, it is often the case that for a fixed p in P, we can minimize D(p, Q), and vice versa. Continuing the example above, using Euclidean distance as the underlying distance, it's easy to see that for a fixed set of points in a cluster, the minimizing center is their centroid. Moreover, for a fixed set of centroids, the assignment is a straight nearest neighbour assignment.

This is the idea behind alternating minimization. Starting with arbitrary p0 and q0, we repeat the two-step procedure:

- p_i = arg min D(p, q_{i-1})
- q_i = arg min D(p_i, q)

Now there are three questions that one usually has to answer for an alternating minimization. The first one is relatively straightforward:

Does it converge ? In general, it's often fairly easy to write down a potential function that decreases strictly with each iteration and is always positive. Thus, convergence can often be established for this procedure.

Does it converge to the optimal solution ? In general, no. This procedure can often get stuck in local minima. However, (and this is one of the more attractive aspects of this heuristic), Csiszar and Tusnady showed that as long as a certain "5-point" property holds (a simple inequalities involving iterates of the procedure), the process will converge to the optimal solution. In their paper, this result is used to show convergence when P and Q are convex sets of distributions, and D is the Kullback-Leibler distance.

Another useful example where the procedure converges is when P and Q are convex sets, and D computes the closest distance between the two sets. In that case, each iterative step determines the closest point in a convex set closest to a given point.

A variant of the above question is: does it converge to a close-to-optimal solution ? Again, in general, one cannot say anything useful about this. However, for k-means, recent work by Arthur and Vassilvitskii, and Ostrovsky, Rabani, Schulman and Swamy, has shown that choosing start points carefully can yield provable bounds on the quality.

Does it converge in polynomial time ? Perhaps the most important question about this heuristic is whether it converges in polynomial time. For k-means, it can be shown that there are examples that will force the heuristic to take superpolynomial to converge. However, recent results in smoothed complexity have shown that if the input is perturbed ever so slightly, then the iterative procedure converges in polynomial time (with high probability). In a sense, one can view this result as explaining the success of the method in practice.

Alternating minimization appears in many domains, and is a common trick for "solving" complex optimizations. It should definitely be taught wherever heuristics are discussed.

As another important alternating algorithm for local search, let's not forget the EM algorithm in machine learning.

ReplyDeleteIn coding theory, this approach is called the Blahut-Arimoto algorithm; they apply it for (numerically) finding channel capacity of certain channels.

ReplyDeleteI didn't. I think of EM and k-means as essentially the same process (k-means is the discrete analog of EM, which itself is a special case of Blahut-Arimoto, with a distance function defined by the induced Bregman distance)

ReplyDeleteA interesting generalization of EM is the CCCP (convex-concave procedure, not the country), it's been used to get convergent message-passing algorithms for loopy graph decoding.

ReplyDeleteInteresting. this seems to be closely related to a new paper by Rifkin et al on value regularization.

ReplyDeletehttp://www.jmlr.org/papers/volume8/rifkin07a/rifkin07a.pdf

I am working in a applied ML field (speech technology) and we use k-means and EM daily. So the topic of the post is really relevant to me.

ReplyDeleteNow that we are talking about EM and the friends. The problem I have been wondering for a some time is what is the complexity of the underlying problem in EM (i.e. loglikelihood maximization)?

The case of K-means, as Suresh wrote is a discrete one, we can always just consider the assignments of data vectors to different partitions. So we got a nice combinatorial problem, but in the case of loglikehood maximization no such discrete structure exits. Even if we restrict the mean vectors of the GMM to be subset of the data vectors, we still need to somehow select component weights and covariance matrices.

Do we need to use Blums Complexity theory (from Complexity and Real Computation) to analyze this problem?

Hmm. interesting question. I actually don't know what one might do for general log likelihood maximization: it would be interesting to see if the k-means type analyses actually extend.

ReplyDeleteHi,

ReplyDeleteinteresting post.

Does anybody have a PDF/PS file of the Csiszar/Tusnady paper?

Thanks,

Thomas

isn't this coordinate ascent/descent?

ReplyDeleteIt's a special case of coordinate descent, yes.

ReplyDelete