The Geomblog

Sunday, August 30, 2009

Spectral Clustering.

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

I don't like you, and I like them, but maybe if we change, we can get along.

Spectral clustering is not clustering: it's actually a metric modification method. It's another way of capturing things that are close versus things that far, using a more global method.

We saw how correlation clustering could be used to capture both the ``I don't like you'' and ``I like them'' aspects of clustering by assigning positive and negative weights to individual element pairs. However, it's not easy to come by such a labelling of pairs. More often than not, all we have is a distance function. We could threshold it, declaring all distances within some parameter r to be positive, and all distances larger to be negative, but this is completely arbitrary and depends on the choice of r.

Let's think of the two-class problem for now: we merely want to divide the points into two clusters. Suppose we were able to guess a function that assigns one label (0) to points in one cluster and another label (1) to points in the other cluster. Then we could write down the cost of the clustering (in a correlation sense) by summing up, over all pairs, the difference

$f_i - f_j$ . In fact, just to make sure it's always positive, we'll square it, and thus will sum up the function $(f_i - f_j)^2$ . We'll also assume a weight function $w_{ij}$ on the edge $ij$ , so that what we're really summing up is $\sum w_{ij}(f_i - f_j)^2$ .

This is where things start to get very interesting. It's not hard to see that this can be written in term of a variant of the graph Laplacian, treating the f values as a vector. If we now allow the f values to be reals, rather than 0 or 1, we get a classical eigenvalue problem on the Laplacian, in effect determining the second smallest eigenvalue of the Laplacian.

Once we do that, the corresponding eigenvector gives us an assignment for f. we can either merely take positive or negative values of f to group the data, or we can use the f values as a feature representation and recluster the data into two groups based on that.

This is the key insight of spectral ``clustering''. We perform a spectral decomposition of the Laplacian, and take the top k eigenvectors. Those vectors give us a k-dimensional feature represntation of the data in an inner product space, and we can now recluster. The real question is: why should this feature representation help us in any way at all ?

Here's my take on what's really going on. We know that the graph Laplacian is a discrete equivalent of the Laplace-Beltrami operator on a manifold, which in turn (via the heat equation) captures the ``flow'' of material in a manifold. Return for a second to our cut-based view of the distance function. In essence, two points are close if there are many ways of getting from one to the other (i.e the min cut between them is large) and they are far if not. This is a flow-based way of thinking about distances, and in effect, what the Laplacian does is map the original data into a new metric space in which distance is based on flow, rather than just on metric distance.

This intuition can be formalized: if we switch now to the stochastic viewpoint, in which we're doing random walks on the graph, then the commute time between two vertices turns out to be precisely the Euclidean distance (squared: thanks, dsivakumar) between the feature vectors constructed from the eigenvalues of the Laplacian.

... take a deep breath...

There are a number of different concepts swirling around here that are worth keeping track of. The cut-based viewpoint of node similarity in graphs lends itself naturally to a Laplacian-based treatment, which in turn leads us to random walks and the heat equation on manifolds. At a deeper level, the Laplacian of a graph, by virtue of how it acts on functions, tells us something very basic about the structure of the distances involved. For more on this particular aspect, you should check out the discussion 'Can you hear the shape of a drum'.

But it's very important to keep in mind that spectral clustering is not so much a clustering technique as a data transformation method. Although k-means is the clustering algorithm of choice once we get down to the feature representation, other methods could be used as well. Also, as relaxations go, this paricular one (relaxing from the hypercube to the reals), isn't that great: the ``integrality gap'' can be quite high. It turns out that this doesn't matter terribly though.

As an aside, the idea of using the graph Laplacian to define a ``stretched'' distance that looks Euclidean is not limited to this problem. The most "famous" version of this idea is the Laplacian eigenmaps work by Belkin and Niyogi in the realm of manifold learning. There's also work by Guibas, Ovsjanikov and Sun that uses similar ideas to create signatures of shapes. The key idea, once again, is that if you're on a manifold of some kind, the Laplacian gives you a handle on the shape of this manifold, as well as how to ``stretch it out'' with local coordinates to make it look flat (and therefore Euclidean). It's a useful trick to keep in your data analysis toolbox.

For further reading, Ulrich von Luxburg has a great survey on spectral clustering.

Saturday, August 08, 2009

Negative-type distances and kernels

This is a story of two constructions that actually end up being essentially the same thing, as I recently discovered.

1. Kernels

The story of kernels in machine learning goes somewhat like this:

Take a positive definite function $$K(x,y)$$ that captures some notion of similarity between objects (a standard example of such a kernel is the Gaussian $$K(x,y) = \exp(-\|x - y\|^2)$$). A positive definite function, btw, is like the generalization of a p.d matrix: the integral $$\int f(x)f(y)k(x,y)dx dy \ge 0$$.

You can then construct a Hilbert space that captures the structure of the kernel. Specifically, for a fixed set of points S, construct a vector space from the basis $$\{k_x | x \in S\}$$, where $$k_x(y) = K(x,y)$$, and then define an inner product of two vectors in this space in the usual way: If $$v_a = \sum a_x k_x$$ and $$v_b = \sum b_x k_x$$, then $$v_a \cdot v_b = \sum a_x b_y K(x,y)$$.

You get nice properties from this construction: the so-called reproducing property is one of them, which tells you that $$K(x,y) = k_x \cdot k_y$$. In other words, we can capture the similarity function K(x,y) by a "standard" inner product in a Hilbert space.

What's even neater is that by invoking Mercer's theorem, you can construct an orthogonal basis, and make sure that the Hilbert space is actually a Euclidean space. The squared Euclidean distance in this space can be written in kernel form, as
\[d_K(x,y) = K(x,x) + K(y,y) - 2K(x,y)\]
which is what you'd expect when treating K(.) as an inner product.

2. Negative-type distance.
The second story comes from Deza and Laurent. When can a distance function be embedded in Euclidean space ? It turns out that it's more convenient to talk about the square of the distance function, which we'll call D.

There's an elegant characterization of when D can be embedded isometrically into $$\ell^2_2$$: this can be done if and only if D satisfies the negative-type inequality:
\[ \sum b_i b_j D(x_i, x_j) \le 0, \sum b_i = 0\]
for all possible assignments to $$b_i$$.

The proof works via a construction called a covariance mapping that takes $D$ to the function $$k_D$$ defined as:
\[ k_D(x,y) = (1/2)(d(x, x_0) + d(y, x_0) - d(x,y)) \]
which differential geometry folks will recognize as the Gromov product.

The proof completes by showing that the negative-type condition on D implies positive definiteness of $$k_D$$, and this in turn means that $$k_D$$ can be expressed as an R-covariance:
\[ k_D(x,y) = \int f_x(\omega)f_y(\omega) d\mu(\omega) \]
for some measure space $\mu$.

Note that the RHS of the equation is an infinite-dimensional inner product.

3. Where the two stories come together
The mapping that takes a kernel to a distance is the inverse of the covariance mapping used to map a distance to a metric. In other words, if we take a kernel K, compute $$d_K$$, and then use the distance to kernel mapping to compute $$k_{d_K}$$, we get back K. Further, since we can show that the negative-type condition on D implies a positive-definite condition on $$k_D$$, we can start off with either "D satisfies negative-type inequalities" or "K is a positive definite kernel" and yield the same conclusion on $$\ell_2^2$$ embeddability.

What's interesting is that the two ways of thinking about this don't seem to have been connected together explicitly before.

p.s This is not a fully rigorous correspondence except possibly in the finite dimensional case, but I find that it's often convenient to replace the negative-type argument with a kernel positive-definiteness argument in my head.

Sunday, August 02, 2009

Correlation Clustering: I don't like you, but I like them...

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

Whether it's k-clustering, or any kind of hierarchical clustering, the world we live in is still the world of ``I don't like you'' clustering, where the geometric landscape is defined by distances between points. Consider the following example:

There is an intuitive sense in which the first clustering is not ``real'' and the second one is: the idea of 'well-separatedness'' is a pervasive component of how a good clustering is perceived. But if we only measure distances between points, and only measure the cost of a clustering in terms of how costly each cluster is, we'll never be able to distinguish between these two examples.

What's needed is a way of declaring likes (similarity) as well as dislikes (distance), and then, critically:

penalizing similar items in different clusters AS WELL AS different items in the same cluster.

By that measure, we'd be able to distinguish the first and second clusterings, because in the first case, presumably elements close to each other that lie in different clusters will make the clustering look more expensive. This point is worth reiterating. Unless we have some way of penalizing mis-separations as well as mis-groupings, we'll always be at the mercy of the tradeoff between k and cost.

Continuing the "clustering via self-help metaphors" theme to these essays, I call this the "I don't like you, but I like them" way of modelling data.

Correlation Clustering

This is where correlation clustering enters the picture. The correlation clustering model is as follows: every pair of elements is assigned a 1 or -1, encoding a similarity or dissimilarity. The goal is to find a clustering (note: no k!) in which any pair of points in a cluster is penalized for being dissimilar, and any pair of points in two different clusters is penalized for being similar. This can be generalized to arbitrary weights: for each pair of elements you assign weights w+ and w-, with the possible caveat that w+ + w- = 1.

So now the goal is merely to minimize the cost of a clustering. The elegance of correlation clustering lies in the natural way that clusters merge or split depending on the number of similar or dissimilar pairs. Do note though that you need to have input data that can be written in that manner: thresholding a distance function will give you the desired input, but is an ad hoc way of doing it, since the thresholding is arbitrary.

There are a number of different algorithms for correlation clustering, and also a very simple one that yields a good approximation guarantee: pick a point at random, and pull in all its neighbors (all points similar to it). Repeat this process with a new unpicked point, until all points have been selected. This algorithm gives a 3-approximation for correlation clustering.

Correlation clustering is also useful as a way of combining clusterings. We'll talk about this later (ed: how many times have I said that !), but the problem of conensus clustering is to ``cluster the clusterings'', or aggregate them into a single average clustering. An easy way to see the connection is this: given a collection of clusterings of a data set, create an instance of correlation clustering with the positive weight for a pair corresponding to the fraction of clusterings that ``vote'' for that pair being in the same cluster, and the negative weight being the fraction of clusterings that ``vote'' for that pair being separated.

Thursday, July 23, 2009

Clustering: Hierarchical methods

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

k-center, k-median, k-means, k-medoids, .... The list is endless, but that pernicious k comes up everywhere. What can we do about it ? There are really two ways to go:

Figure out the "right" k for a problem. This is a complicated matter, and will be the topic of a later post
Don't choose: give the user a universal representation from which they can decide what k they want.

This latter formulation takes us into the world of hierarchical clustering, the topic of this post.

There are two different ways to think about hierarchical clustering: a representational view and an algorithmic view. The algorithmic view is quite simple: let's try to design a clustering via greedy operations. In the top-down view, we find a good split of the data into two parts, and then recurse on each side. In the bottom-up view, we select two clusters for merging (in the beginning, all items are in separate clusters), and merge our way up.

Perhaps the most well known of the bottom-up methods is the 'single link' clustering algorithm, in which at each step, you merge the two clusters that are closest to each other. If this sounds familiar, it should: this is merely Kruskal's MST algorithm run partially to completion.

The "single-link" method is optimistic: it creates clusters via connected components, assuming that transitivity is enough to construct a cluster. The "complete-link" approach is more pessimistic: rather than defining the distance between two clusters as their nearest pair distance, it defines it as the furthest pair distance, ensuring a clique-like structure for clusters. Merging happens as before.

Other variants that are more robust to noise average the pairwise distances to obtain the clustering distance. With the right choice of distance, these amount to comparing the centroids of clusters.

You'll notice that I didn't actually define a problem that these algorithms solve, keeping in with the grand tradition of clustering :). There is a way of defining a general optimization function that single-link clustering solves optimally. For the others though, although we can define a cost function, we can't show that they're optimal.

So much for the algorithmic view of HC. The representational view goes somewhat deeper.

I've had at least one person (you know who you are) tell me that one of the only two clustering algorithms that worked for them is hierarchical clustering. And indeed, the idea is very seductive. Rather than present the user with a single k-clustering - a snapshot, if you will, of the data - we give them a tree of merges, in which there is one leaf for each object. The central idea here, and one that bears emphasizing, beacuse it's so different to how we've thought about clustering thus far, is this:

we understand the structure of data from its transitions, rather than from its states.

What I mean is this: rather than defining a clustering as a static partitioning of the data, we watch the data ``evolve'' as we start merging points together, and use the merge history to figure out what the real structure is. There are a couple of ways of doing this: the most common way is to imagine drawing the tree from top to bottom, and then cutting it by a y-monotone curve (one that intersects any vertical line in exactly one point). This can be done to ensure we have k clusters, or it can be done at points where the merge appears to be ``stable'' (the subtree doesn't merge with any other subtrees for a long time).

A particularly effective visual metaphor for this is the dendrogram, which is very popular in evolutionary tree analysis.

The dendogram is visually appealing because it does two things: firstly, it depicts the tree of merges, permuting the nodes so that there aren't crossings. Secondly, and more importantly, it uses the lengths of edges as a visual marker to indicate the 'time of merge' for clusters. If we think of starting at time 0 and merging clusters, then a cluster that sticks around for a long time will have a tall edge connecting it to its parent, in comparison with a cluster that gets merged quickly. (as an aside, visual metaphors are possibly underrated when it comes to thinking about clustering algorithms. My firm belief is that one of the reasons soft clusterings aren't used as much as hard clusterings is because it's hard to visualize them)

Sidebar:
And this is the key operational idea behind ``clustering from the transitions'': clusters that stay unmerged longer are likely to be more interesting than clusters that get merged quickly. Till recently, this was merely a heuristic way of thinking about hierarchical clusterings. However, we now have the idea of persistence that comes from the realm of computational topology. In short, persistence is a way of quantifying topological features of a shape that persist across different scales. Persistence has been studied extensively as a rigorous way of quantifying the triviality (or non-triviality) of features in a shape, and there's a recent paper that applies persistence-like concepts to clustering as well. It's still early days for this direction, but I think it's a promising one.

Returning to hierarchical clustering, one major problem with this approach is that it's local: make the wrong choice of merge early on, and you'll never get to the optimal solution for a k-clustering problem. And since there's no easy way to change your mind (i.e split a clustering), it's hard to reverse bad decisions. If you're so inclined (and I am nowadays), this is the problem of finding monotone paths in the merge-split lattice on partitions of [1..n], but I digress...

Ultimately, the value of hierarchical clustering comes in its ability to represent an entire history of clusterings, rather than just one, and the relative ease with which a bottom-up algorithm can be written. That, coupled with its uses where you really do want to find a tree (evolutionary or otherwise) is what makes it a popular implement in the clustering toolbag.

Wednesday, July 15, 2009

Consistent BibTeX formatting

I try not to write BibTeX by hand any more: too easy to introduce errors. So I usually use either DBLP or the ACM digital library to get BibTeX for papers. Sometimes the journal has BibTeX, or some format that can be converted. As an aside, IEEE is extremely lame: you have to login to their digital library even to get a citation !

For the most part, I don't need to go beyond ACM or DBLP, which is great. But here's the problem: their formats are different ! I needed the BibTeX for a recent paper of mine, and found it on both sites. Here's what ACM gave me:

@inproceedings{1516372,
author = {Ahmadi, Babak and Hadjieleftheriou, Marios and Seidl, Thomas and Srivastava, Divesh and Venkatasubramanian, Suresh},
title = {Type-based categorization of relational attributes},
booktitle = {EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology},
year = {2009},
isbn = {978-1-60558-422-5},
pages = {84--95},
location = {Saint Petersburg, Russia},
doi = {http://doi.acm.org/10.1145/1516360.1516372},
publisher = {ACM},
address = {New York, NY, USA},
}

and here's what DBLP gave me:

@inproceedings{DBLP:conf/edbt/AhmadiHSSV09,
author = {Babak Ahmadi and
Marios Hadjieleftheriou and
Thomas Seidl and
Divesh Srivastava and
Suresh Venkatasubramanian},
title = {Type-based categorization of relational attributes},
booktitle = {EDBT},
year = {2009},
pages = {84-95},
ee = {http://doi.acm.org/10.1145/1516360.1516372},
crossref = {DBLP:conf/edbt/2009},
bibsource = {DBLP, http://dblp.uni-trier.de}
}

@proceedings{DBLP:conf/edbt/2009,
editor = {Martin L. Kersten and
Boris Novikov and
Jens Teubner and
Vladimir Polutin and
Stefan Manegold},
title = {EDBT 2009, 12th International Conference on Extending Database
Technology, Saint Petersburg, Russia, March 24-26, 2009,
Proceedings},
booktitle = {EDBT},
publisher = {ACM},
series = {ACM International Conference Proceeding Series},
volume = {360},
year = {2009},
isbn = {978-1-60558-422-5},
bibsource = {DBLP, http://dblp.uni-trier.de}
}

So as you can see, we have a problem. The formats are not consistent, which means that if I need to get some references from DBLP, and others from the ACM, my references file is going to look very irregular.

Other critiques:

I have never understood why DBLP splits up the conference and the paper: with BibTeX, if you cite three or more papers that use the same crossref, the crossref is included itself as a reference, which is just strange.
Unless you use double curly braces, capitalizations inside a string get removed, which is mucho annoying: It's "Riemannian", not "riemannian".
The DBLP name for the conference is too cryptic: who'd even know what EDBT is outside the database community. On the other hand, the ACM citation is clunky, and is a page-length disaster waiting to happen.

Thoughts ?

Thursday, July 09, 2009

Quick note

(am posting this from 37000 ft. isn't technology cool ?)

The clustering series will be on hiatus for a few days while I wrap up some more pressing deadlines. It will restart two weeks from now.

Congratulations to Adam Smith and Sean Hallgren on getting a PECASE award. See what creating a new blog can do !

NSF EDA Workshop: Wrap up

Today morning was wrap up day. The theory group was "tasked" with coming up with a few slides suggested avenues for further cooperation. Dick talked about creating (or re-creating) an environment in academia where researchers are building chips and talking to other experts down the corridor (like theoreticians) rather than the 'planted' model where a theoretician is planted in a corporate design environment for 6 months or so.

I talked briefly about what I called "new" theory: the twin ideas of randomization and approximation that have been so powerful in algorithm design (even for intractable prolems), and how these play into the problems of dealing with massive high dimensional data.

Vijaya gave a synopsis of cache-obliviousness, multicore research, and her recent work in this area. There's a lot of interest in EDA and multicore work, and she made the argument that theoreticians are learning how to design efficient multicore algorithms and can be of great help in adapting EDA methods.

There were also four sub-panels that addressed various aspects of EDA. What to me was interesting that many of the panels brought up "EDA needs in theory" that matched some of the things we talked about. For example, there's a pressing need for methods that run linear (and even sublinear) time, tools that are incremental, in that they can progressively generate better and better solutions given more time, and tools that can parallelize well.

They also talked about providing benchmarks: for example, a 10,000 variable SAT instance and code that solves it. I thought (and said) that this would be an excellent way to get people dabbling in this area.

Randomization was a trickier matter. Although the EDA folks recognize the power of randomization, they are concerned about reproducibility. The DA pipelne is long and involved, and once you've fixed the output of a piece of the pipeline, you'd like to keep it in place and not change from iteration to iteration.

Of course this is something that can be fixed with seed control. For more portability, it's conceivable that you can cache the random coin tosses once you've settled on a reasonable solution to any piece in the pipeline.

Overall, it was quite interesting, although exhausting. The general idea with such workshops is that the findings make their way into the next call for proposals (sometime in October/November), so if you have EDA people you'd like to collaborate it, this might be a good opportunity.

It's time to go home.

NSF EDA Workshop: A comment

Paul Beame had a brilliant comment on my post about the EDA Workshop, and I liked it so much I'm reproducing it in full here:

One fault is that we don't even teach the proper tools in our algorithms courses! There are relatively few algorithms texts that even support the teaching of backtracking as an algorithmic paradigm. Moreover, the sort of very nice work with provable small exponential upper bounds that some theory researchers (e.g., David Eppstein, Richard Beigel, Martin Furer) have done in backtracking algorithms for combinatorial problems is only a very small part of the landscape for algorithmic solutions to hard problems.

Too often, we in theory leave the important hard problems to those in AI and formal methods. One of the standard paradigms in practical verification is to attack problems that in their full generality are PSPACE-hard (or worse), find a useful NP subclass, apply a reduction from these problems to SAT, and use SAT solvers that involve worst-case exponential backtracking (or to a much lesser extent local search) algorithms that are fast in many cases. The key techniques used in these solvers are powerful heuristics that are usually studied in AI but rarely in algorithms research.

The potential for practical impact of improved versions of these algorithms (even with only polynomial speed-up) could be huge. Traditional algorithmic research has occasionally had something useful to say about these topics (such as Moser's recent beautiful analysis of variants of the random walk algorithm on the special classes of CNF formulas satisfying the Lovasz Local Lemma conditions) but it is somehow not viewed as part of the mainstream.

Applications in EDA are only part of the utility of some of the exponential algorithms used in formal methods. There has been a large and surprisingly successful body of work on software model checking that uses quite clever theoretical ideas. One of my favorite examples is the following: Though techniques for checking properties of communicating finite state machines (one of those nasty PSPACE-complete problems) had been in use for EDA since the early 1990's, model checking software presented an additional problem because of the need to handle the call stack. Suppose that one needs to check a safety property (i.e., an invariant). The key observation (which goes back to the early 1960's) is that the language consisting of the possible stack contents of a PDA is always a regular language for which an NFA can be easily constructed from the PDA! One can then apply a small extension the previous algorithm to check safety properties (which must hold for all possible stack contents).

Wednesday, July 08, 2009

NSF Workshop: Electronic Design Automation

I'm currently at an NSF Worshop on Electronic Design Automation (the thing that used to be called VLSI design). I'm here as part of the 'theory audience' along with Vijaya Ramachandran and Dick Lipton (blogger power!).

Thankfully, Dick has posted an extensive summary of the day of talks, so I don't have to. Our mandate here is to listen in on the discussions and come up with a response that suggests avenues where theory folk might have something useful to contribute.

From a purely biased algorithms perspective, one thing that strikes me about the EDA community (at least in the formal methods/verification realm) is their unwillingness to give up in the face of beyond-NP-completeness. What I mean is this: most of my training in algorithms (and this is likely true for you as well) is in the polynomial-time regime: all the algorithic paradigms we learn are effective at reducing the complexity of an algorithm from one polynomial to another.

When we engage with NP-hardness, we switch modes to approximations, and focus on the issue of quality: even the approximation algorithms themselves run in poly time. There are very few people (David Eppstein comes to mind) who work on algorithms in the exponential/subexponential realm and will worry about (say) reducing the base of the exponent for SAT or graph coloring or other hard problems.

The verification folks don't necessarily solve their (very hard) problems exactly, but they do design all kinds of tricks (and heuristics) to deal with these problems, because they actually need to solve them ! In my view, it wouldn't be a bad idea for students learning algorithms to learn at least a few tricks for designing algorithms that might run in exponential time, but are efficient. Remember that exponential might be better than n^100 for many values of n.

One thing that came to mind as I listened to talks. With the exception of a talk by Rupak Mazumdar on faulty computations, and a talk by Ed Clarke (yes, that Ed Clarke) on statistical model checking (based on the Neyman-Pearson hypothesis testing framework), there was little talk of the role that randomization might have to play in the problems of EDA.

A second thought was how the lessons of massive data analysis might be useful in the realm of DA. One speakr described one critical problem as being the degree of complexity associated with current DA tools: there are over 4000 "knobs" to turn in one such tool ! It's believed that these knobs are not independent, and might even be contradictory. If we think of each "run" of the DA tool, outputing some kind of chip layout, as a point in this 4000+ dimensional space, I wonder whether techniques for dimensionality reduction and manifold analysis might be useful to find a set of "core knobs" that control the process.

I have to say that it's nice to attend a workshop with a community that throws out terms like NP-Complete, \Sigma_2, and PSPACE so freely :).

Friday, July 03, 2009

FOCS 2009 results.

The FOCS list is out, (with abstracts). 73 papers accepted. Some papers that popped up on my radar screen (post more in the comments since I'll probably miss many good ones):

Matt Gibson and Kasturi Varadarajan. Decomposing Coverings and the Planar Sensor Cover Problem.

This paper continues a line of work that I like to think we started in a SODA paper a few years ago. The problem is this: you're given a collection of sensors that monitor a region. But they have limited battery life, and they're also redundant (many different subsets of sensors have the range to cover the region). Can you schedule the sensors to go on and off so that at all times, the entire region is covered, and the region is covered for as long as possible ?

Early work on this problem formulated it in a combinatorial setting, which immediately led to things like log n-approximability lower bounds via set cover variants. Our belief was that the geometry of the regions would make things a bit easier. We made some small progress, getting a sublogarithmic approximation ratio for intervals on the line, and a constant factor for ranges with special structure. Subsequent work made steady improvements, and this work has brought things down to a constant for general classes of regions in the plane
Saugata Basu and Thierry Zell. Polynomial hierarchy, Betti numbers and a real analogue of Toda's Theorem

Betti number computation has been a hot topic in computational geometry of late, but the idea of using Betti numbers to bound the (algebraic complexity) of basic geometric operations dates back to work by Dobkin and Lipton that relates the complexity of certain computations to the number of connected components of "invariant paths" in a computation. Ben-Or generalized these results to more complicated algebraic computation trees, and noting that "number of connected components" is the zero-th Betti number of the set of computations, asked whether higher-order Betti numbers might be useful for proving lower bounds. A series of results by Bjorner, Lovasz and Yao, Bjorner and Lovasz, and finally Yao showed that indeed the higher-order Betti numbers can be used to lower bound the complexity of algebraic computations.

So Betti numbers are important as a measure of "counting" in algebraic complexity. This paper reinforces this intuition, by relating #P over the reals to the complexity of computing Betti numbers over semi-algebraic sets. Interestingly, this paper mentions the difficult nature of Toda's original proof, and Lance just recently tweeted a new very simple proof of Toda's theorem that he just published in ToC.

David Arthur, Bodo Manthey and Heiko Roeglin. k-Means has Polynomial Smoothed Complexity

Read my post on k-means for more on the significance of such results. This paper finally resolves the smoothed complexity issue, giving a polynomial bound (expected).

Peyman Afshani, Jeremy Barbay and Timothy M. Chan. Instance-Optimal Geometric Algorithms

An interesting result that seems to be able to take away order-dependence from some geometric problems.

T.S. Jayram and David Woodruff. The Data Stream Space Complexity of Cascaded Norms

Cascaded norms (where you compute one aggregate on each of a collection of streams, and compute another aggregate on these aggregates) are tricky for streaming algorithms, and this paper provides some interesting results. Again, I'm waiting for the document to show up, and I'd be interesting in seeing the kinds of techniques they use.

Papers I'd like to learn more about:

Ravindran Kannan. A new probability using typical moments and concentration results
Hans Bodlaender, Fedor Fomin, Daniel Lokshtanov, Eelko Penninkx, Saket Saurabh and Dimitrios Thilikos. (Meta) Kernelization
Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, Alessandro Panconesi and Prabhakar Raghavan. Models for the compressible Web

Comments ?

Clustering: k-means...

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

k-means (also known as Lloyd's algorithm) is probably the most well known clustering algorithm, and as such, deserves some discussion of its own. We've already seen that it's part of the "I don't like you" family of clustering formulations. How does the method work ?

We start with a set of points in R^d. The algorithm is really simple: Fix a collection of k centers. Now perform the following alternating optimization till you reach a steady state: (1) assign each point to its nearest center (2) for each group of points assigned to the same center, compute a new center by taking the centroid of the points.

More importantly, what problem does this algorithm solve ? The quick answer is: none ! Yes, that's right: k-means gives no answer with any provable guarantee for a generic set of points for any cost measure. Pretty dismal, isn't it ? And yet, it's a really popular algorithm, a state of affairs that generally drives theoreticians to tears. So what's the deal ? The answer, as always, lies in the details, and is a good story of how theoretical work has managed to make some sense of an algorithm that was notoriously hard to analyze properly.

The underlying problem that k-means attempts to solve is the one I mentioned in the last post: let a cluster cost be the sum of squared distances to the cluster center, and minimize the sum of cluster costs over all k-clusterings. Now here are some of the bad things that we CAN say about k-means:

It might take exponential time to converge, even in the plane
It can get stuck in a local minimum that has an arbitrarily bad cost.

Running Time:

Let's first consider the problem of guaranteeing running time. Although k-means can take exponential time to converge, a smoothed complexity result says that w.h.p, a perturbed input will converge in polynomial time (for those not familiar with the concept, the smoothed complexity of a problem is its (expected) running time after the inputs have been perturbed randomly by a small amount). The smoothed complexity results suggest one reason why k-means converges quickly "in practice". The kinds of careful constructions (high dimensional, and low dimensional) needed to force super polynomial convergence times are very artificial.

A crucial initialization:

So what about the quality of the results produced ? One unspecified aspect of the k-means algorithm is how the initial set of k centers is created. Two recent works have indicated that the key to extracting good performance from k-means is by being very careful with this initial choice.The idea is very neat, and really should be getting more play in the experimental community than I think it does.

Recall the Gonzalez algorithm for the k-center problem: pick a point, and then its furthest neighbor, and then the point whose closest neighbor in this set is as far as possible, and so on. Consider the following randomized variant:

Pick the next candidate center with probability proportional to its minimum distance squared from the current set of clusters.

Papers by Arthur and Vassilvitski, and by Ostrovksy, Rabani, Swamy and Schulman both show that this simple seeding strategy allows for provable guarantees on the quality of the solution returned by k-means. The actual results are slightly different, and give two different views of the behaviour of the algorithm:

The Ostrovsky et al paper starts with the assumption that the point set is epsilon-separated: the intuition behind this definition is that the cost of a k-clustering is significantly better than the cost of a (k-1)-clustering (controlled by a parameter epsilon). Under these conditions, they show that the initial seeding gives a PTAS for the clustering problem.

I'm eliding a number of details about their algorithm. One interesting feature of their algorithm is what they do after the initial seeding: rather than run Lloyd's iterative step, they fix a radius around each center (different for each center) and compute the centroid of the portion of the set within this ball. They also use a more complicated procedure to convert this into a PTAS, and in my mind that makes their approach a lot less practical.

The Arthur/Vassilvitski paper starts with the same initialization procedure, but they assume no epsilon-separation result. Their algorithm is simply: run the initialization, and then run regular k-means. The guarantee they get is weaker (a log k approximation ratio), but they also provide empirical results showing the superiority of their approach compared to standard k-means. I think it'd be interesting to compare their approach with the 'ball k-means' approach by Ostrovsky et al to see which work better on benchmark data sets.

Another pleasing property of the k-means algorithm is its relationship to mixture modelling and the EM algorithm. I'll have more to say about EM later on, but the gist of the matter is this: The twin steps of finding new centroids and assigning points to nearest neighbours have direct analogs in the world of maximum likelihood concepts and distribution-generated data. There is a general procedure that takes a given exponential family (Gaussian, Poisson and the like), and generates a distance function d such that the density described by the distribution can be expressed as exp(-d(x, mu)), where mu is the mean of the distribution, and x is the point whose probability we're evaluating. ML folks will recognize the Bregman distances here, and the punchline is that for Gaussian distributions, the corresponding distance function is squared Euclidean distance (the exact function used in the k-means evaluation).

What does this all mean ? It means that the k-means process has a semantic meaning in the context of clustering data drawn from a mixture of Gaussians: what's even neater is that a k-means-like algorithm works for data drawn from other distributions, as long as you replace squared Euclidean distance by the appropriate (Bregman) distance function.

Apart from the simplicity of the algorithm, there are some aspects that make it attractive, regardless of lack of guaranteed bounds. From a heuristic-design standpoint, k-means is a particularly atractive example of a technique that's often called alternating optimization.

Notes:

I was asked if there were standard data sets for benchmarking clustering algorithms. I don't often see papers doing this in a systematic manner, but a good source of data sets for experimentation is the UCI Machine Learning Repository: there aren't too many clustering data sets, but they can be found here (look for clustering under the default task).

Next up: hierarchical methods...

Monday, June 29, 2009

SODA 2010 submission server

Remember: abstracts are due today at 5pm ET.

Sunday, June 28, 2009

Clustering: The "I don't like you" view

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

Any clustering problem starts with a data set. And you can basically tell what algorithm you're going to need by asking, 'what structure do I have on the data ?'. The bare minimum you need is some way of taking two items and returning a number that tells you how similar they are, or more usefully, how far apart they are.

Of course, if I know that A and B are a certain distance apart, and B and C are a certain distance apart, I might like to infer something about the distance between A and C. This is what the triangle inequality gives you, and so the most elementary form of clustering takes items and a metric defined on these items (humans actually DON'T think about distances in this manner, as many psychological studies have shown: more on that later on).

I find it convenient to view clustering algorithms from a pop-psychology self-help perspective, and so I'll deem algorithms that operate with a distance metric the 'I don't like you' methods. The general goal with such methods is to group the items into "clusters" where there isn't too much hatred :).

There are many ways to quantify this "lack of hatred". The one most natural to theoreticians is the 'minimize the worst-case angst' viewpoint: in other words, find k clusters such that over all clusters, the maximum pairwise distance between points in a cluster is minimized. This is the well-known k-center problem.

If you don't like the lack of robustness (one point can mess up the cost), you could instead sum up all the pairwise distances to assign the cost for a cluster. This gives a variant of what's normally considered the k-median problem.

A third way of measuring cluster cost is to sum the squares of distances, rather than the distances themselves. This gives you a cost function that looks a lot like what k-means attempts to solve.

It's often useful to define a representative of a cluster: in the k-center view of the world, the center of a cluster is a point whose maximum distance to all points in the cluster is minimum. Note that by triangle inequality, this is within a factor of two of the k-center cluster cost, so it's often convenient to use this as the definition of the cluster cost (the cluster radius, instead of diameter, if you will). In the sum-of-all-costs view, the center correspondingly can be defined as as the point whose sum of all distances to all other points is minimized. Similarly for the case of sum-of-square-costs.

Remarks on complexity:
There is an elegant 2-approximation for the k-center problem by Gonzalez. The algorithm itself is the essence of simplicity: pick any point in the space, take its furthest neighbor, take the point furthest from these two, and so on. The first k points picked in this manner yield the desired cluster centers, and a simple triangle inequality argument yields the 2-approximation.

The k-median problem is a little harder to solve. The best known bound in a general metric space is a 3+eps approximation, and it's known that you can't get a PTAS. The approximation algorithm itself is not too hard: it's a local search heuristic that gets closer and closer to 3 as you permit larger and larger sets of centers to be swapped out.

Going to Euclidean space, or what's in a name ?
General metrics come either from some kind of similarity function between pairs of elements or from the induced shortest path metric on a graph (there are other metrics one can induce from a graph, and we'll talk about them later). But what if you have more information ? There's a general principle of clustering that's worth enunciating here: more structure on the data can lead you to more targeted algorithms that will probably work better, so resist the urge to be too general.

The most natural space to consider is a Euclidean space, and it is this space that explains the names 'median' and 'mean'. It's well known that on the line, the point minimizing the sum of distances to a set of points (the 1-median) is given by the actual median of the numbers. This explains why the problem of minimizing sum of distances is called the k-median problem. Similarly, the point that minimizes the sum of squared Euclidean distance is the centroid or the 'mean', giving rise to the k-means problem.

There's a long line of results on clustering under various measures in Euclidean space. In brief, for most of the above formulations, you can get a PTAS in Euclidean space, but you end up playing whack-a-mole with the parameters n, d, epsilon, and k (that is to say, something ends up exhibiting exponential dependence, and which parameter it is depends on what you care most about).

The Voronoi trick
But there's a general principle that governs much of the algorithm design for these problems. It's the so-called Voronoi trick, and can be summarized by pointing out that you can always improve the clustering cost by making sure that each point is assigned to its nearest neighbor. Effectively this means that the clusters define a Voronoi partition, with the center as a site, and all points within a Voronoi cell being assigned to the corresponding site. First of all, this means that the number of possible outcomes reduces from k^n (the number of ways of partitioning n things into k groups) to n^{kd} (since the number of Voronoi partitions is much smaller than the number of partitions). Secondly, it suggests a simple heuristic for improvement: Take a given clustering: reassign points so that each point is assigned to its nearest cluster center, and then recompute the center for the cluster associated with all points in the (erstwhile) Voronoi cell. Rinse and repeat....

This heuristic is the basis of the k-means algorithm, and is also the basis for a heuristic for the k-median problem clumsily named the k-medoids algorithm.

Notes:

all the above algorithms have the parameter k for the number of clusters. It's obvious why: all the above cost measures can be minimized by placing each item in a cluster by itself, but the resulting answer is meaningless, and "no man is an island"
In a general metric space defined over a finite set of points, the cluster centers will by default by elements of the input. This is often a desirable property, if you want representatives to be from the input. But in a smooth space like a Euclidean space, there's no reason to limit yourselves to data points, and this creates a dichotomy between the so-called discrete and continuous variants of a clustering problem. Often, you can use triangle inequality to argue that the answers don't differ significantly in cost, but it's important to ask yourself which variant you care about. The discrete problem can often be "easier" to solve, because you can enumerate over the choices, but this "easier" can be expensive: for example, minimizing the sum of squares is easy in a (continuous) Euclidean space, but is more expensive in a (discrete) metric space.
For some unknown reason, it's common to describe clustering problems by the algorithm used to "solve" them, which causes no end of confusion. I lost count of the number of times I had to point out that k-means is an algorithm that attempts to optimize a specific cost function. This is more than just pedantry, because unless you realize the difference between a problem and a specific solution, you'll never be able to conceive of an alternate approach, and you'll never even try to abstract out what's special about your problem. To theoreticians, this might be obvious, but it isn't really obvious IRL.

Next up: everything you ever wanted to know about k-means.

p.s this is written in a much more breezy style than I'd use for any formal document, but the organization and thinking reflects how I'd like to structure the material. Comments, as always, are welcome.

Friday, June 26, 2009

"Bloggers", the media and science reporting.

Sometimes I read an article that I just cannot understand. The words make sense, and I can understand the logical flow, but the entire premise just escapes me. Consider the following exhibit: an article in Nature News on the "problem" of blogging from a conference. A snippet from the article (which I recommend reading in its entirety):

By disseminating scientific results far beyond the lecture hall, blogging and social networking blurs the line between journalists and researchers. Scientists in competitive fields may be more reluctant to discuss new findings if they can be posted on the Internet within seconds.

and then later:

MacArthur's comprehensive postings were read by many scientists but they irked journalists attending the meeting. The meeting rules stated that reporters had to seek permission from speakers before publishing material on their work, rules that Cold Spring Harbor instituted in part because some journals, such as Nature, discourage scientists from talking to the press before their work is published. But those rules didn't apply to scientist-bloggers like MacArthur and, after he posted details from a few talks, reporters contacted Stewart for clarification on the policies. The complaint was a wake-up call: "For the first time, I became aware that people were blogging about the data at the meeting," Stewart says.

If I understand this correctly, the premise of the article is that blogging from a conference is bad because it

"scoops" the poor journalists under embargo ?
disseminates information faster than someone might like ?

to which my only response is "HUH " ? Isn't the whole point of a PUBLIC presentation to disseminate results to - you kn0w- THE PUBLIC ? and how on earth does it matter how fast this happens ? And although I feel for the journalists who agree to weird embargo rules by control-freak journals, why on earth should I, as a researcher who happens to blog from conferences, care about this ?

Sometimes I think the media needs to get over itself....

Thursday, June 25, 2009

Clustering: An Occasional Series

Clustering is one of those areas that epitomizes the 'ooze' principle in data mining: it's 3 miles wide and an inch deep. Every year, copious ink get spilt on some clustering method or another, and yet it's not really clear what key ideas are driving the field forward.

I ran a seminar this spring on clustering. I wanted to see if there was a way of organizing material on clustering in a way that was broad without necessarily being exhaustive, but that hit the "eigenvectors" of the field, so to speak. I was surprised that for a topic as important and broad-based as clustering, there was little out there in the way of textbook-level source material to work from. Most texts that I did find were too specialized for my taste, and didn't have the kind of broad perspective of clustering that I felt was critical to really understanding the area.

So I organized my material, and was surprised to find that there actually was a way of collecting the base material on the topic in a way that could be covered in a semester format without leaving crucial pieces out. What I did not try to do:

Cover the details of the best/greatest/fastest algorithm for each problem. In other words, I made a conscious decision to shy away from theoretical overkill.
Try to cover all the variants of different clustering methods. I felt that as long as one knew what k-means was, and what spectral clustering was, there was no real need to study papers that tried to put them together in strange ways.

What I did try to do was:

Where possible, try to evaluate the practical significance of the methods (so a linear time clustering algorithm that actually had a running time doubly exponential in epsilon would not necessarily make the cut, unless it had some other redeeming features)
Try to view the methods from the perspective of the user: given some data, with some structure, what are your options in terms of clustering algorithms ?

The results of this organization can be found here. I was relatively happy with the outcome, and I think the students were too. In fact, many of the discussions in class were interesting enough that I put them together in a collection of notes.

What I want to do in this occasional series is sketch out the "top-level" view of clustering as I see it, and as inspired by our seminar. My vague idea was that these notes might eventually be converted into some kind of monograph, or at the very least a web document that others might find useful. So I'm going to "test-drive" the ideas here, in the hope that my very well-informed readers will add to the discussion, and improve it (as usually happens).

First up: the basic trinity - k-(center/median/means).

Sunday, June 21, 2009

"I, for one, welcome our new TCS overlords"

Via Richard Ladner:

I am pleased to announce the new SIGACT officers for the term beginning July 1, 2009 and ending in three years, June 30, 2012.

SIGACT Chair
Lance Fortnow, Northwestern University

Members-at-large
Cynthia Dwork , Microsoft Research
Anna Lysyanskaya, Brown University
Michael Mitzenmacher, Harvard University
Mike Saks, Rutgers University

Quick hits

If you limit yourself to actually reading one Terry Tao post each week (I'm told this frequency prevents brain explosion), then read this one.
Joachim Gudmundsson wants to remind you that jausage is much taster than sausage. Oh yes, and JoCG is open for submissions.
Get those SODA papers in (but not too many, I have work to do !)
I kid, I kid....
I wanted to spend more time talking about Robert Lang's talk on origami. It had deliciously beautiful pictures, and some even more juicy mathematics, and he pitched it at JUST the right level. No slides available, alas.
Lecture notes on streaming from a workshop in Barbados...
I had to explain to someone that Denmark is not a city in Sweden. And this after watching the Denmark-Sweden world cup qualifying game in a bar with a hundred roaring Danes, and feeling like I had been transported to a Viking battlefield.

And because I can't help myself, "SKÅL"

Friday, June 19, 2009

SoCG 2009: Local search, geometric approximations and PTASs

I planned on this post coming a lot sooner after the last three, but got caught up in my own version of a Danish flu. sigh....

So let's get greedy. In the last post, I talked about new developments in the epsilon-net route to approximating covering/hitting problems. The problem with these approaches though (as Saurabh Ray pointed out) is that they (at best) get you to "some" constant. They don't get you down to a specific constant (unless you can really optimize the net-construction) and they certainly don't give you a PTAS.

So what's a poor approximator to do if their livelihood depends on getting a 1+epsilon-approximation ? Two papers at SoCG 2009 answer this question independently using essentially the same idea, and there's already a third paper that applies this idea as well. You know what they say: "three applications doth a general technique make" (or at least they should say it !), and the technique is so neat that I thought I'd spend some time outlining the basic principle.

But first, a detour. What follows is my own take on the underlying argument, and I'm the only one responsible for all the inaccuracies in interpretation and exposition. Consider the problem of computing a maximum cardinality matching in a bipartite graph. As we all know, this can solved using a max-flow algorithm. A well-known "spin" on this result is the following:

Suppose we construct a max flow F_k using augmenting paths of length at most 2k+1. Then OPT <= (1 + 1/(k+1)) F_k

Or in other words, F_k is a 1 + 1/(k+1) approximation to the true answer. The proof is fairly easy: consider any augmenting path of length at least 2k+3 that could increase the total matching size. Because we've tried all augmenting paths of length 2k+1, it must be that this path alternates between OPT (red) and algorithm (blue) edges, starting and ending in a red edge. In other words, if X is the number of blue edges along this path, then 2X+1= 2k+3, and X = k+1. This means that the number of optimal edges is at most (1 + 1/(k+1)) times the number of algorithm edges. Note that all augmenting paths must be disjoint, so some algebra yields the above result.

Another way of viewing the above fact about alternation is that if we think of the edges as vertices of a bipartite graph, (colored as the edges were), then each red set is adjacent to a reasonable number of blue vertices (in fact if this were not the case, the analysis would break). The edge-disjointness then allows us to scale things up to the entire graph

Notice that the process of generating longer and longer augmenting paths can be viewed as a local search operation. Starting from a given path, we modify it by constructing an alternating path using the current edges.In other words, we're able to make claims about local search yielding optimal solutions because we can control the way OPT interacts with the results of the algorithm, with the interaction parametrized by the "degree of locality".

Now let's return to the papers, keeping this in mind.

Mustafa and Ray apply this idea to get a PTAS for hitting set problems. Specifically, they show that for r-admissible regions in the plane and for halfspaces in R^3, the hitting set problem admits a PTAS that runs in time O(mn^(1/eps^2)) (m: number of regions, n: number of points). A collection of regions in the plane is r-admissible if any pair intersect in at most r points, and the differences are connected (these are often called non-piercing collections for that reason).

Chan and Har-Peled use the idea to get a PTAS for max-independent-set problems for pseudo-disks in the plane (which are 2-admissible). Their running time is similar.

Both papers run the following algorithm, parametrized by a number b. Take any feasible solution, and call it L. Now check all sets that differ from L in at most b places and see if any of them can be used to improve the quality of L by at least one unit. If so, make the swap and repeat. This process will terminate in at most n steps, and at each stage, you need to do a search over at most n^b sets. We'll call such a solution "b-local"

The idea that roughly governs both analyses is this: Consider the solution you're currently at (let's call it red) and the optimal solution (color it blue). Think of building a bipartite graph (this works slightly differently for the different problems) which contains the red and blue objects as vertices, and connects two (differently colored) objects by an edge if there's a range that contains both (for the hitting set problem) or if they intersect (for the MIS) (call this a 'locality property' as Mustafa and Ray do) Then they show some things:

This graph is planar
It "expands": neighborhoods of blue sets of size at most b are large
As a consequence of 1,2, the size of the blue set is upper bounded by (1+1/sqrt(b))(size of red set)

See the resemblance to the max flow argument ? (or maybe not ;)). A reasonably quick corollary then shows that by setting b = 1/eps^2, the desired PTAS follows (since by 3), the optimal solution is reasonably close to the current solution. 2) holds by the locality argument, i.e if it weren't true, then there'd be some way to locally improve the red set using a "lookahead" of size at most b. 1) is case-by-case, but works for these problems. The proof of 3) itself requires separator machinery for planar graphs.

An upcoming paper by Gibson, Kanade, Krohn and Varadarajan applies the same idea for the 1.5D terrain guarding problem (one of the few nontrivial art gallery problems that has yielded to approximation attacks). They produce a planar graph that satisfies the locality property, and then apply the rest of the local search machinery as before. What's neat is that the machinery really does apply just as before: "all they have to do" (and I'm not trying to minimizing the effort here, just pointing out the smoothness of the technique) is construct the planar graph.

Some thoughts:

There's some room for improvement in the running time perhaps, at least down to a linear dependence on 1/epsilon in the exponent. We can't get a poly dependence because of strong-NP-completeness
I don't think the weighted versions of these problems would fall to the same "trick": indeed the Chan-Har-Peled paper considers a neat LP-based approach for the weighted MIS.

There have been numerous PTASs for geometric problems: the Arora/Mitchell TSP approximation is the most famous of course. But this technique is a relatively new kind of generic tool, and has demonstrated immediate applicability for different problems, which makes it noteworthy. Secondly, it's a way of reasoning about the efficacy of local search, which is always neat, since local search algorithms are generally not that complicated.

Saturday, June 13, 2009

SoCG 2009: Epsilon-nets and covering/hitting problems.

The 30,000 horsepower engine of approximation algorithms is linear programming. The fact that an LP gives you a bound on the cost of an NP-hard problem has been the major force behind ever more sophisticated methods for proving approximation guarantees.

Ironically enough, given the geometric nature of an LP, this engine has not been particularly useful for geometric approximation problems. The high level reason (somewhat over-simplified) is that it is generally hard to encode specifics about geometry in a linear program. What this means is that (for example) if you want to solve a geometric covering problem and can't encode the geometry in your LP, you're stuck with the log n approximation that combinatorial set cover gives you, inspite of the fact that there are numerous examples of problems where throwing geometry in actually improves the approximation bound considerably.

To put it mildly, this is a pain in the neck. It basically means that you have to design methods that work for a specific problem, and there are only a few general purpose techniques available, many of them outlined in this survey.

But if your problem can be framed as a covering/hitting problem, then there's a general body of work that has provided better and better results, (and for an added bonus, actually has an LP-formulation). The story dates back to a paper on iterative reweighting by Clarkson, and actually goes back even earlier if you jump to the ML literature: Arora, Hazan and Kale have a nice survey on this general approach.

But I digress: the paper by Bronnimann and Goodrich put the technique in the language we're familiar with today: in short,

Given a range space that admits epsilon-nets of size (1/epsilon)f(1/epsilon), you can solve the hitting set problem to an approximation of f(OPT).

In particular, range spaces that admit epsilon-nets of size 1/epsilon admit constant-factor approximations for the hitting set problem. This led to a "formula" for approximating these problems: find an epsilon-net of an appropriate size, and declare victory. There's been much work on constructing epsilon-nets since then, partly motivated by this.

Another angle to this story was put forward by Clarkson and Varadarajan in their 2005 SoCG paper: the brief summary from that paper was:

Given a range space in which you can bound the complexity of the union of a random subcollection of size r by rf(r), you can find an epsilon net of size (1/epsilon)f(1/epsilon).

This approach gave another tool for finding good approximations for covering problems (and in particular gave a nice constant factor approximation for a thorny guarding problem).

At SoCG 2009, Kasturi Varadarajan presented an improved bound along these lines: his main result improves the transfer complexity: modulo some technical details,

Given a range space in which you can bound the complexity of the union of a random subcollection of size r by rf(r), you can find an epsilon net of size (1/epsilon)log f(1/epsilon).

Although this doesn't improve results in the constant factor regime, where f() is a constant, it makes a significant difference when f() is non-constant.

The interesting story here is that Kasturi's result was simultaneous with another paper by Aronov, Ezra, and Sharir that just appeared in STOC 2009. They get the same bound without the technical details that limit his result, however their methods are quite different.

Returning to LPs, what I failed to mention is that the epsilon-net construction technique used by Bronnimann and Goodrich can be seen as a byproduct of an LP-algorithm, as was shown by Even, Rawitz and Shahar, and Philip Long before that (thanks, Sariel)

SoCG 2009: A neat lower bound example

I've had a number of posts queued up to write (for those who ask me "how do you get the time?", the answer is sometimes, "I don't !"). This SoCG was full of really interesting papers, and even collections of papers that together moved our understanding forward in very nontrivial ways. Definitely read David E's post from earlier today.

Gabriel Nivasch has been doing some great work on lower bound constructions. He recently had a paper on analysing upper and lower bounds for Davenport-Schinzel sequences (got a best student paper award at SODA 2009), and the kind of tight bounds he gets are very impressive (upto terms involving powers of alpha(n) in the exponent). Here, he presented joint work with Boris Bukh and Jiri Matousek~~Micha Sharir~~ on a new lower bound for weak epsilon-nets.

A central construction here, that might be useful for other lower bound constructions is a very fast growing exponential grid. Roughly speaking, if we think of how it would map to a "regular grid", then the point (i1, i2, ....) on the [0..m] grid maps to the point whose jth coordinate is x_j = 2^{i_j * [m^{j-1} + m^{j-2} + ... + 1]d}. So things start getting really jumpy as the coordinates go up. A couple of observations do the heavy lifting for this grid:

A straight line segment between two points in the stretched grid maps to a collection of d line segments in the "regular" grid: d-1 that goes almost straight up from the first (say lower) point, and another that goes across the final dimension (it's easier to think of this recursively)
Convex sets in the stretched grid correspond to so-called "stair-convex sets" in the regular grid
A diagonal in the stretched grid maps to a curve that doesn't touch many points in the regular grid.

Basically, the grid construction allows them to reason about lower bounds by going back to the regular grid. They use these ideas to prove the lower bound for weak epsilon-nets, but also for some incidence problems (using the third property). It seems to me that this construction might prove useful for other lower bounds in the future as well.

Coming up next: max flows, local search and a plethora of PTASs (try saying that fast)...