## Monday, November 21, 2011

### On getting rid of ACM affiliation...

As I mentioned earlier, the SoCG steering committee is running a poll on whether to separate SoCG from ACM and run it as an independent symposium with proceedings published by the Dagstuhl LIPIcs series.

At the time I had considered writing a post expressing my own thoughts on the matter, and then promptly forgot about it. The poll is still open (although probably many of you have voted already) and so I thought (with some external nudging :)) that I'd state my position here (one that I summarized in brief for the poll).

If you'd rather not read any more, here's the summary:
I think it's a bad idea.
And here's why

I understand the rationale behind this: it can be a little difficult to work with ACM. ACM does take a contingency fee that increases registration. The cost of proceedings is a significant fraction of the total budget, and the timelines can be a little tricky to work with. These and more are outlined in the poll document, and you can read all the points being made there.

But that's not why I think this is a bad idea (and I do have specific objections to the claims above). I think it's a bad idea because ACM is not just a conference organizing, proceedings publishing enterprise that has bad taste in style files. It's also the home of SIGACT.

For better or worse, SIGACT is the flagship representative body for theoretical computer science in the US. General theory activities like the Theory Matters wiki and the SIGACT committee for the advancement of theoretical computer science were jump-started by SIGACT. Association with SIGACT means that the work you do is 'theoryCS (A)' in some shape of form.

If you're doing geometry in the US, then affiliation with SIGACT is not a bad thing. It means that you're part of the larger theory community. It means that when the theory community makes pitches to the NSF to get more funding for theory research, you're included. It also helps because there aren't other communities (SIGGRAPH?) ready to absorb geometry into their fold.

While de-linking from ACM doesn't mean that we turn in our 'theory badges', it doesn't help the already fraught relationship between computational geometry and the larger theory community. And while the relationship with ACM is not crucial to the survival of the geometry community in the US, being part of a larger theory community that speaks the same language is crucial.

p.s The center of mass in CG is closer to Europe than many might realize. I understand that our European colleagues could care less about US funding and community issues, which is fair for them. But this matter affects US-resident folks more than ACM's surcharges affect the European researchers, and  our community isn't so big that we can withstand shocks to one side of it.

## Tuesday, November 15, 2011

### SODA 2012 student travel support

While it's not a free round trip to Japan for blogging, ....

Well actually it is !

David Johnson asks me to point out that the deadline for applying for student travel support to SODA 2012 has been extended to Nov 29.
The United States National Science Foundation, IBM, and Microsoft have provided funding to support student travel grants. The nominal award is $1000 toward travel expenses to Kyoto, Japan for SODA12, or ALENEX12 or ANALCO12. In special cases of large travel costs somewhat larger grants may be given. Any full-time student in good standing who is affiliated with a United States institution is eligible to receive an award. All awardees are responsible for paying for meeting registration fees. However, registration fees may be included in travel expenses Top priority will be given to students presenting papers at the meeting, with second priority to students who are co-authors of papers. Ties will be broken in favor of students who have not attended a prior SODA. For more information, visit the SIAM travel support page And if you'd also like to blog about the conference once you're there, the cstheory blog is always looking for volunteers ! ### An intuitive argument for the JL lemma ? Hal Daumé (as is his wont) asked me a simple question with no easy answer. His question was Is there a proof of the Johnson-Lindenstrauss Lemma that can be explained to an undergraduate ? While there are different "simple" proofs of the JL Lemma (the Gupta-Dasgupta proof is one such), it's not clear whether these are "undergraduate" level. So instead of answering his original question, I decided to change it to Is there a proof of the JL Lemma that isn't "geometric" ? It's a little odd to even ask the question, considering the intrinsic geometric nature of the lemma. But there's a reasonably straightforward way of seeing how the bound emerges without needing to worry too much about random rotations, matrices of Gaussians or the Brunn-Minkowski theorem. Warning: what follows is a heuristic argument that helps suggest why the bound is in the form that it is: it should not be confused for an actual proof. In its original form, the JL Lemma says that any set of$n$points in$R^d$can be embedded in$R^k$with$k = O(\log n/\epsilon^2)$such that all distances are preserved to within a$1+\epsilon$factor. But the real result at the core of this is that there is a linear mapping taking a unit vector in$R^d$to a vector of norm in the range$1\pm \epsilon$in$R^k$, where$k = 1/\epsilon^2$(the rest follows by scaling and an application of the union bound). Trick #1: Take a set of values$a_1, \ldots, a_n$and set$Y = \sum_i a_i r_i$, where$r_i$is chosen (iid) to be +1 or -1 with equal probability. Then$E[Y^2] = \sum a_i^2$.This can be verified by an easy calculation. So now consider the vector$v$. Let's assume that$v$'s "mass" is roughly equally distributed among its coordinates. Take a random sample of$d/k$of the coordinates of$v$and apply the above trick to the values. Under the above assumption, the resulting$Y^2$will have roughly$1/k$of the total (squared) mass of$v$. Scale up by$k$. This is one estimator of the norm of$v$. It is unbiased and it has a bounded maximum value because of the assumption. This means that we can apply a Chernoff bound over a set of$k$such estimators. Roughly speaking, the probability of deviation from the mean is$\exp(-\epsilon^2 k)$, giving the desired value of$k$. But how do we enforce the assumption ? By applying a random Fourier transform (or actually, a random Hadamard transform). this "spreads" the mass of the vector out among the coordinates (technically by ensuring an upper bound on the$\ell_\infty$norm). That's basically it. Almost all the papers that follow the Ailon-Chazelle work proceed in this manner, with increasing amounts of cleverness to reuse samples, or only run the Fourier transform locally, or even derandomize the process. What distinguishes this presentation of the result from the earlier approaches (which basically boil down to: populate a matrix with entries drawn from a distribution having subGaussian tails) is that it separates the "spreading" step (called the preconditioner) from the latter, more elementary step (the sampling of coordinates). It turns out that in practice the preconditioner can often be omitted without incurring too much error, yielding an extremely efficient (and sparse) linear transform. ## Thursday, November 10, 2011 ### Computational Geometry at the joint math meetings The Joint Mathematics Meetings is billed as the "largest annual mathematics meeting in the world". In 2012 it's in Boston between Jan 4 and 7, and I'm told that there will be over 2000 presenters. Happily, some of these will be geometers. There will also be an all-day AMS special session on Computational and Applied Topology (also on Thursday). So there's lots to do at the JMM if you're interested in geometry. So if you're in the area in January, drop by ! ## Friday, November 04, 2011 ### SoCG and ACM: Relationship status - Complicated. Mark de Berg (secretary of the SoCG steering committee), sent out an email to the compgeom mailing list that starts: Since its start 27 years ago, SoCG has always been affiliated to ACM. This means that the proceedings are published by ACM, and that the symposium is organized “sponsored by ACM” (more precisely, sponsored by ACM SIGACT & ACM SIGGRAPH) or “in cooperation with ACM”. The latter happened only a couple of times, namely when SoCG was in Korea in 2007 and when it was in Denmark in 2009. Being affiliated to ACM has certain advantages, but also certain disadvantages, as detailed below. Hence, at the business meeting of this year’s SoCG in Paris, an alternative was discussed: organizing SoCG as an independent symposium, with the proceedings being published by Dagstuhl in their LIPIcs series (see below). A straw poll was taken, and the vast majority of the participants wanted the Steering Committee to investigate this issue further, which we do through this opinion poll. We hope you want to participate in this important poll. If you have an interest in how the relationship between SoCG and the ACM continues (or doesn't), I'd strongly encourage you to participate in the poll. The poll documents will eventually be are posted on http://computational-geometry.org, and in the meantime here's a google doc you can read. ## Thursday, November 03, 2011 ### Life in a crowd-sourced research world (Sung to the tune of "War", and with a Jackie Chan accent for bonus points) Jo-ur-nals ! What are they good for ! Absolutely nothing ! There are popular tropes in current discussions on crowd-sourcing research. There's the "scientists as Mean Girls" view of the current state of affairs. There's the utopian "Let a thousand papers bloom in the open research garden". There's the anti-capitalist "Down with evil money-grubbing publishers", and there's of course the always popular "Everything tastes better with crowd-sourced reputation points and achievement badges". But have we really thought through the implications of doing away with the current frameworks for "dissemination, verification and attention management" ? Here's a tl;dr a la Cosma Shalizi: A more open research environment, where all work is published before review, and anyone is free to comment on any work in public without repercussions, is both valuable as well as more chaotic and unpleasant than we might be ready for. Consider a pure "publish-then-filter" world, in which you dumped your paper in a public repository that had commenting, reviewing, reputation features, achievement badges and whatever other technological goodies you wanted to throw in. You'd be in a world not unlike the world that writers and musicians live in today. Since big music/book publishers (read "journals") take a big cut of the royalty revenues ("journal subscriptions") in exchange for promotion/marketing (read "stamps of authenticity"), many authors and musicians have developed smaller but successful brands by going on the road themselves, doing online promotions, cultivating their fan base with special material, downloads, T-shirts, event tickets and what not, and relying on underground word-of-mouth to establish a presence. Are you ready to do the same ? It's naive to think that merely putting papers on a repository and waiting for attention to appear will actually work to disseminate your work. Attention is probably the most valuable resource available to us in this connected era, and the one most fiercely fought over by everyone. No one is going to even be able to pay attention to your work unless you promote it extensively, OR unless there are external ways of signalling value. If you think that reputation mechanisms will help, I will merely ask you to look at the attention garnered by the latest Ke$ha single compared to the attention given to <insert name of your favorite underground-not-selling-out-obscure-indie-band-that-will-set-the-world-on-fire here >

Secondly, I think as researchers, we would cringe at the kind of explicit promotion that authors/musicians have to indulge in. Would you really want to sell tickets for the "1.46 approximation to graphic TSP paper tour?". How would you afford it ?

There's a third aspect to living in a crowd-sourced research world: a loss of mental space. While it should be clear to anyone who follows my blog/tweets/G+/comments on cstheory that I enjoy the modern networked world, it's also clear to me that actual research requires some distance.

In Anathem, Neal Stephenson describes a monastery of mathematics, where monks do their own research, and at regular intervals (1/10/100/1000 years) open their doors to the "seculars" to reveal their discoveries to the outside world.

Even with collaborations, skype, shared documents and github, you still need time (and space) to think. And in a completely open research environment where everything you post can be commented on by anyone,  I can assure you that you'll spend most of your time dealing with comment threads and slashdotting/reditting/HNing (if you're lucky). Are you ready to deploy a 24 hour rapid-response team to deal with the flaming your papers will get ?

Let me be very clear about something. I think there are many academic institutions (journals and conferences especially) that are in desperate need of overhauls and the Internet makes much of this possible. I think it's possible (but I'm less convinced) that we are on the cusp of a new paradigm for doing research, and debates like the ones we are having are extremely important to shape this new paradigm if it comes into being. In that context, I think that what Timothy Gowers (here and now here) and Noam Nisan (here and here) are trying to do is very constructive: not just complain about the current state of affairs or defend the status quo, but try to identify the key things that are good AND bad about our current system AND find a path to a desired destination.

But human nature doesn't change that quickly. New ways of disseminating, valuing and verifying research will definitely change what's valued and what's not, and can help open up the research enterprise to those who feel the current system isn't working (i.e most of us). But when you replace one evaluation system by another, don't be too sure that the new system is fairer - it might merely change what gets valued (i.e the peaks might change but not the distribution itself)

## Tuesday, November 01, 2011

### Large-Data Clustering Part I: Clusters of clusters

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here

I don't know why, but I've been struggling with a series of posts on large-data clustering for a while now. Distilling out the key ideas in a way that is intuitive and yet rigorous has been trickier than I thought. But here goes ...

Let's consider a prototypical clustering algorithm: $k$-means. In a $k$-means iteration, you have to scan each point and determine its nearest cluster center. Then you have to organize all the points that share a cluster center and compute a new one by averaging. Each iteration takes $nk$ time, and this isn't even accounting for high dimensions.

This algorithm doesn't scale well to large data, needless to say. But clustering, like most data mining tasks, is needed most acutely when we're dealing with large data. So what do we do ?

In the next series of posts, I'll try to walk you through some of the key principles that get used when designing large data clustering algorithms. As always, the twin principles of sampling and streaming will be the keys to the kingdom.

Clusters can be formed by clustering clusters.

One of the ideas that makes large-data clustering tractable is that you can cluster clusters (say that three times quickly, will ya ?). $k$-means is a particularly good example of this. I can think of a cluster center as a "representative weighted point" and discard the data associated with it. This is critical, because now I can just worry about the cluster centers themselves, and repeat.

The advantage of this is locality. I can take small chunks of data and cluster them efficiently. Then I can combine the cluster reps from each chunk to get a smaller set of reps, and so on.  Many of the "standard" large-data clustering methods (BIRCH, CLARA, CLARANS and the like) are based on this principle. As a concept, it's sound. But does it guarantee anything ?

The catch of course is error propagation. How do I make sure that in this repeated (hierarchical) clustering process, I didn't commit to a bad clustering  early on because of locality ? It seems like I could pay a lot of error at each level, limiting my ability to group data sets into small chunks (because small chunks mean many levels).

It turns out that there are clever ways around this problem by careful merging. The first example of this is a paper from 1997 by Charikar, Chekuri, Feder and Motwani. They look at the $k$-center problem in a general metric space (minimize the maximum radius) and their technique illustrates this principle well.

Their algorithm is quite elegant, and works in phases. In each phase, they first merge cluster centers together to ensure that all resulting cluster centers are far apart. Then they insert new points into clusters as long as they don't get too big, or create new clusters. This happens as long the number of clusters stays below $k$. When it crosses $k$, the phase ends, and then the process repeats.

Note that the algorithm reads in points and processes them on the fly in a single "pass". The reason it works inspite of the repeated mergings is because of a nifty property of the $k$-center algorithm (which lies at the heart of the Gonzalez 2-approximation for $k$-center):
If you can produce $k+1$ points that are all at distance at least $d$ from each other, then the optimal $k$-clustering has diameter at least $d$.
This is because of a 'quantitative pigeonhole principle': At least two of these points must be in a single cluster, which then has diameter at least $d$.

There are some tricks needed to figure out how much you can expand each cluster when inserting, and how far apart cluster centers should be in the phase after the current one. I'll spare you the details, but they lead to an 8-approximation, which can be improved using a neat randomization trick to a $2e$-approximation.

The above trick works because we can build a witness structure that lower bounds OPT, and so the merging can be done carefully so that error doesn't propagate badly. What happens if you change the cost function to the $k$-median though, where the witness isn't as straightforward ?

An excellent source here is the work by Guha, Meyerson, Mishra, Motwani and O'Callaghan. This paper, which combines prior work on both the theory and practice of $k$-median clustering, presents the following idea:
Break the data set into pieces. Cluster each set "loosely", using many more than $k$ clusters. Weight each cluster by the number of points assigned to it. Cluster these weighted cluster centers to find the desired $k$ clusters.
The key observations they make are:
1. The total cost of clustering the pieces isn't too much larger than the overall clustering cost (so it gives a good lower bound). This follows from the triangle inequality.
2. The new "weighted" instance admits a solution that again is not too much larger than the optimal cost (is actually within a factor of 6 for any partition). This again follows from the triangle inequality.
The last piece of this puzzle is: how to cluster each set "loosely" ? They propose using a bicriterion method, whose cost is compared to the optimal $k$-median cost, but is allowed to use more medians. Such bicriterion methods are often easier to design.

Note that there's nothing stopping them from using their own algorithm recursively to cluster the weighted cluster centers. What's neat about this is that it can be adapted to a streaming setting using another standard trick.

Specifically, if you have an algorithm that breaks things into pieces and then recombines them (a two-level scheme), you can first recursively apply it to get a multi-level hierarchical scheme, and then you can implement this in a streaming setting by updating (potentially) an entire side of the computation tree whenever a new item appears. This can also be viewed as a lazy implementation of the hierarchical strategy.