## Monday, December 26, 2011

### On PC members submitting papers

Update: Michael Mitzenmacher's posts (one, two, and three, and the resulting comments) on implementing CoI at STOC are well worth reading (thanks, Michael). The comments there make me despair that *any* change will ever be implemented before the next century, but given that we've been able to make some changes already (electronic proceedings, contributed workshops, and so on), I remain hopeful.

For all but theory researchers, the reaction to the above statement is usually "don't they always?". In theoryCS, we pride ourselves on not having PC members submit papers to conferences. What ends up happening is:
• You can't have too many PC members on a committee because otherwise there won't be enough submissions
• The load on each PC member is much larger than reasonable (I'm managing 41 papers for STOC right now, and it's not uncommon to hit 60+ for SODA)
There's an ancillary effect that because of the first point, theory folks have fewer 'PC memberships' on their CV  which can cause problems for academic performance review, but this is a classic Goodhart's Law issue, so I won't worry about it.

The main principle at play here is: we don't want potentially messy conflicts or complex conflict management issues if we do have PC members submitting papers. However, it seems to me that the practice of how we review papers is far different from this principle.

Consider: I get an assignment of X papers to review if I'm on a conference PC. I then scramble around finding subreviewers for a good fraction of the papers I'm assigned (I used to do this less, but I eventually realized that a qualified subreviewer is FAR better than me in most subareas outside my own expertise, and is better for the paper).

Note (and this is important) my subreviewers have typically submitted papers to this conference (although I don't check) and I rely on them to declare any conflicts as per conference guidelines.

Subreviewers also get requests from different PC members, and some subreviewers might themselves review 3-4 papers.

Compare this to (say) a data mining conference: there are 30+ "area chairs" or "vice chairs", and over 200 PC members. PC members each review between 5-10 papers, and often don't even know who the other reviewers are (although they can see their reviews once they're done). The area/vice chairs manage 20-30 papers each, and their job is to study the reviews, encourage discussion as needed, and formulate the final consensus decision and 'meta-review'.

If you set "theory subreviewer = PC member" and "theory PC member = vice chair", you get systems that aren't significantly different. The main differences are:
• theory subreviewers don't typically get to see other reviews of the paper. So their score assignment is in a vacuum.
• theory PC members are expected to produce a review for a paper taking the subreviewer comments into account (as opposed to merely scrutinizing the reviews being provided)
• managing reviewer comments for 30 papers is quite different to generating 30 reviews yourself (even with subreviewer help)
• A downside of the two-tier PC system is also that there isn't the same global view of the entire pool that a theory PC gets. But this is more a convention than a rule: there's nothing stopping a PC for opening up discussions to all vice chairs.
• One advantage of area chairs is that at least all papers in a given area get one common (re)viewer. that's not necessarily the case in a theory PC without explicit coordination from the PC chair and the committee itself.
But the main claimed difference (that people submitting papers don't get to review them) is false. Even worse, when submitters do review papers, this is 'under the table' and so there isn't the same strict conflict management that happens with explicit PC membership.

We're dealing with problems of scale in all aspects of the paper review and evaluation process. This particular one though could be fixed quite easily.

## Friday, December 23, 2011

### Thoughts on ICDM II: Social networks

The other trend that caught my eye at ICDM is the dominance of social networking research. There was a trend line at the business meeting that bore this out, showing how topics loosely classified as social networking had a sharp rise among accepted papers in ICDM over the past few years.

There were at least three distinct threads of research that I encountered at the conference, and in each of them, there's something to interest theoreticians.
• The first strand is modelling: is there a way to describe social network graphs using abstract evolution models or random graph processes. I spent some time discussing this in a previous post, so I won't say more about it here. Suffice it to say that there's interesting work in random graph theory underpinning this strand, as well as a lot of what I'll call 'social network archaeology': scouring existing networks for interesting structures and patterns that could be the basis for a future model.
• The second strand is pattern discovery, and the key term here is 'community': is there a way to express natural communities in social networks in a graph-theoretic manner ? While modularity is one of the most popular ways of defining community, it's not the only one, and has deficiencies of its own. In particular, it's not clear how to handle "soft" or "overlapping" communities. More generally, there appears to be no easy way to capture the dynamic (or time-varying) nature of communities, something Tanya Berger-Wolf has spent a lot of energy thinking about. Again, while modelling is probably the biggest problem here, I think there's a lot of room for good theory, especially when trying to capture dynamic communities.
• The final strand is influence flow. After all, the goal of all social networking research is to monetize it (I kid, I kid). A central question here is: can you identify the key players who can make something go viral for cheap ? is the network topology a rich enough object to identify these players, and even if you do, how can you maximize flow (on a budget, efficiently).
There were many papers on all of these topics -- too many to summarize here. But the landscape is more or less as I laid it out. Social networking research is definitely in its bubble phase, which means it's possible to get lots of papers published without necessarily going deep into the problem space. This can be viewed as an invitation to jump in, or a warning to stay out, depending on your inclination. And of course, the definitive tome on this topic is the Kleinberg-Easley book

This concludes my ICDM wrap-up. Amazingly, it only took me a week after the conference concluded to write these up.

## Thursday, December 22, 2011

### CGWeek !!!

If you're on the compgeom mailing list, you probably know this already, but if not, read on:

There's an increased interest in expanding the nature and scope of events at SoCG beyond the conference proper. To this end, Joe Mitchell put together a committee titled "CG:APT" (CG: applications, practice and theory) to chalk out a strategy and solicit contributions.

The call for contributions is now out, and the main nugget is this:
Proposals are invited for workshops/minisymposia, or other types of events on topics related to all aspects of computational geometry and its applications. Typical events may feature some number of invited speakers and possibly some number of contributed presentations. Events may feature other forms of communications/presentations too, e.g., via software demos, panel discussions, industry forum, tutorials, posters, videos, implementation challenge, artwork, etc. CG:APT events will have no formal proceedings; optionally, the organizers may coordinate with journals to publish special issues or arrange for other dissemination (e.g., via arXiv, webpages, printed booklets, etc).
In other words, anything goes ! (this is an experiment, after all). Topics are essentially anything that might be of interest geometrically in a broad sense (i.e not limited to what might appear at the conference itself).

I'd strongly encourage people to consider putting together a proposal for an event. The procedure is really simple, and only needs a two page proposal containing:

1. Title/theme of the workshop/minisymposium/event
2. Organizer(s) (name, email)
3. Brief scientific summary and discussion of merits (to CG) of the proposed topic.
4. A description of the proposed format and agenda
5. Proposed duration: include both minimum and ideal; we anticipate durations of approximately a half day (afternoon), with the possibility that some meritorious events could extend across two half-days.
6. Procedures for selecting participants and presenters
7. Intended audience
8. Potential invited speakers/panelists
9. Plans for dissemination (e.g., journal special issues)
10. Past experience of the organizer(s) relevant to the event
Please note: EVERY COMMUNITY DOES THIS (and now, even theory). The deadline is Jan 13, 2012, and proposals should be emailed to Joe Mitchell (joseph.mitchell@stonybrook.edu). Note that the idea is for such events to be in the afternoon, after morning technical sessions of SoCG.

There are many people out there who grumble about how insular the CG community is. Now's your chance to walk into the lion's den (at Chapel Hill) and tell it off :).

### Thoughts on ICDM I: Negative results (part C)

This is the third of three posts (one, two) on negative results in data mining, inspired by thoughts and papers from ICDM 2011.

If you come up with a better way of doing classification (for now let's just consider classification, but these remarks apply to clustering and other tasks as well), you have to compare it to prior methods to see which works better. (note: this is a tricky problem in clustering that my student Parasaran Raman has been working on: more on that later.).

The obvious way to compare two classification methods is how well they do compared to some ground truth (i.e labelled data), but this is a one-parameter system, because by changing the threshold of the classifier (or if you like, translating the hyperplane around),you can change the false positive and false negative rates.

Now the more smug folks reading these are waiting with 'ROC' and "AUC" at the tip of their tongues, and they'd be right ! You can plot a curve of the false positive vs false negative rate and take the area under the curve (AUC) as a measure of the effectiveness of the classifier.

For example, if the y axis measured increase false negatives, and the x-axis measured increasing false positives, you'd want a curve that looked like an L with the apex at the origin, and a random classifier would look like the line x+y = 1. The AUC score would be zero for the good classifier and 0.5 for the bad one (there are ways of scaling this to be between 0 and 1).

The AUC is a popular way of comparing methods in order to balance the different error rates. It's also attractive because it's parameter-free and is objective: seemingly providing a neutral method for comparing classifiers independent of data sets, cost measures and so on.

But is it ?

There's a line of work culminating (so far) in the ICDM 2011 paper 'An Analysis of Performance Measures For Binary Classifiers' by Charles Parker of BigML.com (sorry, no link yet). The story is quite complicated, and I doubt I can do it justice in a blog post, so I'll try to summarize the highlights, with the caveat that there's nuance that I'm missing out on, and you'll have to read the papers to dig deeper.

The story starts with "Measuring classifier performance: a coherent alternative to the area under the ROC curve" ( link) by David Hand (copy of paper here). His central result is a very surprising one:
The AUC measure implicitly uses different misclassification costs when evaluating different classifiers, and thus is incoherent as an "objective" way of comparing classifiers.
To unpack this result a little, what's going on is this. Suppose you have a scenario where correct classification costs you nothing, and misclassification costs you a certain amount (that could be different for the two different kinds of misclassification). You can now write down an overall misclassification cost for any threshold used for a classifier, and further you can compute the optimal threshold (that minimizes this cost). If you don't actually know the costs (as is typical) you can then ask for the expected misclassification cost assuming some distribution over the costs.

If you run this computation through, what you end up with is a linear transformation of the AUC, where the distribution over the costs depends on the distribution of scores assigned by the classifier ! In other words, as Hand puts it,
It is as if one measured person A’s height using a ruler calibrated in inches and person B’s using one calibrated in centimetres, and decided who was the taller by merely comparing the numbers, ignoring the fact that different units of measurement had been used
This is a rather devastating critique of the use of the AUC. While there's been pushback (case in point is an ICML 2011 paper by Flach, Hernandez-Orallo and Ferri which is a very interesting read in its own right), the basic premise and argument is not contested (what's contested is the importance of finding the optimal threshold). Hand recommends a few alternatives, and in fact suggests that the distribution of costs should instead be made explicit, rather than being implicit (and subject to dependence on the data and classifiers)

What Parker does in his ICML paper is take this further. In the first part of his paper, he extends the Hand analysis to other measures akin to the AUC, showing that such measures are incoherent as well. In the second part of his paper, he unleashes an experimental tour de force of classifier comparisons under different quality measures, showing that
• nearly 50% of the time, measures disagree on which classifier in a pair is more effective. He breaks down the numbers in many different ways to show that if you come up with a new classification algorithm tomorrow, you'd probably be able to cherry pick a measure that showed you in a good light.
• It's the measures perceived to be more "objective" or parameter-less that had the most trouble reconciling comparisons between classifiers.
• It's also not the case that certain classifiers are more likely to cause disagreements: the problems are spread out fairly evenly.
• His experiments also reinforce Hand's point that it's actually better to define measures that explicitly use domain knowledge, rather than trying to achieve some objective measure of quality. Measures that were either point-based (not integrating over the entire range) or domain specific tended to work better.
I'm not even close to describing the level of detail in his experiments: it's a really well-executed empirical study that should be a case study for anyone doing experimental work in the field. It's especially impressive because from personal experience I've found it to be REALLY HARD to do quality methodological studies in this area (as opposed to the "define algorithm-find-toy-data-profit" model that most DM papers seem to follow").

At a deeper level, the pursuit of objective comparisons that can be reduced to a single number seems fundamentally misguided to me. First of all, we know that precise cost functions are often the wrong way to go when designing algorithms (because of modelling issues and uncertainty about the domain). Secondly, we know that individual methods have their own idiosyncracies - hence the need for 'meta' methods. And finally, we're seeing that even the meta-comparison measures have severe problems ! In some ways, we're pursuing 'the foolish hobgoblin of precision and objectivity' in an area where context is more important than we as mathematicians/engineers are used to.

## Tuesday, December 20, 2011

### Thoughts on ICDM I: Negative Results (part B)

Continuing where I left off on the idea of negative results in data mining, there was a beautiful paper at ICDM 2011 on the use of Stochastic Kronecker graphs to model social networks. And in this case, the key result of the paper came from theory, so stay tuned !

One of the problems that bedevils research in social networking is the lack of good graph models. Ideally, one would like a random graph model that evolves into structures that look like social networks. Having such a graph model is nice because
• you can target your algorithms to graphs that look like this, hopefully making them more efficient
• You can re-express an actual social network as a set of parameters to a graph model: it compacts the graph, and also gives you a better way of understanding different kinds of social networks: Twitter is a (0.8, 1, 2.5) and Facebook is a (1, 0.1, 0.5), and so on.
• If you're lucky, the model describes not just reality, but how it forms. In other words, the model captures the actual social processes that lead to the formation of a social network. This last one is of great interest to sociologists.
But there aren't even good graph models that capture known properties of social networks. For example, the classic Erdos-Renyi (ER) model of a random graph doesn't have the heavy-tailed degree distribution that's common in social networks. It also doesn't have a property that's common to large social networks: densification, or the fact that even as the network grows, the diameter stays small (implying that the network seems to get denser over time).

One approach to fixing this models a social network as a Stochastic Kronecker graph. You can read more about these graphs here: a simple way of imagining them is that you add an edge in the graph by a random process that does a (kind of) quad tree like descent down a partitioning of the adjacency matrix and places a 1 at a leaf. SKGs were proposed by Leskovec, Chakrabarti, Kleinberg and Faloutsos, and include ER graphs as a special case. They appear to capture heavy tailed degree distributions as well as densification, and have become a popular model used when testing algorithms on social networks. They're also used as the method to generate benchmark graphs for the HPC benchmark Graph500.

But a thorough understanding of the formal properties of SKGs has been lacking. In "An In-Depth Analysis of Stochastic Kronecker Graphs", Seshadhri, Pinar and Kolda show some rather stunning results. Firstly, they provide a complete analysis of the degree distribution of an SKG, and prove a beautiful result showing that it oscillates between having a lognormal and exponential tail. Their theorems are quite tight: plots of the actual degree distribution match their theorems almost perfectly, and convincingly display the weird oscillations in the degree frequencies (see Figure 2 in the paper).

Secondly, they also formally explain why a noisy variant of SKGs appears to have much more well-behaved degree distribution, proving that a slightly different generative process will indeed generate the desired distribution observed in practice.

Finally, they also show that the graphs generated by an SKG have many more isolated nodes than one might expect, sometimes upto 75% of the total number of vertices ! This has direct implications for the use of SKGs as benchmarks. Indeed, they mention that the Graph500 committee is considering changing their benchmarks based on this paper - now that's impact :)

What I like about this paper is that it proves definitive theoretical results about a popular graph model, and very clearly points out that it has significant problems. So any methodology that involves using SKGs for analysis will now have to be much more careful about the claims it makes.

p.s There's also more supporting evidence on the lack of value of SKGs from another metric (the clustering coefficient, that measures how many configurations uv, uw also have the third edge vw). Real social networks have a high CC, and SKGs don't.This was first mentioned by Sala, Cao, Wilson, Zablit, Zheng and Zhao, and Seshadhri/Pinar/Kolda have more empirical evidence for it as well. (Disclaimer: I was pointed to these two references by Seshadhri: my opinions are my own though :))

## Sunday, December 18, 2011

### Thoughts on ICDM I: Negative Results (part A)

I just got back from ICDM (the IEEE conference on Data Mining). Data mining conferences are quite different from theory conferences (and much more similar to ML or DB conferences): there are numerous satellite events (workshops, tutorials and panels in this case), many more people (551 for ICDM, and that's on the smaller side), and a wide variety of papers that range from SODA-ish results to user studies and industrial case studies.

While your typical data mining paper is still a string of techniques cobbled together without rhyme or reason (anyone for spectral manifold-based correlation clustering with outliers using MapReduce?), there are some general themes that might be of interest to an outside viewer. What I'd like to highlight here is a trend (that I hope grows) in negative results.

It's not particularly hard to invent a new method for doing data mining. It's much harder to show why certain methods will fail, or why certain models don't make sense. But in my view, the latter is exactly what the field needs in order to give it a strong inferential foundation to build on (I'll note here that I'm talking specifically about data mining, NOT machine learning - the difference between the two is left for another post).

In the interest of brevity, I'll break this trend down in three posts. The first result I want to highlight isn't even quite a result, and isn't even from ICDM ! It's actually from KDD 2011 (back in August this year). The paper is Leakage in Data Mining: Formulation, Detection, and Avoidance, by Shachar Kaufman, Saharon Rosset, and Claudia Perlich, and got the Best Paper Award at KDD this year.

The problem they examine is "leakage", or the phenomenon that even in a 'train model and test model" framework, it is possible for valuable information to "leak" from the test data or even from other sources to the training system, making a learned model look surprisingly effective and even give it predictive powers beyond what it really can do. Obviously, the problem is that when such models are then applied to new data, their performance is worse than expected, compared to a model that wasn't "cheating" in this way.

They cite a number of examples, including many that come from the data challenges that have become all the rage. The examples highlight different kinds of leakage, including "seeing the future", cross contamination of data from other data sets, and even leakage by omission, where a well-intentioned process of anonymization actually leaks data about the trend being analyzed.

While there are no results of any kind (experimental or theoretical), the authors lay out a good taxonomy of common ways in which leakage can happen, and describe ways of avoiding leakage (when you have control over the data) and detecting it (when you don't). What makes their paper really strong is that they illustrate this with specific examples from recent data challenges, explaining how the leakage occurs, and how the winners took advantage of this leakage explicitly or implicitly.

There are no quick and dirty fixes for these problems: ultimately, leakage can even happen through bad modelling, and sometimes modellers fail to remove leaky data because they're trying to encourage competitors to build good predictive models. Ironically, it is this encouragement that can lead to less predictive models on truly novel data. But the paper makes a strong argument that the way we collect data and use it to analyze our algorithms is fundamentally flawed, and this is especially true for the more sophisticated (and mysterious) algorithms that might be learning models through complex exploitation of the trained data.

It remains to be seen whether the recommendations of this paper will be picked up, or if there will be more followup work along these lines. I hope it does.

### A rant about school science fair projects

It's science fair time at my son's school. He's in the first grade, so admittedly there's not  a lot that he can do without a reasonable amount of *cough* parental*cough* help. But why do we not have a 'mathematics' fair or a 'programming fair ?

The science fair project format is very confining. You have to propose a hypothesis, tabulate a bunch of results, do some analysis, and discuss conclusions, with nice charts/graphs and other such science cultism. Even if you're interested in something more 'mathematical', there's no real way of shoehorning it into the format. A year or so ago, a colleague of mine was asking me about origami-related projects (because his daughter loves paper folding) but apart from experimenting with knapsack-style algorithms to determine how to fold a ruler into a specified length, we couldn't figure out something that fit into the 'hypothesis-experiment' setting.

Granted, it's a science fair. But at this age level, I assume the whole point of participation in science fairs is about learning something about science, rather than conducting rigorous analysis. You could equally well learn about something mathematical and demonstrate that knowledge. But there's no forum to do that in.

## Friday, December 16, 2011

### Simons Fellowship for theory grad students.

The Simons Foundation has been a great boon for theoretical computer science, supporting postdocs galore, and even running a "sooper-seekrit" search for a new TCS institute.

Their latest initiative is at the grad student level. They're offering a 2-year fellowship to grad students "with an outstanding track record of research accomplishments". I think the idea is to support students who've established a good body of work, and could use this to coast towards their graduation and ramp up their research even more.

The support is considerable:
Each award will provide annual support for the following:
• Fellowship stipend as set by the student’s institution.
• Tuition and fees at the student's institution, to the extent previously covered by other outside funding.
• $3,000 in additional support for the student to use at his/her discretion. •$5,000 per year toward the student’s health insurance and other fringe benefits.
• $5,000 per year for travel, books, a computer and other supplies. •$5,000 in institutional overhead allowance.
Fellowships will start September 2012 and end August 2014.

How do you apply ?
Applicants may apply through proposalCENTRAL (http://proposalcentral.altum.com/default.asp?GMID=50) beginning December 7, 2011. The deadline to apply is May 1, 2012. Please coordinate submission of the proposal with the appropriate officials in accordance with institution/university policies. Please see the Application Instructions for further information.
Application Requirements:
• Research Statement (two page limit): A statement summarizing the applicant’s research contributions, research plans for the immediate future, and career goals. References do not need to be included within the page limit, but should not exceed an additional page.
• A curriculum vitae (two page limit), which includes institution, advisor, and a list of publications.
• A letter of support from the Department Chair.
• A letter of support from the student’s advisor.
• A letter from a reference outside the student’s institution. This letter must be submitted directly via proposalCENTRAL by the reference. Please see the Application Instructions for more information.
• Thesis topic.

(HT Sampath Kannan for pointing this out)

## Thursday, December 08, 2011

### ACM Fellows, 2011

Many theoreticians on this year's list of ACM Fellows:
• Serge Abiteboul
• Guy Blelloch
• David Eppstein
• Howard Karloff
• Joe Mitchell
• Janos Pach
• Diane Souvaine
Congratulations to them, and to all the Fellows this year (especially my collaborator Divesh Srivastava)

### SoCG and ACM: The Results Show

From Mark de Berg:
The bottom row of the table [below] gives the number of votes for each of the three options

A.    I prefer to stay with ACM.

B.    If involvement of ACM can be restricted to publishing the proceedings, at low cost for SoCG, then I prefer to stay with ACM; otherwise I prefer to leave ACM.

C.     I prefer to leave ACM, and organize SoCG as an independent conference with proceedings published in LIPIcs.

and it also gives a breakdown of the votes by the number of SoCG’s that the voter attended in the last 10 years.
 A: stay B: proceedings only C: leave total A: 0 4 3 3 10 B: 1-2 6 16 19 41 C: 3-5 11 15 16 42 D: >5 8 14 9 31 total 29 48 47 124

Based on the result of the Poll, the Steering Committee decided to start negotiating with ACM to see if they can offer SoCG an arrangement in which involvement of ACM is limited primarily to publishing the proceedings, with the possible option for greater involvement if the local organizers request/require it.
It will be interesting to see how this plays out.

## Monday, November 21, 2011

### On getting rid of ACM affiliation...

As I mentioned earlier, the SoCG steering committee is running a poll on whether to separate SoCG from ACM and run it as an independent symposium with proceedings published by the Dagstuhl LIPIcs series.

At the time I had considered writing a post expressing my own thoughts on the matter, and then promptly forgot about it. The poll is still open (although probably many of you have voted already) and so I thought (with some external nudging :)) that I'd state my position here (one that I summarized in brief for the poll).

If you'd rather not read any more, here's the summary:
I think it's a bad idea.
And here's why

I understand the rationale behind this: it can be a little difficult to work with ACM. ACM does take a contingency fee that increases registration. The cost of proceedings is a significant fraction of the total budget, and the timelines can be a little tricky to work with. These and more are outlined in the poll document, and you can read all the points being made there.

But that's not why I think this is a bad idea (and I do have specific objections to the claims above). I think it's a bad idea because ACM is not just a conference organizing, proceedings publishing enterprise that has bad taste in style files. It's also the home of SIGACT.

For better or worse, SIGACT is the flagship representative body for theoretical computer science in the US. General theory activities like the Theory Matters wiki and the SIGACT committee for the advancement of theoretical computer science were jump-started by SIGACT. Association with SIGACT means that the work you do is 'theoryCS (A)' in some shape of form.

If you're doing geometry in the US, then affiliation with SIGACT is not a bad thing. It means that you're part of the larger theory community. It means that when the theory community makes pitches to the NSF to get more funding for theory research, you're included. It also helps because there aren't other communities (SIGGRAPH?) ready to absorb geometry into their fold.

While de-linking from ACM doesn't mean that we turn in our 'theory badges', it doesn't help the already fraught relationship between computational geometry and the larger theory community. And while the relationship with ACM is not crucial to the survival of the geometry community in the US, being part of a larger theory community that speaks the same language is crucial.

p.s The center of mass in CG is closer to Europe than many might realize. I understand that our European colleagues could care less about US funding and community issues, which is fair for them. But this matter affects US-resident folks more than ACM's surcharges affect the European researchers, and  our community isn't so big that we can withstand shocks to one side of it.

## Tuesday, November 15, 2011

### SODA 2012 student travel support

While it's not a free round trip to Japan for blogging, ....

Well actually it is !

David Johnson asks me to point out that the deadline for applying for student travel support to SODA 2012 has been extended to Nov 29.
The United States National Science Foundation, IBM, and Microsoft have provided funding to support student travel grants. The nominal award is $1000 toward travel expenses to Kyoto, Japan for SODA12, or ALENEX12 or ANALCO12. In special cases of large travel costs somewhat larger grants may be given. Any full-time student in good standing who is affiliated with a United States institution is eligible to receive an award. All awardees are responsible for paying for meeting registration fees. However, registration fees may be included in travel expenses Top priority will be given to students presenting papers at the meeting, with second priority to students who are co-authors of papers. Ties will be broken in favor of students who have not attended a prior SODA. For more information, visit the SIAM travel support page And if you'd also like to blog about the conference once you're there, the cstheory blog is always looking for volunteers ! ### An intuitive argument for the JL lemma ? Hal Daumé (as is his wont) asked me a simple question with no easy answer. His question was Is there a proof of the Johnson-Lindenstrauss Lemma that can be explained to an undergraduate ? While there are different "simple" proofs of the JL Lemma (the Gupta-Dasgupta proof is one such), it's not clear whether these are "undergraduate" level. So instead of answering his original question, I decided to change it to Is there a proof of the JL Lemma that isn't "geometric" ? It's a little odd to even ask the question, considering the intrinsic geometric nature of the lemma. But there's a reasonably straightforward way of seeing how the bound emerges without needing to worry too much about random rotations, matrices of Gaussians or the Brunn-Minkowski theorem. Warning: what follows is a heuristic argument that helps suggest why the bound is in the form that it is: it should not be confused for an actual proof. In its original form, the JL Lemma says that any set of$n$points in$R^d$can be embedded in$R^k$with$k = O(\log n/\epsilon^2)$such that all distances are preserved to within a$1+\epsilon$factor. But the real result at the core of this is that there is a linear mapping taking a unit vector in$R^d$to a vector of norm in the range$1\pm \epsilon$in$R^k$, where$k = 1/\epsilon^2$(the rest follows by scaling and an application of the union bound). Trick #1: Take a set of values$a_1, \ldots, a_n$and set$Y = \sum_i a_i r_i$, where$r_i$is chosen (iid) to be +1 or -1 with equal probability. Then$E[Y^2] = \sum a_i^2$.This can be verified by an easy calculation. So now consider the vector$v$. Let's assume that$v$'s "mass" is roughly equally distributed among its coordinates. Take a random sample of$d/k$of the coordinates of$v$and apply the above trick to the values. Under the above assumption, the resulting$Y^2$will have roughly$1/k$of the total (squared) mass of$v$. Scale up by$k$. This is one estimator of the norm of$v$. It is unbiased and it has a bounded maximum value because of the assumption. This means that we can apply a Chernoff bound over a set of$k$such estimators. Roughly speaking, the probability of deviation from the mean is$\exp(-\epsilon^2 k)$, giving the desired value of$k$. But how do we enforce the assumption ? By applying a random Fourier transform (or actually, a random Hadamard transform). this "spreads" the mass of the vector out among the coordinates (technically by ensuring an upper bound on the$\ell_\infty$norm). That's basically it. Almost all the papers that follow the Ailon-Chazelle work proceed in this manner, with increasing amounts of cleverness to reuse samples, or only run the Fourier transform locally, or even derandomize the process. What distinguishes this presentation of the result from the earlier approaches (which basically boil down to: populate a matrix with entries drawn from a distribution having subGaussian tails) is that it separates the "spreading" step (called the preconditioner) from the latter, more elementary step (the sampling of coordinates). It turns out that in practice the preconditioner can often be omitted without incurring too much error, yielding an extremely efficient (and sparse) linear transform. ## Thursday, November 10, 2011 ### Computational Geometry at the joint math meetings The Joint Mathematics Meetings is billed as the "largest annual mathematics meeting in the world". In 2012 it's in Boston between Jan 4 and 7, and I'm told that there will be over 2000 presenters. Happily, some of these will be geometers. There will also be an all-day AMS special session on Computational and Applied Topology (also on Thursday). So there's lots to do at the JMM if you're interested in geometry. So if you're in the area in January, drop by ! ## Friday, November 04, 2011 ### SoCG and ACM: Relationship status - Complicated. Mark de Berg (secretary of the SoCG steering committee), sent out an email to the compgeom mailing list that starts: Since its start 27 years ago, SoCG has always been affiliated to ACM. This means that the proceedings are published by ACM, and that the symposium is organized “sponsored by ACM” (more precisely, sponsored by ACM SIGACT & ACM SIGGRAPH) or “in cooperation with ACM”. The latter happened only a couple of times, namely when SoCG was in Korea in 2007 and when it was in Denmark in 2009. Being affiliated to ACM has certain advantages, but also certain disadvantages, as detailed below. Hence, at the business meeting of this year’s SoCG in Paris, an alternative was discussed: organizing SoCG as an independent symposium, with the proceedings being published by Dagstuhl in their LIPIcs series (see below). A straw poll was taken, and the vast majority of the participants wanted the Steering Committee to investigate this issue further, which we do through this opinion poll. We hope you want to participate in this important poll. If you have an interest in how the relationship between SoCG and the ACM continues (or doesn't), I'd strongly encourage you to participate in the poll. The poll documents will eventually be are posted on http://computational-geometry.org, and in the meantime here's a google doc you can read. ## Thursday, November 03, 2011 ### Life in a crowd-sourced research world (Sung to the tune of "War", and with a Jackie Chan accent for bonus points) Jo-ur-nals ! What are they good for ! Absolutely nothing ! There are popular tropes in current discussions on crowd-sourcing research. There's the "scientists as Mean Girls" view of the current state of affairs. There's the utopian "Let a thousand papers bloom in the open research garden". There's the anti-capitalist "Down with evil money-grubbing publishers", and there's of course the always popular "Everything tastes better with crowd-sourced reputation points and achievement badges". But have we really thought through the implications of doing away with the current frameworks for "dissemination, verification and attention management" ? Here's a tl;dr a la Cosma Shalizi: A more open research environment, where all work is published before review, and anyone is free to comment on any work in public without repercussions, is both valuable as well as more chaotic and unpleasant than we might be ready for. Consider a pure "publish-then-filter" world, in which you dumped your paper in a public repository that had commenting, reviewing, reputation features, achievement badges and whatever other technological goodies you wanted to throw in. You'd be in a world not unlike the world that writers and musicians live in today. Since big music/book publishers (read "journals") take a big cut of the royalty revenues ("journal subscriptions") in exchange for promotion/marketing (read "stamps of authenticity"), many authors and musicians have developed smaller but successful brands by going on the road themselves, doing online promotions, cultivating their fan base with special material, downloads, T-shirts, event tickets and what not, and relying on underground word-of-mouth to establish a presence. Are you ready to do the same ? It's naive to think that merely putting papers on a repository and waiting for attention to appear will actually work to disseminate your work. Attention is probably the most valuable resource available to us in this connected era, and the one most fiercely fought over by everyone. No one is going to even be able to pay attention to your work unless you promote it extensively, OR unless there are external ways of signalling value. If you think that reputation mechanisms will help, I will merely ask you to look at the attention garnered by the latest Ke$ha single compared to the attention given to <insert name of your favorite underground-not-selling-out-obscure-indie-band-that-will-set-the-world-on-fire here >

Secondly, I think as researchers, we would cringe at the kind of explicit promotion that authors/musicians have to indulge in. Would you really want to sell tickets for the "1.46 approximation to graphic TSP paper tour?". How would you afford it ?

There's a third aspect to living in a crowd-sourced research world: a loss of mental space. While it should be clear to anyone who follows my blog/tweets/G+/comments on cstheory that I enjoy the modern networked world, it's also clear to me that actual research requires some distance.

In Anathem, Neal Stephenson describes a monastery of mathematics, where monks do their own research, and at regular intervals (1/10/100/1000 years) open their doors to the "seculars" to reveal their discoveries to the outside world.

Even with collaborations, skype, shared documents and github, you still need time (and space) to think. And in a completely open research environment where everything you post can be commented on by anyone,  I can assure you that you'll spend most of your time dealing with comment threads and slashdotting/reditting/HNing (if you're lucky). Are you ready to deploy a 24 hour rapid-response team to deal with the flaming your papers will get ?

Let me be very clear about something. I think there are many academic institutions (journals and conferences especially) that are in desperate need of overhauls and the Internet makes much of this possible. I think it's possible (but I'm less convinced) that we are on the cusp of a new paradigm for doing research, and debates like the ones we are having are extremely important to shape this new paradigm if it comes into being. In that context, I think that what Timothy Gowers (here and now here) and Noam Nisan (here and here) are trying to do is very constructive: not just complain about the current state of affairs or defend the status quo, but try to identify the key things that are good AND bad about our current system AND find a path to a desired destination.

But human nature doesn't change that quickly. New ways of disseminating, valuing and verifying research will definitely change what's valued and what's not, and can help open up the research enterprise to those who feel the current system isn't working (i.e most of us). But when you replace one evaluation system by another, don't be too sure that the new system is fairer - it might merely change what gets valued (i.e the peaks might change but not the distribution itself)

## Tuesday, November 01, 2011

### Large-Data Clustering Part I: Clusters of clusters

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here

I don't know why, but I've been struggling with a series of posts on large-data clustering for a while now. Distilling out the key ideas in a way that is intuitive and yet rigorous has been trickier than I thought. But here goes ...

Let's consider a prototypical clustering algorithm: $k$-means. In a $k$-means iteration, you have to scan each point and determine its nearest cluster center. Then you have to organize all the points that share a cluster center and compute a new one by averaging. Each iteration takes $nk$ time, and this isn't even accounting for high dimensions.

This algorithm doesn't scale well to large data, needless to say. But clustering, like most data mining tasks, is needed most acutely when we're dealing with large data. So what do we do ?

In the next series of posts, I'll try to walk you through some of the key principles that get used when designing large data clustering algorithms. As always, the twin principles of sampling and streaming will be the keys to the kingdom.

Clusters can be formed by clustering clusters.

One of the ideas that makes large-data clustering tractable is that you can cluster clusters (say that three times quickly, will ya ?). $k$-means is a particularly good example of this. I can think of a cluster center as a "representative weighted point" and discard the data associated with it. This is critical, because now I can just worry about the cluster centers themselves, and repeat.

The advantage of this is locality. I can take small chunks of data and cluster them efficiently. Then I can combine the cluster reps from each chunk to get a smaller set of reps, and so on.  Many of the "standard" large-data clustering methods (BIRCH, CLARA, CLARANS and the like) are based on this principle. As a concept, it's sound. But does it guarantee anything ?

The catch of course is error propagation. How do I make sure that in this repeated (hierarchical) clustering process, I didn't commit to a bad clustering  early on because of locality ? It seems like I could pay a lot of error at each level, limiting my ability to group data sets into small chunks (because small chunks mean many levels).

It turns out that there are clever ways around this problem by careful merging. The first example of this is a paper from 1997 by Charikar, Chekuri, Feder and Motwani. They look at the $k$-center problem in a general metric space (minimize the maximum radius) and their technique illustrates this principle well.

Their algorithm is quite elegant, and works in phases. In each phase, they first merge cluster centers together to ensure that all resulting cluster centers are far apart. Then they insert new points into clusters as long as they don't get too big, or create new clusters. This happens as long the number of clusters stays below $k$. When it crosses $k$, the phase ends, and then the process repeats.

Note that the algorithm reads in points and processes them on the fly in a single "pass". The reason it works inspite of the repeated mergings is because of a nifty property of the $k$-center algorithm (which lies at the heart of the Gonzalez 2-approximation for $k$-center):
If you can produce $k+1$ points that are all at distance at least $d$ from each other, then the optimal $k$-clustering has diameter at least $d$.
This is because of a 'quantitative pigeonhole principle': At least two of these points must be in a single cluster, which then has diameter at least $d$.

There are some tricks needed to figure out how much you can expand each cluster when inserting, and how far apart cluster centers should be in the phase after the current one. I'll spare you the details, but they lead to an 8-approximation, which can be improved using a neat randomization trick to a $2e$-approximation.

The above trick works because we can build a witness structure that lower bounds OPT, and so the merging can be done carefully so that error doesn't propagate badly. What happens if you change the cost function to the $k$-median though, where the witness isn't as straightforward ?

An excellent source here is the work by Guha, Meyerson, Mishra, Motwani and O'Callaghan. This paper, which combines prior work on both the theory and practice of $k$-median clustering, presents the following idea:
Break the data set into pieces. Cluster each set "loosely", using many more than $k$ clusters. Weight each cluster by the number of points assigned to it. Cluster these weighted cluster centers to find the desired $k$ clusters.
The key observations they make are:
1. The total cost of clustering the pieces isn't too much larger than the overall clustering cost (so it gives a good lower bound). This follows from the triangle inequality.
2. The new "weighted" instance admits a solution that again is not too much larger than the optimal cost (is actually within a factor of 6 for any partition). This again follows from the triangle inequality.
The last piece of this puzzle is: how to cluster each set "loosely" ? They propose using a bicriterion method, whose cost is compared to the optimal $k$-median cost, but is allowed to use more medians. Such bicriterion methods are often easier to design.

Note that there's nothing stopping them from using their own algorithm recursively to cluster the weighted cluster centers. What's neat about this is that it can be adapted to a streaming setting using another standard trick.

Specifically, if you have an algorithm that breaks things into pieces and then recombines them (a two-level scheme), you can first recursively apply it to get a multi-level hierarchical scheme, and then you can implement this in a streaming setting by updating (potentially) an entire side of the computation tree whenever a new item appears. This can also be viewed as a lazy implementation of the hierarchical strategy.

## Monday, October 31, 2011

### Models for MapReduce

I've been listening to Jeff Phillips' comparison of different models for MapReduce (he's teaching a class on models for massive data). In what follows, I'll add the disclaimer IANACT (I am not a complexity theorist).

There's something that bothers me about the various models floating around that attempt to capture the MapReduce framework (specifically the MUD framework by Feldman et al, the MRC framework by Karloff, Suri and (co-blogger) Vassilvitskii, and the newer Goodrich-Sitchinava-Zhang framework).

For "seasoned" models of computation like the RAM, or PRAM, or even I/O and streaming, there's a sense of identifying some intrinsic feature of the computation that you wish to capture, building a model around it, and then identifying resources that you wish to expend little of. After the dust settled around the different TM/RAM models for effective computation, we now accept (for the most part) the RAM as a good model for sequential computation. The I/O model identifies disk access as the key operation to capture and builds a model of computation where disk access is a limited resource. Streaming identifies "amnesia" as the key limitation on a computation, and builds a model of computation centered around that.

In all these cases, it's possible to grumble that $O(n^5)$ isn't really "efficient" or that galactic algorithms shouldn't count, or that ignoring main memory computations is "cheating" in the I/O model, or even that allowing $n^{1-\epsilon}$ memory storage is useless for streaming. But those discussions are secondary to the model: and in fact history tells us that inevitably models that appear to allow for horribly inefficient computation facilitate algorithm design that is quite efficient !

Which is why I'm puzzled when I look at the different ways in which the MapReduce framework is being modelled. I see a number of choices being made:  number of processors should be $O(n^{2-\epsilon})$, or total memory usage should be $O(n^\epsilon)$ or number of 'rounds' of computation should be polylogarithmic or even constant.

At one level I understand these choices: for example, a memory bound of $n^\epsilon$ appears to lead to constant $(1/\epsilon)$ round algorithms.

But should the model be getting into decisions about what's efficient before even deciding what resources they are trying to capture ?

Ultimately, this for me gets back to a question that I asked at the W8F workshop back in May):
What is the intrinsic computational metaphor that we are trying to capture with these models ?
It's not just parallelism. It's not just distributed computation. It's not even streaming (or more generally sublinear computations). Is it some delicate tradeoff between these three ? Should we think of map-reduce models like the circuit classes that have bounds on both space and depth ? Wouldn't this be an unholy mess ?

I suspect that over time, if the MapReduce computational framework persists, then these issues will get shaken out. But there's a provisional (and exciting!) feel about the models we currently have.

## Wednesday, October 12, 2011

### O() notation breeds laziness

I will often write "We store the $n^2$ pairs of objects in a data structure" when what I really mean is "We store the $\binom{n}{2}$ pairs ..". Of course there's no "real" difference in asymptotic land. But I've often found that students reading a paper will puzzle over asymptotic handwaving wondering why someone wrote $O(n^2)$ when it should really be over all pairs and whether they're missing some key insight. It's even worse when people throw in gratuitous and specific constants just to make an analysis work ("why do we really need to run for 18.5 n steps???").

While I completely understand the reasons people do this (ranging from the last minute deadline insertion to a misguided attempt at clarity), I think that making these shortcuts often makes a paper harder to read rather than easier. There are places where the asymptotics are all that matter ("Run this randomized strategy $O(\log n)$ times and apply a Chernoff bound") but even there, it doesn't hurt to use $c \log n$ (or a specific constant if you need a specific polynomial bound) instead. And when there actually is a precise expression that conveys meaning (like in the binomial example above), it's far better to use that to signify what's going on.

I think the precept we teach to students in algorithm analysis (don't throw in O() notation till the end) would be well worth paying more heed to ourselves.

Caveat: I'm not advocating this in ALL instances. I'm advocating a greater sensitivity to when and where the use of O() notation is more or less appropriate.

### SoCG 2012 CFP

It's out: !! Please also note the new workshop CG:APT (highlighted) that will attempt to expand the scope of the conference and bring in more applied work.I'm looking forward to hearing more about it.

28th Annual Symposium on Computational Geometry (SoCG 2012)
June 17-20, 2012 (tentative)
University of North Carolina, Chapel Hill
In cooperation with ACM SIGACT and SIGGRAPH

The Twenty-Eighth Annual Symposium on Computational Geometry will be held in University of North Carolina, Chapel Hill. We invite submissions of high-quality papers that describe original research on issues related to computational problems arising from geometric considerations. We also invite submissions of original videos and multimedia on these same themes. This year, SoCG will be collocated with a workshop, Computational Geometry: Applications, Practice, and Theory (CG:APT), whose call for submissions (due Feb 27) will be distributed separately.

The topics of the SoCG 2012 Symposium reflect the rich diversity of research interests in computational geometry. They are intended to highlight both the depth and scope of computational geometry, and to invite fruitful interactions with other disciplines. Topics of interest include, but are not limited to:
• design, analysis, and implementation of geometric algorithms and data structures; lower bounds on the computational complexity of geometric problems;
• mathematical, numerical, algebraic, and experimental issues arising in geometric algorithms and heuristics; discrete and combinatorial geometry; computational topology;
• novel applications of computational geometry in computer graphics, geometric modeling, computer-aided design and manufacturing, scientific computing, geographic information systems, database systems, robotics, computational biology, machine learning, sensor networks, biomedical imaging, combinatorial optimization, statistical analysis, discrete differential geometry, theoretical computer science, graph drawing, pure mathematics, and other fields; novel problems in computational geometry arising from these fields.

Important Dates

November 22, 2011: Paper titles and abstracts (at most 300 words) due (23:59, Honolulu time)
December 2, 2011: Paper submissions due (23:59, Honolulu time)
February 17, 2012: Notification of acceptance/rejection of papers
February 27, 2012 : Submissions of videos and multimedia due (23:59, Honolulu Time)
March 8, 2012: Notification of acceptance/rejection of video/multimedia
March 22, 2012: Camera-ready versions due for papers and video/multimedia abstracts
April 26, 2012: Final versions of video/multimedia due
June 17-20, 2012 (tentative): Symposium in Chapel Hill, North Carolina

Call for Papers

We invite submissions of high-quality papers describing original research on geometric algorithms and data structures, their mathematical foundations and correctness, their implementation, and their applications.

Final versions of accepted papers will be published by ACM in the symposium proceedings. Proceedings will be distributed to symposium participants and will also be available from ACM for purchase and through the ACM digital library. An author of each accepted paper will be expected to attend the Symposium and give a presentation (approximately 20 minutes) of the paper. Authors of a selection of papers from the conference will be invited to submit extended versions of their papers to a special issue of one or more journals. This year we also plan to confer two Best Paper Awards, and a Best Student Presentation Award. The Student Presentation award will be based on the quality of the presentation of a paper by a student during the conference.

Paper Submission

All submissions should be made electronically; see the EasyChair SoCG2012 web page https://www.easychair.org/conferences/?conf=socg2012 for detailed submission instructions (to be available shortly). If electronic submission is not feasible, please contact the program committee chairs, Tamal Dey and Sue Whitesides, well in advance of the submission deadlines.

Submission Guidelines
Papers should be submitted in the form of an extended abstract, which begins with the title and abstract page that contains the title of the paper, each author's name, affiliation, and e-mail address followed by a short abstract. The main body of the extended abstract should begin with a precise statement of the problem considered, a succinct summary of the results obtained (emphasizing the significance, novelty, and potential impact of the research), and a clear comparison with related work. The remainder of the extended abstract should provide sufficient detail to allow the program committee to evaluate the validity, quality, and relevance of the contribution. Clarity of presentation is very important; the whole extended abstract should be written carefully, taking into consideration that it will be read and evaluated by both experts and non-experts, often under tight time constraints.

Submissions should be typeset in single column format, using 11-point or larger font, with at least 1 inch/2.54 cm margins and single line spacing. Excluding the title page and bibliography, the extended abstract must not exceed 10 pages. Submissions deviating from these guidelines risk rejection without consideration of their merits.

All necessary details to verify the results must be provided. If they cannot fit within the 10-page limit, a clearly marked appendix containing omitted details should be included. Appendices are not counted in the 10 page limit, so while they may serve as a reference, they will be read by the program committee members at their discretion. The paper excluding the appendix should clearly describe the results and the approach to achieve them, and give sufficient confidence for their validity. The appendix should then give all the necessary details to verify correctness.

Anticipating the usual high overall quality of submissions, the program committee intends to interpret the scope of the conference broadly, and will seriously consider all papers that are of significant interest to our research community.

Authors must submit the title and an abstract (at most 300 words) of their papers by November 22, 2011. This pre-submission will be used to help make program committee reading assignments. Extended abstracts must be received by December 2, 2011 (23:59, Honolulu time). There will be no extension of these deadlines; late submissions will not be considered. Authors will be notified of acceptance or rejection by February 17, 2012. The final proceedings papers must be formatted in accordance with ACM proceedings guidelines; LaTeX style files will be made available to authors of accepted papers.

Concurrent submission of the same (or essentially the same) abstract to SoCG and to another conference with published proceedings is not allowed. An extended abstract of a paper that is under journal review, or scheduled for publication in a journal after June 2012, may be submitted, when it is clear that the extended abstract differs substantially from the journal version. In such cases, the authors must include the journal version in an appendix that clearly identifies the status of the journal submission.

Program Committee

• Pankaj Agarwal(Duke University)
• Dominique Attali (Gipsa-Lab, Grenoble)
• Gill Barequet (Technion-Israel Institute of Technology)
• Mark de Berg (Technische Universiteit, Eindhoven)
• Danny Chen (University of Notre Dame)
• Tamal Dey (The Ohio State University) (co-chair)
• Vida Dujmović (Carleton University)
• David Eppstein (University of California, Irvine)
• Leonidas Guibas (Stanford University)
• Sylvain Lazard (INRIA, Nancy Grand Est)
• Dinesh Manocha (The University of North Carolina at Chapel Hill)
• Steve Oudot (INRIA Saclay)
• Konrad Polthier (Freie Universität Berlin)
• Edgar Ramos (Universidad Nacional de Colombia, Medellín)
• Jian Sun (Tsinghua University)
• Takeshi Tokuyama (Tohoku University)
• Yusu Wang (The Ohio State University)
• Max Wardetzky (University of Göttingen)
• Sue Whitesides (University of Victoria, British Columbia) (co-chair)

Call for Video and Multimedia Presentations

Video and multimedia presentations are sought for the 21st Annual Video and Multimedia Review of Computational Geometry, to accompany the 28th Annual Symposium on Computational Geometry. This review showcases the use of visualization in computational geometry for exposition and education, for visual exploration of geometry in research, and as an interface and a debugging tool in software development. Algorithm animations, visual explanations of structural theorems, descriptions of applications of computational geometry, and demonstrations of software systems are all appropriate.

Three to five minutes is ideal for most presentations; eight minutes is the upper limit. Accepted video and multimedia presentations will have an abstract in the published conference proceedings; video/multimedia authors will have an opportunity to present their work at the Symposium during a dedicated video session. Accepted presentations will be available online in various formats in a web proceedings. See http://www.computational-geometry.org/ for examples of previous years' proceedings.

Submissions of video clips in QuickTime or MPEG-4 compressed formats (e.g., XviD or DivX version 6) are encouraged. We also encourage submissions of Macromedia Flash, Java applets, and limited forms of other multimedia or software. These formats must come with a script that will allow them to be distributed in both interactive and canned Quicktime or MPEG video formats. In case of doubt, please email the Video and Multimedia Program chair.

Each submission should include a one or two-page description of the material shown in the presentation, and where applicable, the techniques used in the implementation. The final two-page descriptions must be formatted according to the guidelines for proceedings. LaTeX style files will be provided to authors of accepted presentations.

Submission

Submissions should be deposited online where they are accessible through the web or via FTP. Send email to the Video/Multimedia committee chair, Christian Knauer by 23:59 Honolulu Time, Monday, February 27, 2012, with the following information: the names and institutions of the authors, the email address of the corresponding author, and instructions for downloading the submission. For ease of sharing and viewing, we encourage (but do not require) that each submission be uploaded to YouTube, and that the corresponding URL be included with the submission.

We explicitly encourage video/multimedia submissions that support papers submitted to the Symposium. Submitted papers and associated video/multimedia submissions will be treated entirely separately by the respective committees: acceptance or rejection of one will not influence acceptance or rejection of the other.

Authors will be notified of acceptance or rejection, and given reviewers' comments by March 8, 2012. For each accepted submission, the final version of the 2-page textual description will be due by March 22, 2012 (electronically) for inclusion in the proceedings. Final versions of accepted video/multimedia presentations will be due April 26, 2012 in the best format available.

Important Dates

Feb. 27, 2012: Video and Multimedia submissions due
Mar. 8, 2012: Notification for Video/MM submissions (23:59 Honolulu time)
Mar. 22, 2012: Camera-ready video/MM abstracts due
Apr. 26, 2012: Final versions of video/MM presentations due

Video and Multimedia Presentation Program Committee
• Helmut Alt -Freie Universität Berlin
• Sandor Fekete - TU Braunschweig
• Panos Giannopoulos -  Universität Bayreuth
• Xavier Goaoc - INRIA Nancy Grand Est
• Rolf Klein - Universität Bonn
• Christian Knauer (chair) - Universität Bayreuth
• Joseph S. B. Mitchell - SUNY Stony Brook
• Bettina Speckmann - TU Eindhoven
• Fabian Stehn - Universität Bayreuth
• Alexander Wolff - Universität Würzburg

SOCG 2012 Local Arrangements

Jack Snoeyink, Chair

## Friday, September 23, 2011

### Finding the right reference for finding the majority element

Probably everyone knows the standard trick for determining in one pass whether a stream admits an element occuring more than half the time. But what not everyone might now is the origin of this method.

Stuart Shieber writes about the algorithm (discovered by Boyer and Moore) and the mishaps it had on the way to publication. The article also describes how the followup paper by Misra and Gries came into being. Interestingly, I almost always see the Misra-Gries generalization being cited, and not the original Boyer-Moore method.

(HT Fernando Pereira)

## Wednesday, September 07, 2011

### ECML-PKDD Conference Report: Part II

My student Parasaran Raman continues his conference report from ECML-PKDD in Athens. For his first post, see here.

Christopher Bishop gave an excellent plenary talk on the first day about the need to embrace uncertainty. He talked about the Kinect technology that uses infra-red camera and sensors to measure distance of aperson from the screen. Instead of modeling the problem as a motion tracking problem, the folks at MSR started to look at the challenge as a pattern recognition problem. Each frame is now to be cold-started and understood as a picture instead of tracking the motion in a video sense. The goal now is to classify sets of pixels enclosed inside an axis-aligned rectangle. The training data is collected as a series of motion capture videos by letting a test subject wear sensors on the body a la hollywood-animation style, and is combined with some synthetic data, corrupted artificially since these are “too pure”. The training set consists of a whopping 1 million examples of body positions .

They then use a decision tree to classify the pixels into one of 31 predefined body parts by maximizing an information gain function. Another sub-goal that Bishop talked about was to identify human joints. For this, they use a kind of mean-shift technique, applying a simple Gaussian smoothing function on joints.

The second half of the talk revolved around model-based ML; where Bishop recommends constructing model equivalent for algorithms. He says and I quote “Don't look at the algorithms; look at the models”. He talked about an application where the goal is to rank players who play each other in a common environment. They have a new rating system TrueSkill, that extends the popular Elo rating system [which typically only works in a two player game like chess] to a multi-player team games. TrueSkill performs a kind of Bayesian ranking inference on the graphs by local message passing .

He also introduced Infer.NET, a framework for running Bayesian inference in graphical models.
[http://research.microsoft.com/en-us/um/cambridge/projects/infernet/]. This looks like a cool tool to play with, especially since it provides a single platform for classification and clustering problems.

The plenary talk on day two was by Rakesh Agarwal. In a talk that focused on a very interesting application, Rakesh introduced the idea of using data mining towards enriching education. The key idea is to take concepts from textbooks and map it with a relevant Wikipedia article and then induce relationship between concepts.

The goal is to enrich sections in the textbooks by providing recommendations for illustrative pictures to go with the text. Dispersion [which is a measure of how unrelated the concepts are] and syntactic complexity play a big role in deciding “which picture should go where”. Since search engines fail when given long key words, one challenge is to find key concepts in a section that one can then match up to the most similar article in Wikipedia. The underlying assumption was that for most high school textbook sections will have a associated Wikipedia article.

Rakesh showed results on applying this to a huge word corpus from NCERT, who draft the high school textbook in India for most schools. He also talked about computing an immaturity score to see if a section is well-written and not surprisingly, subjects like physics and chemistry scored over commerce and economics!

To summarize, two things that go into solving the problem of augmenting textbooks with images are “Comity” [where they identify lots of short key queries by concatenate 2 or 3 concepts important concepts at a time] and “Affinity”. Comity ransk images by computing a relevance score of each image in context of the concepts picked out from a section and affinity identifies relevant webpages and computes an importance score. Another challenge is that these techniques cannot operate one section at a time since the textbooks would probably not like the images to repeat!

## Tuesday, September 06, 2011

### ECML-PKDD conference report: Part I

My student Parasaran Raman is in Athens attending ECML-PKDD and associated workshops, and will be updating this blog regularly with details from the conference. His posts are edited lightly for format and to add links.

I am in Athens [+1 to location-driven-research] to attend ECML PKDD 2011 and I am at the second MultiClust workshop today. This workshop is held in conjunction with ECML PKDD 2011.

The first talk of the morning was a invited talk by Michael Houle on "Combinatorial approaches to clustering". Michael Houle began the talk with how in high dimensions, all out intuitions about space go for a toss. One example that he gave illustrated how most of the points are concentrated along the boundaries in high dimensions. The talk revolved around how similarity measures between partitions cannot be fully trusted though there have been a range of combinatorial and spatial measures that have been introduced, since each measure suffers from some kind of drawback. He also talked about constructing secondary similarity measures [“Secondary similarity between two points v and w is defined in terms of data objects in the common intersection of neighborhoods based at v and w, where the neighborhoods themselves are determined according to a supplied primary similarity measure”] and how they can get around the curse of dimensionality.

One interesting question that was thrown out was "Can we do a fully automated clustering?", i.e., can we convert similarity functions into ranking on the neighbors, assume no knowledge of distribution of the data, and automatically pick 'k'. The talk moved on to approaches towards computing quality of a partition by computing a significance score for each point which depends on the correlation between a member of the set to the members in neighboring sets and other members of the same set, along with the size of cluster and the feature set. Once we know an object's significance, we can use this to change the shape of a cluster, by adding points and removing them.

Bart Goethals gave another invited talk titled “Cartification: from similarities to item set frequencies”. He talked about doing a k-nearest neighbors for all data points to find the right centers of clusters. The idea is that true 'centers' occurs in many knn baskets.

There were many interesting ideas out of the other talks at the workshop. I will highlight a few of them:

1. One of the first talks about “Subjectively interesting alternate clusters”, uses an information theoretical framework to find interesting alternate clusters.
2. Jilles Vreeken gave a very good talk about the current approaches in pattern mining and how we can use that knowledge in data mining. On this talk titled, “where pattern met subspace cluster”, he highlighted the similarities between subspace clustering and pattern mining where the goal is to find informative local structures.
3. The talk on “Factorial clustering” was about using latent variables to generate multiple clusterings. One question that they wanted an answer for was a way to compute the right distance between partitions and I said, "Hey ! Use ours!”.
4. Another talk on “Browsing clustering alternatives" outlined constructing a Merge-Split tree structure on the space of partitions and enable a user driven browsing of partitions.
My talk was the final one of the session. I talked about some recent work with Suresh Venkatasubramanian and Jeff Phillips on exploring the landscape of clusterings by sampling from the space of partitions proportional to a quality measure. It was generally good to see a lot of head-nods during the talk. A couple of people from the industry who were present agreed to the fact that there exists data that is multi-clusterable and the hard part was finding them. So presenting a landscape where you can pick from looked like a useful idea!

The day ended with a rooftop reception with a magnificent view of the Acropolis and the temple of Zeus! I had very useful dinner conversations with Michael Houle and Thomas Seidl about general problems in the area of meta-clustering and on how to compute distances metrics on partitions. We chatted about combinatorial distances and spatial distances, and secondary level distances, where one constructs a ranking on top of a existing distance between many members.

## Monday, September 05, 2011

### FOCS 2011 Registration open

Marek Chrobak informs me that FOCS 2011 registration is open. The conference is being held from Oct 22-25 in Palm Springs CA, home of the Palm Springs Aerial Tramway that does a record 8500 ft (2600 m) elevation gain to give you views of the valley.

When you're not being made dizzy by the views, attend a tutorial or two or three ! Cynthia Dwork, Kirk Pruhs and Vinod Vaikuntanathan are the headliners for a day of tutorials on Saturday Oct 22. Shafi Goldwasser will be awarded the 2011 IEEE Emmanuel R. Piore Award, given for "outstanding contributions in the field of information processing, in relation to computer science".

I'm told there are also some talks, including some that smash through barriers for the k-server conjecture, metric TSP (in graphs) (twice over), and exact 3SAT.

Register now so that local organizers don't go crazy (I can assure you that they do :)). The deadline is Sep 29 for cheap rates on hotels and registration. And if you'd like to be a FOCS 2011 reporter and write a set of blog posts summarizing the conference for cstheory.blogoverflow.com, please add a note to this post.