Wednesday, December 30, 2009

Reading papers.

It's been a bad few months for blogging, while I've been doing (shudder) "real work". Whenever people ask me how I make time for blogging, I always say that it takes less time than one thinks. This is generally true, but the last few months have been a blizzard of paper and proposal deadlines, and I just never got the time to sit down and pen coherent thoughts. I've also noticed that my tweeting is bleeding thoughts away from the blog: random throw-away comments that might been assembled into a blog post end up merely getting tweeted.

For the first time in a long time, I took a "true" vacation, where not even a laptop came with me (ok I took my iphone, but that's not the same :)). It was only a week, but I appreciated the complete downtime, especially coming over the frenzy of deadlines. It's been hard to get back into the swing of things, but it was a really good way to recharge.

But you don't want to hear about any of that. What you really want to hear about is..... my lament on how to organize paper reading.

When I was a grad student, life seemed simple. STOC/FOCS/SODA (and then SoCG) would come around, we'd peruse the paper list, chatter on about them, and figure out if someone had scooped us or not. Now, things seem more complicated. Especially since I'm the sole algorithms person at the U, I feel the need to catch up on the latest (and not-so-latest) hot topics in theoryCS, at least to keep abreast of things and share the latest news with my students. I also work in a number of more 'applied' areas, which means that there's a SIGMOD/VLDB/PODS/ICDE timeline to keep track of, and more recently a NIPS/ICML/COLT timeline, not to mention even more applied areas like ICCV/CVPR/MICCAI (more on that later).

There's a large and complicated taxonomy of paper reading: some of the main items are:
  • Papers I'm reading for technical material when working on a problem - this is the easiest kind to manage, because you have to read it RIGHT NOW.
  • Papers that seem related to the problem I'm working on, and probably need to be cited, but are not in the critical path. This isn't too hard either - things like Mendeley allow me to save (and tag) papers for later use with just a few clicks. It's not a perfect system (what if the paper's on someone's home page), but it mostly works.
Then come the more complicated categories:
  • Papers related to a problem that I'm not quite working on right now, but seem relevant. I can sock them away, but I have to remember to read them when I return to the other problem
  • Papers that everyone's talking about at the latest-greatest conference, but might have nothing to do with my specific suite of problems (like for example the Moser LLL proof).
  • Papers that might help me catch up on an area that is very hot, but which I wasn't following from the beginning (cough cough AGT cough cough)
  • Papers that were just announced in the latest conference/arxiv/eccc issue, that sound worthy of perusal.
There are many technological solutions to squirrel stuff away: I even use Dave Bacon's arxiview app for the iPhone (and btw, I've found the only use for 2-column format - reading papers on the iphone). I've also experimented with private wordpress installations to allow me to bookmark interesting papers.

But the real problem, that I have yet to crack, is how to systematically plow through the mountains of reading that I must do to stay abreast of the fields I'm interested in. I've tried "read one paper a day", or "read papers over the weekend" and things like that, but nothing ever seems to stick, and I'm curious about what techniques people have used that actually work. I'm not excluding technological solutions, but I think the problem goes beyond that.

So what say you all ?

Tuesday, November 24, 2009

Locating SODA outside the US

David Johnson forwards a note from the SODA Steering Committee about the possibility of having SODA outside the US. Summary: it can be done, with the kind of bid information that accompanies bids to other conferences with independent local arrangement. What follows is the note:

To those who might be interested in having a future SODA conference take place outside North America:

As you may know, local arrangements for SODA have to date always been handled by SIAM. The conference Department at SIAM has wide experience in organizing conferences in North America, but typically relies on help from local organizers when a conference they sponsor takes place elsewhere.

This presented difficulties when, at last January's SODA, the Business Meeting voted a preference for holding the 2011 SODA in Paris. SIAM and the SODA steering committee were unable to find any French organization that was prepared to handle the meeting in Paris. One organization might have been willing to host the meeting in the Parisian suburbs, but we judged that most of the voters for Paris would not have voted for "suburbs of Paris".

(No hotel was available in the second vote-getter, the Virgin Islands, and so SODA will be back at San Francisco, the 3rd-place choice, in 2011, January 22-25.)

In light of the experience in finding a location for the 2011 SODA, we are going to change the ground rules a bit for the 2012 site selection. Proposals of sites outside of North America will still be welcome, but to be considered, they must include the details of what group will handle local arrangements, where precisely the conference will be held, and what the expected hotel costs will be. In other words, the kind of information typically provided by groups proposing to host STOC or FOCS. (Proposals from groups that can collect registrations and take financial responsibility are preferred because of the added costs of SIAM providing these services in a country outside North America.)

If you are interested in making such a proposal at the SODA 2009 Business Meeting, please contact Kirsten Wilden (wilden@siam.org) for more details about what kinds of information SIAM will need about your proposal.

There has also been some thought that costs for SODA in North America could be reduced if local arrangements were placed in the hands of local volunteers, as they are for FOCS and STOC. If anyone wishes to propose to host SODA 2012 in this way, their detailed proposals will also be considered at the Business Meeting. (Again, contact Kirsten Wilden at SIAM in advance.)

And of course, proposals for sites that the SIAM conference staff can handle on their own can be made as usual.

Hope to see you all at SODA 2009 in Austin, January 16-19.

David Johnson, SODA Steering Committee Chair.

Sunday, November 08, 2009

Anathem, and mathematical stereotypes

Neal Stephenson is (or should be !) a familiar figure in the sci-fi/speculative fiction landscape: his Cryptonomicon is a retelling of the story of Turing, along with much modern day drama involving advanced crypto and security. His new book, Anathem, came out with much fanfare, and is a vast tale set in a place where mathematics is a pursuit conducted in monastery-like places with strong religious overtones.

I'm reading Anathem right now, and am only at the beginning (so no spoilers please), but there's already a beautifully rendered discourse on stereotypes of the mathematican (or scientist in general). Inside the 'concent' (the monastery), these are termed 'Iconographies', as formal templates by which to understand how the "saecular" world perceives the mathematicians. I was reminded of this when writing I was writing the post on Soviet-style mathematics and realized the stereotypes at the heart of the referenced WSJ article.

So here are some of the iconographies discussed early on (you get one point for each one that you've encountered in real life):
  • Tennestrian: seemingly clownish eccentric figures that have a darker side, luring impressionables and innocents into folly (a play on the story of Socrates)
  • Doxan: brilliant, unemotional and cold, but as a consequence subordinate to passionate leaders with true human feeling. (can you say 'Spock'?)
  • Yorran: Criminally insane mad scientist shut up in a lab with plans to conquer the world.
  • Rhetorian: Sinister conspiracy plotters, out to take over the world by planting minions out in the secular world to attain positions of power.
  • Muncostran: Eccentric, lovable, dishevelled theoreticians, absent-minded and good-natured (the subtext being: ignorable)
  • Pendarthan: High-strung nervous meddling know-it-alls, unable to understand the realities of life, always subordinate to more 'masculine' (his words, not mine) secular folk.
  • Klevan: Awesomely wise elder statesman who can solve all the problems of the world (apart from Einstein, I don't know of anyone who achieves such stature in our world)
  • Baudan: Cynical frauds living in luxury at the expense of the common man (this sounds like the viewpoint of letter-writers to the Salt Lake Tribune)
  • Penthabrian: keepers of mystical secrets handed down from above, with 'theory' as a smokescreen used to fool the lay folk (if only...)
  • Moshianic: A combination of Klevan and Penthabrian - viewed as the most dangerous iconograph because of the high expectations placed on the theorist shoulders.

It's a neat trick to identify the ways in which the outside world perceives the 'theors', as they are called, and in doing so understand where the outsider is coming from, and what kind of threat they pose. I suspect I'm going to start classifying people in "the real world" the same way when I describe what I do.

Saturday, November 07, 2009

Soviet-style mathematics

Via Anand Kulkarni (aka polybot) comes an interesting article in the WSJ by Masha Gessen on Grigori Perelman, Soviet-era mathematics and the question of 'big math'. The premise of the article (Masha Gessen has a book out on Perelman and the Poincare conjecture) is that special environments are needed to prove big results, and the Soviet-era mathematical enclaves fostered this environment both because of, and inspite of the Soviet political system.

It is indeed true that amazing work came out of the isolated confines of Soviet mathematical institutes, often parallel to or well before similar work in the Western world. There's a joke that goes around theoryCS circles that for every theorem proved before the 80s in the west, there's an equivalent result proved 10 years earlier by a Russian mathematician. We need look no further than the Cook-Levin theorem, the Koebe-Andreev-Thurston theorem (on circle packings), Kolmogorov-Chaitin-Solomonoff complexity (and according to some, the Cauchy-BUNYAKOVSKY-Schwarz inequality, though this is disputed).

But in the article is a more thought-provoking claim:
The flow is probably unstoppable by now: A promising graduate student in Moscow or St. Petersburg, unable to find a suitable academic adviser at home, is most likely to follow the trail to the U.S.

But the math culture they find in America, while less back-stabbing than that of the Soviet math establishment, is far from the meritocratic ideal that Russia's unofficial math world had taught them to expect. American math culture has intellectual rigor but also suffers from allegations of favoritism, small-time competitiveness, occasional plagiarism scandals, as well as the usual tenure battles, funding pressures and administrative chores that characterize American academic life. This culture offers the kinds of opportunities for professional communication that a Soviet mathematician could hardly have dreamed of, but it doesn't foster the sort of luxurious, timeless creative work that was typical of the Soviet math counterculture.

For example, the American model may not be able to produce a breakthrough like the proof of the Poincaré Conjecture, carried out by the St. Petersburg mathematician Grigory Perelman.

This is a reflection of one of the enduring myths of mathematical research, "a mathematician would be happy in jail if they had paper and pen", with a bit of the 'a mathematician is a solitary (and slightly crazy) genius'. I can see the allure in the idea: mathematics requires great concentration, and removal of distractions would surely make it easier to focus on a big problem.

But is this really impossible to achieve in the Western model of research ? After all, even Perelman's work built heavily on a program first outlined by Richard Hamilton from Columbia. Andrew Wiles proved Fermat's theorem while at Princeton. Ketan Mulmuley has been banging away at P vs NP while shuttling between Chicago and IIT Bombay (yes, I know it's not a perfect comparison because it hasn't been resolved yet). Stephen Cook proved that SAT is NP-Complete while at Toronto. And so on and so forth.

Possibly one argument in favor of the 'isolation: good' theory is that Perelman didn't need to prove himself for 6-7 years, maintain a steady stream of funding, and teach lots of classes in order to "earn" the right to study such a hard problem. It's hard to imagine a researcher in the US being able to do this before they get some kind of job security (tenure, or otherwise).

Monday, November 02, 2009

Innovation in Computer Science

As the polylogblogdogslogglog blog points out, the ICS results are out. 39 papers were accepted in all - at some point I knew the number of submissions, but I've forgotten since.

The ICS folks didn't make life easy for themselves by explicitly stating that they wanted "conceptual contributions". But looking over the list of papers, a few things come to mind:
  • It's a great list of papers. Nothing to complain about really, and any of these could have been a credible paper at FOCS/STOC
  • The Arora et al paper on designing derivatives using NP-hard problems has already received so much chatter, one might argue that the conference mandate has already been satisfied. Similarly for the quantum money followup.
  • Even if the whole 'conceptual contributions' thing doesn't pan out, I see no harm in having a third conference inserted between FOCS and STOC - the more the merrier.
  • I guess innovation = "game theory + crypto + quantum + misc" :)
A side note. 8 of the 12 PC members have papers at the conference. This is a departure from the theory "norm". While I don't necessarily think it's a problem (everyone except theoryCS does this, and this is also a new conference), it's worth discussing. Is this something we'd like to have happen in other theory conferences as well ? Recall that the first year of the SODA 2-page experiment also had PC-submitted papers (and that was mainly to ensure submission volume, from what I recall).

Update: Shiva Kintali has PDFs for the accepted papers.

Sunday, November 01, 2009

What is computational topology ?

Jeff's post on his computational topology class (and the wonderful class outline), prompted me to write about something that I've tried to explain to people before.

Computational topology is arguably the hottest thing in SoCG-land right now, and has been so for a number of years (for curious folk, the "other" hot topic is high dimensional approximate geometry). But if you ask different people, you'll get different answers to the question "what is computational topology ?". I've even had to explain to local students why my computational geometry class is different from the computational topology class being offered. So here goes:

CT I: Using topological ideas to understand the 'shape' of data
This is the version of CT that has taken up the largest fraction of CT mindshare in the SoCG land. Dating back to work by Edelsbrunner and Mucke on alpha-shapes, the field now encompasses a range of ideas for extracting topological structure from data. New topological constructs that have come from this area include various notions of persistence, as well as related work on reconstruction, data smoothing, and visualization.

Afra Zomorodian's thesis (now a book) is a nice introduction to these concepts for a CS audience. Herbert Edelsbrunner is coming out with a book on this topic very soon (Jan 16, 2010! Mark your amazons!), and Valerio Pascucci teaches a class on computational topology at the U.

CT II: Asking computational questions about fundamental topological objects and concepts.

This is closest to the 'Computational X' flavor of work, where a computational perspective is brought to the study of X. There are many interesting problems here, and they have a nice discrete flavor that makes them 'friendly' for theoryCS folk. For example, computing the Betti numbers of a simplicial complex efficiently, or finding homotopy equivalent paths on a surface, or lots of work on graphs on surfaces.

I don't think there's a single book on this topic. JeffE has put together a fantastic course outline (with reading material), and there's also a great book on graphs on surfaces by Mohar and Thomasson. It's worth noting that some of the deeper tools in the Robertson-Seymour graph minor results take us into graphs-on-surfaces-of-large-genus land.

CT III: Using topological ideas in the heart of complexity theory.
Almost no one I know uses 'computational topology' in this sense, and there isn't a coherent and connected body of work to talk about as such. But there are some fascinating results at the core of complexity theory that rely on topological constructions. There's the algebraic complexity work of Yao/Steele/Ben-Or and others, showing that lower bounds for algebraic complexity of certain problems can be related to the sum of Betti numbers of associated surfaces. There's the Kahn-Saks-Sturtevant (and Chakrabarti-Khot-Shi) work on evasiveness of graph properties, and then the work (that I don't quite understand) on topology in distributed computing (that got Herlihy, Saks, Shavit and Zahoroglou the Godel Prize)

This is one of the reasons I think a course in topology (of some flavor, whether it be combinatorial, algebraic or point-set) should be required mathematical background for all theoryCS students.

Wednesday, October 28, 2009

Clustering interlude: Time series

In the absence of an actual post on clustering, I encourage all of you to go and read Sorelle's (or should I say ***SORELLE***'s) post on time series clustering.

Monday, October 19, 2009

"Wave"ing...

(while I wait for actual inspiration to strike)

I just acquired an invite to Google Wave (thanks to John Moeller), and have been finding it quite intriguing (note: you can't "get" Google Wave unless you're invited, and no, I don't have any invites to give out).

Google Wave is a mysterious combination of email, chat, wikis and browser extensions that's hard to describe - you can read descriptions of it all over the web, but unless you are able to get in and start playing, it's like trying to buy wine based on text descriptions of the taste.

So far I've used it to:
  • discuss an outline for next semester's seminar with my students (Topics in Graph Algorithms, for those curious)
  • Start working with a collaborator on an outline for a tutorial we want to do
  • set up administrative struts for our research group svn server (I've started using svn for writing papers - it's an amazing experience - more on that some other time)
By far the coolest feature of Wave for me is that you can include a LaTeX extension into any "wave"or conversation, and it automatically converts things between $$...$$ marks into latex, which makes me hopeful that it will be useful for more substantive discussions (since I often wish I had such capabilities in Skype/Google chat)

Although the preview is painfully slow, and is crippled in various ways, the potential is clearly there, and as more people get on it, it will only start getting more effective. I'm looking forward to it.

Monday, October 12, 2009

On the virtue of NOT working on a problem

Semester has hit with a vengeance, and while I've been busy (among other things) with this and this, my clustering series has gone on temporary hiatus, hopefully to return shortly.

In all the pages and pages of advice given to grad students, postdocs, and starting faculty, I think one item tends to get left by the wayside, or at least is not explicitly stated.
You always underestimate the time spent managing a project from start to finish.
What I mean is this: problems (at least in theoryCS) are easy to state, and fun to work on. Sometimes they take a while to crack, and sometimes they give up their secrets easily. But the time you spend on any given project is much more than the actual time spent thinking about it. There's
  • Writing up the first few drafts
  • Iterating to get a polished submission version
  • (...this step repeats until the paper is accepted)
  • Preparing the final (often very cramped) version
  • Making slides for the talk/talks you'll be giving
  • Preparing a full/journal/arxiv version, which often involves simplifying, rewriting, reproving, adding new references, etc etc.
  • Submitting to a journal, and waiting endlessly for updates on its status.
  • Addressing reviewer concerns, and resubmitting
  • And finally, getting it into print.
The few-months thrill of actually thinking about the problem and solving it ends up being a multi-year odyssey filled with many fallow periods punctuated by bursts of activity.

It's not so much the time involved - papers tend to time-multiplex quite well so you're usually in different phases of the above sequence for different papers.

It's more a matter of motivation. I don't think I'm the only person who feels this, but once I have some nice results, and especially if there isn't follow-on work to be done, I get bored with a paper. Having to deal with it for months and months afterwards is then as excruciating as killing off zombies that keep coming back (not to mention what happens if it keeps getting rejected).

So be careful when you choose a project: make sure it can last through at least a few papers, or you'll be spending a lot of time cursing yourself for the time you spend.

Sunday, September 27, 2009

Rajeev Motwani Memorial

I just returned from the technical workshop and memorial in honor of Rajeev Motwani. The workshop was excellently run by Ashish Goel, and designed (extremely well, I thought) as a mix of retrospective and technical content, with one speaker presenting a brief retrospective and introduction, and a technical speaker laying out a body of work.

The topics were varied: they started with Sanjeev Khanna discussing matchings in random graphs (Rajeev's thesis work), and going onto brand new results in this area: for example, an expected time O(n log n) bound for matchings in d-regular bipartite graphs. Piotr Indyk talked about locality-sensitive hashing, David Karger talked about randomized min cuts, Aneesh Sharma discussed monetization problems in social networks, and Sudipto Guha concluded with a overview of data streams.

The talks were surprisingly technical: if I closed my eyes, I could have easily imagined being in a SODA conference room. The only difference was that people were actually paying attention, as opposed to clicking away on laptops (or tweeting!). It was a large crowd: over 100 people, by my casual count.

There were many retrospectives, given by Dick Karp, Jeff Ullman, Chandra Chekuri, and Ron Conway. Chandra spoke in particular about the experience of being Rajeev's student, and as a former student myself, his words touched me the most. He talked with feeling, compassion and honesty, and drew a compelling picture of a man that we began to know all over again.

There was a beautiful memorial service in the Stanford Church, with words in English and Sanskrit, a hauntingly beautiful hymn from the Vedas sung by Rajeev's elder daughter, and testimonials from colleagues and friends old and new. Don Knuth was the organist for the entire ceremoney, and played pieces you didn't think could be played on a church organ. After the service, and a reception, there was a concert by one of Rajeev's favorite bands, Indian Ocean. They played amazing music, and I'm downloading their songs as we speak, but that's a tale for another time.

It was good to go back and meet people who I knew so well for a brief period of time, and then lost touch with. Many (if not all) of Rajeev's former students were there, and there were many others who cohabited the Gates Building along with me. All of us older, a little grayer, but still recognizable :). Spread-apart families often only get together at weddings or at funerals, and this was one of those occasions where it was great to see everyone, but as we all kept murmuring "unfortunately it had to happen like this".

If I had to describe the feeling that dominated my thinking that day, it was a sense of being robbed. Upon hearing testimonial after testimonial, anecdote after anecdote, listening to this divine rock group that Rajeev listened to and loved, I could only wonder at the many sides of this person whom I knew so little of. I wished I had known more about him: that our interactions had been more multidimensional than that of advisor and student, and that I (and my fellow students at the time) had seen more of the ebullience and vivacity that others spoke so vividly of.

By the end, a new picture began to emerge, of a 'hub', a 'connector' and a 'facilitator', someone who had the clarity to know what people really needed to succeed, and the self-effacement to stand back and make it happen, by connecting people together. He helped legions, and legions came to bid him farewell.

It therefore seems oddly fitting that his career in research started with studying random matchings, and ended with new explorations of social networks. His life, one might think, has always been about creating, analyzing and enriching connections.

Thursday, September 24, 2009

Recent items

Deadlines are keeping me busy, and away from blogging.

Speaking of which, the Fall Workshop on Computational Geometry has a submission deadline this Friday. The Fall workshop is a "true" workshop, in that you go, give a talk on whatever you're working on, and there's no messing around with proceedings, publications and things like that. It's a pure meet-and-chat kind of venue, which means the pressure is low, and the visibility is quite high (many of the east-coast geometry folks show up).

This year it's in Tufts, so the location is even better. So get those 2-page abstracts in !

In other news, FOCS registration deadline is looming. Bill mentions a workshop to celebrate 50 years of FOCS (which dates back to when it was the conference on switching and circuits (!)) and 20 years of the GATech ACO program.

The workshop is NOT like the Fall workshop :). It's a series of invited talks by a star-studded cast: KARP !! YANNAKAKIS !! ALON !! BLUM !! All together, for one brief engagement !!

Sounds like a great idea: the kind of thing we probably need to do more of to get more folks to the conference.

Wednesday, September 16, 2009

Memorial Workshop for Rajeev Motwani: Update

I had mentioned a memorial workshop for Rajeev Motwani to be held at Stanford Friday Sep 25. Registration is now open for the workshop and the memorial service to follow.

Registration is free, but mandatory. So if you plan on attending either the workshop or the service or both, make sure you register.

Beamer + Ipe + views...

At this point, mostly everyone is aware of how to use beamer to make presentations in LaTeX. However, many fewer people (mostly only geometry folk) are aware of Ipe, a kind of next-generation xfig.

You may stop reading right now if
  • you always use powerpoint for slides OR
  • you rarely have to use LaTeX in slides OR
  • you really hate non-WYSIWYG presentation software
Still here ? ok...

I wanted to share a workflow tip that I found quite useful when making slides with step-through animations. Suppose you have a sequence of slides in a presentation that unfold a figure step by step, or animate some algorithm execution etc. Ipe, coupled with a few LaTeX commands, provides a really nifty way of rendering the animation without jumps, misalignments, or misdrawings.

Ipe (and many other drawing programs) has the notion of a layer. More powerfully, Ipe also has the notion of a 'view', which you can think of as a (sub)set of layers. For example, if you have a drawing with layers 1,2,3,4,5, then view 1 might consist of {1,2,3}, and view 2 might consist of {1,2,5}, and so on.

What this means is that when you want to do step-animation, it's really easy. Each time you move to a new step, you create a new view in Ipe (which also usually creates a new layer), and you can select whichever subset of the current set of layers you want to render, as well as drawing new ones.

Ipe stores all the views as separate pages in a PDF file, so your final animation consists of a multi-page PDF file. And now comes the cool part.

Suppose you want to include this sequence of views in a beamer slide, with each view appearing after the next in response to a mouse click. You need two things:
  • pdftk (which comes standard in most linux installations), which allows you to split a PDF file into multiple files (one per page), with any format for the filename that you specify. For example, I have a command called 'splitpdf' that does this:
    pdftk $1.pdf burst output $1-%d.pdf

    which takes a file name.pdf and splits it into name-1.pdf, name-2.pdf and so on.

  • Next, you need the (standard) LaTeX package 'xmpmulti' which gives you the command \multiinclude. It allows you to include multiple figures that share a common prefix. So for example, to include all the figures created by the previous pdftk command, you merely write
    \multiinclude[start=1,format=pdf]{name}
    .The 'start=1' starts counting from 1 instead of 0, and you can also specify an end-counter.

But the best part is when you instead use
\multiinclude[<+>][start=1,format=pdf]{name}
Now, the files are included with an automatic 'replace each by the next' mode (<+> is standard beamer notation for this). At this point, you have a sequence of animations ready to go. In fact, when I give lectures, I have a number of slides that just look like this:
\begin{frame}
\frametitle{Title}
\mi{name}
\end{frame}

where \mi is a macro I defined for the above multiinclude. Ipe does all the layering and viewing work for me, and multiinclude takes care of the rest. This has made making complicated animations really simple and fast.

p.s if you're still wondering why one should use Ipe instead of xfig, the LaTeX integration in Ipe is superb. No nonsense with special flags, and pstex_t and craziness like that. You get WYSIWYG LaTeX inside Ipe, you can use whatever macros you have in your text, and the various nifty alignment tools make an Ipe drawing look really clean.

Friday, September 11, 2009

Memorial Workshop for Rajeev Motwani

Plans have been in the offing for a memorial workshop for Rajeev Motwani, and now the details are available. Below is the announcement (sent by Ashish Goel):
Dear friends,

As you might know, our dear friend and colleague, Rajeev Motwani, passed away in a tragic accident on June 5, 2009. We are holding a technical workshop titled

Randomized Algorithms: Theory and Applications

at Stanford University on Sep 25th, 2009, from 10am - 2:30pm to honor Rajeev's research in the area of algorithms and their applications. If you did not know Rajeev's research, please see http://www.stanford.edu/~ashishg/cgi-bin/RememberingRajeev/ for a brief introduction.

The workshop will take place at the Bechtel Conference Center in Encina Hall. Workshop attendees are also invited to a memorial celebration starting at 4:00pm at the Stanford Memorial church, followed by a performance by one of Rajeev Motwani's favorite bands, Indian Ocean. The registration URL, web-page information, parking directions, and talk titles will follow in a later email.

Registration will be free, but mandatory. Please feel free to bring this to the attention of any colleagues/students who you think might wish to attend, and send me an email if you have any questions.

Workshop program:

10 - 10:45 am
Welcome remarks
Retrospective by Richard Karp, UC Berkeley
Technical talk by Sanjeev Khanna, University of Pennsylvania

10:45 - 11:30 am
Retrospective by Jeff Ullman, Stanford University
Technical talk by Piotr Indyk, Massachusetts Institute of Technology

11:30 am - 12:15 pm
Retrospective by Chandra Chekuri, University of Urbana-Champaign
Technical talk by David Karger, Massachusetts Institute of Technology

12:15 - 1:30 pm
Lunch for registered attendees

1:30 - 2:30 pm
Retrospective by Ron Conway, Venture Capitalist
Technical talk by Sudipto Guha, University of Pennsylvania
Technical talk by Aleksandra Korolova, Stanford University

The Scientific committee for the workshop consisted of:
Moses Charikar
Ashish Goel
Richard Karp
Prabhakar Raghavan
Tim Roughgarden
I'll update this post with the registration URL when it becomes available.

Wednesday, September 09, 2009

Geometry at ESA (guest post)

(ed. note: Jeff Phillips is a researcher in computational geometry, data analysis and statistics. He's also one of the newly minted CIFellows, and will be working with me starting in a week. He kindly agreed to file this report from ESA.)

ESA (along with ICALP) is one of the top two European conferences for Theory A. And probably since the ICALP deadline is typically between the SoCG submission and notification, ESA is the home of many of the best computational geometry talks. This year was no exception, with two "geometry" sessions, as well as many other geometry talks mixed in other sessions.

Another interesting pattern at ESA which makes it different to other algorithms conferences is the juxtaposition of theoretical and experimental work. The two tracks (A: Design and Analysis, B: Engineering and Applications) have separate submissions, but are not treated separately in the program. This leads to the fun game of trying to determine whether a talk was from track A or track B. Sometimes you are surprised when a talk starts with a nice theoretical results, and then the speaker presents several experiments to show the algorithm works well in practice, or mentions that it has already been added to CGAL.

I think this a great practice and should be considered by other conferences such as SODA or SoCG. For instance ALENEX talks could be mixed in with SODA talks. This would encourage more algorithms people to actually implement their algorithms, because it would allow them to apply to a separate track that would give more value to practical algorithms. How would this not be a positive for our community and its perception from other parts of computer science!

A couple of related data points: There were 56 track A papers and 10 track B papers, and both were accepted at a rate of about 25%.

The invited talks followed the theme of theory and practice as well. Michael Mitzenmacher began with a very clear talk explaining many open problems in Cuckoo hashing. He has been working on small variations in the algorithm that lead to large differences in performance, and then has been going to great lengths to explain why these variations make such a large difference.

Erik Demaine gave a very entertaining talk describing how a certain line of his work has alternated between art and the theory behind the art. He also performed a couple magic tricks, reminding us that most of the best magic requires lots of research, and sometimes algorithms!

On the third day, Noam Nisan presented a great high level view of how Google's TV ad auctions work. Much of his talk served to point out all of the important details often ignored in theoretical analysis, but critical in implementing such a system effectively.

I'd like to note a few papers I found interesting. This is not by any means an exclusive list, but just enough to give a taste of the (geometric) results.

Manuel Caroli and Monique Teillaud. Computing 3D Periodic Triangulations.
They presented a set of rules and an algorithm for computing triangulations in 3D which are periodic, that is they tesselate to fill an infinite space, but can be defined as a complex within a unit cube. These triangulations are needed for many simulations where it is difficult to deal with boundary conditions, so these can be avoided, but effectively having no boundary. Despite, the usefulness of these triangulations, they had not really been properly formally defined until this paper. Furthermore, their algorithm has been implemented and will soon me in CGAL, if not already.


Mohammad Ali Abam, Mark de Berg, Mohammad Farshi, Joachim Gudmundsson and Michiel Smid. Geometric Spanners for Weighted Point Sets.
They study Euclidean a size n point sets with weights so that a weighted distance between two points p and q is described
d_w(p,q) = w(p) + d(p,q) + w(q)
where d(.,.) is the Euclidean distance and w(.) is the weight of a point. This may, for instance, represent a potential road network where it takes w(p) time to get to the center of a city from its outskirts. This paper shows that typical spanner-type results can be found for a weighted point set using this weighted distance. I.e. in R^d a (5+eps)-spanner can be found with O(n) edges, and (2+eps)-spanner can be found with O(n log n) edges.


Sumit Ganguly and Christian Sohler. d-dimensional Knapsack in the Streaming Model.
This paper studies the d-dimensional knapsack problem, that is given a set of n items with a value v_i and a size in d-dimensions s_{i,1}...s_{i,d} find a set of items I with maximum total value such that for all j \in [1,d] that sum_{i in I} s_{i,j} <= 1. This problem is studied in the streaming model, which I found interesting, because they were unable to store the actual solution, because it might have linear size. Instead, they approximate the optimal cost and present a "template" solution where they give an approximate size of each element in their approximate solution I'.

Bernhard Haeupler, Siddhartha Sen, and Robert E. Tarjan. Rank-Pairing Heaps.
A data structure paper, but for one very useful in geometry, among other places. This paper does not present any new results from a classical theoretical perspective. They describe a heap data structure which has the same running time for all operations as Fibonacci heaps. The contribution is that their data structure is considerably simpler than any variant of Fibonacci heaps, in particular, the structure of the heap never needs to be restructured. As an added benefit, their data structure easily outperforms Fibonacci heaps.
Other notes (mainly from business meeting): See also Michael Mitzenmacher's post.

  • There were 37 computational geometry submissions, 10 were accepted.
  • Mark deBerg (next years track A chair), declared that next year the page size will be STRICTLY ENFORCED and he is not opposed to automatically rejecting papers if they play too many games to squeeze more text in the alloted number of pages. Be warned.
  • next year, ESA 2010 (and ALGO) will be in Liverpool, England, and it was voted for ESA 2011 to be in Saarbrucken. Which won over a bid from Greece. (ed: HOW ON EARTH ? !!!!!!)
  • Proceedings were not included in registration, but could be bought. On the second day, after politely asking Springer, participants were able to download a single pdf of the proceedings from a couple USB keys. This was very useful!, but it would have been better earlier in the conference.

Things I missed (I had to leave partway through the third day).

Attached workshops which are part of ALGO on Thursday and Friday:

Friday, September 04, 2009

SODA 2010

Ironically (I'm on the PC), I'm probably the last one to post about the SODA acceptances (although I did tweet it a while ago)

In case you don't already know, one major change this year is the 20 page limit on final versions, brought about by the all-electronic proceedings. It's worth pointing out here that this is a major change that I don't think ANY conference (theoretical or otherwise) has put into place (I can say this because I had nothing to do with it :)). 20 pages is huge: there's no way you can wiggle out of writing full proofs at this point, and I hope that this trend will continue as more of our conferences go all-electronic.

One caveat: I don't know what the final version format is. It seems unnecessary to go with the butt-ugly Sheridan 2-column format, but we'll see.

On the papers:
I'm not going to comment on the papers, since I reviewed many of them. Suffice it to say that (even though it sounds like a cliche), I was impressed by the quality of the submissions, and there were many good papers that unfortunately couldn't make it. We're in the process of writing decision summaries for the authors: these are intended to capture more of the discussion surrounding a paper. Hopefully, they will give a better sense of what the reviewers liked and disliked about the paper than just the actual reviews.

On the process:
I felt a bit more disconnected from the overall deliberations this time, and maybe others felt this way too. I spent a lot of time on "my pile", but I didn't have a lot of time to look over papers that I wasn't in some way responsible for. Given the number of submissions, this is unavoidable I guess.

I actually think the reason I felt this way was because of the software interface. Instead of the (clunky) SIGACT server interface used in years gone by, we used the snazzy HotCRP interface. In almost every way, this is a vastly superior interface (and I've used EasyChair, the Microsoft CMT, and other packages as well). It feels lightweight, it has great searching and tagging capabilities, and most of the interfaces are one or two clicks from the home page. It cleanly combines email notifications and uploads/downloads with web-based editing, and I'd recommend it very strongly for anyone organizing a conference in the future.

The only feature the SIGACT server had which this doesn't, was a landing page where you got a stream of all comments on all papers. It was a serendipitious way of picking up on discussions not related to papers you were in charge of, and I remember in the past getting interested in a discussion and a paper and actually reading and submitting a review myself. In HotCRP, you land at a page containing your reviews, and it doesn't give you a direct view into the global picture (except maybe for the PC chair).

One amazing feat of social engineering that HotCRP also does: at this landing page, it tells you how many reviews you've put in compared to the average. There was a time during the review process where I'd reload the page every few hours and see myself falling behind the PC average, increasing the pressure on me to submit reviews :).

Sunday, August 30, 2009

Spectral Clustering.

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)
I don't like you, and I like them, but maybe if we change, we can get along.
Spectral clustering is not clustering: it's actually a metric modification method. It's another way of capturing things that are close versus things that far, using a more global method.

We saw how correlation clustering could be used to capture both the ``I don't like you'' and ``I like them'' aspects of clustering by assigning positive and negative weights to individual element pairs. However, it's not easy to come by such a labelling of pairs. More often than not, all we have is a distance function. We could threshold it, declaring all distances within some parameter r to be positive, and all distances larger to be negative, but this is completely arbitrary and depends on the choice of r.

Let's think of the two-class problem for now: we merely want to divide the points into two clusters. Suppose we were able to guess a function that assigns one label (0) to points in one cluster and another label (1) to points in the other cluster. Then we could write down the cost of the clustering (in a correlation sense) by summing up, over all pairs, the difference

. In fact, just to make sure it's always positive, we'll square it, and thus will sum up the function . We'll also assume a weight function on the edge , so that what we're really summing up is .

This is where things start to get very interesting. It's not hard to see that this can be written in term of a variant of the graph Laplacian, treating the f values as a vector. If we now allow the f values to be reals, rather than 0 or 1, we get a classical eigenvalue problem on the Laplacian, in effect determining the second smallest eigenvalue of the Laplacian.

Once we do that, the corresponding eigenvector gives us an assignment for f. we can either merely take positive or negative values of f to group the data, or we can use the f values as a feature representation and recluster the data into two groups based on that.


This is the key insight of spectral ``clustering''. We perform a spectral decomposition of the Laplacian, and take the top k eigenvectors. Those vectors give us a k-dimensional feature represntation of the data in an inner product space, and we can now recluster. The real question is: why should this feature representation help us in any way at all ?


Here's my take on what's really going on. We know that the graph Laplacian is a discrete equivalent of the Laplace-Beltrami operator on a manifold, which in turn (via the heat equation) captures the ``flow'' of material in a manifold. Return for a second to our cut-based view of the distance function. In essence, two points are close if there are many ways of getting from one to the other (i.e the min cut between them is large) and they are far if not. This is a flow-based way of thinking about distances, and in effect, what the Laplacian does is map the original data into a new metric space in which distance is based on flow, rather than just on metric distance.


This intuition can be formalized: if we switch now to the stochastic viewpoint, in which we're doing random walks on the graph, then the commute time between two vertices turns out to be precisely the Euclidean distance (squared: thanks, dsivakumar) between the feature vectors constructed from the eigenvalues of the Laplacian.

... take a deep breath...

There are a number of different concepts swirling around here that are worth keeping track of. The cut-based viewpoint of node similarity in graphs lends itself naturally to a Laplacian-based treatment, which in turn leads us to random walks and the heat equation on manifolds. At a deeper level, the Laplacian of a graph, by virtue of how it acts on functions, tells us something very basic about the structure of the distances involved. For more on this particular aspect, you should check out the discussion 'Can you hear the shape of a drum'.


But it's very important to keep in mind that spectral clustering is not so much a clustering technique as a data transformation method. Although k-means is the clustering algorithm of choice once we get down to the feature representation, other methods could be used as well. Also, as relaxations go, this paricular one (relaxing from the hypercube to the reals), isn't that great: the ``integrality gap'' can be quite high. It turns out that this doesn't matter terribly though.


As an aside, the idea of using the graph Laplacian to define a ``stretched'' distance that looks Euclidean is not limited to this problem. The most "famous" version of this idea is the Laplacian eigenmaps work by Belkin and Niyogi in the realm of manifold learning. There's also work by Guibas, Ovsjanikov and Sun that uses similar ideas to create signatures of shapes. The key idea, once again, is that if you're on a manifold of some kind, the Laplacian gives you a handle on the shape of this manifold, as well as how to ``stretch it out'' with local coordinates to make it look flat (and therefore Euclidean). It's a useful trick to keep in your data analysis toolbox.

For further reading, Ulrich von Luxburg has a great survey on spectral clustering.

Saturday, August 08, 2009

Negative-type distances and kernels

This is a story of two constructions that actually end up being essentially the same thing, as I recently discovered.

1. Kernels

The story of kernels in machine learning goes somewhat like this:

Take a positive definite function $$K(x,y)$$ that captures some notion of similarity between objects (a standard example of such a kernel is the Gaussian $$K(x,y) = \exp(-\|x - y\|^2)$$). A positive definite function, btw, is like the generalization of a p.d matrix: the integral $$\int f(x)f(y)k(x,y)dx dy \ge 0$$.

You can then construct a Hilbert space that captures the structure of the kernel. Specifically, for a fixed set of points S, construct a vector space from the basis $$\{k_x | x \in S\}$$, where $$k_x(y) = K(x,y)$$, and then define an inner product of two vectors in this space in the usual way: If $$v_a = \sum a_x k_x$$ and $$v_b = \sum b_x k_x$$, then $$v_a \cdot v_b = \sum a_x b_y K(x,y)$$.

You get nice properties from this construction: the so-called reproducing property is one of them, which tells you that $$K(x,y) = k_x \cdot k_y$$. In other words, we can capture the similarity function K(x,y) by a "standard" inner product in a Hilbert space.

What's even neater is that by invoking Mercer's theorem, you can construct an orthogonal basis, and make sure that the Hilbert space is actually a Euclidean space. The squared Euclidean distance in this space can be written in kernel form, as
\[d_K(x,y) = K(x,x) + K(y,y) - 2K(x,y)\]
which is what you'd expect when treating K(.) as an inner product.

2. Negative-type distance.
The second story comes from Deza and Laurent. When can a distance function be embedded in Euclidean space ? It turns out that it's more convenient to talk about the square of the distance function, which we'll call D.

There's an elegant characterization of when D can be embedded isometrically into $$\ell^2_2$$: this can be done if and only if D satisfies the negative-type inequality:
\[ \sum b_i b_j D(x_i, x_j) \le 0, \sum b_i = 0\]
for all possible assignments to $$b_i$$.

The proof works via a construction called a covariance mapping that takes $D$ to the function $$k_D$$ defined as:
\[ k_D(x,y) = (1/2)(d(x, x_0) + d(y, x_0) - d(x,y)) \]
which differential geometry folks will recognize as the Gromov product.

The proof completes by showing that the negative-type condition on D implies positive definiteness of $$k_D$$, and this in turn means that $$k_D$$ can be expressed as an R-covariance:
\[ k_D(x,y) = \int f_x(\omega)f_y(\omega) d\mu(\omega) \]
for some measure space $\mu$.

Note that the RHS of the equation is an infinite-dimensional inner product.

3. Where the two stories come together
The mapping that takes a kernel to a distance is the inverse of the covariance mapping used to map a distance to a metric. In other words, if we take a kernel K, compute $$d_K$$, and then use the distance to kernel mapping to compute $$k_{d_K}$$, we get back K. Further, since we can show that the negative-type condition on D implies a positive-definite condition on $$k_D$$, we can start off with either "D satisfies negative-type inequalities" or "K is a positive definite kernel" and yield the same conclusion on $$\ell_2^2$$ embeddability.

What's interesting is that the two ways of thinking about this don't seem to have been connected together explicitly before.

p.s This is not a fully rigorous correspondence except possibly in the finite dimensional case, but I find that it's often convenient to replace the negative-type argument with a kernel positive-definiteness argument in my head.

Sunday, August 02, 2009

Correlation Clustering: I don't like you, but I like them...

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

Whether it's k-clustering, or any kind of hierarchical clustering, the world we live in is still the world of ``I don't like you'' clustering, where the geometric landscape is defined by distances between points. Consider the following example:


There is an intuitive sense in which the first clustering is not ``real'' and the second one is: the idea of 'well-separatedness'' is a pervasive component of how a good clustering is perceived. But if we only measure distances between points, and only measure the cost of a clustering in terms of how costly each cluster is, we'll never be able to distinguish between these two examples.

What's needed is a way of declaring likes (similarity) as well as dislikes (distance), and then, critically:

penalizing similar items in different clusters AS WELL AS different items in the same cluster.

By that measure, we'd be able to distinguish the first and second clusterings, because in the first case, presumably elements close to each other that lie in different clusters will make the clustering look more expensive. This point is worth reiterating. Unless we have some way of penalizing mis-separations as well as mis-groupings, we'll always be at the mercy of the tradeoff between k and cost.

Continuing the "clustering via self-help metaphors" theme to these essays, I call this the "I don't like you, but I like them" way of modelling data.

Correlation Clustering

This is where correlation clustering enters the picture. The correlation clustering model is as follows: every pair of elements is assigned a 1 or -1, encoding a similarity or dissimilarity. The goal is to find a clustering (note: no k!) in which any pair of points in a cluster is penalized for being dissimilar, and any pair of points in two different clusters is penalized for being similar. This can be generalized to arbitrary weights: for each pair of elements you assign weights w+ and w-, with the possible caveat that w+ + w- = 1.


So now the goal is merely to minimize the cost of a clustering. The elegance of correlation clustering lies in the natural way that clusters merge or split depending on the number of similar or dissimilar pairs. Do note though that you need to have input data that can be written in that manner: thresholding a distance function will give you the desired input, but is an ad hoc way of doing it, since the thresholding is arbitrary.


There are a number of different algorithms for correlation clustering, and also a very simple one that yields a good approximation guarantee: pick a point at random, and pull in all its neighbors (all points similar to it). Repeat this process with a new unpicked point, until all points have been selected. This algorithm gives a 3-approximation for correlation clustering.


Correlation clustering is also useful as a way of combining clusterings. We'll talk about this later (ed: how many times have I said that !), but the problem of conensus clustering is to ``cluster the clusterings'', or aggregate them into a single average clustering. An easy way to see the connection is this: given a collection of clusterings of a data set, create an instance of correlation clustering with the positive weight for a pair corresponding to the fraction of clusterings that ``vote'' for that pair being in the same cluster, and the negative weight being the fraction of clusterings that ``vote'' for that pair being separated.

Thursday, July 23, 2009

Clustering: Hierarchical methods

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

k-center, k-median, k-means, k-medoids, .... The list is endless, but that pernicious k comes up everywhere. What can we do about it ? There are really two ways to go:
  1. Figure out the "right" k for a problem. This is a complicated matter, and will be the topic of a later post
  2. Don't choose: give the user a universal representation from which they can decide what k they want.

This latter formulation takes us into the world of hierarchical clustering, the topic of this post.


There are two different ways to think about hierarchical clustering: a representational view and an algorithmic view. The algorithmic view is quite simple: let's try to design a clustering via greedy operations. In the top-down view, we find a good split of the data into two parts, and then recurse on each side. In the bottom-up view, we select two clusters for merging (in the beginning, all items are in separate clusters), and merge our way up.


Perhaps the most well known of the bottom-up methods is the 'single link' clustering algorithm, in which at each step, you merge the two clusters that are closest to each other. If this sounds familiar, it should: this is merely Kruskal's MST algorithm run partially to completion.


The "single-link" method is optimistic: it creates clusters via connected components, assuming that transitivity is enough to construct a cluster. The "complete-link" approach is more pessimistic: rather than defining the distance between two clusters as their nearest pair distance, it defines it as the furthest pair distance, ensuring a clique-like structure for clusters. Merging happens as before.


Other variants that are more robust to noise average the pairwise distances to obtain the clustering distance. With the right choice of distance, these amount to comparing the centroids of clusters.


You'll notice that I didn't actually define a problem that these algorithms solve, keeping in with the grand tradition of clustering :). There is a way of defining a general optimization function that single-link clustering solves optimally. For the others though, although we can define a cost function, we can't show that they're optimal.


So much for the algorithmic view of HC. The representational view goes somewhat deeper.


I've had at least one person (you know who you are) tell me that one of the only two clustering algorithms that worked for them is hierarchical clustering. And indeed, the idea is very seductive. Rather than present the user with a single k-clustering - a snapshot, if you will, of the data - we give them a tree of merges, in which there is one leaf for each object. The central idea here, and one that bears emphasizing, beacuse it's so different to how we've thought about clustering thus far, is this:


we understand the structure of data from its transitions, rather than from its states.


What I mean is this: rather than defining a clustering as a static partitioning of the data, we watch the data ``evolve'' as we start merging points together, and use the merge history to figure out what the real structure is. There are a couple of ways of doing this: the most common way is to imagine drawing the tree from top to bottom, and then cutting it by a y-monotone curve (one that intersects any vertical line in exactly one point). This can be done to ensure we have k clusters, or it can be done at points where the merge appears to be ``stable'' (the subtree doesn't merge with any other subtrees for a long time).


A particularly effective visual metaphor for this is the dendrogram, which is very popular in evolutionary tree analysis.



The dendogram is visually appealing because it does two things: firstly, it depicts the tree of merges, permuting the nodes so that there aren't crossings. Secondly, and more importantly, it uses the lengths of edges as a visual marker to indicate the 'time of merge' for clusters. If we think of starting at time 0 and merging clusters, then a cluster that sticks around for a long time will have a tall edge connecting it to its parent, in comparison with a cluster that gets merged quickly. (as an aside, visual metaphors are possibly underrated when it comes to thinking about clustering algorithms. My firm belief is that one of the reasons soft clusterings aren't used as much as hard clusterings is because it's hard to visualize them)


Sidebar:
And this is the key operational idea behind ``clustering from the transitions'': clusters that stay unmerged longer are likely to be more interesting than clusters that get merged quickly. Till recently, this was merely a heuristic way of thinking about hierarchical clusterings. However, we now have the idea of persistence that comes from the realm of computational topology. In short, persistence is a way of quantifying topological features of a shape that persist across different scales. Persistence has been studied extensively as a rigorous way of quantifying the triviality (or non-triviality) of features in a shape, and there's a recent paper that applies persistence-like concepts to clustering as well. It's still early days for this direction, but I think it's a promising one.


Returning to hierarchical clustering, one major problem with this approach is that it's local: make the wrong choice of merge early on, and you'll never get to the optimal solution for a k-clustering problem. And since there's no easy way to change your mind (i.e split a clustering), it's hard to reverse bad decisions. If you're so inclined (and I am nowadays), this is the problem of finding monotone paths in the merge-split lattice on partitions of [1..n], but I digress...


Ultimately, the value of hierarchical clustering comes in its ability to represent an entire history of clusterings, rather than just one, and the relative ease with which a bottom-up algorithm can be written. That, coupled with its uses where you really do want to find a tree (evolutionary or otherwise) is what makes it a popular implement in the clustering toolbag.

Wednesday, July 15, 2009

Consistent BibTeX formatting

I try not to write BibTeX by hand any more: too easy to introduce errors. So I usually use either DBLP or the ACM digital library to get BibTeX for papers. Sometimes the journal has BibTeX, or some format that can be converted. As an aside, IEEE is extremely lame: you have to login to their digital library even to get a citation !

For the most part, I don't need to go beyond ACM or DBLP, which is great. But here's the problem: their formats are different ! I needed the BibTeX for a recent paper of mine, and found it on both sites. Here's what ACM gave me:
@inproceedings{1516372,
author = {Ahmadi, Babak and Hadjieleftheriou, Marios and Seidl, Thomas and Srivastava, Divesh and Venkatasubramanian, Suresh},
title = {Type-based categorization of relational attributes},
booktitle = {EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology},
year = {2009},
isbn = {978-1-60558-422-5},
pages = {84--95},
location = {Saint Petersburg, Russia},
doi = {http://doi.acm.org/10.1145/1516360.1516372},
publisher = {ACM},
address = {New York, NY, USA},
}

and here's what DBLP gave me:
@inproceedings{DBLP:conf/edbt/AhmadiHSSV09,
author = {Babak Ahmadi and
Marios Hadjieleftheriou and
Thomas Seidl and
Divesh Srivastava and
Suresh Venkatasubramanian},
title = {Type-based categorization of relational attributes},
booktitle = {EDBT},
year = {2009},
pages = {84-95},
ee = {http://doi.acm.org/10.1145/1516360.1516372},
crossref = {DBLP:conf/edbt/2009},
bibsource = {DBLP, http://dblp.uni-trier.de}
}

@proceedings{DBLP:conf/edbt/2009,
editor = {Martin L. Kersten and
Boris Novikov and
Jens Teubner and
Vladimir Polutin and
Stefan Manegold},
title = {EDBT 2009, 12th International Conference on Extending Database
Technology, Saint Petersburg, Russia, March 24-26, 2009,
Proceedings},
booktitle = {EDBT},
publisher = {ACM},
series = {ACM International Conference Proceeding Series},
volume = {360},
year = {2009},
isbn = {978-1-60558-422-5},
bibsource = {DBLP, http://dblp.uni-trier.de}
}
So as you can see, we have a problem. The formats are not consistent, which means that if I need to get some references from DBLP, and others from the ACM, my references file is going to look very irregular.

Other critiques:
  • I have never understood why DBLP splits up the conference and the paper: with BibTeX, if you cite three or more papers that use the same crossref, the crossref is included itself as a reference, which is just strange.
  • Unless you use double curly braces, capitalizations inside a string get removed, which is mucho annoying: It's "Riemannian", not "riemannian".
  • The DBLP name for the conference is too cryptic: who'd even know what EDBT is outside the database community. On the other hand, the ACM citation is clunky, and is a page-length disaster waiting to happen.
Thoughts ?

Thursday, July 09, 2009

Quick note

(am posting this from 37000 ft. isn't technology cool ?)

The clustering series will be on hiatus for a few days while I wrap up some more pressing deadlines. It will restart two weeks from now.

Congratulations to Adam Smith and Sean Hallgren on getting a PECASE award. See what creating a new blog can do !

NSF EDA Workshop: Wrap up

Today morning was wrap up day. The theory group was "tasked" with coming up with a few slides suggested avenues for further cooperation. Dick talked about creating (or re-creating) an environment in academia where researchers are building chips and talking to other experts down the corridor (like theoreticians) rather than the 'planted' model where a theoretician is planted in a corporate design environment for 6 months or so.

I talked briefly about what I called "new" theory: the twin ideas of randomization and approximation that have been so powerful in algorithm design (even for intractable prolems), and how these play into the problems of dealing with massive high dimensional data.

Vijaya gave a synopsis of cache-obliviousness, multicore research, and her recent work in this area. There's a lot of interest in EDA and multicore work, and she made the argument that theoreticians are learning how to design efficient multicore algorithms and can be of great help in adapting EDA methods.

There were also four sub-panels that addressed various aspects of EDA. What to me was interesting that many of the panels brought up "EDA needs in theory" that matched some of the things we talked about. For example, there's a pressing need for methods that run linear (and even sublinear) time, tools that are incremental, in that they can progressively generate better and better solutions given more time, and tools that can parallelize well.

They also talked about providing benchmarks: for example, a 10,000 variable SAT instance and code that solves it. I thought (and said) that this would be an excellent way to get people dabbling in this area.

Randomization was a trickier matter. Although the EDA folks recognize the power of randomization, they are concerned about reproducibility. The DA pipelne is long and involved, and once you've fixed the output of a piece of the pipeline, you'd like to keep it in place and not change from iteration to iteration.

Of course this is something that can be fixed with seed control. For more portability, it's conceivable that you can cache the random coin tosses once you've settled on a reasonable solution to any piece in the pipeline.

Overall, it was quite interesting, although exhausting. The general idea with such workshops is that the findings make their way into the next call for proposals (sometime in October/November), so if you have EDA people you'd like to collaborate it, this might be a good opportunity.

It's time to go home.

NSF EDA Workshop: A comment

Paul Beame had a brilliant comment on my post about the EDA Workshop, and I liked it so much I'm reproducing it in full here:

One fault is that we don't even teach the proper tools in our algorithms courses! There are relatively few algorithms texts that even support the teaching of backtracking as an algorithmic paradigm. Moreover, the sort of very nice work with provable small exponential upper bounds that some theory researchers (e.g., David Eppstein, Richard Beigel, Martin Furer) have done in backtracking algorithms for combinatorial problems is only a very small part of the landscape for algorithmic solutions to hard problems.

Too often, we in theory leave the important hard problems to those in AI and formal methods. One of the standard paradigms in practical verification is to attack problems that in their full generality are PSPACE-hard (or worse), find a useful NP subclass, apply a reduction from these problems to SAT, and use SAT solvers that involve worst-case exponential backtracking (or to a much lesser extent local search) algorithms that are fast in many cases. The key techniques used in these solvers are powerful heuristics that are usually studied in AI but rarely in algorithms research.

The potential for practical impact of improved versions of these algorithms (even with only polynomial speed-up) could be huge. Traditional algorithmic research has occasionally had something useful to say about these topics (such as Moser's recent beautiful analysis of variants of the random walk algorithm on the special classes of CNF formulas satisfying the Lovasz Local Lemma conditions) but it is somehow not viewed as part of the mainstream.

Applications in EDA are only part of the utility of some of the exponential algorithms used in formal methods. There has been a large and surprisingly successful body of work on software model checking that uses quite clever theoretical ideas. One of my favorite examples is the following: Though techniques for checking properties of communicating finite state machines (one of those nasty PSPACE-complete problems) had been in use for EDA since the early 1990's, model checking software presented an additional problem because of the need to handle the call stack. Suppose that one needs to check a safety property (i.e., an invariant). The key observation (which goes back to the early 1960's) is that the language consisting of the possible stack contents of a PDA is always a regular language for which an NFA can be easily constructed from the PDA! One can then apply a small extension the previous algorithm to check safety properties (which must hold for all possible stack contents).

Wednesday, July 08, 2009

NSF Workshop: Electronic Design Automation

I'm currently at an NSF Worshop on Electronic Design Automation (the thing that used to be called VLSI design). I'm here as part of the 'theory audience' along with Vijaya Ramachandran and Dick Lipton (blogger power!).

Thankfully, Dick has posted an extensive summary of the day of talks, so I don't have to. Our mandate here is to listen in on the discussions and come up with a response that suggests avenues where theory folk might have something useful to contribute.

From a purely biased algorithms perspective, one thing that strikes me about the EDA community (at least in the formal methods/verification realm) is their unwillingness to give up in the face of beyond-NP-completeness. What I mean is this: most of my training in algorithms (and this is likely true for you as well) is in the polynomial-time regime: all the algorithic paradigms we learn are effective at reducing the complexity of an algorithm from one polynomial to another.

When we engage with NP-hardness, we switch modes to approximations, and focus on the issue of quality: even the approximation algorithms themselves run in poly time. There are very few people (David Eppstein comes to mind) who work on algorithms in the exponential/subexponential realm and will worry about (say) reducing the base of the exponent for SAT or graph coloring or other hard problems.

The verification folks don't necessarily solve their (very hard) problems exactly, but they do design all kinds of tricks (and heuristics) to deal with these problems, because they actually need to solve them ! In my view, it wouldn't be a bad idea for students learning algorithms to learn at least a few tricks for designing algorithms that might run in exponential time, but are efficient. Remember that exponential might be better than n^100 for many values of n.

One thing that came to mind as I listened to talks. With the exception of a talk by Rupak Mazumdar on faulty computations, and a talk by Ed Clarke (yes, that Ed Clarke) on statistical model checking (based on the Neyman-Pearson hypothesis testing framework), there was little talk of the role that randomization might have to play in the problems of EDA.

A second thought was how the lessons of massive data analysis might be useful in the realm of DA. One speakr described one critical problem as being the degree of complexity associated with current DA tools: there are over 4000 "knobs" to turn in one such tool ! It's believed that these knobs are not independent, and might even be contradictory. If we think of each "run" of the DA tool, outputing some kind of chip layout, as a point in this 4000+ dimensional space, I wonder whether techniques for dimensionality reduction and manifold analysis might be useful to find a set of "core knobs" that control the process.

I have to say that it's nice to attend a workshop with a community that throws out terms like NP-Complete, \Sigma_2, and PSPACE so freely :).

Friday, July 03, 2009

FOCS 2009 results.

The FOCS list is out, (with abstracts). 73 papers accepted. Some papers that popped up on my radar screen (post more in the comments since I'll probably miss many good ones):
  • Matt Gibson and Kasturi Varadarajan. Decomposing Coverings and the Planar Sensor Cover Problem.

    This paper continues a line of work that I like to think we started in a SODA paper a few years ago. The problem is this: you're given a collection of sensors that monitor a region. But they have limited battery life, and they're also redundant (many different subsets of sensors have the range to cover the region). Can you schedule the sensors to go on and off so that at all times, the entire region is covered, and the region is covered for as long as possible ?

    Early work on this problem formulated it in a combinatorial setting, which immediately led to things like log n-approximability lower bounds via set cover variants. Our belief was that the geometry of the regions would make things a bit easier. We made some small progress, getting a sublogarithmic approximation ratio for intervals on the line, and a constant factor for ranges with special structure. Subsequent work made steady improvements, and this work has brought things down to a constant for general classes of regions in the plane
  • Saugata Basu and Thierry Zell. Polynomial hierarchy, Betti numbers and a real analogue of Toda's Theorem

    Betti number computation has been a hot topic in computational geometry of late, but the idea of using Betti numbers to bound the (algebraic complexity) of basic geometric operations dates back to work by Dobkin and Lipton that relates the complexity of certain computations to the number of connected components of "invariant paths" in a computation. Ben-Or generalized these results to more complicated algebraic computation trees, and noting that "number of connected components" is the zero-th Betti number of the set of computations, asked whether higher-order Betti numbers might be useful for proving lower bounds. A series of results by Bjorner, Lovasz and Yao, Bjorner and Lovasz, and finally Yao showed that indeed the higher-order Betti numbers can be used to lower bound the complexity of algebraic computations.

    So Betti numbers are important as a measure of "counting" in algebraic complexity. This paper reinforces this intuition, by relating #P over the reals to the complexity of computing Betti numbers over semi-algebraic sets. Interestingly, this paper mentions the difficult nature of Toda's original proof, and Lance just recently tweeted a new very simple proof of Toda's theorem that he just published in ToC.

  • David Arthur, Bodo Manthey and Heiko Roeglin. k-Means has Polynomial Smoothed Complexity

    Read my post on k-means for more on the significance of such results. This paper finally resolves the smoothed complexity issue, giving a polynomial bound (expected).

  • Peyman Afshani, Jeremy Barbay and Timothy M. Chan. Instance-Optimal Geometric Algorithms

    An interesting result that seems to be able to take away order-dependence from some geometric problems.

  • T.S. Jayram and David Woodruff. The Data Stream Space Complexity of Cascaded Norms

    Cascaded norms (where you compute one aggregate on each of a collection of streams, and compute another aggregate on these aggregates) are tricky for streaming algorithms, and this paper provides some interesting results. Again, I'm waiting for the document to show up, and I'd be interesting in seeing the kinds of techniques they use.

Papers I'd like to learn more about:


Comments ?

Clustering: k-means...

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

k-means (also known as Lloyd's algorithm) is probably the most well known clustering algorithm, and as such, deserves some discussion of its own. We've already seen that it's part of the "I don't like you" family of clustering formulations. How does the method work ?

We start with a set of points in R^d. The algorithm is really simple: Fix a collection of k centers. Now perform the following alternating optimization till you reach a steady state: (1) assign each point to its nearest center (2) for each group of points assigned to the same center, compute a new center by taking the centroid of the points.

More importantly, what problem does this algorithm solve ? The quick answer is: none ! Yes, that's right: k-means gives no answer with any provable guarantee for a generic set of points for any cost measure. Pretty dismal, isn't it ? And yet, it's a really popular algorithm, a state of affairs that generally drives theoreticians to tears. So what's the deal ? The answer, as always, lies in the details, and is a good story of how theoretical work has managed to make some sense of an algorithm that was notoriously hard to analyze properly.

The underlying problem that k-means attempts to solve is the one I mentioned in the last post: let a cluster cost be the sum of squared distances to the cluster center, and minimize the sum of cluster costs over all k-clusterings. Now here are some of the bad things that we CAN say about k-means:
  • It might take exponential time to converge, even in the plane
  • It can get stuck in a local minimum that has an arbitrarily bad cost.

Running Time:
Let's first consider the problem of guaranteeing running time. Although k-means can take exponential time to converge, a smoothed complexity result says that w.h.p, a perturbed input will converge in polynomial time (for those not familiar with the concept, the smoothed complexity of a problem is its (expected) running time after the inputs have been perturbed randomly by a small amount). The smoothed complexity results suggest one reason why k-means converges quickly "in practice". The kinds of careful constructions (high dimensional, and low dimensional) needed to force super polynomial convergence times are very artificial.


A crucial initialization:

So what about the quality of the results produced ? One unspecified aspect of the k-means algorithm is how the initial set of k centers is created. Two recent works have indicated that the key to extracting good performance from k-means is by being very careful with this initial choice.The idea is very neat, and really should be getting more play in the experimental community than I think it does.

Recall the Gonzalez algorithm for the k-center problem: pick a point, and then its furthest neighbor, and then the point whose closest neighbor in this set is as far as possible, and so on. Consider the following randomized variant:
Pick the next candidate center with probability proportional to its minimum distance squared from the current set of clusters.
Papers by Arthur and Vassilvitski, and by Ostrovksy, Rabani, Swamy and Schulman both show that this simple seeding strategy allows for provable guarantees on the quality of the solution returned by k-means. The actual results are slightly different, and give two different views of the behaviour of the algorithm:
  1. The Ostrovsky et al paper starts with the assumption that the point set is epsilon-separated: the intuition behind this definition is that the cost of a k-clustering is significantly better than the cost of a (k-1)-clustering (controlled by a parameter epsilon). Under these conditions, they show that the initial seeding gives a PTAS for the clustering problem.

    I'm eliding a number of details about their algorithm. One interesting feature of their algorithm is what they do after the initial seeding: rather than run Lloyd's iterative step, they fix a radius around each center (different for each center) and compute the centroid of the portion of the set within this ball. They also use a more complicated procedure to convert this into a PTAS, and in my mind that makes their approach a lot less practical.

  2. The Arthur/Vassilvitski paper starts with the same initialization procedure, but they assume no epsilon-separation result. Their algorithm is simply: run the initialization, and then run regular k-means. The guarantee they get is weaker (a log k approximation ratio), but they also provide empirical results showing the superiority of their approach compared to standard k-means. I think it'd be interesting to compare their approach with the 'ball k-means' approach by Ostrovsky et al to see which work better on benchmark data sets.


Another pleasing property of the k-means algorithm is its relationship to mixture modelling and the EM algorithm. I'll have more to say about EM later on, but the gist of the matter is this: The twin steps of finding new centroids and assigning points to nearest neighbours have direct analogs in the world of maximum likelihood concepts and distribution-generated data. There is a general procedure that takes a given exponential family (Gaussian, Poisson and the like), and generates a distance function d such that the density described by the distribution can be expressed as exp(-d(x, mu)), where mu is the mean of the distribution, and x is the point whose probability we're evaluating. ML folks will recognize the Bregman distances here, and the punchline is that for Gaussian distributions, the corresponding distance function is squared Euclidean distance (the exact function used in the k-means evaluation).

What does this all mean ? It means that the k-means process has a semantic meaning in the context of clustering data drawn from a mixture of Gaussians: what's even neater is that a k-means-like algorithm works for data drawn from other distributions, as long as you replace squared Euclidean distance by the appropriate (Bregman) distance function.

Apart from the simplicity of the algorithm, there are some aspects that make it attractive, regardless of lack of guaranteed bounds. From a heuristic-design standpoint, k-means is a particularly atractive example of a technique that's often called alternating optimization.

Notes:
  • I was asked if there were standard data sets for benchmarking clustering algorithms. I don't often see papers doing this in a systematic manner, but a good source of data sets for experimentation is the UCI Machine Learning Repository: there aren't too many clustering data sets, but they can be found here (look for clustering under the default task).
Next up: hierarchical methods...

Sunday, June 28, 2009

Clustering: The "I don't like you" view

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

Any clustering problem starts with a data set. And you can basically tell what algorithm you're going to need by asking, 'what structure do I have on the data ?'. The bare minimum you need is some way of taking two items and returning a number that tells you how similar they are, or more usefully, how far apart they are.

Of course, if I know that A and B are a certain distance apart, and B and C are a certain distance apart, I might like to infer something about the distance between A and C. This is what the triangle inequality gives you, and so the most elementary form of clustering takes items and a metric defined on these items (humans actually DON'T think about distances in this manner, as many psychological studies have shown: more on that later on).

I find it convenient to view clustering algorithms from a pop-psychology self-help perspective, and so I'll deem algorithms that operate with a distance metric the 'I don't like you' methods. The general goal with such methods is to group the items into "clusters" where there isn't too much hatred :).

There are many ways to quantify this "lack of hatred". The one most natural to theoreticians is the 'minimize the worst-case angst' viewpoint: in other words, find k clusters such that over all clusters, the maximum pairwise distance between points in a cluster is minimized. This is the well-known k-center problem.

If you don't like the lack of robustness (one point can mess up the cost), you could instead sum up all the pairwise distances to assign the cost for a cluster. This gives a variant of what's normally considered the k-median problem.

A third way of measuring cluster cost is to sum the squares of distances, rather than the distances themselves. This gives you a cost function that looks a lot like what k-means attempts to solve.

It's often useful to define a representative of a cluster: in the k-center view of the world, the center of a cluster is a point whose maximum distance to all points in the cluster is minimum. Note that by triangle inequality, this is within a factor of two of the k-center cluster cost, so it's often convenient to use this as the definition of the cluster cost (the cluster radius, instead of diameter, if you will). In the sum-of-all-costs view, the center correspondingly can be defined as as the point whose sum of all distances to all other points is minimized. Similarly for the case of sum-of-square-costs.

Remarks on complexity:
There is an elegant 2-approximation for the k-center problem by Gonzalez. The algorithm itself is the essence of simplicity: pick any point in the space, take its furthest neighbor, take the point furthest from these two, and so on. The first k points picked in this manner yield the desired cluster centers, and a simple triangle inequality argument yields the 2-approximation.

The k-median problem is a little harder to solve. The best known bound in a general metric space is a 3+eps approximation, and it's known that you can't get a PTAS. The approximation algorithm itself is not too hard: it's a local search heuristic that gets closer and closer to 3 as you permit larger and larger sets of centers to be swapped out.

Going to Euclidean space, or what's in a name ?
General metrics come either from some kind of similarity function between pairs of elements or from the induced shortest path metric on a graph (there are other metrics one can induce from a graph, and we'll talk about them later). But what if you have more information ? There's a general principle of clustering that's worth enunciating here: more structure on the data can lead you to more targeted algorithms that will probably work better, so resist the urge to be too general.

The most natural space to consider is a Euclidean space, and it is this space that explains the names 'median' and 'mean'. It's well known that on the line, the point minimizing the sum of distances to a set of points (the 1-median) is given by the actual median of the numbers. This explains why the problem of minimizing sum of distances is called the k-median problem. Similarly, the point that minimizes the sum of squared Euclidean distance is the centroid or the 'mean', giving rise to the k-means problem.

There's a long line of results on clustering under various measures in Euclidean space. In brief, for most of the above formulations, you can get a PTAS in Euclidean space, but you end up playing whack-a-mole with the parameters n, d, epsilon, and k (that is to say, something ends up exhibiting exponential dependence, and which parameter it is depends on what you care most about).

The Voronoi trick
But there's a general principle that governs much of the algorithm design for these problems. It's the so-called Voronoi trick, and can be summarized by pointing out that you can always improve the clustering cost by making sure that each point is assigned to its nearest neighbor. Effectively this means that the clusters define a Voronoi partition, with the center as a site, and all points within a Voronoi cell being assigned to the corresponding site. First of all, this means that the number of possible outcomes reduces from k^n (the number of ways of partitioning n things into k groups) to n^{kd} (since the number of Voronoi partitions is much smaller than the number of partitions). Secondly, it suggests a simple heuristic for improvement: Take a given clustering: reassign points so that each point is assigned to its nearest cluster center, and then recompute the center for the cluster associated with all points in the (erstwhile) Voronoi cell. Rinse and repeat....

This heuristic is the basis of the k-means algorithm, and is also the basis for a heuristic for the k-median problem clumsily named the k-medoids algorithm.

Notes:
  • all the above algorithms have the parameter k for the number of clusters. It's obvious why: all the above cost measures can be minimized by placing each item in a cluster by itself, but the resulting answer is meaningless, and "no man is an island"
  • In a general metric space defined over a finite set of points, the cluster centers will by default by elements of the input. This is often a desirable property, if you want representatives to be from the input. But in a smooth space like a Euclidean space, there's no reason to limit yourselves to data points, and this creates a dichotomy between the so-called discrete and continuous variants of a clustering problem. Often, you can use triangle inequality to argue that the answers don't differ significantly in cost, but it's important to ask yourself which variant you care about. The discrete problem can often be "easier" to solve, because you can enumerate over the choices, but this "easier" can be expensive: for example, minimizing the sum of squares is easy in a (continuous) Euclidean space, but is more expensive in a (discrete) metric space.
  • For some unknown reason, it's common to describe clustering problems by the algorithm used to "solve" them, which causes no end of confusion. I lost count of the number of times I had to point out that k-means is an algorithm that attempts to optimize a specific cost function. This is more than just pedantry, because unless you realize the difference between a problem and a specific solution, you'll never be able to conceive of an alternate approach, and you'll never even try to abstract out what's special about your problem. To theoreticians, this might be obvious, but it isn't really obvious IRL.
Next up: everything you ever wanted to know about k-means.

p.s this is written in a much more breezy style than I'd use for any formal document, but the organization and thinking reflects how I'd like to structure the material. Comments, as always, are welcome.

Disqus for The Geomblog