Monday, December 31, 2007

Comte de Buffon, and a happy new year !

My personal resolution for the new year is that I will actually try to write down a larger fraction of the blog posts that are floating around in my head, especially the many that were born while I was sitting in my algorithms seminar.

The NYT has a travel feature on visiting the birthplace of Comte de Buffon, a contemporary and competitor to Carolus Linnaeus, who by this account played Hooke to Linnaeus' Newton. Of course, I (and probably most of my readers) know Buffon not for his work in biology and the natural world, but for the famous problem that bears his name and by some accounts gave rise to the area of geometric probability.

The Buffon needle problem is this: Suppose we drop a needle of length l onto a grid with lines spaced d units apart. What is the probability that the needle touches a grid line ? (let's assume for now that l <> d, the expression is more involved).

The Buffon's needle problem is a textbook example of the area of geometric probability: doing probability theory on entities defined geometrically. This is a harder thing than one might imagine: as Bertrand demonstrated much later on, the very notion of a uniform random sample can be called into question. He showed this by posing a simple question:

Draw an equilateral triangle and circumscribe a circle around it. Pick a chord at random. What is the probability that the chord length is longer than the side of the triangle ?

It turned out that the "Pick a chord at random" step leads to all kinds of problems. Bertrand presented three different equally reasonable ways of picking a chord, and showed that each of the three ways led to a different probability. At the time, this was viewed as a paradox: it seemed strange that what appeared to be a well defined question would lead to three different answers. Of course, the modern view is that there is no paradox, merely a different choice of measure space to work with in each case.

But if that's the case, is there a "natural" choice of measure ? One elegant way of thinking about this was to think of this as a physical process that must obey certain transformation invariants. This leads inexorably to notions like the Haar measure: can we assign a measure to subsets of a group that is invariant over left (or right) group actions ?

The Haar measure has practical utility for anyone working in geometry. Consider the problem of choosing a random rotation (this comes up in when doing Johnson-Lindenstrauss embeddings). In 2 dimensions, this is easy, since the space of rotations is the unit circle. However, in three dimensions or higher, it gets a bit trickier to define the correct measure to sample with respect to. Since rotations form a group, we really need a measure that is invariant under the group action, and a (unique upto scaling) Haar measure is guaranteed to exist in this case, yielding the desired measure to sample from (it's actually fairly easy to generate a random rotation algorithmically: the easiest is to populate a matrix with independent normally distributed random variables, and then orthogonalize it).

But it all goes back to a needle, a grid, and a frustrated naturalist. Happy new year !

Wednesday, December 26, 2007

DARPA Mathematical challenges

Via Ars Mathematica and Not Even Wrong, a DARPA-issued list of challenges in mathematics for the 21st century. Some that puzzled me:
  • Computational Duality: Duality in mathematics has been a profound tool for theoretical understanding. Can it be extended to develop principled computational techniques where duality and geometry are the basis for novel algorithms?

    I'm not sure what this is trying to say, or whether I'm reading it wrong, because the story of linear programming, primal dual schemes, and the Lagrangian is the story of using "duality and geometry as the basis for novel algorithms"
  • What are the Physical Consequences of Perelman’s Proof of Thurston’s Geometrization Theorem?
    Can profound theoretical advances in understanding three-dimensions be applied to construct and manipulate structures across scales to fabricate novel materials?

    Thurston's geometrization conjecture talks about the structure of 3-manifolds: i.e 3 dimensional surfaces living in higher dimensional spaces ? What kinds of materials could be fabricated using this ?
  • Computation at Scale: How can we develop asymptotics for a world with massively many degrees of freedom?

    This sounds like there's a nice computational question lurking somewhere, but I'm not quite sure where.
Nice to see algorithmic origami and self-assembly also mentioned. I am particularly intrigued by the reference to the geometry of genome spaces.

Tuesday, December 11, 2007

On the dominance of the Ivies..

Via Cosmic Variance:

Exhibit A:

“One thing we all must worry about — I certainly do — is the federal support for scientific research. And are we all going to be chasing increasingly scarce dollars?” says Drew Gilpin Faust, Harvard’s new president.

Not that Faust seems worried about Harvard or other top-tier research schools. “They’re going to be—we hope, we trust, we assume—the survivors in this race,” she says. As for the many lesser universities likely to lose market share, she adds, they would be wise “to really emphasize social science or humanities and have science endeavors that are not as ambitious” as those of Harvard and its peers.

Exhibit B:
Mario Capecchi: 2007 Nobel Laureate, Ph.D Harvard (1967). Associate Prof. Biochemistry (Harvard) (1969-1971)
Moved to U. Utah (1973).

Through a series of bold experiments begun in the 1980s, Capecchi demonstrated that he could alter any gene in a mouse cell by replacing it with a modified version. At the time, scientists were skeptical that such altered DNA could be targeted to a particular gene. But Capecchi was not to be deterred. Indeed, his studies demonstrated that it is possible to replace an intact, functional gene with a modified version that can zero in on the corresponding DNA in the chromosome.
And finally:

Some 50 universities are located in the Boston area. Rather than collaboration, Capecchi felt that the thousands of researchers were working in isolation on projects that promised "immediate gratification." As he explained, "Everyone is so aware of what everyone else is doing. 'What's new?' was asked every day. That limits you to short-term returns, posing questions that you know can be answered in six months."

In contrast, the University of Utah in Salt Lake City offered "a relaxed atmosphere, where you could work on projects whose outcome may take 10 years. The relative isolation tends to make you more focused on the biological question you're working on.

"It was a good choice," said Capecchi of his decision to relocate to the U of U in 1973.

Friday, December 07, 2007

VLDB dogfight, and peer review

I guess this had to happen sooner or later, given how easy it is to publish a critique. I'm only surprised that more people aren't doing it.

Via Daniel Lemire comes a link to a page by John Wu where he rakes a VLDB 2006 paper over the coals and doing so, mounts a serious criticism of the review process that led to the paper being accepted. I don't have a dog in this fight, and know next to nothing about the area, but the criticisms are prima facie plausible, even if mitigated by the fact that he's the "injured party" in this argument. One idea that he proposes, and that Daniel endorses, is this:
Our first recommendation is for the program committee to assign one member to be responsible for a paper. To encourage committee members to take this responsibility seriously, the accepted paper should be printed with the name of the responsible person. Many journal articles are printed with a footnote to the effect of “Accepted on the recommendation of So-and-So.” Similar footnote could be printed on the final copy of the conference proceedings.
It's an interesting idea. I've heard this being proposed before, and apart from the fact that papers can often slip in without strong champions, I don't see too much that's wrong with it.

Neighborly polytopes and compressed sensing

I ran a seminar on approximate high dimensional geometry this semester, and there are many posts queued up based on interesting observations from the papers that were presented. This post is about polytopes, compressed sensing, and lower bounds for convex hulls.

Compressed sensing is a fascinating topic that (depending on how you look at it) is influenced by approximation theory, sketching techniques, compression, and geometry. Posts by Terry Tao and Igor Carron are good places to learn more about this, and the paper repository at Rice is another great resource. What I wanted to talk about in this post however, is an interesting result by Donoho and Tanner that connects back to structures that geometers are intimately familiar with, as well as helping me understand the nature of convex hull lower bounds.

Let's briefly sketch the motivating problem. Suppose you're given a signal s that you want to express in terms of linear combinations of basis functions drawn from a dictionary. A key idea in compressed sensing is that the dictionary itself is redundant, in that there are multiple ways of representing a signal via linear combinations of basis functions. A useful analogy is that of barycentric coordinates. We can represent a point in the plane uniquely via a linear combination of three non-degenerate points (i.e a simplex), but this stops being true if we have more than three points that form the "basis".

Given that there are multiple representations, which is the best one ? The next important idea is that of sparsity: namely, that the number of coefficients needed is very small compared to the size of the dictionary. Again, returning to our analogy, it would be as if the dictionary had 100 points in it, and all we needed to determine was a representation using a linear combination of three points.

So one rendition of the compressed sensing problem is: Suppose there's a signal that you interact with via a 'measurement matrix' (which means you supply a vector of coefficients, and get the value of the inner product of the signal coefficients with the measurement coefficients; each row of the matrix is one such measurement). You retain the output values y; formally, you can write this as y = Ax, where x is the unknown signal vector. Can you find an x with the fewest number of nonzero coefficients (i.e can you minimize the l0 norm) that satisfies this equation ?

This problem is NP-hard; intuitively, you can encode SUBSET SUM in this problem, by setting up an appropriate y vector. Suppose however that we relax the optimization, and rather than asking to minimize the l0 norm, we ask to minimize the l_1 norm of the target vector x ? It turns out that this problem can be solved via linear programming.

Under what conditions then can the l_1 solution yield the l_0 solution ? Suppose that we could reconstruct the vector x uniquely from y. In that case, it doesn't really matter what cost function we use: if there's only one solution, then we'll find it either way.

This is where neighborly polytopes enter the picture. A polytope in R^d is said to be m-neighborly if any m vertices lie on a face of the polytope. Note that the simplex is always d-neighborly: essentially this definition tries to capture the idea of a "simplicial core" of a polytope: crudely put, an m-neighborly polytope has an m-simplex encoded within it somewhere.

If we think of the measurement y as being a point reconstructed from m coefficients, then if y lies on the m-simplex, there are unique coefficients we can assign to the vertices of the simplex to reconstruct it. Donoho and Tanner showed that this is essentially sufficient: namely that the polytope AT^d (where T^d is the d-simplex) is m-neighborly if and only if there's a unique reconstruction of y (implying that the l_1 optimization yields the desired sparse vector).

This is quite an elegant result in its own right: further, they generalize this to a "robust" variant, where the polytope might be approximately m-neighborly, in which case they show that with high probability a unique solution can be reconstructed (this latter result resembles the Johnson-Lindenstrauss theorem). What's interesting about these results is that they connect a very old concept in geometry (neighborly polytopes) to a reconstruction algorithm in compressed sensing.

A side effect of understanding neighborly polytopes from this perspective is that I finally saw why cyclic polytopes "work" as a lower bound for estimating the size of a convex hull. We can ask what the largest value of m can be for which some d-dimensional polytope is m-neighborly. It turns out that the answer is md/2, and the example is the cyclic polytope.

Why is this interesting ? If we go back to the intuition of m-neighborly polytopes containing an m-dimensional simplicial core, it becomes clear. All vertices of a simplex lie on the convex hull. If a polytope is m-neighborly, it should come as no surprise that its convex hull has size n^m. We thus recover the lower bound for a convex hull size in d dimensions from this fact.

Thursday, December 06, 2007

Leibniz Prize

Via an anonymous commenter comes the news that Susanne Albers is one of the 2008 Leibniz Prize winners. This is the most prestigious science award in Germany (and that's all of science, not just computer science). Going back over the list of past winners, it seems (and everything's in German, so I could be wrong here) that the last theory prize winner was Gunter Ziegler, back in 2001. Susanne Albers, for those who don't know, works in online algorithms, approximations, and algorithmic game theory.

From the Leibniz Prize page:
The Leibniz Programme, established in 1985, aims to improve the working conditions of outstanding scientists and academics, expand their research opportunities, relieve them of administrative tasks, and help them employ particularly qualified young researchers. Up to ten prizes are awarded annually with a maximum of €2.5 million per award. The prize is awarded from a slate of nominations put forward by third parties. The prizewinner is selected by the Joint Committee on the basis of a recommendation from the Nominations Committee for the Leibniz Programme.
The Leibniz Prize is administered by the DFG, which according to Wikipedia translates to the German Research Foundation, which would make it the equivalent of the NSF, and the Leibniz Prize roughly comparable to the NSF's Waterman Prize.

Update: here's the announcement (in German)

Academia vs Industry: Hierarchies...

Universities, and the people within it, often like to project the image of an egalitarian society. "This is your university", "A university is made up of professors", "WE run the place", and so on. Nowadays, in the era of 'university-as-business-enterprise', these buzzphrases are often supplemented by, "we're all partners in this enterprise", and other business metaphors.

There's one obvious way in which this is fiction: the tenured/non-tenured divide is vast, and dominates the thinking of the untenured in many ways. Continuing the business metaphors, the tenured/untenured divide is like the partner-associate divide at law firms or investment banks.

But factoring that out, there are more subtle hierarchies: the administrative pathway (chair to dean to provost) is one obvious one, and there are class hierarchies based on funding output, fame, and other parameters like that.

In itself, this is not particularly noteworthy: where humans exist, hierarchies are sure to follow. What I find interesting to observe is how the lack of an institutionalized hierarchical structure (that we see in a company) removes some of the obvious external cues that help us slot into our various positions in the hierarchies. In other words, you have to be reasonably socially aware to pick up on the signals that indicate hierarchical patterns in academia, and once you do, you have to constantly remind yourself that they exist, otherwise you run the risk of silently crossing lines that you shouldn't be crossing.

The key issue here is that the natural tendency to form hierarchies is in this case going counter to the professed semi-fiction of equality, and it's the conflict between the two views that makes for an interesting social dynamic.

In all the ink that gets spilled about academia vs industry, this is one of the more subtle issues that crops up. I doubt it makes a difference to anyone's decision on which way to go, but it's curious nonetheless.

Wednesday, December 05, 2007

Tenure-track position in geometric computation at McGill

Contact McGill Geometry for more details. Here's the summary:
The School of Computer Science at McGill University invites applications for one tenure-track position at the assistant professor level, to begin August 1, 2008, in the general area of geometric computation.

The successful candidate must have a strong theoretical foundation and should be active in research on geometric problems and their applications.

Complete pdf format applications, including a curriculum vitae, a list of publications with copies of one or two sample reprints, a research statement as well as a teaching statement, and the names and e-mail addresses of three references should be sent to geometry@cs.mcgill.ca.

Applications will be reviewed as soon as they are received. No applications will be accepted after January 15, 2008.

The School of Computer Science offers a collegial environment with opportunities for interaction with world class researchers in areas including (but not limited to): computational geometry, discrete mathematics, mobile robotics, computer vision , computer graphics, bioinformatics, cryptography and quantum information, reasoning and learning, and scientific computing.

For further information on the School, see: http://www.cs.mcgill.ca.
Deadline is Jan 15, 2008.

2007 ACM Fellows

A huge crop of algorithms/complexity folks have been named ACM Fellows in this year's batch of awards: among them
  • Rajeev Alur
  • Avrim Blum
  • Andrei Broder
  • Danny Dolev
  • Rodney Downey
  • Tao Jiang (thanks, Anon)
  • Lance Fortnow
  • Rajeev Motwani
Congratulations to all !

(12/6): A special congratulations also to Rajeev Motwani (my research "father") and Lance (my blog "father") :)

Monday, December 03, 2007

A question on quantum debugging

As my algorithms class winds down, I've been doing some "topics" lectures, and today I attempted to not make a fool of myself giving students a very high level view of quantum computing, with much help from Dave Bacon and Umesh Vazirani's lecture notes, as well as the Nielsen/Chuang book.

One question came up that I thought was interesting, and I'm hoping that Pontiffs in Shtetls can help answer it :). One of the "features" of programming a randomized algorithm with pseudorandom bits generated by a seed is that you can reproduce a program run by using the same trace. This is of course handy when debugging.

Is there any way of doing this with a quantum algorithm ? I.e is there any way to run a quantum algorithm twice and get the same answer ? maybe using entangled bits in some strange way ?

Friday, November 30, 2007

The true meaning of peer review

from Cosma Shalizi:
passing peer review is better understood as saying a paper is not obviously wrong, not obviously redundant and not obviously boring, rather than as saying it's correct, innovative and important.
Of course, the contrapositive of this is that an obviously wrong, redundant and boring paper has some nonzero chance of being rejected ;)

Interesting new blogspam method ....

When you've been writing a blog for a while, you'll discover what appear to be automated blogs that shamelessly steal your posts and repost them as new posts verbatim. I can only imagine that they do this to get high hits (and therefore advertising revenue) from Google, although why someone would take *my* posts for this purpose escapes me.

A new form of blog plagiarism combines the best of spam techniques with post copying, so as to evade the obvious duplicate detection methods. Consider the following examples:

The original post:
There's been a lot of grief over the reduction in price for iPhones only a few months after they were released. Wired magazine interviewed people who don't regret paying the higher price for the virtue of being an early adopter/arbiter-of-cool, and this comment caught my eye
The modified post (and no, I'm not going to link to it):
There's been a number of visibility at the decline master p scholomance for online in very couple weeks and we were released. Wired magazine interviewed people who don't regret paying a low frequency of the world as both an early riser and perhaps comment and the eye
The original post:
Suppose you're given a metric space (X, d) and a parameter k, and your goal is to find k "clusters" such that the sum of cluster costs is minimized. Here, the cost of a cluster is the sum (over all points in the cluster) of their distance to the cluster "center" (a designated point).
The modified post:
Lil romeo you're given a multidimensional space (X, d) and a parameter k, and a buff being use find k "clusters" such as lump- end of the effect is minimized. Here, lil romeo use 70million a target wikipedia, the sum of these items out the amount of the need codify the cluster "center" (a designated point).
And so on. What's more mysterious is that this isn't straight plagiarism: the post credits my post as the "source", although given the garbage that's generated, I'm not sure I want that "source" credit :).

Very mystifying...

Saturday, November 24, 2007

Black Friday -- online

I'm on a cable modem, and thus I share my network connection with my neighbours. Ever since Friday morning, my connection has been awfully slow during the day (and much faster in the early AM). I wonder if the whole 'Black Friday, crowded mall' phenomenon is moving online.

Tuesday, November 20, 2007

Author contributions

In this conference season, it seems like a good time to think about the issue of author contributions. Here are some of the methods for author listing used in computer science:
  • Alphabetical (the norm in theoryCS and communities heavily influenced by it)
  • Advisor-student: student name comes first, advisor comes last.
  • Work-ordering: first author did all the work, last author provided the funding, intermediate authors ordered by contribution (common in systems work)
Of course, there are combinations: I've seen alphabetical modified by advisor-student ordering, or alphabetical with sub-level work-ordering pieces etc, although the latter is not common in theoryCS. The arguments for and against the various methods are well known I imagine:
  • alphabetical ordering conceals who did the real work ("but the cream rises to the top")
  • Work ordering allows advisors to slap their names on anything coming out of their lab ("But they did all the work needed to fund this research: grant writing, selling, etc")
  • alphabetical gives undue eminence to authors whose names begin with 'A', (but see here for a counter-theory)
  • alphabetical ordering makes authors with names in the end of the alphabet lazy (Who're you looking at, punk ?)
  • theory papers can't be work-ordered, because ideas "float in the air without ownership" (yes, I have heard this argument many times, so don't snicker)
But what is perhaps more interesting is the problem of assigning author contributions, especially in systems like the alphabetical one where author contributions are not signalled by placement. Luca Aceto points out an unusually detailed description of author contributions, and speculates as to what a similar note might look like on theory papers.

He also links to an interesting set of axioms that Hardy and Littlewood developed before their famous collaboration started. I actually admire this approach. As a general rule, theory folks tend to be very allergic towards detailed discussions about contributions and what-not, but in collaborations, it's critically important to lay out the ground rules in advance, and the Hardy-Littlewood axioms are good because they lay out rules in advance that eliminate any problems later on: for example,
And, finally, the fourth, and perhaps most important axiom, stated that it was quite indifferent if one of them had not contributed the least bit to the contents of a paper under their common name . . .
Agreeing to such an axiom requires great trust between the collaborators, because of the immense potential for abuse, and maybe that's exactly the point: a good collaboration requires trust, and you can't agree to such an axiom unless trust already exists.

Monday, November 19, 2007

My first ipelet: 1-medians

I just wrote my first ipelet, and it was an amazingly pleasant experience. Not to mention the thrill I got from marking a collection of points and whoosh! getting the answer right there on the screen.

The ipelet is for computing the discrete and geometric 1-median of a set of points. The geometric 1-median of a set of points (also called the Fermat-Weber-Steiner-Torricelli-< insert your name here > point) is the point that minimizes the sum of distances to a set of points. The discrete 1-median is the same minimization, restricted to points in the input.

Computing the discrete 1-median of a set of points in the plane is trivial in n^2 time, and I didn't do anything clever there. Computing the geometric 1-median is more nontrivial though: there is no closed form solution beyond 4 points, and it can be shown that the point cannot be computed using arithmetic and k^th roots.

There are various approximation schemes: the most pertinent one here (in the plane) is a scheme by Bose, Maheshwari and Morin that uses a cone decomposition of the plane to yield an approximation scheme. In practice, researchers often use an iterative scheme developed by Weiszfeld; it's a simple iterative scheme built around the fact that the optimal solution must be a point such that the sum of unit vectors from it to all other points is zero (formally, this is the condition on the gradient).

The cost function is convex and differentiable except at the input points themselves. Thus, the Weiszfeld iteration is guaranteed to give the optimal solution as long as it doesn't get stuck at one of the data points. This algorithm is often referred to as the "founding algorithm of location science", and there are numerous modifications, extensions, and refinements. The ipelet implements the basic Weiszfeld scheme with a fairly brain-dead termination condition.

What possessed me to do such a thing ? Well, I'm trying to find counter examples for a conjecture involving 1-medians, and needed a way of visualizing examples :).

The code is here: feel free to use it, comment on it, modify it, etc.

IKEA = NP ?

Brian Hayes, in a post with a very catchy title, makes this painful, and yet not-off-the-mark comment about complexity class nomenclature:

The sad truth is, the naming conventions for furniture at Ikea make for a more consistent language than those of complexity theory.

Ouch ! For more on this, read Lance's post from way back when.

Tuesday, November 13, 2007

A day in the life...

Here's a list of things I did today:
  • I taught one lecture of my class
  • I attended a departmental committee meeting
  • I had a meeting with collaborators to work out a paper outline for something we're submitting
  • I met with my student and did some nontechnical advising
  • I had a (brief) discussion with a collaborator about a pending grant proposal.
In other words, I covered the entire gamut of activities one might expect of an academic - with one significant exception. Did you notice it ?

Nowhere in there was any actual research done ! Gaaaah !

Tuesday, November 06, 2007

"As good as your best result"

Mihai makes the argument that in theoryCS, you're as famous as your best result. I think the number of good results does matter a bit more than that, but getting well known for your best result is the best way to make a first splash and get on the radar.

I've actually heard the opposite claim made by systems researchers, and this has been extended to other empirical researchers as well. Namely, "you're as good as your worst result" (presumably thresholded to ignore grad school). The rationale here appears to be that in the conservative empirical sciences, where a badly designed experiment can cast a shadow on all your later work, it's your worst result that matters.

I can see how this works (sort of): in the more mathematical disciplines, a "proof" can to a large extent be validated independent of the author, so mistakes in past proofs can be tolerated (though if this becomes an endemic problem, then ...). Louis de Branges is famous for his proof of a conjecture in complex analysis known as the Bieberbach conjecture, and is currently well known for his claimed proof of the Riemann Hypothesis (Karl Sabbagh's book on this has more on de Branges). His proof of the Bieberbach conjecture was not without some initial skepticism from other mathematicians, because of some earlier false starts. However, it was soon seen as correct, and has opened up new areas of mathematics as well. As a consequence, his proofs of the Riemann hypothesis have received more cautious consideration as well (although I am not aware of the current status of his claims).

On the other hand, validating empirical work is largely on trust: you trust that the scientists have made faithful measurements, have kept proper lab notes etc, and I can see how a weakening of that trust can lead to much higher skepticism.

In this light, it's interesting to read Malcolm Gladwell's latest article for the New Yorker. He profiles the "profilers": the psych experts who help the FBI track down serial killers, and comes to this conclusion:
He seems to have understood only that, if you make a great number of predictions, the ones that were wrong will soon be forgotten, and the ones that turn out to be true will make you famous.

Sunday, November 04, 2007

Wednesday, October 24, 2007

Catchy: "Put the Turing in Manufac-turing"

From an op-ed piece by Ken Goldberg and Vijay Kumar:

But U.S. manufacturing today is where database technology was in the early 1960's, a patchwork of ad hoc solutions that lacked the rigorous methodology that leads to scientific innovation. That all changed in 1970 when Ted Codd, an IBM mathematician, invented relational algebra, an elegant mathematical database model that galvanized federally funded research leading to today's $14 billion database industry.

Manufacturing needs the same treatment. Just as the method to add two numbers together doesn't depend on what kind of pencil you use, manufacturing abstractions can be wholly independent of the product one is making or the assembly line systems used to assemble it. Another precedent is the Turing Machine, an elegant abstract model invented by Alan Turing in the 1930s, which established the mathematical and scientific foundations for our now-successful high-tech industries. Without Turing's theoretical work, the system that typeset this line wouldn't exist.

What's needed today is an analogy to the Turing Machine for design, automation and manufacturing. Recent developments in computing and information science have now made it possible to model and reason about physical manufacturing processes, setting the stage for us to "put the Turing into Manufacturing". The result, as was the case with databases and computers, would be higher quality, more reliable products, reduced costs, and faster delivery

As they say, RTWT.

Untangling a geometric graph

Planarity is a rather addictive game developed by John Tantalo. The idea is that you're given a planar graph, drawn with lots of crossings, and your goal is to move vertices around so as to recover a planar embedding. The game can get surprisingly challenging, and has even spawned a photoshop tribute called, 'Planarity, My Anti-drug'.

If we put on our theory hats, the first question that pops out is: how many vertices might you need to move in order to move from an embedding of a planar graph to a planar embedding ? This process, of moving vertices around to remove crossings, is called untangling. Technically, H is an untangling of G if H is crossing free and is isomorphic as G (in other words, untangling might allow more general moves than the sliding allowed by Planarity). A vertex of G was said to be fixed if it was mapped (by the isomorphism) to a vertex of H with the same coordinates.

Back in 1998, it was first asked whether a graph could be made planar while keeping a constant fraction of vertices fixed. The answer, shown by Pach and Tardos, was no, via a graph in which at most roughly n^{2/3} of the vertices could stay fixed (which is one way of explaining the difficulty of Planarity). Of course, the next question was whether it was the case that all planar graphs could be untangled by fixing at least some n^epsilon vertices.

A recent paper by Bose, Dujmovic, Hurtado, Langerman, Morin and Wood answers this question in the affirmative, showing that any geometric planar graph can be untangled while keeping at least n^{1/4} vertices fixed. The proofs involved are intricate, but are graph-theoretic and "elementary" in nature.

Wolfram (2,3) machine exists

Back in May, Bill and David discussed the new Wolfram challenge: Show that a particular 2-state 3-color TM is universal. Apparently, this question has now been resolved affirmatively, by Alex Smith, an undergraduate at the University of Birmingham.

Read more about here, and here.

(via BB)

Monday, October 15, 2007

I foresee mass murders of cell phone users...

From the NYT:
It’s been a long time, but Led Zeppelin, one of the last superstar acts to refrain from selling its music online, is finally offering its catalog to digital-music fans. The shift by Led Zeppelin, whose reunion concert in London next month has already incited a frenzy for tickets, highlights the clout of digital sales in the music market as mass merchants reduce the shelf space devoted to compact discs. Under a series of new agreements expected to be announced today, the band will make its songs available first as ringtones and similar mobile features starting this week in an exclusive deal with Verizon Wireless.
And if you don't believe me, watch Wayne's World.

Thursday, October 11, 2007

Popularizing computing

Brian Hayes writes for the American Scientist (and on his blog, bit-player), and in my mind is one of the best writers of popular material about computing. Every time I read something he's written, I learn something new, even if it's about a topic I'm familiar with. What's even more impressive is that he often focuses on the mathematical aspects of computing and finds interesting ways to explain them to his audience.

Two recent articles of his are exemplars:

1. Sorting out the genome, from September-October 2007. Here's a key extract:
Given two arrangements of a set of genes—say a b c d e f g and f e b a g c d—how do you determine what sequence of reversals produced the transformation? This example has a three-step solution, which you can probably find in a few minutes with pencil and paper. For larger genomes and longer chains of reversals, however, trial-and-error methods soon falter. Is there an efficient algorithm for identifying a sequence of reversals that converts one permutation into another?

The genetic reversal problem lies at the intersection of biology, mathematics and computer science. For some time, the prospects for finding a simple and efficient solution seemed dim, even with the most powerful tools of all three disciplines. But the story has a happy ending. A little more than a decade ago, computing gene reversals was still a subtle research problem; now it can be done with such ease that it's a matter of routine technology. If you need to know the "reversal distance" between two genomes, you can go to a Web site and get the answer in seconds.
The article proceeds to lay out the entire history of sorting with reversals, pancake sorting, and related problems. It's a great case study for the study of algorithms in an application domain: how to model a problem, come up with algoritms, discover that the model doesn't quite capture what you want, rethink the model, come up with more algorithms, and so on. It touches on dynamic programming, greedy algorithms, NP-hardness, and approximations. I liked it so much I'll be devoting a lecture in my algorithms class to it, to illustrate how algorithms research works "in the real world".

2. Conquering divide, bit-player, October 9th, 2007.
The "hook" for this article is is Eric Allender's review article on the true complexity of division (short summary: Division is complete for DLOGTIME-uniform TC^0), which is a must-read in its own right. What Hayes discusses in his article is the reason to even ask this question, starting with the general lack of symmetry between grade-school methods for multiplication and division. I found this particularly insightful:

Why is division so hard?

At times I feel this is a dumb question, and that the universe doesn’t owe us an answer to such questions. Some things are just harder than others; that’s the way the world works. We can count ourselves lucky that multiplication has a polynomial-time algorithm; we have no right to expect that our luck will hold when we turn to division.

But then I get to thinking about physical systems that multiply and divide, where the two operations seem perfectly symmetrical. An example is an electrical circuit governed by Ohm’s law.

If we attach the terminals of a voltmeter to opposite ends of a resistor, we observe Ohm’s law in the form E = IR: Voltage is equal to the product of current and resistance. If we install an ammeter in series with the resistor, we see the effects of the law I = E/R: Current is equal to voltage divided by resistance. If you prefer pipes to wires, you can do the hydraulic version of the experiment, and there are also lots of mechanical schemes, where the same principle is illustrated with levers, gears, pulleys and such. In all of these systems, nature seems to divide just as easily as it multiplies. Why can’t we do the same with pencil and paper, or transistors, or neurons?

Saturday, October 06, 2007

CoM #18

The latest installment of the Carnival of Mathematics is out, hosted by jd2718. It's an amazing effort, with 52 posts from 49 different authors. The first few installments of the CoM were quite sparse, and for a while I wondered whether there was enough chatter in the blathosphere to sustain it. The answer, looking at the variety and quality of posts that Jonathan has put together, is clearly yes !

Well done !

Sunday, September 30, 2007

"Related Work" vs "Prior Work"

(ed: As deadlines draw near, I get increasingly caught up with the minutiae of paper-writing. Be warned)

"Related work" or "Prior work" ? I've seen both section titles used when describing all the work that came before the magnificient specimen of research that I'm reading right now. Personally, I prefer 'Prior work': it seems somehow closer to the spirit of 'I stand on the shoulders of giants', even if that was really just a dig at Hooke. 'Related Work' to me has the tone of a grudging 'yeah yeah, these guys did something that might vaguely resemble what we do here, and I guess I should include it, but don't imagine that their work in any way influences my brilliant thoughts, which are sui generis'.

An issue that appears to be related to this is the placement of the Related/Prior Work section. I personally prefer to have prior work discussions just after the introduction, unless discussing prior work requires the use of notation that I introduce in a definitions section. In many papers (and some of mine as well), the prior work gets shunted off to the no-man's land before the conclusions. I wonder if there's some correlation between the choice of the title, 'Related Work' and the banishing of the section to the very end, a kind of double dismissal of the value of the cited works.

Wednesday, September 26, 2007

The "Voronoi trick"

Only a few days after talking about work on redistricting based on computing a geometric clustering, I came across (via Andrew Gelman) yet another paper on automatic redistricting by Roland Fryer and Richard Holden.

It's a very interesting paper, and at the risk of sounding patronizing, I have to say that I'm impressed at their use of power diagrams and the Voronoi trick, something that many theoryCS folks may not have encountered. Once again, the problem they're looking at is redistricting: can you design voting districts to satisfy certain fairness criteria ?

One specific question that they ask is: can you evaluate the quality of a state's redistricting plan ? Their answer is in the form of an approximation ratio. They fix a measure of quality for a redistricting, and take the ratio of the actual plan cost to the cost of the optimal plan. Their cost, in this case, is a well known clustering measure: the cost of a cluster (in this case, a district) is the all-pairs sum of distances squared, and the cost of a clustering (in this case, a redistricting) is the sum of cluster costs. One extra twist is that they require all clusters to be equally sized upto rounding errors. That is, a 2-clustering of 9 people must have one cluster with 5 and one cluster with 4 people.

The problem of computing an optimal clustering is NP-hard if k, the number of clusters, is part of the input. However, if not, then there's an elegant method for solving these kinds of problems which I'll call the "Voronoi trick".

The Voronoi Trick:

A clustering of n items into k clusters is a partitioning problem, and as such, there are roughly k^n ways of assigning items to clusters (k^n comes from treating all clusters as distinct; in general, the amount might be reduced by as much as k!). But for geometric problems, we fully expect that not all partitions are likely to be optimal, because of proximity constraints and the triangle inequality. Is there some way of restricting the set of possible partitions ?

Consider the following set of possible partitions of a set of n points in a Euclidean space. We put down k centers, and assign each point of the input to its nearest center. Now not all placement of k centers lead to distinct partitions: I can always perturb one center slightly without changing the combinatorial structure of the partition. We'll call such a partition a Voronoi partition, from the connection to Voronoi diagrams.

Now it's clear that not all partitions of the input are realizable as Voronoi partitions. For example, if we consider k=2 points in the plane, then there are only O(n^2) distinct partitions, rather than the 2^n ways of partitioning a set of items into two groups. In general, for a set of n points in d dimensions, there are roughly n^{kd} distinct Voronoi partitions, which is polynomial in n, rather than being exponential.

Now here comes the trick: for many clustering problems, including the one mentioned above, we can show that the optimal clustering must be realizable as a Voronoi partition. This kind of result has appeared in many different forms: Inaba et al showed that for two different kinds of minimum variance clustering (one where the cost of a cluster is the sum of squared distances from the center, and the other where the cost is the pairwise sum of squared distances of points in the cluster), the optimal clustering is a Voronoi partition. Other forms of this trick were used by Aurenhammer et al in their paper on least-squares partitioning where they showed that we could compute the optimal RMS matching between two point sets using Voronoi partitions.

One aspect of this trick that I elided: often, the Voronoi partitioning doesn't come from using a straight distance function, but from using a weighted distance function. One way this is done is to associate a weight w(s) with each site, and define the Voronoi bisector as the set of points satisfying d(x,s) - w(s) = d(x,s') - w(s'), rather than the usual d(x,s) = d(x,s'). If you recall the view of a Voronoi diagram as the lower envelope of a collection of cones in one higher dimension, with apexes at the site locations, then a weighted Voronoi diagram corresponds to lifting the cones by different (even negative) heights, corresponding to the weight function w(s). This diagram is also called a power diagram.

The Voronoi trick and clustering:
Apart from the fact that the Voronoi trick allows us to reduce the space of possible partitions to consider, it also allows us to isolate the influence of points on clusters. This idea was first exploited in the above-mentioned paper by Inaba et al, and has since been used in approximate clustering work by Badoiu, Har-Peled and Indyk, as well as the recent linear-time k-means algorithms by Kumar, Sabharwal and Sen. The argument goes something like this:

Since we know that each center is "influenced" by points in its Voronoi cell, we can focus on the problem of finding a single center out of the k. This is typically solved using a (simpler) 1-median or 1-means algorithm, run on a random sample of points that is likely to contain a large number of points from a single cluster. Once this is done, we can "peel" off this center, and repeat. The ability to do random sampling and peeling comes from the idea of irreducibility: namely, that either a k-clustering can be approximated using a k-1-clustering, or if not, it means that the k^th cluster is "reasonably well-separated" from the rest, and can be isolated using the above ideas.

Returning to redistricting:
Coming back to the redistricting paper, what they do is essentially show that their all-pairs measure of the cost of a cluster reduces to the 'cost to the centroid' measure, since all clusters are required to have the same size. In this case, they can make use of the power diagram to search efficiently for solutions. If I may nitpick, I wonder why they didn't look into the many approximation schemes available for these problems (but this is really a nitpick).

Postscript:
There are other tricks that are also worth of being called a 'Voronoi trick'. One such idea dates back to a paper by Dobkin, Drysdale and Guibas from 1983 ("D.P. Dobkin, R.L. Drysdale, IIi, and L.J. Guibas. Finding smallest polygons. In Advances in Computing Research, Voi. 1, JAI Press, 1983, 181-214."), and works as follows:

Suppose you want to find the k points from a point set of size n that optimize some function: for example, you want the k points with the smallest enclosing ball, or the k points with the smallest convex hull area, and so on. Further, let's say you have some expensive algorithm that can solve this problem in time O(n^c). For many of these functions, it can be shown that the optimal set of k points must lie within a single cell of the O(k)^th order Voronoi diagram of the set of points. In case you hadn't encountered this notion before, the k^th order Voronoi diagram is what you get if instead of taking the lower envelope of your cones, you take the k^th level of the arrangement. Intuitively, it consists of cells in which for any point, its k nearest neighbours are a fixed set of points.

The k^th order Voronoi diagram of points in the plane has complexity O(kn). Using the above fact, we can enumerate over all cells, and run the expensive algorithm over the O(k) points in each cell. This gives a running time of O(k^c * kn = k^{c+1} n), rather than n^c. This idea works upto a point: see this paper by fellow bloggers David and Jeff for improvements that bypass the Voronoi construction in favor of iterated nearest neighbors.

Friday, September 21, 2007

Manifold learning and Geometry

In June, Partha Niyogi gave the invited lecture at SoCG. The subject was manifold learning:
Given a collection of (unlabelled) data inhabiting some high dimensional space, can you determine whether they actually lie on some lower dimensional manifold in this space ?
This talk was a great window into an area of machine learning with strong geometric connections, an area where geometers could profitably play and make useful contributions.

Now NIPS (one of the major machine learning conferences) is returning the favor, with a workshop devoted to the topic of Topology Learning:

There is a growing interest in Machine Learning, in applying geometrical and topological tools to high-dimensional data analysis and processing.

Considering a finite set of points in a high-dimensional space, the approaches developed in the field of Topology Learning intend to learn, explore and exploit the topology of the shapes (topological invariants such as the connectedness, the intrinsic dimension or the Betti numbers), manifolds or not, from which these points are supposed to be drawn.

Applications likely to benefit from these topological characteristics have been identified in the field of Exploratory Data Analysis, Pattern Recognition, Process Control, Semi-Supervised Learning, Manifold Learning and Clustering.

However it appears that the integration in the Machine Learning and Statistics frameworks of the problems we are faced with in Topology Learning, is still in its infancy. So we wish this workshop to ignite cross-fertilization between Machine Learning, Computational Geometry and Topology, likely to benefit to all of them by leading to new approaches, deeper understanding, and stronger theoretical results about the problems carried by Topology Learning.

The list of invited speakers bridges geometry, topology and learning. It looks like a great forum for continuing the machine learning-computational geometry cross-talk that Partha kicked off at SoCG.

Thursday, September 20, 2007

Modelling via the algorithmic lens

Much of my time is spent on the interface between algorithms research and other areas, in the dreaded trenches of 'algorithmic modelling'. Algorithmic modelling can be a tremendously useful tool, if wielded skillfully, and in the few cases where one hits the jackpot, you have an instant argument for the value of theoryCS outside computer science. Most importantly, it provides a kind of rationale for the intrinsic value of the algorithmic method: that thinking algorithmically leads to new ways of thinking about computational problems, and hooks problems in other domains into the engine of algorithmic improvements, so that better algorithm design leads quickly to better solutions to "real-world" problems.

Such work however can be an endless source of frustration. Most commonly, there are cultural difference between areas, and terminology barriers that need to be overcome. But most of all, there's a sense of apathy towards any effort at producing formal guarantees for computational methods (either for running times, or quality). In general, methods based on formal algorithmic analysis still prove themselves in the experimental setting, rather than possessing any intrinsic value by virtue of being correct, having running time and quality guarantees etc.

This in itself can be defended: if the main bottlenecks towards providing an effective practical solution to a problem are not computational (which they often are though), then it's not clear that a practitioner needs to care about esoteric notions like formal guarantees (which can often be quite weak). What is harder to surmount is the 'I've dug my hole and I like it' phenomenon.

For example, the RMS (root-mean-square) distance is commonly used to compare two 3D protein models. But I've never heard any fundamental argument as to why this measure is superior to other possible candidates for matching shape. What's worse, the RMS measure is harder to minimize (under transformations) than other equally plausible measures of shape similarity. However, as the status quo, it's hard to do anything to change this, short of voluminous studies that argue for the relative merits of a different measure based on comparisons on shapes.

Lest it seem that I'm merely whining, Barbie-style, because 'Modelling is hard', let me summarize my main point:

The computational efficacy and guarantees associated with a particular modelling technique are not in and of themselves sufficient to convince a practitioner that this approach is valid, even if in all other respects, the method is identical to other, more entrenched methods.

This makes me wonder about the best way to do algorithmic modelling, if one chooses to do it at all.

Sunday, September 16, 2007

Sunday afternoon MathTubing

Via John Baez, learning category theory on YouTube ! Now what we need is a YouTube video for Scott Aaronson's soap film demonstrations.

The videos are listed under the category, 'How-to and DIY'. Indeed ! Make your own monads, in 9 minutes or less !

Saturday, September 15, 2007

The hardness of fair redistricting

Through a series of pointers, I ended up at Brian Olson's page on redistricting, where he proposes an algorithm to do "fair redistricting" of districts so as to avoid problems with gerrymandering. The measure proposed appears to be a discrete k-median variant: namely, cluster "blocks" into k districts such that the average distance to the center of the district (defined as the centroid of the collection of blocks) is minimized. His code is online, and appears to use a GA-type heuristic. This is essentially a planar k-median problem, and is solvable approximately by a variety of methods.

In reading some of the (critical) comments on some of the original posts, (most of the form (a) this has been proposed ever since the early 60s (b) automation is not the hardest problem in redistricting), I was led to an award-winning political science dissertation by Micah Altman, in which (among other things) he shows that most of the reasonable redistricting formulations (although not the one above) are NP-hard (they tend to be variants of set packing). It was interesting to read a lengthy discourse on computational complexity and NP-hardness in the middle of such a dissertation.

Monday, September 10, 2007

An interesting blog

Derrick Coetzee writes a blog at MSDN titled, 'Developing for Developers'. There's lots of stuff related to code development, but also articles on general computing and theory. Some recent posts:
None of this is necessarily new to people reading this blog, but it's heartening to see an audience for this material in the developer community. And of course, anyone who says,
In reality, computing has nothing to do with computers. Computing is fundamentally about solving problems using strictly defined processes under resource constraints. The theory of computing, by design, has little regard for what device you use to carry those processes out.
will always have a special spot in my blogroll :)

Sunday, September 09, 2007

The TheoryCS blogosphere flourishes !

Jeez. I leave my computer for two hours, and four different bloggers announce the SODA accepts. If for some extremely obscure reason you haven't yet seen the postings, here's the list.

Saturday, September 08, 2007

Carnival of Mathematics XVI

is up, courtesy Kurt van Etten @ Learning Computation. The theme is 'Good, Bad and Ugly', and for some reason, computer science is "Ugly" :). Given that two of the entries involve mathematical feuds, maybe this is fitting.

Physician, heal thyself ?

There's been a lot of grief over the reduction in price for iPhones only a few months after they were released. Wired magazine interviewed people who don't regret paying the higher price for the virtue of being an early adopter/arbiter-of-cool, and this comment caught my eye (emphasis mine):
"If they told me at the outset the iPhone would be $200 cheaper the next day, I would have thought about it for a second - and still bought it," said Andrew Brin, a 47-year-old addiction therapist in Los Angeles. "It was $600 and that was the price I was willing to pay for it."
I don't know: I think there are some other people who should get their money back.

Wednesday, September 05, 2007

Heuristics for NP-hard problems

A while back, Michael was discussing the teaching of heuristics in his undergraduate algorithms class, and had this request:
I wish the standard textbooks would devote a solid chapter on heuristic techniques; in real life, most students are probably more likely to have to implement a heuristic for some NP-hard problem than to have to implement a minimum spanning tree algorithm.
If you remove the word, 'standard', he may just have got his wish. From the press blurb for the Handbook of Approximation Algorithms and Metaheuristics:
Delineating the tremendous growth in this area, the Handbook of Approximation Algorithms and Metaheuristics covers fundamental, theoretical topics as well as advanced, practical applications. It is the first book to comprehensively study both approximation algorithms and metaheuristics. Starting with basic approaches, the handbook presents the methodologies to design and analyze efficient approximation algorithms for a large class of problems, and to establish inapproximability results for another class of problems. It also discusses local search, neural networks, and metaheuristics, as well as multiobjective problems, sensitivity analysis, and stability.
Browsing the table of contents, the reader encounters an entire section on heuristics, with chapters on local search, stochastic local search, neural networks, genetic algorithms, tabu search, simulated annealing, ant colony optimization and the like. The book is published by CRC, which stands for "Organization-That-Singlehandedly-Destroys-Rain-Forests-While-Making-Your-Biceps-Extremely-Strong".

I generally have mixed feelings about CRC books: they are good reference texts, but tend to skim over content way too easily, and CRC has this unhealhy obsession with handbooks for everything. But for teaching heuristics, this might be an invaluable reference.

I should mention that the vast majority of the handbook deals with "proper" approximation algorithms, as well as applications to specific areas. The problem is that approximation algorithms is a fast moving field (though maybe not as fast as in the middle 90s), and so a handbook, being a static snapshot of the time, will get dated quickly.

Tuesday, September 04, 2007

Teacher's Day

I've been out of India for far too long, or so it seems. I had to be reminded by this article that Sep 5, Wednesday, is Teacher's Day in India. This day was chosen for its co-incidence with the birthday of India's second president, S. Radhakrishnan, a statesman and politician famous for his erudition and learning. According to Wikipedia, only 21 countries have an official celebration of Teacher's Day. Much to my surprise, the US is among them, with the day being part of a week called Teacher Appreciation Week.

Teacher's day in my school was always a big event; teachers did not teach that day, and came to school mainly to be entertained and feted in an assembly with variety shows/skits/gentle mockery. Senior students would mind the junior classes. when I was younger, this meant a day of partying in school. When I became one of the seniors, this meant I was able to come to school in "civilian clothing" (most schools in Delhi had strict uniform policies) to boss over the younger students. Good times...

Now I'm a teacher myself (after a fashion), and find it both easy (and fashionable) to grumble about the lack of respect shown by students in the US, as compared to students in India. The truth of course is far more complicated than that. Schools in Delhi appear to have acquired many of the "bad traits" of the worst high schools here; my mother has taught in Delhi schools for nearly 20 years and has seen them acquire more "western" habits (which is a shorthand for more boy-girl interaction, less "studiousness", less fear/reverence for the teachers, take your pick). And ultimately, as a university professor, I'm not even on the frontlines of education here in the US, and have no business complaining about the largely excellent students I teach.

In any case, here's a cheer for the teachers I had; the ones who tolerated my brashness, general arrogance, and constant questions, and helped me reach the teaching pedestal I find myself at today. And here's hoping that one day there'll be some student who might think as fondly of me as I think of all my teachers long past.

Monday, September 03, 2007

NP hardness of Euclidean k-median clustering

Suppose you're given a metric space (X, d) and a parameter k, and your goal is to find k "clusters" such that the sum of cluster costs is minimized. Here, the cost of a cluster is the sum (over all points in the cluster) of their distance to the cluster "center" (a designated point).

This is the well-known k-median problem (which differs from the also popular k-means problem by the use of distances, rather than squares of distances). In a general metric space, the k-median problem is known to be NP-hard, as well as being hard to approximate to within arbitrary constant factor. The current best known approximation ratio for the k-median is 4, due to Charikar and Guha 3 + eps, via a local search heuristic due to Arya, Garg, Khandekar, Meyerson, Munagala and Pandit (thanks, Chandra).

If the underlying metric is the Euclidean metric, then the problem complexity changes: in fact a PTAS for the Euclidean k-median can be obtained, due to the results of Arora, Raghavan and Rao, and then Kolliopoulos and Rao (who provide an almost-linear time algorithm). But as far as I am aware, there is still no NP-hardness proof for the Euclidean k-median problem, and I'd be interested in knowing if I am wrong here.

Note that the related problem of Euclidean k-means is known to be NP-hard from an observation by Drineas, Frieze, Kannan, Vempala and Vinay.

Update: As pointed out in the comments, the NP-hardness of k-median in the plane was shown in 1984 by Megiddo and Supowit. Once again, the wisdom of the internet beats independent research ;)

Tenure committee is to Colombian drug cartel as ...

Via HWTW, a hilarious advice piece to new profs by Phil Ford at Inside Higher Ed, adapted from a Notorious B.I.G rap, "The Ten Crack Commandments". The article is great, but the comments are even funnier.

"Data Mining" = voodoo science ?

On the Statistical Modeling blog, Aleks Jakulin has a rant on the virtues of data mining:

I view data analysis as summarization: use the machine to work with large quantities of data that would otherwise be hard to deal with by hand. I am also curious about what would the data suggest, and open to suggestions. Automated model selection can be used to list a few hypotheses that stick out of the crowd: I was not using model selection to select anything, but merely to be able to quantify how much a hypothesis sticks out from the morass of the null.

The response from several social scientists has been rather unappreciative along the following lines: "Where is your hypothesis? What you're doing isn't science! You're doing DATA MINING !"
I had almost the same reaction a while back when I was visiting JPL: the climatologists there were horrified at the idea of trolling for patterns in climate data, and to the person, asked me the dreaded 'But what is the science question?" question. Of course, given the general hot-potato-ness of climatology right now, one might sympathize with their skittishness.

Data mining is a tricky area to work in, and I've discussed this problem earlier. It's a veritable treasure-chest of rich algorithmic problems, especially in high dimensional geometry, and especially over large data sets. However, it's often difficult to get a sense of forward progress, especially since the underlying analysis questions often seem like elaborate fishing expeditions.

In that context, the distinction Aleks makes between confirmatory data analysis (check if the data validates or invalidates a hypothesis) and exploratory data analysis (play with the data to create a non-uniform distribution on plausible hypotheses) is quite helpful. It also emphasizes the interactive and very visual nature of data mining; interactive tools and visualizations are as important as the underlying analysis tools as well.

Update: Chris Wiggins points me to some of the earlier references to 'data mining'. One of the most vituperative is a paper by Michael Lovell in 1983 in The Review of Economics And Statistics. This paper drips with scorn for 'data miners', but makes a point that is at the very least worthy of consideration: namely that because of the large dimensionality of the space of hypotheses that a data mining application typically explores (here couched in terms of explanatory variables for a regression), patterns with apparently high p-values might not actually be that significant (or stated another way, in high dimensional spaces, there are many seemingly rare patterns that aren't that rare).

Thursday, August 30, 2007

Parallel algorithms: A resurgence ?

My department has a fairly strong systems and hardware group, and some of the buzz I hear about involves the research opportunities in multicore architectures (I'm talking MANY MANY cores, not 2 or 4), and how parallel computation is relevant once again. In fact, I've had one request to include basic material on parallel algorithms in my graduate algorithms class.

Juxtaposed with that is the fact that the Stein-enhanced version of Introduction to Algorithms does NOT contain the chapter on parallel algorithms that the old CLR used to have. In fact, I'll probably use the chapter outline from CLR if I cover parallel algorithms at any level. I wonder what was the thinking behind removing the chapter on parallel methods, and whether this might be reversed in future editions.

Update: (that was quick! - thanks, Cliff): Tom Cormen writes in to explain the decision to remove the chapter on parallel algorithms:
Cliff Stein pointed me to this blog. Suresh asked why we removed the chapter on parallel algorithms when we wrote our second edition.

The chapter on parallel algorithms in the first edition was exclusively on PRAM algorithms. We wrote the second edition mostly during the late 1990s, into early 2001. At the time, PRAM algorithms were yesterday's news, because the PRAM model was too far from what could be built realistically. We considered including algorithms for some other parallel computing model, but there was no consensus model at the time. We felt that the best decision was to just leave out parallel algorithms and use the pages for other new material.
I'm not surprised. It *was* true that by the time the 2nd ed arrived, PRAMs were yesterday's news: in fact, streaming and external memory methods were "hotter" at the time. It'll be interesting to see if multicore actually does spur new interest in parallel algorithms, and not just parallel architectures. In recent months I've participated in discussions about algorithm design for things like the Cell and the implied architecture of nVidia's CUDA, and it's surprising how often standard parallel algorithms methods like pointer jumping and the like are being reinvented.

What makes the new platforms more interesting is that there are features (especially streaming features) that make the mapping to "standard" PRAM models not so obvious. It may not merely be a mattter of shoehorning new systems into parallel models, but of extending and modifying the models themselves.

Tuesday, August 28, 2007

Streaming Summer School Report

(Ed: This post is by Piotr Indyk, who is always willing to be your near neighbor)

Greetings from the Kingdom of Denmark! The country of Vikings, meatballs, and football teams that just refuse to win, has hosted the Summer School on Data Stream Algorithms last week (August 20-23). The school was organized under the banner of MADALGO, a new research center dedicated to MAssive DAta ALGOrithms, set up in Aarhus University. The inauguration ceremony for the center took place on August 24, with several people giving invited lectures.

Muthu (one of the invited lecturers) has covered the inauguration ceremony, so I will skip the detailed description. Suffices to say, it was a pleasure to see that the Danish Research Foundation (or as the locals like to say, Grundforskningsfond) is eager to support an algorithmic research center with a budget of roughly $10M over 5 years, while its US counterpart spends about $7M per year for the entire Theory of Computing program. Did I mention that the population of Denmark is roughly 2% of that of US ?

Anyway, back to the summer school. We had 70+ participants altogether, including 5 lecturers. The school covered the following topics:
  • The dynamic Sudipto Guha gave two lectures. The first lecture was on algorithms for clustering. Massive amounts of data were clustered, including metric data, graph data, and a few careless participants sitting in the first row. In the second lecture, Sudipto covered the "random stream model", where the elements are assumed to be arriving in a random order, which circumvents the usual worst-case paranoia.
  • The twin duo of T.S. Jayram and Ravi Kumar covered lower bounds: communication complexity, information complexity, and generally "everything you wanted to know but were afraid to ask". It was the first time I have seen the details of the linear-space lower bound for estimating the L_infty distance, and I am happy to report that I understood everything, or at least that is what I thought at the time. Jayram and Ravi have also occasionally ventured into the land of upper bounds, covering algorithms for the longest increasing sequences and probabilistic data streams.
  • The scholarly Martin Strauss gave an overview of the algorithms for finding frequent elements, heavy hitters (sometimes on steroids) and their more recent versions used in compressed sensing.
  • I have covered the basic upper bounds for the L_p norm/frequency moments estimation, as well as the algorithms for geometric data (clustering, MST, matching), notably those based on core-sets. The latter topic was originally supposed to be covered by Sariel Har-Peled; however, the dark forces highly enlighted and insightful geniuses of the INS [Sariel's corrections] have jeopardized his plans. I guess the force was not strong enough with this one...
We also had an open problem session. Some of the problems were copy-pasted from the "Kanpur list", but a few new problems were posed as well. The list will be available shortly on the school website, so sharpen your pencils, prepare your napkins, pour some coffee, and ... give all of this to your students!

The lecture slides are also available on-line. If you spot any typos, let the lecturers know.

Overall, I think the school has been a success, perhaps with the notable exception of the weather: it started to rain promptly after the school has began, and it stopped when the school has ended. One has to admire the timing though.

SOCG 2009 will be held in Aarhus. See you then!

(Ed: But what about the beer report ?)

Friday, August 24, 2007

Infinite trees

If you followed Andy Drucker's series of posts on infinite trees and were craving more, be prepared for more foliage:

Via Ars Mathematica, a pointer to Keith Devlin's column on a bizarre counter-version of Konig's tree lemma: namely, if you have an uncountably large tree with countably large levels, then it is not necessarily true that an uncountable path must exist.

Sunday, August 12, 2007

For the love of fonts....

Continuing my part-time obsession with fonts (Helvetica II: Arial strikes back !), allow me to present this delightful tale about how the right font can save lives:
What I saw, Pietrucha knew, was what we all may see soon enough as we rush along America’s 46,871 miles of Interstate highways. What I saw was Clearview, the typeface that is poised to replace Highway Gothic, the standard that has been used on signs across the country for more than a half-century. Looking at a sign in Clearview after reading one in Highway Gothic is like putting on a new pair of reading glasses: there’s a sudden lightness, a noticeable crispness to the letters.
It's a fascinating tale of 10 years of research and lobbying that went into replacing the fonts used on all the Federal highway signs in the U.S. I was driving along I-15 today and almost got into an accident trying to tell whether the local highway sign fonts had changed. Apart from the inside-baseball of how to design a font, always guaranteed to send a shiver through my spine, the story draws out the tale of how the team of designers got together, designed the font, and managed, after repeated lobbying, to convince the Department of Transportation to replace the original fonts with the new ones.

There was an amusing line in the article about American-style engineering:
The letter shapes of Highway Gothic weren’t ever tested, having never really been designed in the first place. “It’s very American in that way — just smash it together and get it up there,” says Tobias Frere-Jones, a typographer in New York City who came to the attention of the design world in the mid-1990s with his Interstate typeface inspired by the bemusing, awkward charm of Highway Gothic. “It’s brash and blunt, not so concerned with detail. It has a certain unvarnished honesty.”
If you remain unconvinced about the difference, look at the accompanying slideshow.

Thursday, August 09, 2007

SGERs being replaced ?

The new NSF news feeds are great. I get to hear all kinds of interesting tidbits about the NSF, including some recent chatter on encouraging "transformative research". One direct consequence of this chatter is this (emphasis mine):

[NSF Director] Bement yesterday proposed a three-pronged strategy before the task force on transformative research. It was unanimously adopted by the task force on Tuesday and then unanimously adopted by the Board on Wednesday. Bement's plan for will:

1. Infuse support of potentially transformative research throughout NSF and all of its programs;

2. Learn how to facilitate potentially transformative research; and

3. Lead the community through opportunities for potentially tranformative research proposal submissions.

[...]

To lead the community, Bement will embark on a three-year trial, during which NSF will replace small grants for exploratory research with a two-tiered "early-concept" award mechanism. Tier I will call for limited funding grants that are internally reviewed. Tier II will entail larger grants requiring additional levels of review. Further, NSF will create a working group to recommend implementation details; a mechanism to monitor and track impact and lessons-learned; and a method to advertise the new approach to the community.

Monday, August 06, 2007

Things that make you pull your hair out in despair

I was recently at AT&T visiting for a few weeks, and I was lucky enough to catch a talk by Amit Chakrabarti on lower bounds for multi-player pointer jumping. A complexity class that figured prominently in his talk was the class ACC0, which consists of constant depth, unbounded fanin, poly sized circuits with AND, OR, NOT and MOD m gates, for all m.

Suppose we make our life simple by fixing m to be a single prime, like 3. It turns out that the corresponding class AC0[m] can be contained strictly within NC1: this arises from results in the 80s by Razborov and Smolensky. Now suppose that instead of picking a prime integer, we choose a number like 6, which is the product of distinct primes. We do not even know whether AC0[6] = NP or not !

Wednesday, August 01, 2007

Saving lives with exact algorithms

It's not often you get to say this in a paper:
We aim for exact algorithms [because] ... any loss of optimality could lead to unnecessary patient deaths.
Anyone who's gone through an algorithms class will at some point hear about stable marriage algorithms, and how the method is used to match hospitals and newly minted MDs looking for residencies.

Consider now the far more serious problem of matching kidney transplant candidates to potential donors. Because transplant lists are long, and cadaver donors are few, marketplaces matching healthy donors to recipients have sprung up in the US. For complicated ethical reasons (which are not without controversy), such exchanges are not made for money, and are viewed as gifts.

So what happens if a donor kidney doesn't match the potential recipient ? Ordinarily, nothing. Suppose however that there was another donor-recipient pair with a similar mismatch, and if only the donors were swapped, both transplants could go through ? What about if a 3-way cycle of matchings could be found ? This is called a 'market clearing', and is the subject of a paper by Abraham, Blum and Sandholm to appear in EC.

I'll get into the problem statement in a second. What's far more important is that the results of this paper have already been used to find potential transplant matches where none existed ! The Alliance For Pair Donation has been using this method since last December for finding potential matches, and has already found matches that prior methods would have missed. This is an incredible achievement: working on a problem abstracted from a real life-or-death scenario, and actually taking the results back to the source and making a difference.

Technically, the problem is easily stated. Given a directed graph with weights on edges, and a parameter L, find a maximum weight collection of disjoint cycles, each of length at most L. Vertices are agents with items (in this case, transplant recipients with donors). The weight on an edge represents the utility to the source of obtaining the sink's item (a donor transfer). The L-constraint reflects reality, in that all such transplants would have to be performed simultaneously (to ensure that all donors go through), and it's not feasible to perform more than a few (typically 3) of these transplants simultaneously.

The line I quoted at the beginning of the post comes as part of an explanation as to why they want exact algorithms for the problem (it's NP-hard for L >= 3). The technical contributions include finding a way to run an integer-linear program at scale for graphs with hundreds of thousands of nodes.

(Via NSF Current)

Tuesday, July 31, 2007

New TheoryCS Blog Alert !

Mihai Pătraşcu, cause of numerous flame wars on the complexity blog, as well as lonely campaigner for the rights of esoteric data structures everywhere, has a blog, self-billed as 'Terry Tao without the Fields Medal'.

Enjoy !

Thursday, July 26, 2007

A counting gem

Consider the following puzzle:
Given 2n items, determine whether a majority element (one occuring n+1 times) exists. You are allowed one pass over the data (which means you can read the elements in sequence, and that's it), and exactly TWO units of working storage.
The solution is incredibly elegant, and dates back to the early 80s. I'll post the answer in the comments tomorrow, if it hasn't been posted already.

Update: A technical point: the problem is a promise problem, in that you are promised that such an element exists. Or, in the non-promise interpretation, you are not required to return anything reliable if the input does not contain a majority element.

Tuesday, July 24, 2007

STOC/FOCS/SODA: The Cage Match (with data!)

(Ed: This post, and the attendant web page and data, was initiated as a joint effort of Piotr Indyk and myself. Many others have helped improve the data, presentation and conclusions.)

Inspired by Michael Mitzenmacher's flamebait post on SODA/STOC/FOCS, we decided to roll up our sleeves, and resolve the biggest outstanding issue in Theoretical Computer Science, namely the great "STOC/FOCS vs. SODA" debate ("P vs. NP" is a tainted second, sullied by all that money being offered to solve it). We have some interesting preliminary observations, and there are many interesting problems left open by our work ;)

The hypothesis:
There is a significant difference in citation patterns between STOC/FOCS and SODA
The plan:

First, we obtained the list of titles of conference papers appearing in STOC, FOCS and SODA in the last 10 years (1997-2006). We deliberately excluded 2007 because FOCS hasn't happened yet. We got this list from DBLP (Note: DBLP does not make any distinction between actual papers and tutorials/invited articles; we decided to keep all titles because there weren't that many tutorials/invited papers in any case).

For each title, we extracted the citation count from Google Scholar, using a process that we henceforth refer to as "The Extractor". Life is too short to describe what "The Extractor" is. Suffices to say that its output, although not perfect, has been verified to be somewhat close to the true distribution (see below).

The results, and methodology, can be found at this link. The tables and graphs are quite self-explanatory. All source data used to generate the statistics are also available; you are welcome to download the data and make your own inferences. We'll be happy to post new results here and at the webpage.

OBSERVATIONS:

The main conclusion is that the hypothesis is valid: a systematic discrepancy between citation counts of SODA vs. STOC/FOCS does appear to exist. However, the discrepancy varies significantly over time, with years 1999-2001 experiencing the highest variations. It is interesting to note that 1999 was the the year when SODA introduced four parallel sessions as well as the short paper option.

Although most of the stats for STOC and FOCS are quite similar, there appears to be a discrepancy at the end of the tail. Specifically, the 5 highest citation counts per year for STOC (years 1997-2001) are all higher than the highest citation count for FOCS (year 2001). (Note: the highest cited STOC article in 2001 was Christos Papadimitriou's tutorial paper on algorithms and game theory). The variation between SODA and STOC/FOCS in the 1999-2001 range shows up here too, between STOC and FOCS themselves. So maybe it's just something weird going on these years. Who knows :)

Another interesting observation comes from separating the SODA cites into long and short paper groups (for the period 1999-2005). Plotting citations for short vs long papers separately indicates that the presence of short papers caused a net downward influence on SODA citation counts, but as fewer and fewer shorts were accepted, this influence decreased.

There are other observations we might make, especially in regard to what happens outside the peak citations, but for that we need more reliable data. Which brings us to the next point.

VALIDATION OF THE DATA:

To make sure that the output makes sense, we performed a few "checks and balances". In particular:
  • we sampled 10 random titles from each of FOCS, STOC and SODA for each of the 10 years, and for each title we checked the citation count by hand. Results: there were 7 mistakes in FOCS, 9 in STOC, and 11 in SODA, indicating a current error rate in the 9-10% range.
  • for each of FOCS, STOC, SODA, we verified (by hand) the values of the top 10 citation numbers, as reported by The Extractor
  • we compared our stats for the year 2000 with the stats obtained by Michael. The results are pretty close:





Median (Mike's/Ours)Total (over all 10 years) (Mike's/Ours)
FOCS 38/38 3551/3315
STOC 21/21 3393/2975
SODA 14/13 2578/2520

A CALL FOR HELP:
Warning: the data displayed here is KNOWN to contain errors (our estimate is that around 10% of citation counts are incorrect). We would very much appreciate any efforts to reduce the error rate. If you would like to help:
  1. choose a "random" conference/year pair (e.g., STOC 1997)
  2. check if this pair has been already claimed in this blog; if yes, go to (1)
  3. post a short message claiming your pair (e.g., "CLAIMING STOC 1997") on the blog.
  4. follow the links to check the citations. For each incorrect citation, provide two lines: (1) paper title (2) a Google Scholar link to the correct paper
  5. Email the file to Suresh Venkatasubramanian.
(and yes, we know that this algorithm has halting problems). Many thanks in advance. Of course, feel free to send us individual corrections as well.

Citation guidelines: The world of automatic citation engines is obviously quite messy, and sometimes it is not immediately clear what is the "right" citation count of a paper. The common "difficult case" is when you see several (e.g., conference and journal) versions of the same paper. In this case, our suggestion is that you ADD all the citation counts that you see, and send us the links to ALL the papers that you accounted for.


Acknowledgement: Adam Buchsbaum for suggesting the idea of, and generating data for the short-long variants of SODA. David Johnson, Graham Cormode, and Sudipto Guha for bug fixes, useful suggestions and ideas for further plots (which we can look into after the data is cleaned up)

Disqus for The Geomblog