There exists no pentagon in the plane all of whose lengths (sides and diagonals) are rational.Passed on to me by a friend. And no, I don't know the answer.

One fact that is known: no regular pentagon in the plane can have integer coordinates.

Ruminations on computational geometry, algorithms, theoretical computer science and life

Prove or disprove:

One fact that is known: no regular pentagon in the plane can have integer coordinates.

There exists no pentagon in the plane all of whose lengths (sides and diagonals) are rational.Passed on to me by a friend. And no, I don't know the answer.

One fact that is known: no regular pentagon in the plane can have integer coordinates.

True episode:

I was looking for a paper in the university online journal listing. It's a SIAM paper published prior to 1997, so LOCUS is where it resides. Unfortunately, our online journal listing didn't seem to have LOCUS, and so I clicked on one of the help buttons to talk to a librarian (on live chat). A few minutes later, she points out to me that this particular issue of the journal is physically available in our math library.

Till that point, it had not even occurred to me to check the physical stacks.

p.s I'm just back from the CG PC meeting and the McGill workshop on limited visibility in Barbados. I had to leave early, so discussions continue. Barbados (and later, Bellairs itself) was off the internet till I left on Monday, so I have a collection of posts that will dribble out over time.

I was looking for a paper in the university online journal listing. It's a SIAM paper published prior to 1997, so LOCUS is where it resides. Unfortunately, our online journal listing didn't seem to have LOCUS, and so I clicked on one of the help buttons to talk to a librarian (on live chat). A few minutes later, she points out to me that this particular issue of the journal is physically available in our math library.

Till that point, it had not even occurred to me to check the physical stacks.

p.s I'm just back from the CG PC meeting and the McGill workshop on limited visibility in Barbados. I had to leave early, so discussions continue. Barbados (and later, Bellairs itself) was off the internet till I left on Monday, so I have a collection of posts that will dribble out over time.

A posting on comp.theory:

I just started a new class recently and am in need of a good classI have to ask: how does this person know they will like the course if they are not interested in the material ?

project.

The course is a graduate course in computational geometry.

I think I will like this course but, overall, I don't find myself too

much interested in teh material

I figure a good project will help motivate me through the course and,

well, a project is required anyway! :)

My main interests are in number theory and cryptography.

Any suggestions for a nice little project that could be done in the

span of several weeks by a beginning graduate student? The idea could

either be theoretical or an implementation in code or perhaps a bit of

both.

I'm teaching CG this semester, and it's a lot of fun to go back and read some of the older texts and papers. Preparata and Shamos is actually quite a neat book, chock full of useful tricks and techniques; if only they had better ways of describing their algorithms ...

The real challenge is designing good assignment problems, and I've almost given up on this. With the number of courses available on the web, and the amount of information accessible via Google, I'm hard pressed to write down a question that can't be solved using Google and a few keywords. Even the trick of digging up interesting lemmas from related papers doesn't work if you mention the paper in class. Or maybe I'm underestimating my students' potential desire to solve the problem themselves.

The real challenge is designing good assignment problems, and I've almost given up on this. With the number of courses available on the web, and the amount of information accessible via Google, I'm hard pressed to write down a question that can't be solved using Google and a few keywords. Even the trick of digging up interesting lemmas from related papers doesn't work if you mention the paper in class. Or maybe I'm underestimating my students' potential desire to solve the problem themselves.

On Harvard's search for a new president:

The growing financial importance of research also could pressure Harvard to tap a scientist, something it hasn't done since 1933.I don't get it. Being a giant science lab is a BAD thing ?But Harvard also could go the other way -- picking a nonscientist who could rise above turf battles and reassure the rest of the school that America's oldest and richest university isn't becoming a giant science lab.

We don't need no stinkin' longest common subsequence ! We'll use FOOTBALL to explain dynamic programming:

The

footballcommentary.comDynamic Programming Model is intended to provide guidance for certain decisions that arise during a game, such as two-point conversions and going for it on fourth down. This article explains the main ideas of the Model in simplified form. [...]

The Model is built around the idea that in making decisions, we are trying to maximize our team's probability of winning the game, and the opponents are trying to minimize that probability. There are three types of situations, called

states, in which the Model explicitly evaluates our probability of winning. The first type of state is when one team or the other has just gained possession. The second type is when a team has just scored a touchdown, but has not yet tried for the extra point (or points). The third type is when a team is about to kick off.

'Tis the season.

McGill University is looking to hire in geometric computing and bioinformatics. I can think of at least three reasons for any new Ph.D to apply:

* It's Canada ! Everyone gets funded by the government ! Need I say more ?

* It's in Montreal: where else can you get the feeling you're in a strange dream where you're in Paris but everyone speaks English ?

* You can "do research" in Barbados whenever you like, and definitely in the winter.

and most importantly,

* they have a job opening for people who do geometry for a living. How civilized is that ?

McGill University is looking to hire in geometric computing and bioinformatics. I can think of at least three reasons for any new Ph.D to apply:

* It's Canada ! Everyone gets funded by the government ! Need I say more ?

* It's in Montreal: where else can you get the feeling you're in a strange dream where you're in Paris but everyone speaks English ?

* You can "do research" in Barbados whenever you like, and definitely in the winter.

and most importantly,

* they have a job opening for people who do geometry for a living. How civilized is that ?

Us conference goers are an ornery lot. We remember the disastrous under-construction DC hotel for SODA 2001, the bizarrely smelling Miami hotel for FOCS 1997, and the strange San Francisco hotel with the low ceilings and no lounge space for SODA 1998. There's an urban legend that FOCS can never go back to Las Vegas because of some unspecified incidents that happened in 1993; the truth is probably far more boring.

Special opprobium is reserved for the Hotel Pennsylvania in NYC, where FOCS 1999 was held. This was an old hotel; very, very old. It had the kind of rooms you'd describe as "charming" or "quaint" in publicity material. We all know what that means.

SODA 2009 is slated to be in NYC, assuming that votes are not mysteriously erased from Hal's Powerpoint slides. I am happy to announce that the Hotel Pennsylvania will in all likelihood NOT be one of the candidates for hosting the conference: it is being demolished to make way for a multi-story office complex (story via BB)

Special opprobium is reserved for the Hotel Pennsylvania in NYC, where FOCS 1999 was held. This was an old hotel; very, very old. It had the kind of rooms you'd describe as "charming" or "quaint" in publicity material. We all know what that means.

SODA 2009 is slated to be in NYC, assuming that votes are not mysteriously erased from Hal's Powerpoint slides. I am happy to announce that the Hotel Pennsylvania will in all likelihood NOT be one of the candidates for hosting the conference: it is being demolished to make way for a multi-story office complex (story via BB)

After many years of dormant recruiting policies, and after finally getting rid of some deadwood, AT&T Labs is now hiring, in multiple areas. There's recruiting across the board in multiple areas: almost all the mentioned areas have algorithmic angles to them. So now that your faculty job applications have all been sent, send your resume to AT&T.

And in case all the mergers and rebrandings confuse you, Stephen Colbert is here to help.

And in case all the mergers and rebrandings confuse you, Stephen Colbert is here to help.

We know that ~~in the comparison model,~~ computing a planar convex hull takes Ω(n log n) time (or Ω(n log h, if you're picky) in any reasonable algebraic decision tree model. ~~The proof (at least for n log n) is a simple reduction from sorting.~~ There's a more involved proof by Yao that shows that even generating the set of vertices of the convex hull takes at least n log n time. (i.e ordering the vertices isn't the only hard part). Yao's proof is on the quadratic test model; tests on objects are allowed to be signs of quadratic polynomials, and such tests characterize all known convex hull algorithms. [Addendum: Ben-Or generalizes these results to higher-degree algebraic tests as well]

My question is: is it known whether estimating the size of the convex hull is also lower bounded by n log n (or n log h) in this model ? It seems like this should be true: I don't see an obvious way of using a size-estimation oracle to actually compute the hull (or its vertices) in linear time, but I don't know if there's a proof of this. Yao's proof doesn't appear to yield any direct insight into this question.

It's worth pointing out in the light of recent results by Chan, (and by Pătraşcu on point location), that we can solve the 2D convex hull problem in sub- n log n time; we pick our favorite machine model and sort the x-coordinates in sub-n log n time, and then run the Graham scan in linear time. Chan's paper achieves sub n log n bounds for the 3D convex hull and related problems.

My question is: is it known whether estimating the size of the convex hull is also lower bounded by n log n (or n log h) in this model ? It seems like this should be true: I don't see an obvious way of using a size-estimation oracle to actually compute the hull (or its vertices) in linear time, but I don't know if there's a proof of this. Yao's proof doesn't appear to yield any direct insight into this question.

It's worth pointing out in the light of recent results by Chan, (and by Pătraşcu on point location), that we can solve the 2D convex hull problem in sub- n log n time; we pick our favorite machine model and sort the x-coordinates in sub-n log n time, and then run the Graham scan in linear time. Chan's paper achieves sub n log n bounds for the 3D convex hull and related problems.

I am sorry to say that the business meeting was so dull we couldn't even muster up the energy to argue about old controversies, let alone create new ones. Here are some of the highlights:

- Attendance was 276, comparable to SODA 2004 in New Orleans. 96 students
- 796 distinct accepted authors, from 29+ countries.
- Milan Ružić was awarded the best student paper prize for "Making Deterministic Signatures Quickly".
- The domain gmail.com had a 44% acceptance rate, compared to yahoo.com's 14% acceptance rate. Yahoo's stock price fell 3% on hearing this news. Google's stock price fell 5% on worries that researchers were wasting their time writing papers.
- Manufactured abstract from a random selection of words in accepted paper abstracts:

We present arbitrary coinciding factors that are hierarchical and use predecessors as well as important jobs. We show there exist revenues that are treasure sales and independent for identical parametric desires.

- Hal Gabow makes some attempts to shake things up with suggestions about pandering to the discrete math community, prompting this from Ian Munro:

"The notion of a discrete math community is an interesting one and somewhat perverse"

- He tries to resurrect short papers, prompting this, from an unnamed PC member (but we know who you are:))

"Please don't exhume this dead horse just to kick the rotting bones around. Again"

- SODA 2008 is in San Francisco; Shang-Hua Teng is chair.
- Major discussion on locations for SODA 2009. Three candidates emerge (NYC/Puerto Rico/Las Vegas). After an attempt at vote-rigging that would put Diebold to shame, the organizers do a straight vote and NYC wins !!

David has been posting detailed impressions of specific papers. Do check them out.

He mentions the constant-degree expander result of Alon, Schwartz and Shapira. I was at that talk today morning, and it's a very neat result. Expander graphs have a long and fascinating history (see the Linial-Wigderson course notes for details), and I won't try to summarize it here. Basically, the goal for a long time has been to construct reasonably sparse, constant degree expanders that also have "simple" proofs of expansion. Many of the constructions to date were either non-constructive (like the random expanders) or had very sophisticated proofs of expansion. Luca Trevisan had two very detailed posts that give an idea of how complicated some of these arguments can be.

The main contribution of the Alon/Schwartz/Shapira paper is the explicit construction of an expander that is both small (constant degree) and which has a "simple" proof of expansion. I should mention that the celebrated zig-zag product of Reingold-Vadhan-Wigderson already does this. However, their proof (and basically all proofs of expansion) rely on the spectral analysis of the graphs, using the relation between expansion and the gap between the first and second eigenvalues of the adjacency matrix of a graph.

This paper uses a graph product called a replacement product, and presents an elementary (i.e combinatorial) proof that the replacement product of two expanders is also an expander. With that in hand, they construct three graphs, and with two replacement products, get a constant-degree expander.

The invited talk today was by Monika Henzinger, of Google and EPFL. This talk was the "applied" talk (SODA usually has one such); Monika talked about algorithmic success stories at Google, discussing PageRank (and HITS), detecting document duplicates, and load balancing queries on servers. Each of these topics deserves a post on their own (the work on detecting duplicates has some particularly elegant ideas), so I don't want to go into detail here.

There's a point worth making here. The success of PageRank and other methods for search is really a success story for algorithmic modelling, rather than algorithms per se.

What do I mean by this ? I mean that the key success of PageRank, to take an example, was the idea that pages could be viewed as nodes, and edges as transitions in a Markov chain, and that relevance (or PageRank) could be modelled as a kind of "return probability". Of course, once you do this, all your theoretical machinery kicks in, and you can prove bounds on convergence, design fast algorithms, etc. But the main step was the modelling step, where you took the raw data (web pages) and viewed them in a certain abstract framework. For those of you who don't remember this, the existing paradigm of search at the time was text-based IR, and Altavista was the main exemplar of this. What Google was proposing was a very different way of viewing documents and the problem of relevance.

This is a common, and yet widely misunderstood, aspect of doing "applied" algorithms. You can define all kinds of problems, and write all kinds of papers proving results about these problems. But the mathematical tools developed for solving these problems will always take a backseat to the essentially scientific question of whether the problems and models fit the data correctly or not.

There are many domains where this is not true; cryptography is one domain where provable guarantees are not just nice, but are the crucial element of a secure system. But the success of algorithms in Web search come not from knowing to simulate a Markov chain efficiently, but from realizing that web documents are essentially nodes in a gigantic graph, and that the problem of ranking pages can be translated to a mathematical abstraction on graphs. As an aside, one of the things that the Kleinberg/Tardos textbook appears to do well is walk students through the process of problem abstraction, via the extended real-world exercises.

Understanding this aspect of data modelling changes the questions somewhat. The issue now is not, "Is this the most efficient algorithm for the problem", but rather, "Is this the right problem for the data" ? The first question will become relevant only when the second one is answered satistfactorily, more akin to a scientific discovery of truth than a mathematical one.

Outtakes:

He mentions the constant-degree expander result of Alon, Schwartz and Shapira. I was at that talk today morning, and it's a very neat result. Expander graphs have a long and fascinating history (see the Linial-Wigderson course notes for details), and I won't try to summarize it here. Basically, the goal for a long time has been to construct reasonably sparse, constant degree expanders that also have "simple" proofs of expansion. Many of the constructions to date were either non-constructive (like the random expanders) or had very sophisticated proofs of expansion. Luca Trevisan had two very detailed posts that give an idea of how complicated some of these arguments can be.

The main contribution of the Alon/Schwartz/Shapira paper is the explicit construction of an expander that is both small (constant degree) and which has a "simple" proof of expansion. I should mention that the celebrated zig-zag product of Reingold-Vadhan-Wigderson already does this. However, their proof (and basically all proofs of expansion) rely on the spectral analysis of the graphs, using the relation between expansion and the gap between the first and second eigenvalues of the adjacency matrix of a graph.

This paper uses a graph product called a replacement product, and presents an elementary (i.e combinatorial) proof that the replacement product of two expanders is also an expander. With that in hand, they construct three graphs, and with two replacement products, get a constant-degree expander.

The invited talk today was by Monika Henzinger, of Google and EPFL. This talk was the "applied" talk (SODA usually has one such); Monika talked about algorithmic success stories at Google, discussing PageRank (and HITS), detecting document duplicates, and load balancing queries on servers. Each of these topics deserves a post on their own (the work on detecting duplicates has some particularly elegant ideas), so I don't want to go into detail here.

There's a point worth making here. The success of PageRank and other methods for search is really a success story for algorithmic modelling, rather than algorithms per se.

What do I mean by this ? I mean that the key success of PageRank, to take an example, was the idea that pages could be viewed as nodes, and edges as transitions in a Markov chain, and that relevance (or PageRank) could be modelled as a kind of "return probability". Of course, once you do this, all your theoretical machinery kicks in, and you can prove bounds on convergence, design fast algorithms, etc. But the main step was the modelling step, where you took the raw data (web pages) and viewed them in a certain abstract framework. For those of you who don't remember this, the existing paradigm of search at the time was text-based IR, and Altavista was the main exemplar of this. What Google was proposing was a very different way of viewing documents and the problem of relevance.

This is a common, and yet widely misunderstood, aspect of doing "applied" algorithms. You can define all kinds of problems, and write all kinds of papers proving results about these problems. But the mathematical tools developed for solving these problems will always take a backseat to the essentially scientific question of whether the problems and models fit the data correctly or not.

There are many domains where this is not true; cryptography is one domain where provable guarantees are not just nice, but are the crucial element of a secure system. But the success of algorithms in Web search come not from knowing to simulate a Markov chain efficiently, but from realizing that web documents are essentially nodes in a gigantic graph, and that the problem of ranking pages can be translated to a mathematical abstraction on graphs. As an aside, one of the things that the Kleinberg/Tardos textbook appears to do well is walk students through the process of problem abstraction, via the extended real-world exercises.

Understanding this aspect of data modelling changes the questions somewhat. The issue now is not, "Is this the most efficient algorithm for the problem", but rather, "Is this the right problem for the data" ? The first question will become relevant only when the second one is answered satistfactorily, more akin to a scientific discovery of truth than a mathematical one.

Outtakes:

- (Thanks to Vijay Kumar) You could, at some point, buy a watch on Amazon.com for the heavily discounted (50% off) price of $499,999. The comments on the product page are hilarious.
- What's the title of Britney Spears' only SODA paper ? "Stable Marriage is Hard"

I didn't always know I wanted to do algorithms; in fact, I came to Stanford thinking I was more interested in AI. One of my most embarrassing moments in graduate school was when I went to talk to my advisor-to-be. He told me he worked in the design and analysis of algorithms, and I asked him if he did design or analysis.

(Note to graduate students everywhere; when someone tells you that no question is a stupid question, don't believe it)

But after attending Philippe Flajolet's talk today on "Analytic Combinatorics", and after hearing Luc Devroye's talk yesterday, I'm not so sure that my question was off the mark.

A bit of background. What we refer to as "the analysis of algorithms" is usually associated with Don Knuth and the Art of Computer Programming. It referred to the initial methods being developed to analyze structures like hash tables, search trees and the like. Most computer scientists have taken some kind of discrete math class, and have seen the Knuth-Graham-Patashnik "Concrete Mathematics" textbook, and much of the content (basic combinatorics, recurrence relations, generating functions, etc) was used for early algorithm analysis.

These methods were quite precise. It's not uncommon to look back at papers from the 70s and see detailed constants in front of running times for algorithms; Bob Sedgewick mentioned one such calculation in his introduction to Flajolet's talk. People didn't use O() notation like a sledgehammer, the way we do today.

Over time, we became more comfortable with O() notation; algorithms became more sophisticated, and it became harder to figure out actual constants. It didn't seem to matter as much. After all, when you were coming up with elaborately complicated algorithms that ran in exotic times like O(n^(11/7)), it hardly seemed to matter what the constant was. This was, and has continued to be, the Age of Design.

But all along, with people like Flajolet, Knuth, Sedgewick, and many many others, the work of really analyzing algorithms continued on. ANALCO is an offshoot of this long effort; a way to attract people working on some of the considerably hard problems in this area, while creating some cross-fertilization with the design crowd at SODA. Of course, by no means do I claim that algorithm designers don't analyze; of course we do. But it's fair to say that the sophisticated analysis methods and sophisticated design methods do appear to have diverged.

Which brings us to the talk today. The rough theme for his talk was an overview of how the combinatorial problem of counting structures (trees, steps in a linear probe, etc) can be transformed into a generating function, to which then the methods of real, and more recently, complex analysis can be applied. There's some pretty heavy mathematical machinery being thrown out here: we saw large deviation theory in yesterday's talk, for example, and there are things Flajolet talked about that I have only the barest inkling of.

Doing such analysis is hard; it's not as if we're suddenly going to abandon O() notation. But, as Piotr Indyk pointed out when we discussed this later, computers aren't getting any faster, and data is getting larger and larger, and it's more and more true that the actual constants in front of a running time matter, sometimes even more than the asymptotic bound. If more sophisticated analysis methods allow us to reveal algorithm running times more transparently, this also helps repair some of the "bad press" theoreticians can get with more applied folk.

So the analysis of algorithms takes on its original meaning again; there is a conference as well, now in its 13th year, and there's an upcoming book by Flajolet and Sedgewick that covers much of the mathematics Flajolet refers to in his talk. I looked at it briefly (it's 753 pages and counting!), and I hope that when it does come out, we learn more about how to use methods from analytic combinatorics to improve analysis techniques for even our run-of-the-mill algorithms.

Outtakes:

(Note to graduate students everywhere; when someone tells you that no question is a stupid question, don't believe it)

But after attending Philippe Flajolet's talk today on "Analytic Combinatorics", and after hearing Luc Devroye's talk yesterday, I'm not so sure that my question was off the mark.

A bit of background. What we refer to as "the analysis of algorithms" is usually associated with Don Knuth and the Art of Computer Programming. It referred to the initial methods being developed to analyze structures like hash tables, search trees and the like. Most computer scientists have taken some kind of discrete math class, and have seen the Knuth-Graham-Patashnik "Concrete Mathematics" textbook, and much of the content (basic combinatorics, recurrence relations, generating functions, etc) was used for early algorithm analysis.

These methods were quite precise. It's not uncommon to look back at papers from the 70s and see detailed constants in front of running times for algorithms; Bob Sedgewick mentioned one such calculation in his introduction to Flajolet's talk. People didn't use O() notation like a sledgehammer, the way we do today.

Over time, we became more comfortable with O() notation; algorithms became more sophisticated, and it became harder to figure out actual constants. It didn't seem to matter as much. After all, when you were coming up with elaborately complicated algorithms that ran in exotic times like O(n^(11/7)), it hardly seemed to matter what the constant was. This was, and has continued to be, the Age of Design.

But all along, with people like Flajolet, Knuth, Sedgewick, and many many others, the work of really analyzing algorithms continued on. ANALCO is an offshoot of this long effort; a way to attract people working on some of the considerably hard problems in this area, while creating some cross-fertilization with the design crowd at SODA. Of course, by no means do I claim that algorithm designers don't analyze; of course we do. But it's fair to say that the sophisticated analysis methods and sophisticated design methods do appear to have diverged.

Which brings us to the talk today. The rough theme for his talk was an overview of how the combinatorial problem of counting structures (trees, steps in a linear probe, etc) can be transformed into a generating function, to which then the methods of real, and more recently, complex analysis can be applied. There's some pretty heavy mathematical machinery being thrown out here: we saw large deviation theory in yesterday's talk, for example, and there are things Flajolet talked about that I have only the barest inkling of.

Doing such analysis is hard; it's not as if we're suddenly going to abandon O() notation. But, as Piotr Indyk pointed out when we discussed this later, computers aren't getting any faster, and data is getting larger and larger, and it's more and more true that the actual constants in front of a running time matter, sometimes even more than the asymptotic bound. If more sophisticated analysis methods allow us to reveal algorithm running times more transparently, this also helps repair some of the "bad press" theoreticians can get with more applied folk.

So the analysis of algorithms takes on its original meaning again; there is a conference as well, now in its 13th year, and there's an upcoming book by Flajolet and Sedgewick that covers much of the mathematics Flajolet refers to in his talk. I looked at it briefly (it's 753 pages and counting!), and I hope that when it does come out, we learn more about how to use methods from analytic combinatorics to improve analysis techniques for even our run-of-the-mill algorithms.

Outtakes:

- I've quite enjoyed the talks I've attended thus far. I haven't written much about them, but that's mostly due to laziness on my part. I've been quite torn having to navigate the multiple parallel sessions; human cloning, where art thou ?
- [From a random research paper session] It's funny to see a speaker struggling with their desire to display EVERY SINGLE SLIDE that they made, when faced with a remorseless clock ticking down to zero. Word of advice: no one really cares if you go through all your slides, or even flip thru them frantically while muttering very fast. They do care if you go over time and drag things ON and ON and ON.
- Contrary to the general confusion being spread around, the wireless DOES work and it IS free.
- I don't like hotels with two towers; especially when I'm in one and the conference is in the other, and ESPECIALLY when the only connection between the two towers is at the lobby.

Getting into New Orleans at 1 am because of "mechanical trouble" meant that I haven't been at my best so far. But I've already heard one amazing talk today.

Luc Devroye gave the ANALCO plenary lecture on "Weighted Heights of Random Trees", based on work with his students Erin McLeish and Nicolas Broutin. After having sat through many talks with titles like this, I generally approach them with great caution and with a clear escape route. But...

This was an amazing exposition of a topic that could have become dry and terse, and essentially incomprehensible, within a slide or two. He had jokes, (that were funny), a global plan for the material, enough technical material that I went away feeling like I'd learnt something, and intuition galore. And the work itself is very beautiful.

So what was it all about ? The problem is really quite simple to state. Suppose I give you a (random) weighted binary tree, where nodes attach to parents randomly, and edges may have weights chosen randomly. What is the maximum height of such a tree ?

The standard application for such a tool is in analyzing binary search trees. The height of a such a tree controls the running time of an algorithm that needs to use it. And there's now a vast literature analyzing both the asymptotics of the height distribution (basically it's sharply concentrated around 2 log n) and the specific constants (the maximum height of a random binary search tree is roughly 4.3 log n, and the minimum is around 0.37 log n).

The "master goal" that Devroye described in his talk was this: Suppose I have a general way of attaching nodes to parents (that leads to a general distribution on subtree sizes), and a general way of attaching weights to edges (rather than being deterministically 1 for binary search trees). Such a general model captures the analysis of tries (trees on strings that are very important in text searching), geometric search structures like k-d trees, and even restricted preferential attachment models in social network analysis (Think of the edges as hyperlinks, and the height of the tree as the diameter of a web tree).

Is there a generic theorem that can be applied to all of these different situations, so that you can plug in a set of distributions that describes your process, and out pops a bound on the height of your tree ? It turns out that you can (with some technical conditions). The method uses two-dimensional large-deviation theory: can you estimate the probability of a sum of random variables being bounded by some function, while at the same time ensuring that some other sum of random variables (that might depend slightly on the first) is also bounded ?

An example of a 1D large deviation result is of course a Chernoff bound. Devroye showed that a 2D large deviation bound for the height of such trees can be expressed in a similar form using the so-called Cramér exponent, something that will probably not be surprising to experts in large deviation theory. After that, the analysis for any tree process becomes a whole lot easier. You have to analyze the corresponding Cramér function for your distributions, and a bound (with a constant; no big-O nonsense here!) pops out.

He also talked about a neat extension of this method to analyzing the "skinnyness" of k-d tree decompositions, showing that for a kind of "relaxed" k-d tree construction, the skinniest cell can be extremely skinny (having a super-linear aspect ratio). It's the kind of result that I imagine would be very difficult to prove without such a useful general theorem.

Luc Devroye gave the ANALCO plenary lecture on "Weighted Heights of Random Trees", based on work with his students Erin McLeish and Nicolas Broutin. After having sat through many talks with titles like this, I generally approach them with great caution and with a clear escape route. But...

This was an amazing exposition of a topic that could have become dry and terse, and essentially incomprehensible, within a slide or two. He had jokes, (that were funny), a global plan for the material, enough technical material that I went away feeling like I'd learnt something, and intuition galore. And the work itself is very beautiful.

So what was it all about ? The problem is really quite simple to state. Suppose I give you a (random) weighted binary tree, where nodes attach to parents randomly, and edges may have weights chosen randomly. What is the maximum height of such a tree ?

The standard application for such a tool is in analyzing binary search trees. The height of a such a tree controls the running time of an algorithm that needs to use it. And there's now a vast literature analyzing both the asymptotics of the height distribution (basically it's sharply concentrated around 2 log n) and the specific constants (the maximum height of a random binary search tree is roughly 4.3 log n, and the minimum is around 0.37 log n).

The "master goal" that Devroye described in his talk was this: Suppose I have a general way of attaching nodes to parents (that leads to a general distribution on subtree sizes), and a general way of attaching weights to edges (rather than being deterministically 1 for binary search trees). Such a general model captures the analysis of tries (trees on strings that are very important in text searching), geometric search structures like k-d trees, and even restricted preferential attachment models in social network analysis (Think of the edges as hyperlinks, and the height of the tree as the diameter of a web tree).

Is there a generic theorem that can be applied to all of these different situations, so that you can plug in a set of distributions that describes your process, and out pops a bound on the height of your tree ? It turns out that you can (with some technical conditions). The method uses two-dimensional large-deviation theory: can you estimate the probability of a sum of random variables being bounded by some function, while at the same time ensuring that some other sum of random variables (that might depend slightly on the first) is also bounded ?

An example of a 1D large deviation result is of course a Chernoff bound. Devroye showed that a 2D large deviation bound for the height of such trees can be expressed in a similar form using the so-called Cramér exponent, something that will probably not be surprising to experts in large deviation theory. After that, the analysis for any tree process becomes a whole lot easier. You have to analyze the corresponding Cramér function for your distributions, and a bound (with a constant; no big-O nonsense here!) pops out.

He also talked about a neat extension of this method to analyzing the "skinnyness" of k-d tree decompositions, showing that for a kind of "relaxed" k-d tree construction, the skinniest cell can be extremely skinny (having a super-linear aspect ratio). It's the kind of result that I imagine would be very difficult to prove without such a useful general theorem.

"Whatever comes in sufficiently large quantities commands the general admiration." Trurl the Constructor, from Stanislaw Lem's Cyberiad.I've been reading Malcolm Gladwell's masterful article on the Enron scandal, and he frames it with the device of 'puzzles' vs 'mysteries':

There is a fundamental problem that comes up when you start messing with "data". Our training in algorithms makes us instinctively define a "problem" when working with data, or any kind of applied domain. Many of the problems in clustering, like k-center, k-median, k-means, or what-have-you, are attempts to structure and organize a domain so we can apply precise mathematical tools.The national-security expert Gregory Treverton has famously made a distinction between puzzles and mysteries. Osama bin Laden’s whereabouts are a puzzle. We can’t find him because we don’t have enough information. The key to the puzzle will probably come from someone close to bin Laden, and until we can find that source bin Laden will remain at large.

The problem of what would happen in Iraq after the toppling of Saddam Hussein was, by contrast, a mystery. It wasn’t a question that had a simple, factual answer. Mysteries require judgments and the assessment of uncertainty, and the hard part is not that we have too little information but that we have too much. The C.I.A. had a position on what a post-invasion Iraq would look like, and so did the Pentagon and the State Department and Colin Powell and Dick Cheney and any number of political scientists and journalists and think-tank fellows. For that matter, so did every cabdriver in Baghdad. [....]

If things go wrong with a puzzle, identifying the culprit is easy: it’s the person who withheld information. Mysteries, though, are a lot murkier: sometimes the information we’ve been given is inadequate, and sometimes we aren’t very smart about making sense of what we’ve been given, and sometimes the question itself cannot be answered. Puzzles come to satisfying conclusions. Mysteries often don’t.

In a sense, we treat these problems like puzzles to be solved. The game is then to find the best solution, the fastest, the most accurate; but the structure of the puzzle has been set. We can change the game (and we often do), but once again, the goal is to crack the puzzle.

But when you get down and dirty with data, you start seeing the problems that Gladwell describes. If your goal is to "understand" the data, then more is not necessarily better, and causes more confusion, and what you need is interpretative skills, rather than number-crunching or even problem solving skills.

This is what makes data mining so hard and exasperating, and yet so important. The need is clearly there, and there are mysteries to mine. But we've been attacking data mining problems as puzzles, and realizing fairly quickly that solving a puzzle doesn't reveal the mystery of the data.

I've often likened data mining research to an ooze; it's thin and spreads horizontally, without too much depth. But I think that's because the puzzles that we solve are of limited range, and not terribly deep. What we seem to need more are interpretative frames rather than algorithmic frames; frames that tell us about invariances in the data, rather than about quirks of representations.

Labels:
data-mining,
research

(WARNING: personal information ahead. If you prefer to think of the Geomblog as written by robotic monkeys pounding on millions of keyboards, read no further)

One of the reasons the Geomblog has been silent these past few weeks is that I've been busy moving, and falling sick, and unpacking, and unpacking, and unpacking, and...

Now, where was I ?

Oh yes, moving. After many years of cloistered comfort at AT&T, I've decided to take the plunge into the exciting and dangerous waters of academia, at the University of Utah (30, count 'em, 30 minutes from the best powder skiing imaginable).

Why the move ? Many people have asked me this, and the answer is actually simple: because I finally wanted to. AT&T has been a wonderful place for me to work (and they're hiring next year, so get those applications ready), but I realized that the kinds of things I wanted to do (teach, initiate my own research programs, guide students, and participate in the academic conversation in general) were better done at this point in a university setting.

It's not a decision I made easily. It is said that the real value of a workplace is in the colleagues you have, and from that point of view, leaving AT&T has been hard. Leaving for a real job after completing a Ph.D felt like a natural rite of passage, much as leaving India for grad school felt like. But leaving the labs was a purely elective decision, and as such makes the transition a little more jarring.

And now here I am, in Salt Lake City (technically, I'm in Cincinnati airport waiting for a much delayed flight to New Orleans, but I digres...), preparing for my geometry class, working on a proposal, and doing my research. On the one hand, I have the basic day to day business of research more or less under control, and work and collaborations go on seamlessly. On the other hand, I often feel like a fresh Ph.D at his first job, managing myriad things that seem new and foreign. It's a strange feeling.

But I'm genuinely excited to be teaching, and am looking forward to interacting with students; something that I sorely missed at AT&T, except for the occasional summer. It will be an exciting adventure.

One of the reasons the Geomblog has been silent these past few weeks is that I've been busy moving, and falling sick, and unpacking, and unpacking, and unpacking, and...

Now, where was I ?

Oh yes, moving. After many years of cloistered comfort at AT&T, I've decided to take the plunge into the exciting and dangerous waters of academia, at the University of Utah (30, count 'em, 30 minutes from the best powder skiing imaginable).

Why the move ? Many people have asked me this, and the answer is actually simple: because I finally wanted to. AT&T has been a wonderful place for me to work (and they're hiring next year, so get those applications ready), but I realized that the kinds of things I wanted to do (teach, initiate my own research programs, guide students, and participate in the academic conversation in general) were better done at this point in a university setting.

It's not a decision I made easily. It is said that the real value of a workplace is in the colleagues you have, and from that point of view, leaving AT&T has been hard. Leaving for a real job after completing a Ph.D felt like a natural rite of passage, much as leaving India for grad school felt like. But leaving the labs was a purely elective decision, and as such makes the transition a little more jarring.

And now here I am, in Salt Lake City (technically, I'm in Cincinnati airport waiting for a much delayed flight to New Orleans, but I digres...), preparing for my geometry class, working on a proposal, and doing my research. On the one hand, I have the basic day to day business of research more or less under control, and work and collaborations go on seamlessly. On the other hand, I often feel like a fresh Ph.D at his first job, managing myriad things that seem new and foreign. It's a strange feeling.

But I'm genuinely excited to be teaching, and am looking forward to interacting with students; something that I sorely missed at AT&T, except for the occasional summer. It will be an exciting adventure.

Subscribe to:
Posts (Atom)