Friday, July 06, 2007

NSF RSS Feeds

The NSF now has RSS feeds for important sections of their website. I don't know how new this is, but it's great. Some of the useful feeds:
The NSF publications link has two new items that might be of interest. The first one is about a change in formatting for proposals. If I read it correctly, it now states that Times font cannot be used for writing proposals, rather Computer Modern (>= 10pt) is the preferred font.
a. Use of only the approved typefaces identified below, a black font
color, and a font size of 10 points or larger must be used:
. For Windows users: Arial, Helvetica, Palatino Linotype, or Georgia
. For Macintosh users: Arial, Helvetica, Palatino, or Georgia
. For TeX users: Computer Modern

A Symbol font may be used to insert Greek letters or special
characters; however, the font size requirement still applies;
It's a little confusing why Windows and Mac are of the same "type" as TeX.

The second document is a statistical overview of immigrant scientists and engineers in the US. Lots of interesting statistics to muse over, including this: fully 15% of all immigrant scientists and engineers in the US are Indian-born, followed by nearly 10% Chinese-born. Compare this to 19% for all of Europe.

SODA 2008 Server Issues ?

Piotr asks if anyone else has been having problems with the SODA 2008 server shortly before 5pm ET (i.e around 90 minutes ago).

Update (7/707): At least one person thinks the server was on (and accepting submissions) for a few hours after the deadline. If this is true, this would be quite interesting, considering that I've never heard of this happening before (if the server did crash, it's a rational response of course).

P.S Maybe the server was out watching fireworks.

Thursday, July 05, 2007

Here's your algorithmic lens

In studies of breast cancer cells, [senior investigator Dr. Bert] O'Malley and his colleagues showed how the clock works. Using steroid receptor coactivator-3 (SRC-3), they demonstrated that activation requires addition of a phosphate molecule to the protein at one spot and addition of an ubiquitin molecule at another point. Each time the message of the gene is transcribed into a protein, another ubiquitin molecule is chained on. Five ubiquitins in the chain and the protein is automatically destroyed.
A counter on a separate work tape: neat !

Main article at sciencedaily; link via Science and Reason.

Wednesday, July 04, 2007

FOCS 07 list out

The list is here. As I was instructed by my source, let the annual "roasting of the PC members" begin !

63 papers were accepted, and on a first look there appears to be a nice mix of topics: it doesn't seem as if any one area stands out. Not many papers are online though, from my cursory random sample, so any informed commenting on the papers will have to wait. People who know more about any of the papers are free to comment (even if you're the author!). Does anyone know the number of submissions this year ? I heard it was quite high.

What's an advanced algorithm ?

And he's back !!!

After a rather refreshing (if overly hot) vacation in India, I'm back in town, forced to face reality, an impending move, and the beginning of semester. Yep, the summer is drawing to a close, and I haven't even stuffed my face with hot dogs yet, let alone 59.5 of them (is there something slightly deranged about ESPN covering "competitive eating" as a sport?).

Anyhow, I digress. The burning question of the day is this:
What is an advanced algorithm ?
It might not be your burning question of the day, but it certainly is mine, because I am the proud teacher of the graduate algorithms class this fall.

Some explanation is in order. At most schools, undergrads in CS take some kind of algorithms class, which for the sake of convenience we'll call the CLRS class (which is not to say that it can't be a KT class, or a DPV class, or a GT class, or....). At any rate, this is the first substantial algorithms class that the students take, and often it's the last one.

Now when you get into grad school, you often have to complete some kind of algorithms breadth requirement. In some schools, you can do this via an exam, and in others, you have to take a class. Namely, a graduate algorithms class. Note also that some students taking this class might not have taken an undergrad algorithms class (for example if they were math/physics majors)

What does one teach in such a class ? There are a number of conflicting goals here:
  • Advertising: attract new students to work with you via the class. Obviously, you'd like to test them out first a little (this is not the only way of doing this)
  • Pedagogy: there's lots of things that someone might need even if they don't do algorithms research: topics like approximations and randomization are not usually well covered in an intro algorithms class
  • Retaining mindshare: you don't want to teach graduate algorithms as a CLRS class because you'll bore all the people who took such a class. On the other hand, you zoom too far ahead, and you lose the people who come to CS from different backgrounds. And the Lord knows we're desperate nowadays for people to come to CS from different backgrounds (more on this in a later post).
This is obviously not a new problem, and there are numerous models for how to teach a graduate algorithms course, all with their own pros and cons. Some of the extreme points that define the spectrum of solutions are:
  • Teach to the top: dive straight into advanced topics; anyone who needs to catch up does so on their own time. This is probably the most appealing solution from a lecturer perspective, because it covers more interesting topics. It's also most likely to excite the few students who care. Be prepared for bumpy student evals, especially from disgruntled folks who just need a grade so that they can forget all about algorithms for the rest of their Ph.D
  • Teach to the bottom: Do a CLRS-redux, with maybe a few advanced topics thrown in at the end (NP-hardness might count as advanced). You'll bore the good ones, but you'll bring the entire class along.
A 'split-the-difference' solution of covering basic material quickly, and then jumping ahead into advanced topics, is also a possibility. Like most solutions of this kind though, it runs the risk of satisfying nobody while trying to satisfy everyone.

All of this elides the most important point though: what constitutes advanced material ? I've seen graduate students struggle with NP-hardness, but this is covered in almost any undergraduate algorithms class (or should). What about general technique classes like approximation algorithms and randomization, both of which merit classes in their own right and can only be superficially covered in any algorithms class (the same is true for geometry) ? Advanced data structures can span an entire course, and some of them are quite entertaining. Topics like FFTs, number-theoretic algorithms, (string) pattern matching, and flows also seem to fit into the purview of advanced topics, though they don't cohere as well with each other.

It seems like an advanced algorithms class (or the advanced portion of a graduate algorithms class) is a strange and ill-defined beast, defined at the mercy of the instructor (i.e ME !). This problem occurs in computational geometry as well; to use graph-theoretic parlance, the tree of topics fans out very quickly after a short path from the root, rather than having a long and well-defined trunk.

Friday, June 08, 2007

My Biased Coin: A New Blog !!

It's been a while since I've been able to announce a new CS blog. Please point your bookmarks/RSS newsreaders/browsers to Michael Mitzenmacher's new blog, My Biased Coin. Michael has occasionally posted over here, and has often had comments and suggestions for me. When he's not brokering peace agreements between power law distributions and log normal distributions, Michael corrupts young grad students at Harvard for a living, filling their minds with all kinds of random bits that are occasionally error corrected.

SoCG 2007: CUP takes over all of geometry...

The good news from today is that the Demaine-O'Rourke book on folding and linkages is finally out. Erik had a copy of the book at the conference, and it's a handsome volume that will hopefully soon reside on my bookshelf.

The authors have a web page associated with the book, and it has a number of applets that go along with constructions from the text.

The publisher of this book is Cambridge University Press, which is great, because I love their fonts :). CUP clearly has a plan for domination of the entire computational geometry catalog: along with the folding book, they are publishing:
Phew ! Looks like CUP will be making a tidy sum off of me. Now if only bloggers got review copies (hint hint hint).

Wednesday, June 06, 2007

SoCG 2007: Approximate Clustering

I was listening to a couple of talks that improve known bounds for various kinds of approximate clustering in high dimensions, and I got to thinking.

One of the revolutions in geometry over the last 10 years has been the development of nontrivial tools for dealing with approximations in high dimensions. This is of course necessitated by the curse of dimensionality, and the hardness of most high-D data analysis problems (most exact solutions are exponential in the dimension). So rather than computing the optimal solution to some problem on n points in d dimensions, we compute a (1+e)-approximation to the optimal solution.

One problem this creates is that every algorithm is now described by a complicated expression involving three parameters (n, d, e). Some algorithms are exponential in 1/e, but polynomial in d. Some are poly in 1/e, and exponential in d. The exponent could have a base of n, or 2, or even d, or 1/e.

In short, it's an unholy mess of strictly incomparable methods that tradeoff different parameters against each other.

This makes life for me, the "user" of approximation technology, rather difficult. What I'd really like to understand are the gadgets that transform one kind of bound to another (and there are many such gadgets: discretization, enumeration, random sampling, etc). But it's hard to gather these from the papers directly: these gadgets (the really useful tools) are buried deep inside lemmas, and inside people's heads.

What I'd like to see is some kind of "taxonomization" or classification of the different "tricks of the trade" in high-dimensional geometric approximation, with some sense of which techniques apply when, and why. In fact, I like this idea so much I might even try to run a seminar on it..

SoCG 2007: "Magic Hausdorff Lions Have Happy Endings"

(It's at the point now where people complain if the business meeting post is NOT up 5 minutes after the actual meeting. Sigh...)

Summary for people in a hurry:
  • Attendance this year was a record low
  • Next PC Chair: Monique Teillaud
  • SoCG 2008 will be at the University of Maryland (rather than Virginia: long story)
  • SoCG 2009 (by a 4-1 margin) will be in Aarhus, Denmark. BEER !!!!!!!!!
And now for the details:

Local News:
Otfriend Cheong was in charge of local arrangements. I have to say that the hotel we are at is quite amazing: It's on the Bomun Lake, and right outside the hotel is this lake-side path that leads to all the restaurants one can imagine, a tourist village with acres of motorized scooters (!), and some not-disgusting coffee. The hotel facilities themselves are excellent (again, modulo the coffee); much better for a cheaper price in comparison to a US Hotel. And as I sit here writing this, I can hear the rehearsals for our Korean classical music concert tonight.

Unfortunately, there was no one here to enjoy it ! Attendance is down tremendously (111 people), a factor exacerbated by FCRC, being held in a day or two in the exotic locale of San Diego (fish tacos ! blech!), and other related conferences. The relative remoteness of Gyeongju played no small role in this, I imagine.

There was much discussion about whether we should continue to be sponsored by ACM or not (the main issue is cost, and logistics when organizing in a non-US location); no resolution on this point.

PC Stuff: (obvious caveat: I was on the PC this year)
Jeff Erickson presented the usual stats (45/139 papers accepted, out of 286 submitting authors, and with 115+ external reviews). It turns out that the formula for succes at SoCG this year involves
submitting seven papers, with 4 co-authors, from an email address in Pakistan.
The title of this post was composed from words in accepted papers.

Lots of other stats, nothing particularly enlightening.

The Near Future.
Monique Teillaud will chair the SoCG 2008 program committee. The committee has 17 people on it, an unusually large number. She's hoping to get 4 reviews per paper, so maybe that's why.

David Mount explained why we had to move from Alexandria, Virginia to UMD for SoCG 2008. Apparently the main hotel we would have used was bought out and is now much more expensive. Boo hoo. On the bright side, UMD will be a lot cheaper in terms of accomodation.

SoCG 2009.

We had two competing bids for SoCG 2009. Aarhus (Beer ! Lego ! Beer ! City life ! Beer!) and Portland (Roses ! Beer ! More Roses ! Powell's Books ! Microbreweries!).

No, we are not a bunch of boozing alcoholics.

Lars did a bang up job with his Aarhus presentation. John Hershberger did a great presentation on Portland (a great city to visit, btw), but it was no contest. By a 4-1 margin, Aarhus it is !

Open Discussion.
It was getting rather late by the time we got to open discussions. Guenter Rote initiated the question, "Should we have a rebuttal process for reviewing papers", in response to apparently some aggrieved set of rejected authors.

We had a heated discussion on this point, and although we ultimately went nowhere, it was felt that we should continue things electronically (i.e on blogs). I'll post at greater length on this issue later, but to summarize the main points pro and con:

Pro:
  • Creates an improved sense of fairness
  • A safety net for potential errors
Cons:
  • Time waste for reviewers
  • Time waste for authors (!) (according to Jack Snoeyink), but I am not convinced that this is a valid argument
  • Won't make a significant difference
I'm personally dubious whether there's any measurable benefit to a rebuttal process, but I'm mildly in favor primarily because of the "placebo effect" it has.

SoCG 2007: Geometric Views of Learning

I've been fairly remiss in my SoCG blogging; I'll blame it on having session chair duties, and not wanting to lug my laptop around.

The invited talk was by Partha Niyogi from the University of Chicago on 'A Geometric Perspective on Machine Learning'. You may remember his work from my earlier post on the estimation of the surface area of a convex body (read the comments). More importantly, he is part of the group that developed a method known as Laplacian Eigenmaps for learning a manifold from a collection of data.

Manifold learning is a new set of problems in machine learning that has interesting connections to algorithms and geometry. The basic problem is as follows. Given a collection of (unlabelled) data inhabiting some high dimensional space, can you determine whether they actually lie on some lower dimensional manifold in this space ?

Why do we care ? The problem with any data analysis problem in high dimensionality is the rather poetically named, 'curse of dimensionality' which basically says that any interesting data analysis algorithm runs in time exponential in the dimension of the data. For data that lives in 100s of dimensions, this is rather bad news.

However, "data that lives in 100 dimensions" is really an artifact of the way we represent data, slapping on dimensions willy-nilly for every attribute that might be relevant. What one often expects is that data doesn't really lie in 100 dimensions, but in some lower dimensional manifold of it. A beautiful example of this was given by Partha in his talk: he described the problem of inferring a sound wave signal generated at one of a tube by listening in at the other hand. By Fourier analysis, you can think of both signals as living in an infinite dimensional space, but suppose we now vary the tube length, for a fixed input signal. Then the output signal varies smoothly along a curve (i.e a 1-d manifold) in this infinite dimensional space.

"So what ?" one might ask. The problem is that the standard method of doing data analysis is to translate the problem of interest into some property of the distances between points in the high dimensional space. If the data really lies on some kind of curved surface, then the "distance in ambient space" does not reflect the true structure of the data. What we really need is "distance along the manifold", which we could do if we could reconstruct the manifold !

The key idea of the Laplacian Eigenmaps work is this: If you set up an appropriate weighted graph on the data points (where each edge has a weight that is exponentially related to the distance between the points) and compute the Laplacian of this graph, you get a approximation that converges (as the data size increases) to the Laplacian of the underlying manifold !! This assumes that that the data was sampled uniformly (or mostly uniformly) from the manifold. Moreover, the convergence happens at a rate that depends only on the dimension of the manifold, rather than the dimension of ambient space.

There are many ramifications of this idea, that connect to shape modelling, homology, and even volume estimation. But it reinforces the idea of the Laplacian as a key geometric construct that can be modelled combinatorially in a consistent way.

Monday, May 21, 2007

Creating a new publication model

I like to gripe about our current conference/journal system, and how ungainly a method it is for publishing and disseminating quality work. Not surprisingly, other areas of computer science appear to have similar issues as well.

Hal Daume, my colleage at the U. of U., has sparked off a rather interesting discussion with his post on NLPers about the need for a new open-access journal for NLP research. The original motivation for his post is partly NLP-specific, but the evolving discussion over at his wiki provides a closeup view of the process of designing a new publication model that sits somewhere between traditional journals and traditional conference.

It's too early to tell what will come of this, but if their model gains some traction, it might provide a good exemplar for future reform in other disciplines in computer science.

Sunday, May 20, 2007

Numb3rs wins NSF award

I used to blog every episode of Numb3rs in its first season, and got bored of doing this soon after. However, I've continued to watch the show; I don't have the heart to avoid watching a show that involves mathematics (and more often than not, computer science). It's been surprisingly decent in its three years, as long as you're willing to concede all kinds of miraculous data mining feats performed in the span of a few hours.

However, I always wondered if it could make it on a mainstream network for very long. After all, ratings are king, and how long could such a tech-heavy series last ? After all, the crime procedural aspects of the show are hardly unique, and with the monster success of CSI, even more scientifically oriented crime series were clogging up the airwaves.

Thus, I was pleasantly surprised to hear that Numb3rs is now the "most-watched show" on Friday nights, clocking in at around 12 million viewers. It's been picked up for a fourth season, and ended season three with a dramatic moment, as well as interesting existential angst for the main characters. In fact, I suspect one of the reasons Numb3rs has done so well is that they've done a fairly realistic job of portraying some of tensions inherent in the life of the researcher (even if the mathematics itself, and the way the characters talk about it, is still somewhat cringeworthy).

The NSF (or technically, the NSB, the board that oversees the NSF) has recognized the success of Numb3rs as well. The show and its creaters were just given a public service award for
...extraordinary contributions to increase public understanding of science. Recipients are chosen for their contributions to public service in areas such as: increasing the public's understanding of the scientific process and its communication; contributing to the development of broad science and engineering policy; promoting the engagement of scientists and engineers in public outreach; and fostering awareness of science and technology among broad segments of the population.

Friday, May 18, 2007

8th carnival of mathematics

It's back-to-school night at the carnival of mathematics. Time to revisit, reflect, and ponder on things we think we already know.

Estraven at Proving Theorems wishes that mathematicians today spent more time revisiting and bring up to date older, more inaccessible works of mathematics. This is discussed in the context of a re-publication of some of Grothendieck's early work on schemes.

There's a lovely quote at the end of the article, one that I cannot but help share:
I could of course keep writing paper after paper. But there is so much beautiful mathematics out there that I don't know yet, and I want to read at least some of it before I die
I remember first sensing beauty when I first learnt Burnside's Lemma: Leadhyena Inrandomtan at Viviomancy has a detailed post on this simple, and yet elegant result in combinatorics.

Burnside's Lemma ultimately deals with counting symmetries: Ian Stewart has a new book out on the history of symmetry titled, "Why Beauty is Truth: A History of Symmetry". In a post at the Brittanica blogs, he describes why he decided to write this book.

In an ongoing series, John Amstrong continues his unapologetic look at coloring knots. I'd say any topic that has phrases like 'involuntary quandle' is worth studying. Of course, the original tying-yourself-in-knots proof was the diagonalization method of Cantor, which Mark Chu-Carroll is kind enough to explain to us, while he takes a timeout from beating errant creationists with sticks of logic. Mark notes that he's been going back to some old books on Set theory and is thoroughly enjoying them, another example of the theme of the day :)

Godel's incompletenes theorem is another of the tying-yourself-in-knots variety, and this year's Godel prize winner is a result in the same vein, showing why certain "natural proof structures" cannot be used to prove P != NP. Siva and Bill Gasarch have the scoop.

Walt at Ars Mathematica catches up to the Bulletin of the AMS, highlighting some nice expository articles that should make Estraven happy.

It's time to get educational now. Dave Marain helps high school math teachers everywhere with lesson plans for teaching quadratic systems and tangents without calculus. Mikael Johansson shows bravery far above and beyond the call of duty, explaining fundamental groupoids to a group of 9th graders. Heck, I don't even understand what a groupoid is, and I read John Baez all the time !

jd2718 shows his students how Fermat is connected to them. It would be remiss of me not to give a shout out to the Mathematics Genealogy Project. We've all heard the horror stories about the students who write "dy/dx = y/x"; Dan Greene at The Exponential Curve talks of students who write "log 5/log 2 = 5/2" and wonders if we need to change the notation for log to something more "mathematical".

I am delighted, and surprised, to see the quality of reports produced in the undergraduate
automata class that Leo Kontorovich is TAing. Read more about what his students did here. Andy Drucker weaves tales (tails?) of dogs eating steak in mineshafts, and cats climbing GMO-compromised trees, to explain some of the subtler details of computability theory.

We're almost ready to wrap up this edition of the carnival. Chew on some simple brainteasers that will excercise regions of your brain that you may not have known existed ! And review the history of calculating devices, with some speculation on what future calculators might look like.

Finally, no carnival of mathematics would be complete without a cartoon by XKCD:



And that's all, folks ! Keep those posts coming. The next Carnival of Mathematics is in two weeks, hosted by JD2718.

Tuesday, May 15, 2007

This year's Godel Prize

The 2007 Godel Prize goes to Alexander Razborov and Steven Rudich for their paper, "Natural Proofs" (JCSS Vol 55. No. 1, 1997).

I'd like to think that Godel himself would have approved of this particular award, given the mind-bending nature of the Razborov-Rudich result. To use Scott Aaronson's loose formulation, their result can be summarized as
The reason P != NP is so difficult to prove is that P != NP
Computational Complexity (is it the Lance-blog, or the Bill-blog, or a Lance-post in the Bill-blog, or a Lance-post in the Bill-blog-that-used-to-be-Lance's-blog?), and In Theory have detailed explanations of the RR result, and do a far better job than I can do here.

To put it in perspective, the Razborov-Rudich result is a "meta-result": it says that no proofs of a certain kind (the so-called natural proofs) can be used to prove P != NP, unless other hardness assumptions break to begin with. Another example of a meta-result is the (older) result by Baker, Gill and Solovay that the P=NP question does not relativize (i.e cannot be proved using methods that hold under the application of oracles).

Razborov-Rudich really put a shaft in methods for separating P and NP (see this pessimistic post by Lance). If only all the P vs NP cranks actually tried examining proofs like this before wasting all those valuable bits on comp.theory ;)

(HT: [LB, UB] and Process Algebra Diary).

Friday, May 11, 2007

Reminder: Deadline for the 8th Carnival of Mathematics in a week

Post your submissions here, or email me.

Probabilistically checkable presentations

My ex-advisor, long time VC, and the M in ALMSS, Rajeev Motwani, will be one of the experts assisting to select 20 startups to fund as part of the TechCrunch20 conference.

Maybe he'll sample a few slides from each presentation....

Wednesday, May 09, 2007

Tracking changes in LaTeX

A rather long time ago, I had asked how to track changes in Emacs/LaTeX, along the lines of what Word can do. There were many suggestions at the time, along the lines of using CVS, and one person had even mentioned 'texdiff'.

I actually got around to trying latexdiff today, and it actually works quite well. Given an old and new LaTeX document, it will output a "diff" TeX file that when compiled marks changes in coded colors; blue for insertions, red for deletions. People I know use this all the time for managing changes with collaborators.

The program itself can be run as a standalone Perl script, so it's quite portable. I plan to use it more often.

Tuesday, May 08, 2007

Faure sequences and quasi-Monte Carlo methods

Fix a number n, and consider the following recursive procedure to construct a permutation $\sigma(n)$ of the numbers 0..n-1. We'll view the permutation as sequence, so the permutation (0 2 1) maps 0 to 0, 1 to 2 and 2 to 1. When convenient, we'll also treat the sequences as vectors, so we can do arithmetic on them.
  1. If n is even, then $\sigma(n) = s_1 \circ s_2$, where $s_1 = 2\sigma(n/2), s_2 = 2\sigma(n/2)+1$.

    For example, $\sigma(2) = (0 1)$. $\sigma(4) = (0 2) \circ (1 3) = (0 2 1 3)$
  2. If n is odd, then $\sigma(n) = s_1 \circ (n-1)/2 \circ s_2$, where $s_1$ and $s_2$ are constructed by taking the sequence for $\sigma(n-1)$, incrementing all elements that are at least $(n-1)/2$, and then splitting the sequence into two parts of equal length.

    For example, $\sigma(4) = (0 2 1 3)$. Let n = 5. Then incrementing all elements at least 2 gives us $(0 3 1 4)$. Splitting and inserting (2) gives us $\sigma(5) = (0 3 2 1 4)$
Here's the question: given n and j, can we return the jth entry of $\sigma(n)$ using fewer than $\log(n)$ operations ? Note that writing down $j$ takes $\log n$ bits already, so the question is couched in terms of operations. For reasons explained later, an amortized bound is not useful (i.e we can't compute the entire sequence first and then pick off elements in constant time)

For the backstory of where this sequence, known as a Faure sequence, comes from, read on...

One of the applications of geometric discrepancy is in what are called quasi-Monte Carlo methods. If you want to estimate the integral of some complicated function over a domain, (for simulation purposes for example), one way of doing this is to sample at random from the domain, and use the function evals at the sampled points to give an estimate of the integral. Typically, the function is expensive to evaluate as well, so you don't want to pick too many points.

Of course in practice, you're picking random points using a random number generator that is itself based on some pseudorandom generation process. An important area of study, called quasi-Monte Carlo methods, asks if this middleman can be eliminated. In other words, is it possible to generate a deterministic set of points that suffices to approximate the integration ? This is of course the question of finding a low discrepancy set, something that we've used in computational geometry to derandomized geometric algorithms (especially those based on $\epsilon$-net constructions).

There are many ways of constructing good sequences, and a good overview can be found in Chazelle's discrepancy book (read it free here). One of the more well known is due to Halton, and works by what is called radical inversion: given a base p, write down the numbers from 1 to n in base p, and then reverse the digits, and map back to a number less than one. For example, using a base of 2, we can write 6 as 110, which reversed becomes 011, or after adjoining a decimal point, 0.011 = 3/8. This specific sequence, using base 2, is called a van der Corput sequence and is also used to construct a Hammersley point set. For sampling in d-dimensional space, the trick is to let the jth coordinate of the ith point be the ith entry in a Halton sequence using some base $p_j$ (usually a prime).

It can be shown that these sequences have low discrepancy; however, they can have "aliasing" problems, or repeated patterns, if the primes are not large enough. One way around this problem is b observing that if we ignore the decimal point, all we're really doing is constructing a permutation of the numbers from 0 to $p^d-1$, (for a d-digit number in base p). Thus, if we could scramble the permutation further, this might allow us to preserve the discrepancy properties without the annoying aliasing effects. The process described above is a well known scramble called the Faure sequence. It maintains the desired properties of the sequence, and is quite popular in the quasi-MC community.

Note that if we were to preprocess all needed sequences, we'd need n*d space, where n is the number of samples, and d is the dimension of the space. This is not desirable for large dimensional sampling problems, and hence the question about the direct evaluation of coordinates.

Friday, May 04, 2007

Silvio Micali elected to the NAS

The National Academy of Sciences just announced 72 new members, among them Silvio Micali from MIT. For those who may not know, he's famous for his work in pseudorandomness, cryptography, and interactive proof systems. Among his many awards are the Godel Prize in 1993 and being named a Fellow of the AAAS in 2003.

The 7th Carnival of Mathematics is out

Read all about it.

I'll be hosting the next one, so if you have a submission you'd like to be publicized, send it to sureshv REMOVE at THIS gmail AND dot THIS com. Or, you can submit it at the master blog carnival site.

Thursday, May 03, 2007

Hazardous Jobs #523: Dean of Undergraduate Studies

From Bob Sloan's rumination on the joys of being an NSF program manager:
When I left for NSF, I was Director of Undergraduate Studies for a department that had well over 600 majors at the height of the dot com boom. The previous two holders of that position had only ended their time in it by leaving the country in one case and dying in the other—both far more extreme than going to NSF for a couple of years

The joy of AMS Notices.

For the past few months, I've been reading articles from the Notices of the AMS, something I never did earlier. This is a direct consequence of reading In Theory and Ars Mathematica, both of which will regularly highlight articles from the latest issue of the Notices.

I'm amazed at how many of the articles I find interesting enough to peruse, or even print out to read offline. Most of these are technical articles, on specific topics in mathematics. Quite often, there will be discussions on topics tangentially related to theoryCS or even things that I'm interested in. For example, the May issue of the Notices has an article titled, "How to generate random matrices from the classical compact groups". Anyone who messes around in geometry and graphics will immediately see the connection to the standard problem of generating a random rotation, and in fact there's a very interesting explanation of the group theory underlying the standard procedure for generating a random orthogonal matrix.

A second article with the whimsical title, "If Euclid had been Japanese" discusses the constructibility of points using folding operations rather than "Euclidean" constructions. Apart from the obvious origami connections, this is particularly interesting to me because I've been experimenting with a short presentation that tries to explain algorithms via origami (folds are like steps in a program, etc..).

I could go on like this. Every issue has at least one article that is both acecssible and interesting to me, and I'm not even a mathematician ! Why can't we have such a delightful magazine for computer science, or even for theoryCS ?

SIGACT News does a valiant job. And it's hard work to manage a newsletter, along with all one's other activities. But it comes out far too rarely, and one doesn't often find the kind of short vignettes that entertain and illuminate at the same time. I enjoy the algorithms column and the geometry column. I will make valiant attempts to read the complexity column, but it's often too "structural complexity" for me. But in a way I am unable to articulate, I find SIGACT News somehow less stimulating than the Notices. I wonder if others share this view as well ?

Update (5/3/07): I think I just had an epiphany. The above post is the equivalent of my standing in front of the Model T, wondering why my horse-buggy can't go any faster. In other words, we already have excellent publications that write expository surveys and cute vignettes that connect the community together. They're called blogs !!

Wednesday, May 02, 2007

MAD about ALGOs

The new BRICS Center for Massive Data Algorithmics (MADALGO) is kicking off with a summer school on data streams. It's a 4-day affair between Aug 20-23, in Aarhus, Denmark, and I am told on good authority that along with learning the ins and outs of stream algorithms, you will also take a mandatory course on how to open one beer bottle with another one.

[Side note: you might ask a Danish person, "What do I do if I have only one beer bottle?". Danish Person: < long silence >]

The set of topics are just what you'd expect for a data stream course:
  • Algorithms for metric and geometric data streams
  • Randomized sketching and compressed sensing
  • Histograms, norms and other statistics of data streams
  • Algorithms for ordered data
  • Lower bounds and communication complexity
Registration is free, and accomodation has been blocked at fairly congenial rates. Especially if you're a grad student and have the wherewithal to go, this would be a great opportunity. The school is open to all researchers. Deadline for registration is June 1.

And when you're done streaming, you can go to Legoland, or experience the rather nonstandard (and NSFW) techniques the Danes employ to slow traffic down (NSFW).

Sunday, April 29, 2007

Estimating the surface area of a polytope

One of the more striking examples of the power of randomization is its use in the estimation of the volume of a polytope. Given an oracle that declares (for a given point) whether it's inside or outside the polytope, a randomized algorithm can get an arbitrarily good approximation to the volume, even though the problem is #P-Complete. A deterministic algorithm, on the other hand, will fail miserably.

A related question that I was curious about is the surface area estimation problem. Given a polytope defined by the intersection of hyperplanes (we can assume that the input promises to yield a bounded polytope for now), can we estimate its surface area efficiently ? Now, one can use a volume oracle on each hyperplane to determine the area of the face it contributes to (if at all), but I was wondering if there was a more direct method (especially since the above approach is far more wasteful than necessary, especially in small dimensions).

There's even a practical reason to do fast surface area estimation, at least in 3D: this is used as a heuristic for building efficient space partitionings for ray tracing.

Thursday, April 26, 2007

Things not stated on the Korean visa form

If you're travelling to Korea for SoCG, and you need a visa, then there are a few things not mentioned on the Korean consulate web page:
  • If you're a permanent resident, you have to attach a copy of your green card
  • If you work for a company, you need a letter from your employer. If you are exalted enough to be employed by a university in the honorable profession of "professor", you don't need one.
  • Processing time is about one week.

Tuesday, April 24, 2007

It's the Sans Serif smackdown

In the right corner: HELVETICA !!
Helvetica, which had its world premiere at the conference, presents the life story of something all of us encounter on a daily (or even hourly) basis. Created in 1957 by the Swiss modernist designer Max Miedinger as a response to the cluttered typography and design of the postwar era, Helvetica's clean neutrality and balanced use of the empty space surrounding letters quickly made it a go-to font for public signage, advertising, corporate logos and works of modernist design around the world

[...]

Filmmaker Gary Hustwitt revels in his fascination with something so commonplace that it blends almost entirely into a context-less background, becoming a detective of sorts to unveil the myriad everyday places Helvetica is hiding (“It's a disease,” Hustwitt said of his obsessive font-spotting).
In the left corner: COMIC SANS MS:

Earz: I found a weird website on typography, it was written in Italian I think, and had images of a gravestone lettered in comic sans. What does that say to you?

That would only be appropriate if the deceased were a clown or comedian, but other than that, I'd come back to haunt whoever did that if I were the dead guy.
Personally, I prefer Trebuchet MS.

p.s In the top corner, ARIAL:
It's been a very long time since I was actually a fan of Helvetica, but the fact is Helvetica became popular on its own merits. Arial owes its very existence to that success but is little more than a parasite—and it looks like it's the kind that eventually destroys the host. I can almost hear young designers now saying, "Helvetica? That's that font that looks kinda like Arial, right?"

Monday, April 23, 2007

Miscellaneous links

It's a good day for math blogging. Gil Kalai, guest blogging for Terence Tao, explains the weak -net conjecture for convex sets (and can I say that I'm very jealous of the impeccable LateX typesetting on wordpress.com). Leo Kontorovich presents an interesting example of the breakdown of discrete intuition from the Steele book on the Cauchy-(Bunyakovsky)-Schwarz inequality. Andy Drucker talks about why convexity arises naturally when thinking about multi-objective optimization.

Friday, April 20, 2007

FOCS/STOC Reviewing

Bill Gasarch has a long list of questions about paper review processes at FOCS/STOC (what happened to SODA?) following up on Vijay Vazirani's not-radical-enough guest post about submitting videos along with submissions. Since it's tedious to post a list of answers to so many questions in comments, I'll post it here, and let Blogger trackback work its magic.

  1. Is the community really driven by these conferences? An Underlying assumption of these discussions has been that someone judges us based on the number of STOC/FOCSs we have. Who is this mysterious someone? Is it our departments? Our Colleges? Ourselves? Granting agencies?

    Well it's not really that mysterious: I'd say all of the above.

  2. Is it bad that we are so judged?? PRO: Its good to have a central place where you know the good papers are. CON: The rest of the items on this list are about what problems there are in judging quality CON: Some of these papers are never put into proper journal form. CAVEAT: Is the Journal-Refereeing system all that good to decry that it is lacking here?

    Nothing wrong with being judged. However, the assumption that FOCS/STOC is the repository of all "good papers" is problematic. And personally, I'm tired of people complaining all the time about journal refereeing. The SAME people who review papers cursorily for conferences are the ones who do it for journals. If conference reviewing is "OK", what makes journal refereeing so terrible?

  3. Other fields do not have high-prestige conferences- why do we and is it a good thing?. Our field moves fast so we want to get results out fast. It is not clear that FOCS/STOC really do this. Using the web and/or Blogs can get the word out. Important new results in theory get around without benefit of conferences. For results just below that threshold its harder to say.

    I'm sorry. I'm having a hard time getting through the fog of hubris around this statement. I fail to understand how we can have the gall to stand up and say that CS theory moves a lot faster than any field that abjures conferences. Bio journals publish every few weeks, and there are mountains of papers that show up. For a closer-to-home example of the herd mentality that often drives research in theoryCS, papers in string theory appear at high rates without needing the acceleration of conferences.

  4. Are the papers mostly good?

    As I was saying to someone the other day, it's unlikely that you'll find more than a couple bad FOCS/STOC papers in each proceedings. So, Yes.

  5. Is their a Name-School-bias? Is their a Name-person-bias? Some have suggested anonymous submissions to cure this problem.

    After a recent SODA, someone levelled this complaint against the committee, both in terms of committee composition, as well as paper acceptance. There is some info on this matter here.

  6. Is their an area-bias? There are several questions here: (1) is the list-of-topics on the conference annoucement leaving off important parts of theory? (2) is the committee even obeying the list as is? (3) have some areas just stopped submitting?

    I'd argue that there is (mostly) faddism rather than bias; certain topics gain a certain cachet in certain years, and often papers that in other years might not have cracked a FOCS/STOC list do so. However, (3) is definitely true to a degree for computational geometry and data structures. I've lost count of the number of people who claim that they only submit to STOC/FOCS because of tenure pressure. It's not clear to me whether actions match words in this regard, but the prevailing sentiment, at least in these two areas, is strong.

  7. Is their a Hot-area-bias?

    YES: see above.
  8. Is their a mafia that controls which topics gets in?

    I've never been on a FOCS/STOC committee, but I'd strongly doubt this. I think our community is full of strong personalities, and it's hard to imagine that many people working together with such cohesion :)

  9. Is their a bias towards people who can sell themselves better? To people that can write well?

    Undoubtedly so. However, this cannot be viewed as a problem per se. It's just a function of human nature; if you understand something better, or it seems clearer, you will often view it favorably. And there's nothing wrong in creating an evolutionary pressure towards better writing and "framing"
  10. Is their a bias towards making progress on old problems rather than starting work on new problems?
    Not really. I think there's a mix of preferences, and that reflects individual interests.

  11. Is their a bias towards novel or hard techniques?

    Again, I suspect people have their own viewpoints; some like beautiful proofs, some like novel techniques, some are impressed by technical wizardry , and so on.
  12. Is it just Random? Aside from the clearly good and clearly bad papers, is it random? Is even determining clearly good and clearly bad also random? One suggestion is to make it pseudo-random by using the NW-type generators. This solves the problem in that since it really is random it is less prestigous and most of the problems on this list go away. Would also save time and effort since you would not need a program committee.

    Well since the vast majority of submissions to any conference are in the mushy middle, we have to invoke the proposition that I think is attributed to Baruch Awerbuch: the probability of a paper being accepted is a random variable whose expectation is a function of the paper quality, and whose variance is a function of the program committee.
  13. Are there many very good papers that do not get in? It has been suggested that we go to double sessions so that more get in. If the quality of papers has been going up over time this might make sense and would not dilute quality.

    This goes back to the dual conflicting roles of a conference: quality stamping and dissemination. You will almost certainly dilute the first by increasing paper acceptances, but you will only help the second.

  14. Is 10 pages too short for submissions? This was part of Vijay's Video Suggestion. Are figures and diagrams counted for those 10 pages? If they are they shouldn't be.

    Too short for what ? Reviewers, with the number of papers they need to review, can hardly be expected to read 20 page submissions. And as an aside, if you can spend the time on a video review to explain the paper, why can't you use the same time to write a GOOD outline of the main ideas in one section of the paper itself ?

  15. Are many submissions written at the last minute and hence badly written?

    YES
  16. Are many submissions written by the authors taking whatever they have by the deadline and shoving it into a paper?

    OFTEN
  17. Since the conference is about all of theory, can any committee do a good job?Vijay was partially addressing this problem by trying to find a way to make their job easier.

    Committees try, with extensive outsourcing, but it's essentially an impossible task, and one should not expect perfection.
  18. Do other conferences have these problems? That is, the more specialized conferences- do they have similar problems? Why or why not?

    Allowing PC members to submit changes the review dynamics tremendously. It increases the reviewer pool, reduces reviewer load, but also reduces the quality of review ratings. Most other communities allow PC submissions, so it's really hard to compare. I am NOT advocating going this route.
  19. Do you actually get that much out of the talks? If not then it is still valuable to to go for the people you meet in the hallways?

    Schmoozing is much more effective for me than attending talks. But that's just me. Others may have different viewpoints.
  20. For all the items above, even if true, are they bad? Some have suggested that bias towards big-names is okay.

    Bias towards "big-names" is braindead. Bias towards quality is perfectly fine. A correlation between "big-names" and quality does not imply that one should be replaced by the other.

Tuesday, April 17, 2007

On judgements

Paul Graham has a short note on judgements which seems apropos to dealing with rejection of various kinds (papers, jobs, grants, etc):
Our early training and our self-centeredness combine to make us believe that every judgement of us is about us. In fact most aren't. This is a rare case where being less self-centered will make people more confident. Once you realize how little most people judging you care about judging you accurately—once you realize that because of the normal distribution of most applicant pools, it matters least to judge accurately in precisely the cases where judgement has the most effect—you won't take rejection so personally.

Monday, April 16, 2007

Factoidals

Brian Hayes has a very interesting article in American Scientist on the 'factoidal', a stochastic version of the factorial that he coined. It's defined as follows: for input n, pick numbers at random between 1 and n, multiplying them together until you pick a 1.

He expands at some length on the stochastic properties of this function: unsurprisingly, the arithmetic mean of samples of this function doesn't converge with more samples, but the geometric mean does. What's a little more surprising is that the distribution isn't log-normal. One might expect that since upon taking logs, we're averaging a collection of random numbers between 1 and log n, we might expect the average of the logs to display normality, but that doesn't happen.

The explanation is a nice take-way nugget from this article. In short, work by William Reed and Barry Hughes from 2002 shows (this example taken from their abstract) that if you take an exponentially growing process (for example ), and truncate it at a random time given by an exponential process (for example, with parameter ), then the resulting random variable exhibits the power law distribution


Thursday, April 12, 2007

Latex For Blogger

This is a neat plugin. And just to test it, here's the Johnson-Lindenstrauss Lemma:

For any and any integer n, let be a positive integer such that

Then for any set of points of n points in , there exists a map such that for all ,


Not bad at all !

Tuesday, April 10, 2007

Knuth Prize

Nancy Lynch from MIT has been awarded this (one and half) year's Knuth Prize for her work in reliability of distributed computing. I remember first hearing about her work when I studied distributed computing and Byzantine agreement problems back in IITK.

This has been a good (and well deserved) year for women in computing: first Frances Allen, and now Nancy Lynch.

The Cauchy-BUNYAKOVSKY-Schwarz inequality

Episode #2583 of "By the law of large numbers (of Russian mathematicians), your theorem has been proved by some Russian before you even thought of it":
Viktor Bunyakovsky worked on Number Theory as well as geometry, mechanics and hydrostatics. He discovered the Cauchy-Schwarz inequality 25 years before Cauchy or Schwarz.

Measure concentration

I've been preparing a lecture on the Johnson-Lindenstrauss Lemma and dove deep into the beautiful world of measure concentration. Matousek's chapter on the topic is excellent in this regard, but I found that the flow of the results (from measure concentration towards the JL Lemma) was not to my taste either personally or pedagogically (it's possible to discuss the lemma algorithmically using elementary proofs without having to uncover the deep measure concentration machinery chugging below).

In that respect, Alexander Barvinok's lecture notes on measure concentration prove to be quite useful. He starts from simple measure concentration on the sphere, builds up to the JL Lemma, and then carries on forward into Levy's theorem, Dvoretsky's theorem and the Brunn-Minkowsky inequality. He starts with a good intuitive definition of measure concentration:
Any "reasonable" function defined on a "large" probability space takes values close to its average with high probability
which nicely ties it together (in my mind) with other large deviation results in information theory as well (he also has a section on rate distortion and Cramer's work).

Another useful nugged from Barvinok's notes is a thorough drilling on the use of the "Laplace trick": the technique of taking the exponential of a random variable and applying the Markov inequality in order to prove tail bounds (the standard proof of the Chernoff bound is one example of this).

Wednesday, April 04, 2007

Inventor of 'idf' passes on

Karen Sparck-Jones, professor at Cambridge University, winner of the Lovelace Medal and the Allen Newell Award (among other awards), died today. She was a pioneer in NLP and information retrieval, and is also the inventor of one of the most fundamental notions in information retrieval: the concept of IDF (or inverse document frequency).

Information retrieval methods are at the heart of web search today, even in the age of link structures and PageRank. One of the most elementary and yet most enduring ideas in information retrieval is the concept of a document as a "bag of words"; namely that even if we encoded a document as a (weighted) set of terms occuring in it, we can extract meaningful information from it. The 'bag-of-words' model itself dates back to at least to Gerald Salton's IR book in 1968; it is probably much older.

Once you think of documents as bags of "terms", you need a way of weighting these terms in a way that allows "similar" documents to have a similar set of weights. Term frequency (TF) is one way of doing this: weight a term with a number proportional to how often it appears. This makes some sense; a document about soccer might use the word 'football' or 'soccer' fairly often, as compared with a document on cricket.

However, words like 'a', 'an' and 'the' also appear very often in a document. How does one prevent these terms from washing away any semantic context one might have ? One idea is to weight the term frequency by something called 'Inverse Document Frequency' (IDF). For a given collection of documents, count in how many documents a fixed term appears. This number is the term's document frequency. Clearly, a term that appears in all documents can't be a useful distinguisher of a document and so the larger this number is, the less relevant the term. Therefore, by multiplying the term frequency by a function inversely related to the document frequency (if p is the fraction of documents a term appears in, the IDF is often set equal to log(1/p)), you get a more accurate estimate of the importance of the term.

This to me is a very modern way of dealing with documents, and yet the idea itself was proposed by Karen Sparck-Jones back in 1972, well before we could imagine search engines, and fast computers that could crunch these numbers in a jiffy. I'm not an expert on IR methods, but I dare say that a simple TF-IDF computation is, even today, a first preprocessing step when organizing document collections.

(Via Hal Daume)

Tuesday, April 03, 2007

Generalized(?) Dyck Paths

A Dyck path is a useful metaphor for dealing with Catalan numbers. Suppose you are walking on an n X n grid, starting at (0,0). You can make a horizontal move to the right or a vertical move upwards. A Dyck path is a path in this grid where
  • the path ends at (n,n)
  • the path may touch (but never goes over) the main diagonal
The number of Dyck paths is precisely the Catalan number Cn-1. One way of seeing this is that all Dyck paths that touch the diagonal at some interior point can be written as the concatenation of two Dyck paths of shorter length (one from the origin to the point, and one from the point to (n,n)). Dyck paths are also related to counting binary trees, and strings of balanced parentheses (called Dyck languages).

People have studied "generalized" Dyck paths, where the grid is now rectangular (n X m), and the step lengths are appropriately skewed. However, what I'm interested in is the following seemingly simple generalization:
Let a (n,k)-Dyck path be a Dyck path with the modification that the path, instead of ending at (n,n), ends at (n,k) (k <=n). What is the number of (n,k)-Dyck paths ?
It seems like this should have been looked at, but I've been unable so far to find any reference to such a structure. I was wondering if readers had any pointers ? Note that this number is at most Cn-1 since any such path can be completed into a unique Dyck path.

Tuesday, March 27, 2007

Visiting JPL

I'm at JPL today and tomorrow, visiting a colleague (and hopefully collaborator) at NASA, and attending the first day of the AIRS Science Meeting.

AIRS stands for the Atmospheric Infrared Sensor, an instrument that sits aboard NASA's Aqua satellite. Along with other sensors onboard Aqua, AIRS provides for detailed atmospheric sensing, especially of greenhouse gases that account for climate change.

In fact, the AIRS science meeting is ground central for some of the climate change work that we mostly hear about through a political lens. It's fascinating to sit in the JPL cafeteria and listen to atmosphere scientists talk about some of the fundamental problems in climate science (clouds are apparently a BIG mystery in terms of how they influence global temperature).

So what's a computer scientist doing in this super-charged atmosphere ? The instrument produces gargantuan amounts of data, and there are all kinds of interesting data analysis questions that need to be addressed at scale. What makes these questions interesting is that unlike in much of mining, there is a physical ground truth that we can use to validate any interesting patterns that a mining process might uncover.

Sunday, March 25, 2007

An ending...

They say, "all good things come to an end". What they don't say is, "if good things didn't end, you wouldn't realize how good they were".

Lance announces today that he's shutting down the Computational Complexity blog. As far as I know, he's the ur-TCS blogger, and was the direct inspiration for my starting the Geomblog. I've always had the Complexity blog on my must-read list, and have felt both pressured (to write better posts) and inspired (by the rich content) every time I've seen a new posting.

Another aspect of Lance's blog that I envy is the rich commenter community: he was able to create a forum for discussion of TheoryCS, on matters both technical and non-technical, that will be hard to replace.

So thanks for all the posts, Lance, and hopefully you'll be back one day !

4th Carnival of mathematics up

Read all about it.

Wednesday, March 21, 2007

Job announcements

For people who don't subscribe to compgeom-announce:

Meta-posts

I was browsing my referrer logs and found a callback to Lance's blog. In the comments for the post, I see this comment (emphasis mine):
Here's one that I've found very vexing: Let A(d) denote the least surface area of a cell that tiles R^d via the translations in Z^d. Improve on the following bounds:

sqrt(pi e / 2) sqrt(d) <= A(d)/2 <= d - exp(-Omega(d log d)).

(It's a little more natural to consider A(d)/2 than A(d).)

The lower bound is by considering a volume-1 sphere, the upper bound by monkeying around a little with a cube's corner.

Any proof of A(d)/2 >= 10sqrt(d) or A(d)/2 <= d/10 would be very interesting.

For motivation: Either pretend I posted on Suresh's blog (so it's just a problem in geometry); or -- it's related to finding the best rate for parallel repetition of a simple family of 2P1R games.
It's a meta post !!! And by the way, there's no such thing as "JUST a problem in geometry". Harumph !!

;)

Tuesday, March 20, 2007

On ranking journals

Via Cosma Shalizi comes a new way of ranking journals developed by Jevin West, Ben Althouse, Martin Rosvall, Ted Bergstrom, and Carl Bergstrom. It works on the PageRank principle: take a random paper from a journal and do a random walk on the citations, using the stationary distribution to rank journals. The result is eigenfactor.org, a website that catalogues many journals and other publications, and provides both a ranking by their "pagerank" but also a "value for money" that might be useful when deciding which journals to purchase for a library.

Impact measurement for journals has become a popular parlor game, as well as impact factors like the 'h-index' for individual researchers. There are all kinds of problems with these measurements in general, and Eigenfactor does provide a way of eliminating some of the usual problems with measuring impact across multiple communities with different citation mechanisms, community sizes, and so on.

Eigenfactor has a few top 10 lists for different areas (science, social science, etc): here's my informal list of top ranked computer science algorithms (and friends) journals, ranked by article impact factor (all scores are percentiles over the universe of 6000+ journals considered):
  • SIAM Review: 98.35
  • J. ACM: 97.91
  • IEEE Trans. Inf. Theory: 95.05
  • Machine Learning: 93.92
  • SICOMP: 93.04
  • JCSS: 90.93
  • Journal of Algorithms: 90.31*
  • DCG: 85.29
  • CGTA: 81.96
  • Algorithmica: 79.13

* The ACM Transactions on Algorithms, which absorbed most of the editorial board of J. Alg, is too new to show up. This ranking should probably reflect the historical relevance of J. Alg as well as its current state.

Monday, March 19, 2007

John Backus, 1924-2007

John Backus died on Saturday. He was the father of FORTRAN, Turing Award winner, and one of the names behind the Backus-Naur form (Peter Naur, 2005 Turing Award winner, was the other).

No matter how much you might make fun of FORTRAN today, the fact is that it remains the language of choice for many scientific computing packages. Even if you're writing in C, you're probably interfacing with routines first written in FORTRAN.

His Turing Award lecture is also a remarkable read. Titled, "Can programming be liberated from the Von Neumann style", it laid the groundwork for functional programming, and anticipated many of the stream programming paradigms that we see today, from GPU programs to Google's MapReduce system. I'm always humbled when I read articles from so many years ago that have such resonance today; I can only hope that anything that I do can still have value even ten years later.

Saturday, March 17, 2007

Anonymizing PDF

File this under 'late-night LaTeX griping':

Is there any way of stripping metadata from a PDF file ? I'm writing a referee report for a journal, and used PDFLaTeX to create the report. When I scan it in acroread, there's all kinds of meta data that could identify me.

Now pdftk is a useful package that can strip out some of the simple metadata like 'creator'. However, pdftex adds "Advanced" fields, and one of them is the full pathname of the original LaTeX file. If your filesystem (UNIX) is anything like mine, then a part of that pathname is the /<username>/ section, which in many instances is an almost unique identifier. This also happens with dvipdfm, which uses the gs backend to create the PDF file, and with ps2pdf. pdftk cannot strip out these fields, because it doesn't appear to see them.

I suspect that if I owned a copy of the very-not-free Acrobat, I could meddle around with this metadata. Obviously I could submit the review as a Postscript file, but in general I prefer to maintain PDF. Further this problem also occurs if I want to do due diligence when submitting to conferences with double blind review, and sometimes I don't have the option to use PS.

Thursday, March 15, 2007

Dr. Karp and The TCS Lens

or How I learned to stop worrying and love complexity.

[ed. note: This note was prompted by Aaron Clauset's series of articles titled 'Unreasonable effectiveness' (the title itself a riff on Eugene Wigner's famous essay). Further inspiration came from reading the various summaries of Karp's 'algorithmic lens' idea.]

I believe in the gospel of St. Bernard. We are about to enter the age of the algorithm. The only question is: what kind of algorithm will it be ?

We're all familiar with the steps. Define the problem. Design an algorithm. Analyze the algorithm. Rinse and repeat, till we have a matching lower bound, or till we hit an even harder problem. Algorithms are designed from the top down, by a coffee-swilling researcher with a whiteboard, with the elegance of a swiss watch, or the clumsiness of a sledgehammer. Some of them are so beautiful they make you weep. Some are algorithms "in name only"; no one in their right minds would ever dream of implementing one of these beasts.

The algorithmic lens flips this on its head: complexity (and the solution) emerges from a set of simple rules. It's like Spore; set up the rules, start the game, and see what happens. The ingenuity here is not in the construction of a solution, but in the design of the rules. It's declarative, not procedural. It's bottom up, not top down. It's distributed, not centralized.

It has the flavor of (computational) complexity: I'll give you a resource; a guess, a random coin, a scratch pad, an AskJeeves clone. You tell me what I can do with how little. But it's still a lens. Why ? Things we're familiar with show up as distributed, bottom up, emergent constructions: A Voronoi diagram is a battle between equally (or unequally) matched forces. A Steiner tree is a soap bubble diagram. A clustering algorithm is a local "I'll go to my closest friend"; a regular language is the consequence of amnesiac entities bouncing around between states.

It's unsettling: we aren't the masters of our domain any more. We're not GOD, intelligently designing beautiful watches. We're just the rule-makers for a complex system. It's like being the parent of a young child !

What we can do is this: find out which rules yield cause what solutions to emerge. We study social networks, and watch complexity emerge from a set of simple pairwise interactions. In other words, the algorithm IS the social network. You can't get much more Web 2.0 than that...

If you've heard this before, it's because you have: things like the Ising model, phase transitions, statistical mechanics, Conway's Game of Life, Stephen Wolfram's New Kind of Science. So what makes these concepts so exciting now ?

Maybe it's all about scale. We preach that immense scale requires theoretical analysis; the N0 in the "for all n > N0" of O() notation is finally here. But maybe linear time algorithms, streaming algorithms, and small space sketching are all soon-to-be-futile attempts to stave off the inevitable horrible failure of programs to manage the complexity of the data we deal with. Suddenly, local algorithms that evolve continuing solutions look quite appealing, as well adaptive.

It's an exciting time to do algorithms, whether you're designing watches, or setting up an ecosystem.

Wednesday, March 14, 2007

A movie about a font is my kind of movie

One of the curses of making web pages and writing documents is an unhealthy obsession with fonts. People who've written papers with me have heard my paens to the Euler math font with barely suppressed irritation, and I often have to stop myself from attempting to call out names of familiar fonts when walking down the street :).

Sounds like this is the movie for me:
Helvetica, which had its world premiere at the conference, presents the life story of something all of us encounter on a daily (or even hourly) basis. Created in 1957 by the Swiss modernist designer Max Miedinger as a response to the cluttered typography and design of the postwar era, Helvetica's clean neutrality and balanced use of the empty space surrounding letters quickly made it a go-to font for public signage, advertising, corporate logos and works of modernist design around the world

[...]

Filmmaker Gary Hustwitt revels in his fascination with something so commonplace that it blends almost entirely into a context-less background, becoming a detective of sorts to unveil the myriad everyday places Helvetica is hiding (“It's a disease,” Hustwitt said of his obsessive font-spotting).

Tuesday, March 13, 2007

And he's OOOUUUTTT !!!!

Yep, the world cup is now finally on. After a long time, I'm in the same time zone, which is not as nice as it might seem, given that all matches are during the workday. I was trying to explain this to my wife, and she said, 'Aren't cricket matches always at 3 am ?'. Considering that the last time she saw cricket was when I dragged her out to see the India-Aus final 4 years ago, she has a point :).

As people who watch cricket in the US know, Kelly Broadcasting has had a strangehold over cricket rights in North America for some time now. This year however, there are many more options for watching cricket. You can continue to watch the WC on dishTV, but now DirecTV has moved into the biz with a $200 package for the entire cup. If that's too pricey for your taste (and you can't spend all day at home watching on TV!), then sgstream.com has a streaming package for $75, which to my amazement works quite well (so far; let's see what happens for the first India game).

Right now, WI is 140-3 after 36 overs against Pakistan.

Monday, March 12, 2007

Order polytopes

One view of the techniques of computational (and combinatorial) geometry is the delicate tension between the discrete (combinatorial) and the continuous (geometric). Whether it be Davenport-Schinzel sequences and lower envelopes, or VC-dimension and intersections of geometric sets, a common step in making geometric structures tractable is finding a combinatorial structure that captures the aspect of the geometry that's important for a particular problem.

It's nice to return the favor sometimes. Combinatorics and topology have strong connections, and one doesn't have to go too far from home find instances of "lifting" a combinatorial problem to a topological domain to solve it; Lovasz's proof of the Kneser conjecture, and the various methods for proving evasiveness properties, are among the more well known examples. In fact, distributed computing is replete with topological hammers for proving properties of protocols.

Of course, one can view the entirety of linear programming as an example of lifting combinatorics to geometry; I'm deliberately ignoring it because you have to go through the integer lattice to get combinatorial problems, rather than having geometry characterizing the structure directly. This is of course totally ad hoc :)

What I describe next is an extremely elegant geometric view of a purely combinatorial problem. Suppose you're sorting numbers, and you have some rough idea of the sorted order. Maybe you can rank some of the elements, but not all of them (for example, maybe you're integrating results from different web searches). At any rate, what you have is a partial order, and you'd like to get a sense of its 'orderedness'. Specifically, you'd like to estimate the number of different ways this partial order can be completed (by inserting orderings between as-yet unordered elements without disturbing existing order). Such a completion (or extension) is called a linear extension, because you can think of the final total order as being points on a line, numbered from lowest to highest rank.

For example, suppose you have the four elements (a,b,c,d) and you know that (a < b) and (c < d). There are then six ways of constructing a linear extension from this partial order.

Suppose now that comparisons were costly, and we'd like to choose the next comparison so as to reduce the number of potential linear extensions by as much as possible. If we could guarantee that the number reduced by an α fraction, then we could complete the sorting using logα E comparisons, where E was the number of linear extensions to begin with. Since each comparison partitions the set of linear extensions into two sets, clearly this fraction must be at least 1-α.

It's not hard to see that we can't do better than α = 2/3, by considering the partial order {(a < b), c}. There's always a way of picking the answer to the next comparison to make this so. The best known bound to date is roughly α = 0.7236, and this bound (and earlier methods) make use of an elegant geometric characterization of linear extensions via order polytopes.

Given n elements in the partial order, let the order polytope be the set of points in [0,1]n such that the coordinates of each point satisfies the partial order. Namely, if there's a constraint of the form a1 < a2, then the coordinates x1 and x2 of the point x satisfy the same constraint.

It's not hard to see that this is indeed a polytope; it's the intersection of a set of hyperplanes, and is bounded. What is more intriguing is that every simplex of this polytope corresponds to a specific linear extension. If we look for example at the simplex defined by the four blue points, the coordinates are (0,0,0), (0,1,1), (0,1,0), and (1,1,1), which describes the order (a < c < b). All these simplices are congruent and non overlapping, and each has volume 1/n!. This implies the remarkable result that the number of linear extensions for a given partial order is the volume of the resulting order polytope, divided by n!.

The favor gets returned as well. The paper, "Counting linear extensions is #P-complete", by Brightwell and Winkler in 1991, yields the corollary that computing the volume of a polyhedron is #P-complete.

For more on this, Matousek's Lectures on Discrete Geometry is a wonderfully readable source.

Wednesday, March 07, 2007

Computer Scientist:Programming::Mathematician:Arithmetic

One of the things that continues to exasperate me on a regular basis is the conflation of computer science with programming. Consider two recent gems (emphasis mine):
  1. From a press release for Microsoft TechFest 2007:
    Boku, a virtual robot in a simulated world, debuted as a research project to teach kids basic programming skills in a fun and entertaining way. “There is an ongoing and deepening crisis in computer science,” Rashid said. “Our goal is to stem the tide by showing young kids the magic of software programming.”
  2. From an article on changing perceptions of computer science at college and K-12 level:
    East Allen County Schools is working to make sure students are exposed to computer careers, whether they think they might be interested or not. All students are required to take a computer course before graduating, and those who know they are interested can take in-depth courses, including training on Cisco computer networks...
Sigh. Ok people, say after me, slowly: Computer Science IS NOT programming. How many musicians do you think you're going to attract by preaching the exquisite beauty of scales and arpeggios to little kids?

As Lance mentions, the closure of stores like CompUSA is a harbinger of the end of computer science as "television science". The more familiar people get with computers, the more they treat them as appliances rather than as complex devices worthy of worship.

What does this mean ? You aren't going to attract people to a field by saying, "Lookee here! here's a neat television ! Let me show you how to build one. It's FUN!!!!". First of all, people ain't stupid. Secondly, there's a lot more to computer science than programming.

Thankfully, we do see here and there the signs of a manifesto for computer science that doesn't involve actually programming a computer: From Jeanette Wing's CACM article:
Computer science is the study of computation: what can be computed and how to compute it.
Amen to that. And notice how different it sounds to the version you might get from the random person on the street:
Computer science is the study of computers.
If I had to preach the gospel of computer-science-as-computation, I'd probably riff off three things:
'Nuff said.

p.s Chazelle is quickly becoming the poet-laureate for 21st century computer science: check out the table of contents for his course titled, "What do your DNA and your iPod have in common ?"

Simulacrums all the way down...

You can already meet politicians, autograph books, have all kinds of kinky virtual sex, and complain about virtual denizens who spend their time in their virtual houses playing virtual video games. And now you can even get pregnant, virtually speaking (this from the creater of Fable, an Xbox RPG):
“There is the appreciation the wide world feels toward your character as he lives and fights in their world. There is the ability to make love and make babies. Yes, you can be both a man or a woman and if you're a woman, you can get pregnant. A first, he believes, for a main character in an RPG.”

(HT: Penny-arcade)

Friday, March 02, 2007

So so true....

From xkcd, the only geek comic you need ever read:

Monday, February 26, 2007

Dating a result..

The PCP results appeared in conferences in 1992, Sanjeev Arora's thesis appeared sometime after, and the "journal version" appeared in 1998. Similarly, the PRIMES in P result was announced in 2002, but was published in the Annals of Math in 2004.

If I'm trying to write something that talks about the history of a set of results (let's say I'm talking about PCP), it is generally recommended that I cite the definitive version (i.e the journal paper). So I'd talk about a PCP result "published in 1998", which seems silly given that everyone knows it appears much earlier. Given the lag time of CS journals, this is more of a problem than in other areas.

Is there any clean way out of this dilemma ? Should I cite the journal, but attempt to avoid any mention of dates in the text ? Should I use the "first published date" in the text, and cite both the earlier, conference version, AND the journal version ?

Maybe it doesn't matter, because the only results worth announcing well before publication are so famous that such questions are moot, and "everyone knows when it appeared". I don't have answers here: I'm just confused.

Wednesday, February 21, 2007

Presidential candidate wants to repeal undecidability.

From CNET, via BB, and Ed Felten:
Sen. Sam Brownback (R-Kansas) on Tuesday reintroduced the Truth in Video Game Rating Act, first proposed last September. It calls for requiring video game rating organizations to play all games "in their entirety" before issuing labels and prohibiting game developers from withholding any "hidden" game content from raters. It would also punish ratings groups that "grossly mischaracterize" any game's content.
That's right: he wants to have games played till they terminate, in order to have them certified. I guess it's time to dust off those "Pi = 22/7" legislations.

Fran Allen, Turing Award, 2006

Fran Allen, Fellow Emerita at IBM, has been named this year's Turing Award winner. She is also the first woman to win this award.

Monday, February 19, 2007

WADS deadline fast approaching

For those of you nursing your poor SoCG/STOC/PODS rejects, the WADS deadline is fast approaching. WADS (The Workshop on Algorithms and Data Structures) alternates annually with SWAT, the Scandinavian Workshop on Algorithm Theory, and is being held this year in Halifax, Canada. Lest the title fool you, WADS is actually an honest-to-goodness conference; proceedings are published in an LNCS issue.

The deadline is Feb 23, more than enough time to even invent a problem, solve it, write a snappy intro, and send it off. So get cracking !

Update: The deadline has been moved to Mar 2. Heck, you could write TWO papers in that time.

p.s Not that the WADS folks are asking for my opinion, but I don't like it when conferences shift deadlines.

Disqus for The Geomblog