Friday, September 21, 2007

Manifold learning and Geometry

In June, Partha Niyogi gave the invited lecture at SoCG. The subject was manifold learning:
Given a collection of (unlabelled) data inhabiting some high dimensional space, can you determine whether they actually lie on some lower dimensional manifold in this space ?
This talk was a great window into an area of machine learning with strong geometric connections, an area where geometers could profitably play and make useful contributions.

Now NIPS (one of the major machine learning conferences) is returning the favor, with a workshop devoted to the topic of Topology Learning:

There is a growing interest in Machine Learning, in applying geometrical and topological tools to high-dimensional data analysis and processing.

Considering a finite set of points in a high-dimensional space, the approaches developed in the field of Topology Learning intend to learn, explore and exploit the topology of the shapes (topological invariants such as the connectedness, the intrinsic dimension or the Betti numbers), manifolds or not, from which these points are supposed to be drawn.

Applications likely to benefit from these topological characteristics have been identified in the field of Exploratory Data Analysis, Pattern Recognition, Process Control, Semi-Supervised Learning, Manifold Learning and Clustering.

However it appears that the integration in the Machine Learning and Statistics frameworks of the problems we are faced with in Topology Learning, is still in its infancy. So we wish this workshop to ignite cross-fertilization between Machine Learning, Computational Geometry and Topology, likely to benefit to all of them by leading to new approaches, deeper understanding, and stronger theoretical results about the problems carried by Topology Learning.

The list of invited speakers bridges geometry, topology and learning. It looks like a great forum for continuing the machine learning-computational geometry cross-talk that Partha kicked off at SoCG.

Thursday, September 20, 2007

Modelling via the algorithmic lens

Much of my time is spent on the interface between algorithms research and other areas, in the dreaded trenches of 'algorithmic modelling'. Algorithmic modelling can be a tremendously useful tool, if wielded skillfully, and in the few cases where one hits the jackpot, you have an instant argument for the value of theoryCS outside computer science. Most importantly, it provides a kind of rationale for the intrinsic value of the algorithmic method: that thinking algorithmically leads to new ways of thinking about computational problems, and hooks problems in other domains into the engine of algorithmic improvements, so that better algorithm design leads quickly to better solutions to "real-world" problems.

Such work however can be an endless source of frustration. Most commonly, there are cultural difference between areas, and terminology barriers that need to be overcome. But most of all, there's a sense of apathy towards any effort at producing formal guarantees for computational methods (either for running times, or quality). In general, methods based on formal algorithmic analysis still prove themselves in the experimental setting, rather than possessing any intrinsic value by virtue of being correct, having running time and quality guarantees etc.

This in itself can be defended: if the main bottlenecks towards providing an effective practical solution to a problem are not computational (which they often are though), then it's not clear that a practitioner needs to care about esoteric notions like formal guarantees (which can often be quite weak). What is harder to surmount is the 'I've dug my hole and I like it' phenomenon.

For example, the RMS (root-mean-square) distance is commonly used to compare two 3D protein models. But I've never heard any fundamental argument as to why this measure is superior to other possible candidates for matching shape. What's worse, the RMS measure is harder to minimize (under transformations) than other equally plausible measures of shape similarity. However, as the status quo, it's hard to do anything to change this, short of voluminous studies that argue for the relative merits of a different measure based on comparisons on shapes.

Lest it seem that I'm merely whining, Barbie-style, because 'Modelling is hard', let me summarize my main point:

The computational efficacy and guarantees associated with a particular modelling technique are not in and of themselves sufficient to convince a practitioner that this approach is valid, even if in all other respects, the method is identical to other, more entrenched methods.

This makes me wonder about the best way to do algorithmic modelling, if one chooses to do it at all.

Sunday, September 16, 2007

Sunday afternoon MathTubing

Via John Baez, learning category theory on YouTube ! Now what we need is a YouTube video for Scott Aaronson's soap film demonstrations.

The videos are listed under the category, 'How-to and DIY'. Indeed ! Make your own monads, in 9 minutes or less !

Saturday, September 15, 2007

The hardness of fair redistricting

Through a series of pointers, I ended up at Brian Olson's page on redistricting, where he proposes an algorithm to do "fair redistricting" of districts so as to avoid problems with gerrymandering. The measure proposed appears to be a discrete k-median variant: namely, cluster "blocks" into k districts such that the average distance to the center of the district (defined as the centroid of the collection of blocks) is minimized. His code is online, and appears to use a GA-type heuristic. This is essentially a planar k-median problem, and is solvable approximately by a variety of methods.

In reading some of the (critical) comments on some of the original posts, (most of the form (a) this has been proposed ever since the early 60s (b) automation is not the hardest problem in redistricting), I was led to an award-winning political science dissertation by Micah Altman, in which (among other things) he shows that most of the reasonable redistricting formulations (although not the one above) are NP-hard (they tend to be variants of set packing). It was interesting to read a lengthy discourse on computational complexity and NP-hardness in the middle of such a dissertation.

Monday, September 10, 2007

An interesting blog

Derrick Coetzee writes a blog at MSDN titled, 'Developing for Developers'. There's lots of stuff related to code development, but also articles on general computing and theory. Some recent posts:
None of this is necessarily new to people reading this blog, but it's heartening to see an audience for this material in the developer community. And of course, anyone who says,
In reality, computing has nothing to do with computers. Computing is fundamentally about solving problems using strictly defined processes under resource constraints. The theory of computing, by design, has little regard for what device you use to carry those processes out.
will always have a special spot in my blogroll :)

Sunday, September 09, 2007

The TheoryCS blogosphere flourishes !

Jeez. I leave my computer for two hours, and four different bloggers announce the SODA accepts. If for some extremely obscure reason you haven't yet seen the postings, here's the list.

Saturday, September 08, 2007

Carnival of Mathematics XVI

is up, courtesy Kurt van Etten @ Learning Computation. The theme is 'Good, Bad and Ugly', and for some reason, computer science is "Ugly" :). Given that two of the entries involve mathematical feuds, maybe this is fitting.

Physician, heal thyself ?

There's been a lot of grief over the reduction in price for iPhones only a few months after they were released. Wired magazine interviewed people who don't regret paying the higher price for the virtue of being an early adopter/arbiter-of-cool, and this comment caught my eye (emphasis mine):
"If they told me at the outset the iPhone would be $200 cheaper the next day, I would have thought about it for a second - and still bought it," said Andrew Brin, a 47-year-old addiction therapist in Los Angeles. "It was $600 and that was the price I was willing to pay for it."
I don't know: I think there are some other people who should get their money back.

Wednesday, September 05, 2007

Heuristics for NP-hard problems

A while back, Michael was discussing the teaching of heuristics in his undergraduate algorithms class, and had this request:
I wish the standard textbooks would devote a solid chapter on heuristic techniques; in real life, most students are probably more likely to have to implement a heuristic for some NP-hard problem than to have to implement a minimum spanning tree algorithm.
If you remove the word, 'standard', he may just have got his wish. From the press blurb for the Handbook of Approximation Algorithms and Metaheuristics:
Delineating the tremendous growth in this area, the Handbook of Approximation Algorithms and Metaheuristics covers fundamental, theoretical topics as well as advanced, practical applications. It is the first book to comprehensively study both approximation algorithms and metaheuristics. Starting with basic approaches, the handbook presents the methodologies to design and analyze efficient approximation algorithms for a large class of problems, and to establish inapproximability results for another class of problems. It also discusses local search, neural networks, and metaheuristics, as well as multiobjective problems, sensitivity analysis, and stability.
Browsing the table of contents, the reader encounters an entire section on heuristics, with chapters on local search, stochastic local search, neural networks, genetic algorithms, tabu search, simulated annealing, ant colony optimization and the like. The book is published by CRC, which stands for "Organization-That-Singlehandedly-Destroys-Rain-Forests-While-Making-Your-Biceps-Extremely-Strong".

I generally have mixed feelings about CRC books: they are good reference texts, but tend to skim over content way too easily, and CRC has this unhealhy obsession with handbooks for everything. But for teaching heuristics, this might be an invaluable reference.

I should mention that the vast majority of the handbook deals with "proper" approximation algorithms, as well as applications to specific areas. The problem is that approximation algorithms is a fast moving field (though maybe not as fast as in the middle 90s), and so a handbook, being a static snapshot of the time, will get dated quickly.

Tuesday, September 04, 2007

Teacher's Day

I've been out of India for far too long, or so it seems. I had to be reminded by this article that Sep 5, Wednesday, is Teacher's Day in India. This day was chosen for its co-incidence with the birthday of India's second president, S. Radhakrishnan, a statesman and politician famous for his erudition and learning. According to Wikipedia, only 21 countries have an official celebration of Teacher's Day. Much to my surprise, the US is among them, with the day being part of a week called Teacher Appreciation Week.

Teacher's day in my school was always a big event; teachers did not teach that day, and came to school mainly to be entertained and feted in an assembly with variety shows/skits/gentle mockery. Senior students would mind the junior classes. when I was younger, this meant a day of partying in school. When I became one of the seniors, this meant I was able to come to school in "civilian clothing" (most schools in Delhi had strict uniform policies) to boss over the younger students. Good times...

Now I'm a teacher myself (after a fashion), and find it both easy (and fashionable) to grumble about the lack of respect shown by students in the US, as compared to students in India. The truth of course is far more complicated than that. Schools in Delhi appear to have acquired many of the "bad traits" of the worst high schools here; my mother has taught in Delhi schools for nearly 20 years and has seen them acquire more "western" habits (which is a shorthand for more boy-girl interaction, less "studiousness", less fear/reverence for the teachers, take your pick). And ultimately, as a university professor, I'm not even on the frontlines of education here in the US, and have no business complaining about the largely excellent students I teach.

In any case, here's a cheer for the teachers I had; the ones who tolerated my brashness, general arrogance, and constant questions, and helped me reach the teaching pedestal I find myself at today. And here's hoping that one day there'll be some student who might think as fondly of me as I think of all my teachers long past.

Monday, September 03, 2007

NP hardness of Euclidean k-median clustering

Suppose you're given a metric space (X, d) and a parameter k, and your goal is to find k "clusters" such that the sum of cluster costs is minimized. Here, the cost of a cluster is the sum (over all points in the cluster) of their distance to the cluster "center" (a designated point).

This is the well-known k-median problem (which differs from the also popular k-means problem by the use of distances, rather than squares of distances). In a general metric space, the k-median problem is known to be NP-hard, as well as being hard to approximate to within arbitrary constant factor. The current best known approximation ratio for the k-median is 4, due to Charikar and Guha 3 + eps, via a local search heuristic due to Arya, Garg, Khandekar, Meyerson, Munagala and Pandit (thanks, Chandra).

If the underlying metric is the Euclidean metric, then the problem complexity changes: in fact a PTAS for the Euclidean k-median can be obtained, due to the results of Arora, Raghavan and Rao, and then Kolliopoulos and Rao (who provide an almost-linear time algorithm). But as far as I am aware, there is still no NP-hardness proof for the Euclidean k-median problem, and I'd be interested in knowing if I am wrong here.

Note that the related problem of Euclidean k-means is known to be NP-hard from an observation by Drineas, Frieze, Kannan, Vempala and Vinay.

Update: As pointed out in the comments, the NP-hardness of k-median in the plane was shown in 1984 by Megiddo and Supowit. Once again, the wisdom of the internet beats independent research ;)

Tenure committee is to Colombian drug cartel as ...

Via HWTW, a hilarious advice piece to new profs by Phil Ford at Inside Higher Ed, adapted from a Notorious B.I.G rap, "The Ten Crack Commandments". The article is great, but the comments are even funnier.

"Data Mining" = voodoo science ?

On the Statistical Modeling blog, Aleks Jakulin has a rant on the virtues of data mining:

I view data analysis as summarization: use the machine to work with large quantities of data that would otherwise be hard to deal with by hand. I am also curious about what would the data suggest, and open to suggestions. Automated model selection can be used to list a few hypotheses that stick out of the crowd: I was not using model selection to select anything, but merely to be able to quantify how much a hypothesis sticks out from the morass of the null.

The response from several social scientists has been rather unappreciative along the following lines: "Where is your hypothesis? What you're doing isn't science! You're doing DATA MINING !"
I had almost the same reaction a while back when I was visiting JPL: the climatologists there were horrified at the idea of trolling for patterns in climate data, and to the person, asked me the dreaded 'But what is the science question?" question. Of course, given the general hot-potato-ness of climatology right now, one might sympathize with their skittishness.

Data mining is a tricky area to work in, and I've discussed this problem earlier. It's a veritable treasure-chest of rich algorithmic problems, especially in high dimensional geometry, and especially over large data sets. However, it's often difficult to get a sense of forward progress, especially since the underlying analysis questions often seem like elaborate fishing expeditions.

In that context, the distinction Aleks makes between confirmatory data analysis (check if the data validates or invalidates a hypothesis) and exploratory data analysis (play with the data to create a non-uniform distribution on plausible hypotheses) is quite helpful. It also emphasizes the interactive and very visual nature of data mining; interactive tools and visualizations are as important as the underlying analysis tools as well.

Update: Chris Wiggins points me to some of the earlier references to 'data mining'. One of the most vituperative is a paper by Michael Lovell in 1983 in The Review of Economics And Statistics. This paper drips with scorn for 'data miners', but makes a point that is at the very least worthy of consideration: namely that because of the large dimensionality of the space of hypotheses that a data mining application typically explores (here couched in terms of explanatory variables for a regression), patterns with apparently high p-values might not actually be that significant (or stated another way, in high dimensional spaces, there are many seemingly rare patterns that aren't that rare).

Thursday, August 30, 2007

Parallel algorithms: A resurgence ?

My department has a fairly strong systems and hardware group, and some of the buzz I hear about involves the research opportunities in multicore architectures (I'm talking MANY MANY cores, not 2 or 4), and how parallel computation is relevant once again. In fact, I've had one request to include basic material on parallel algorithms in my graduate algorithms class.

Juxtaposed with that is the fact that the Stein-enhanced version of Introduction to Algorithms does NOT contain the chapter on parallel algorithms that the old CLR used to have. In fact, I'll probably use the chapter outline from CLR if I cover parallel algorithms at any level. I wonder what was the thinking behind removing the chapter on parallel methods, and whether this might be reversed in future editions.

Update: (that was quick! - thanks, Cliff): Tom Cormen writes in to explain the decision to remove the chapter on parallel algorithms:
Cliff Stein pointed me to this blog. Suresh asked why we removed the chapter on parallel algorithms when we wrote our second edition.

The chapter on parallel algorithms in the first edition was exclusively on PRAM algorithms. We wrote the second edition mostly during the late 1990s, into early 2001. At the time, PRAM algorithms were yesterday's news, because the PRAM model was too far from what could be built realistically. We considered including algorithms for some other parallel computing model, but there was no consensus model at the time. We felt that the best decision was to just leave out parallel algorithms and use the pages for other new material.
I'm not surprised. It *was* true that by the time the 2nd ed arrived, PRAMs were yesterday's news: in fact, streaming and external memory methods were "hotter" at the time. It'll be interesting to see if multicore actually does spur new interest in parallel algorithms, and not just parallel architectures. In recent months I've participated in discussions about algorithm design for things like the Cell and the implied architecture of nVidia's CUDA, and it's surprising how often standard parallel algorithms methods like pointer jumping and the like are being reinvented.

What makes the new platforms more interesting is that there are features (especially streaming features) that make the mapping to "standard" PRAM models not so obvious. It may not merely be a mattter of shoehorning new systems into parallel models, but of extending and modifying the models themselves.

Tuesday, August 28, 2007

Streaming Summer School Report

(Ed: This post is by Piotr Indyk, who is always willing to be your near neighbor)

Greetings from the Kingdom of Denmark! The country of Vikings, meatballs, and football teams that just refuse to win, has hosted the Summer School on Data Stream Algorithms last week (August 20-23). The school was organized under the banner of MADALGO, a new research center dedicated to MAssive DAta ALGOrithms, set up in Aarhus University. The inauguration ceremony for the center took place on August 24, with several people giving invited lectures.

Muthu (one of the invited lecturers) has covered the inauguration ceremony, so I will skip the detailed description. Suffices to say, it was a pleasure to see that the Danish Research Foundation (or as the locals like to say, Grundforskningsfond) is eager to support an algorithmic research center with a budget of roughly $10M over 5 years, while its US counterpart spends about $7M per year for the entire Theory of Computing program. Did I mention that the population of Denmark is roughly 2% of that of US ?

Anyway, back to the summer school. We had 70+ participants altogether, including 5 lecturers. The school covered the following topics:
  • The dynamic Sudipto Guha gave two lectures. The first lecture was on algorithms for clustering. Massive amounts of data were clustered, including metric data, graph data, and a few careless participants sitting in the first row. In the second lecture, Sudipto covered the "random stream model", where the elements are assumed to be arriving in a random order, which circumvents the usual worst-case paranoia.
  • The twin duo of T.S. Jayram and Ravi Kumar covered lower bounds: communication complexity, information complexity, and generally "everything you wanted to know but were afraid to ask". It was the first time I have seen the details of the linear-space lower bound for estimating the L_infty distance, and I am happy to report that I understood everything, or at least that is what I thought at the time. Jayram and Ravi have also occasionally ventured into the land of upper bounds, covering algorithms for the longest increasing sequences and probabilistic data streams.
  • The scholarly Martin Strauss gave an overview of the algorithms for finding frequent elements, heavy hitters (sometimes on steroids) and their more recent versions used in compressed sensing.
  • I have covered the basic upper bounds for the L_p norm/frequency moments estimation, as well as the algorithms for geometric data (clustering, MST, matching), notably those based on core-sets. The latter topic was originally supposed to be covered by Sariel Har-Peled; however, the dark forces highly enlighted and insightful geniuses of the INS [Sariel's corrections] have jeopardized his plans. I guess the force was not strong enough with this one...
We also had an open problem session. Some of the problems were copy-pasted from the "Kanpur list", but a few new problems were posed as well. The list will be available shortly on the school website, so sharpen your pencils, prepare your napkins, pour some coffee, and ... give all of this to your students!

The lecture slides are also available on-line. If you spot any typos, let the lecturers know.

Overall, I think the school has been a success, perhaps with the notable exception of the weather: it started to rain promptly after the school has began, and it stopped when the school has ended. One has to admire the timing though.

SOCG 2009 will be held in Aarhus. See you then!

(Ed: But what about the beer report ?)

Friday, August 24, 2007

Infinite trees

If you followed Andy Drucker's series of posts on infinite trees and were craving more, be prepared for more foliage:

Via Ars Mathematica, a pointer to Keith Devlin's column on a bizarre counter-version of Konig's tree lemma: namely, if you have an uncountably large tree with countably large levels, then it is not necessarily true that an uncountable path must exist.

Sunday, August 12, 2007

For the love of fonts....

Continuing my part-time obsession with fonts (Helvetica II: Arial strikes back !), allow me to present this delightful tale about how the right font can save lives:
What I saw, Pietrucha knew, was what we all may see soon enough as we rush along America’s 46,871 miles of Interstate highways. What I saw was Clearview, the typeface that is poised to replace Highway Gothic, the standard that has been used on signs across the country for more than a half-century. Looking at a sign in Clearview after reading one in Highway Gothic is like putting on a new pair of reading glasses: there’s a sudden lightness, a noticeable crispness to the letters.
It's a fascinating tale of 10 years of research and lobbying that went into replacing the fonts used on all the Federal highway signs in the U.S. I was driving along I-15 today and almost got into an accident trying to tell whether the local highway sign fonts had changed. Apart from the inside-baseball of how to design a font, always guaranteed to send a shiver through my spine, the story draws out the tale of how the team of designers got together, designed the font, and managed, after repeated lobbying, to convince the Department of Transportation to replace the original fonts with the new ones.

There was an amusing line in the article about American-style engineering:
The letter shapes of Highway Gothic weren’t ever tested, having never really been designed in the first place. “It’s very American in that way — just smash it together and get it up there,” says Tobias Frere-Jones, a typographer in New York City who came to the attention of the design world in the mid-1990s with his Interstate typeface inspired by the bemusing, awkward charm of Highway Gothic. “It’s brash and blunt, not so concerned with detail. It has a certain unvarnished honesty.”
If you remain unconvinced about the difference, look at the accompanying slideshow.

Thursday, August 09, 2007

SGERs being replaced ?

The new NSF news feeds are great. I get to hear all kinds of interesting tidbits about the NSF, including some recent chatter on encouraging "transformative research". One direct consequence of this chatter is this (emphasis mine):

[NSF Director] Bement yesterday proposed a three-pronged strategy before the task force on transformative research. It was unanimously adopted by the task force on Tuesday and then unanimously adopted by the Board on Wednesday. Bement's plan for will:

1. Infuse support of potentially transformative research throughout NSF and all of its programs;

2. Learn how to facilitate potentially transformative research; and

3. Lead the community through opportunities for potentially tranformative research proposal submissions.

[...]

To lead the community, Bement will embark on a three-year trial, during which NSF will replace small grants for exploratory research with a two-tiered "early-concept" award mechanism. Tier I will call for limited funding grants that are internally reviewed. Tier II will entail larger grants requiring additional levels of review. Further, NSF will create a working group to recommend implementation details; a mechanism to monitor and track impact and lessons-learned; and a method to advertise the new approach to the community.

Monday, August 06, 2007

Things that make you pull your hair out in despair

I was recently at AT&T visiting for a few weeks, and I was lucky enough to catch a talk by Amit Chakrabarti on lower bounds for multi-player pointer jumping. A complexity class that figured prominently in his talk was the class ACC0, which consists of constant depth, unbounded fanin, poly sized circuits with AND, OR, NOT and MOD m gates, for all m.

Suppose we make our life simple by fixing m to be a single prime, like 3. It turns out that the corresponding class AC0[m] can be contained strictly within NC1: this arises from results in the 80s by Razborov and Smolensky. Now suppose that instead of picking a prime integer, we choose a number like 6, which is the product of distinct primes. We do not even know whether AC0[6] = NP or not !

Wednesday, August 01, 2007

Saving lives with exact algorithms

It's not often you get to say this in a paper:
We aim for exact algorithms [because] ... any loss of optimality could lead to unnecessary patient deaths.
Anyone who's gone through an algorithms class will at some point hear about stable marriage algorithms, and how the method is used to match hospitals and newly minted MDs looking for residencies.

Consider now the far more serious problem of matching kidney transplant candidates to potential donors. Because transplant lists are long, and cadaver donors are few, marketplaces matching healthy donors to recipients have sprung up in the US. For complicated ethical reasons (which are not without controversy), such exchanges are not made for money, and are viewed as gifts.

So what happens if a donor kidney doesn't match the potential recipient ? Ordinarily, nothing. Suppose however that there was another donor-recipient pair with a similar mismatch, and if only the donors were swapped, both transplants could go through ? What about if a 3-way cycle of matchings could be found ? This is called a 'market clearing', and is the subject of a paper by Abraham, Blum and Sandholm to appear in EC.

I'll get into the problem statement in a second. What's far more important is that the results of this paper have already been used to find potential transplant matches where none existed ! The Alliance For Pair Donation has been using this method since last December for finding potential matches, and has already found matches that prior methods would have missed. This is an incredible achievement: working on a problem abstracted from a real life-or-death scenario, and actually taking the results back to the source and making a difference.

Technically, the problem is easily stated. Given a directed graph with weights on edges, and a parameter L, find a maximum weight collection of disjoint cycles, each of length at most L. Vertices are agents with items (in this case, transplant recipients with donors). The weight on an edge represents the utility to the source of obtaining the sink's item (a donor transfer). The L-constraint reflects reality, in that all such transplants would have to be performed simultaneously (to ensure that all donors go through), and it's not feasible to perform more than a few (typically 3) of these transplants simultaneously.

The line I quoted at the beginning of the post comes as part of an explanation as to why they want exact algorithms for the problem (it's NP-hard for L >= 3). The technical contributions include finding a way to run an integer-linear program at scale for graphs with hundreds of thousands of nodes.

(Via NSF Current)

Disqus for The Geomblog