The Geomblog

Friday, March 25, 2011

Permanent record of work

In our hurry to tar and feather the ACM, the IEEE and other LargePubs, I'm not sure we are quite ready to face the universe that will result, and the amount of work we'll need to do on our own.

Consider:

I was recently asked if I had presentation slides for my paper on universal MDS. I managed to ferret them out from my collection of talks, and sent it over. What I should have also done was add them to the paper page as well, but I've been busy and haven't got around to it (I have other talks that I need to upload as well)
This CS professor asks on reddit: "Where should I host code for the paper I just wrote ?". Good answers are provided, with github.com being the most popular choice.

Researchers are peripatetic: join company, leave company, join university, leave university for other university, leave university for company, rinse and repeat. The obvious way to keep a single fixed repository of your work is to maintain your own website with your own domain, and many researchers now do that.

But it is a pain in the neck. Granted, it's a necessary pain in the neck, but there was something to be said for being able to write a paper, ship it off somewhere, and have it be maintained by someone else.

The arxiv is doing a pretty good job of this for papers, as long as it can continue to receive funding, but what about talks, code fragments, and other related material that goes into research ?

Saturday, March 19, 2011

Lego lectures

I've been pondering the matter of lecture notes of late, and Luca's announcement of his new set of notes prompted me to write this.

Here's what often happens to me. I'm looking to either hand out or prepare an outline for a lecture (or set of lectures) on topic X. Under the principle of (code) reuse, I go hunting for lecture notes that I can link to. I'll often find three or four examples of almost what I need, but either there'll be background information that I have to provide links for as well, or maybe the treatment isn't quite what I wanted.

It seems to me that what one needs are Lego lectures (this is the term coined by my colleague Matt Might when I described my solution to him). My inspiration for this idea comes from reading Dexter Kozen's book on complexity theory.

So what are Lego lectures ?

One set of notes, a few pages or less, on a SINGLE topic. Very focused, and usually one main theorem. In Kozen's complexity notes, each lecture is (almost) one main result in complexity theory.
As little referencing of prior notes as possible, and notation declared when needed.

The idea is that if you have a collection of lego lectures on all topics, you can cobble together a class on a topic relatively easily, and give it your own angle.

I've probably written only two lego lectures in my life: one on tail bounds, and one on the FFT. But they have turned out to be immensely useful, and get reused all the time. I think that from now one I'll model my lectures notes on the lego principle.

Sunday, March 13, 2011

SoCG 2011: New Venue and hotel recommendations

I had mentioned a few days ago that the SoCG 2011 venue was set to change. The new venue has been announced: it's at UICP, right near the Eiffel tower (which I might add, looks like a snapshot of a negatively curved mesh). There are hotel recommendations as well, but it sounds as if even if you made reservations at a hotel near the previous location, it won't be too much trouble to get around. Aah, the joys of efficient public transport.

In any case, do those registrations now, and get those hotel rooms booked. As a former organizer, I know how nerve-wracking it can be while you wait for the registration count to climb towards respectability :).

Monday, February 28, 2011

NSF and government shutdown

Amidst all the other chaos that will ensue if the government shuts down, there will also be NSF problems. Heard on the grapevine:

If the Federal government does shut down on March 4, you will not have access to FastLane for the duration of the shutdown. On the NSF side, no one will have access to NSF email, and we will not be permitted to conduct government business using alternate email.

For example, this could be rather annoying for anyone trying to submit a CPS proposal.

Saturday, February 26, 2011

Arch-blogging

The Utah blogging juggernaut (random fact: juggernaut is a corruption of Jagannath, from the festival of Lord Jagannath in Orissa in which people would throw themselves in front of giant chariot carrying an image of the god) chugs on. Our newest member is Rajeev Balasubramonian, who might very well be the first computer architecture blogger. The blog itself is intended as a group blog of the Utah architecture group, whose cleverly-chosen logo is the Delicate Arch (get it ? Arch?). Read the blog manifesto here.

Idle thought: which academic department has the highest ratio of bloggers/faculty ? Or blogs/faculty ?

Phone interviews ?

Phone interviews are common in the industrial hiring scene. As a candidate, you go through a phone interview (or even two) and if you make the cut, you fly out for an actual interview.

I'm seeing phone interviews become more common in the academic world. You get a phone interview with some member of the recruiting committee, and the actual interview follows if you make the cut.

Has this always been like this (and am I just misremembering?) or is it a new trend ? Phone interviews can make a lot of sense if you want to weed out candidates who might look promising on paper, and it's a clear cost-saver (and time saver), but I haven't seen them be so common in academic settings. Maybe it's because there are more people chasing each slot and so these filters are more necessary now ?

Tuesday, February 22, 2011

If you're making plans for SoCG 2011...

Don't. Yet.

Important notice: in order to solve some logistical issues and to reduce the costs, the exact location of the conference is likely to change within the next two weeks. The final location (still downtown Paris) will be specified here by March 8th at the latest. Please accept our apologies for the inconvenience, and feel free to contact us if you have any questions or concerns.

Although if you don't need to find a hotel right next to the conference venue, this will not affect you.

Saturday, February 19, 2011

Agnostic/Non-agnostic Learning and Nets/Samples of Range Spaces

There are deep connections between geometry and learning that go back to the origins of VC dimension and include concepts like support vector machines, lifting maps, kernel functions, and iterative reweighting. What is often difficult is the translation problem of going between machine learning concepts and the corresponding (often identical) geometric concepts. One example of this that I've talked about is the two different viewpoints on characterizing isometric embeddability in an $\ell_2$ space.

I strongly believe that with a better "dictionary", we might find a lot more cross-fertilization between geometry and learning, and this would be beneficial for both communities. The 2009 summer school on geometry and learning was a good example of this: alas, it almost completely coincided with SoCG 2009. This summer, there is again a summer school in geometry and learning just before SoCG 2011, and it's in conjunction with it (shameless plug: I'll be lecturing there), and I look forward to seeing more such events.

In this post, my postdoc Jeff Phillips and student Avishek Saha explore a fascinating connection between nets and samples in range spaces on the one hand, and agnostic/non-agnostic learning on the other. Avishek is a machine learning specialist, and Jeff is of course an expert in approximate geometry, sampling and uncertainty. So it is fitting that their joint conversation led to this post.

There is a deep connection between some concepts in combinatorial geometry and learning. It turns out that $\varepsilon$-approximations and $\varepsilon$-nets of range spaces correspond to sample complexity in agnostic learning and non-agnostic learning, respectively. We noticed this last fall when both reading a cute paper by Gandhi, Suri, and Welzl and a book by Anthony and Bartlett.

The geometry:
Given a data set $P$ and a family of subsets $\mathcal{A} \subset 2^P$, the pair $(P,\mathcal{A})$ is a range space. An $\varepsilon$-approximation of $(P,\mathcal{A})$ is subset $Q \subset P$ such that

$\max_{R \in \mathcal{A}} \left| \frac{|R|}{|P|} - \frac{|R \cap Q|}{|Q|} \right| \leq \varepsilon.$

An $\varepsilon$-net of $(P,\mathcal{A})$ is a subset $N \subset P$ such that for all $R \in \mathcal{A}$ with $|R| \geq \varepsilon |P|$ then the intersection $|R \cap Q| \geq 1$. Informally, the $\varepsilon$-approximation approximately preserves the density, and the $\varepsilon$-net hits every large enough subset.

The VC-dimension $v$ of a range space $(P,\mathcal{A})$ informally describes the description complexity of a range $R \in \mathcal{A}$. For instance, halfspaces in $\mathbb{R}^d$ have VC-dimension $d+1$. A random sample $Q \subset P$ of size $O(v/\varepsilon^2)$ is an $\varepsilon$-approximation of $(P,\mathcal{A})$ with constant probability, and a random sample $N \subset P$ of size $O((v/\varepsilon) \log (1/\varepsilon))$ is an $\varepsilon$-net of $(P,\mathcal{A})$ with constant probability.

Learning theory:
Agnostic and non-agnostic learning can also be described with respect to a range space $(P,\mathcal{A})$. We are given a data set $P$ where each element $p \in P$ is either labeled $+$ or $-$. We want to find a classifier $R \in \mathcal{A}$ such that all elements $p \in P$ labeled $+$ are in $R$ and all elements $p \in P$ labeled $-$ are not in $R$. Non-agnostic learning, also referred to as realizable learning, is the setting where we assume that there exists such a classifier $R^*$ in "concept class" $\mathcal{A}$ that has zero-error, it perfectly separates all $+$ points from $-$ points. Agnostic learning is the setting without this assumption of the existence of a zero-error classifier in $\mathcal{A}$, so the aim is to find a classifier $\hat R \in \mathcal{A}$ which misclassifies the fewest number of points. This weakened target function assumption can be alternatively viewed as traditional PAC learning that accommodates arbitrary noise.

The connection:
Now here is where the concepts start looking similar: In non-agnostic learning, if we are given a random sample $N \subset P$ of size $O((v/\varepsilon) \log (1/\varepsilon))$, and run an algorithm to perfectly classify $+$ and $-$ points from $N$, then the same classifier will misclassify at most $\varepsilon |P|$ points on $P$, with constant probability.
In agnostic learning, if we are given a random sample $Q \subset P$ of size $O(v/\varepsilon^2)$, and run an algorithm to find a classifier on $Q$ that misclassifies the fewest number of points, then the same classifier will misclassify at most $\varepsilon$ larger fraction of points on $P$, with constant probability.

Why are those bounds the same?

Lets first look at $\varepsilon$-nets and non-agnostic learning. The key is to construct a range space $(P,\mathcal{S})$ where each $T \in \mathcal{S}$ is the symmetric difference of two ranges $R^*,R^\prime \in \mathcal{A})$, that is it contains all points contained in exactly one of the two ranges. If $(P,\mathcal{A})$ has VC-dimension $v$, then $(P,\mathcal{S})$ has VC-dimension $O(v)$.
Now consider the optimal classifier $R^* \in \mathcal{A}$ of $P$, and any other classifier $R^\prime \in \mathcal{A}$ that misclassifies at most $\varepsilon |P|$ points of $P$. Then the symmetric difference of $R^*$ and $R^\prime$ is at most $\varepsilon |P|$ points, and corresponds with some range $T \in \mathcal{S}$. If $N \subset P$ is an $\varepsilon$-net of $(P,\mathcal{S})$, then any classifier $R^\prime \in \mathcal{A}$ learned perfectly on $(N,\mathcal{A})$, cannot misclassify more than $\varepsilon |P|$ points since it would then necessarily have too many points in the symmetric difference with the perfect classifier $R^*$ on $P$, and $N$ would contain at least one of them.

Now to understand $\varepsilon$-approximations and agnostic learning, we need to see why this same approach won't work. Namely, because there is, in general, no perfect classifier $R^\prime \in \mathcal{A}$ defined on a subset of $Q \subset P$. Thus when we look at the symmetric difference between $\hat R$ and $R^\prime$, we cannot argue that the set of misclassified points should be empty in $Q$ using $R^\prime$ if the difference is less than $\varepsilon |P|$, and can't apply the $\varepsilon$-net result.

But we can apply the stronger $\varepsilon$-approximation result in this agnostic setting. The difference between any two ranges $R, R^\prime \subset \mathcal{A}$ on $P$, that contain the same points in $Q$, is at most $\varepsilon |P|$. Thus the difference in the fraction of misclassified points in $P$ from the optimal range $\hat R$ found on $Q$ is no more than $\varepsilon$, since $\hat R$ only differs by an $\varepsilon$-fraction in the number of points it contains in $Q$ versus $P$.

An intuitive way to see this connection is to think of agnostic and non-agnostic learning in the context of presence and absence of noise. Non-agnostic learning is the ideal setting where we have no noise and standard VC-dimenion based learning bounds suggest that we need $O((v/\varepsilon)\log(1/\varepsilon))$ samples (an $\varepsilon$-net) to learn a target function with $\varepsilon$ classification error. On the other hand, agnostic learning accommodates noise by allowing points from either class to spill over the decision boundary. This implies that some of the randomly sampled points might be noisy and hence misleading for the purposes of learning the target classifier. To nullify this effect of noisy sampled points we need to oversample which increases the sample complexity by about an $(1/\varepsilon)$ factor (yielding an $\varepsilon$-approximation).

Wednesday, February 09, 2011

My talks at ITA and at the college of engineering in Montana State

This is the abstract for the talk I'm giving in brief at the ITA Workshop and more expanded at a College of Engineering colloquium at Montana State. Thanks to the ITA program commitee and Brendan Mumey (at Montana State) for the invitations.

Dimensionality reduction for distributions: the good and the bad

In many application areas, the natural representation of data is in the form of an empirical probability distribution. Documents are represented as normalized frequency counts of words, images are represented by color (or other feature) histograms, and speech signals can be represented by spectral histograms.

There are many natural and meaningful ways of measuring similarity (or distance) between such distributions, and these define different geometries in which we might wish to analyze collections of distributions for interesting patterns. However, a practical bottleneck is the high dimensionality of these representations: for example, an 256 x 256 image might be represented by a vector of over 1000 features, and a document might be represented as a sparse vector with hundreds of attributes.

Thus, a key problem is: can we reduce the dimensionality of collections of distributions to make data analysis tractable while preserving the distances in the collection ?

In this talk, I'll discuss a collection of recent results centered around this theme, that provide both good news (and bad) for dimensionality reduction on distributions in theory and in practice.

The above draws on information mostly from this paper with Arvind Agarwal and Jeff Phillips, and this work-in-progress with Rasmus Kyng and Jeff Phillips.

Tuesday, February 08, 2011

SoCG Videos Call

You drew all those snazzy diagrams using your fancy Mac drawing program. Now it's time to make some videos:

Video and multimedia presentations are sought for the 20th Annual Video and Multimedia Review of Computational Geometry, to accompany the 27th Annual Symposium on Computational Geometry in Paris, France. This review showcases the use of visualization in computational geometry for exposition and education, for visual exploration of geometry in research, and as an interface and a debugging tool in software development. Algorithm animations, visual explanations of structural theorems, descriptions of applications of computational geometry, and demonstrations of software systems are all appropriate.

Three to five minutes is ideal for most presentations; eight minutes is the upper limit. Accepted video and multimedia presentations will have an abstract in the published conference proceedings; video/multimedia authors will have an opportunity to present their work at the Symposium during a dedicated video session. Accepted presentations will be available online in various formats in a web proceedings. See http://www.computational-geometry.org/ for examples of previous years' proceedings.

Send email to the Video/Multimedia committee chair, Jonathan Shewchuk, jrs@cs.berkeley.edu by 23:59 Pacific Standard Time, Friday, February 18, 2011, with the following information: the names and institutions of the authors, the email address of the corresponding author, and instructions for downloading the submission.

Monday, February 07, 2011

DIMACS Worskhop on Parallelism: Call for Talks

The DIMACS Workshop on Parallelism is soliciting contributed talks (deadline Feb 9!). Sergei Vassilvitskii offers the following synopsis with information on submission.

Decades after its origins and its study in academia, parallel computing is finally becoming pervasive. Today's PCs have multiple cores, and some predict 1000-core machines in the near future. Google, Yahoo and others run MapReduce or Hadoop on thousands of machines to answer search queries, among other things. D. E. Shaw Research is building a massively parallel machine to simulate molecular dynamics. Climate scientists predict the evolution of the earth's climate on parallel machines. Amazon's EC² enables users to run jobs on a "cloud" of PCs.

The evolution of parallel computing from primarily an academic subject in the '80s to its realization today is an exciting development. This DIMACS workshop will bring together some of the leading researchers and practitioners involved in parallel computing to describe their work. Attendees will discuss, for example:

how parallel computing in its various forms is used today;

what new uses and programming abstractions will arise by 2020;

what parallel computers will look like in 2020; and

how to model parallelism theoretically.

Those wishing to give a contributed talk must submit a one-page description of their work, with additional material (such as a paper) optional, to howard@research.att.com by February 9, 2011. Should there be an excess of submissions, the organizers will select the contributed talks according to the summaries submitted.

Friday, February 04, 2011

POTD: Reproducing Kernel Banach Spaces with the ℓ1 Norm

Reproducing Kernel Banach Spaces with the ℓ1 Norm

Guohui Song, Haizhang Zhang, and Fred J. Hickernell

http://arxiv.org/abs/1101.4388

Abstract:

Targeting at sparse learning, we construct Banach spaces B of functions on an input space X with the properties that (1) B possesses an l1 norm in the sense that it is isometrically isomorphic to the Banach space of integrable functions on X with respect to the counting measure; (2) point evaluations are continuous linear functionals on B and are representable through a bilinear form with a kernel function; (3) regularized learning schemes on B satisfy the linear representer theorem. Examples of kernel functions admissible for the construction of such spaces are given.

Notes:
This one probably requires some explanation, for the non-ML folks. Reproducing Kernel Hilbert spaces are the coin of the realm in machine learning and for good reason. They allow much of ML to be "ported" from linear classifiers to non-linear classifiers: the kernel mapping essentially linearizes (via lifting) the nonlinear classifiers so you can get the benefit of the nonlinearity while operating algorithmically in a linear world. Even though the induced Hilbert space is typically a function space and is therefore infinite-dimensional, the representer theorem allows us in most cases to operate in a finite dimensional space (where the dimension is bounded by the number of samples). From a metric embedding perspective, kernels completely characterize the class of metrics isometrically embeddable in Euclidean space.

So RKHSs are great. So what's the deal with this paper ? What it tries to do is combine the power of RKHSs with the regularity and sparsity properties guaranteed by $\ell_1$ norms. Even though your typical Banach space doesn't admit an inner product (what you need for the kernel mapping), they show that you can define special Banach spaces in which kernels can be defined as before, and the representer theorem holds, but that you can get sparse bases for solutions because of the nice $\ell_1$ properties.

I promised SHORT summaries, so I'm not going to go further. But the takeaway message here is the ability to extend the nice properties of RKHSs to Banach spaces. For completeness I should mention that there are other approaches that have tried to do this, but using different mathematical constructs that are in some way less well suited.

Thursday, February 03, 2011

On the future of conferences

There's a superb article out today in the CACM by Jonathan Grudin (non-paywall version here, thanks to Alexandre Passos) on the future (and past) of our CS conference-driven model. I'm biased in favor because it takes a generally negative view of the current model, but that's not why I like it.

I like it because he systematically tracks back to the origins of the conference culture, discusses why it's so prevalent, and avoids the glib "let's do everything in journals" argument that even I've been guilty of in the past.

Some of the tl;dr points (but do read the whole thing):

Technology and a Professional Organization Drove the Shift to Conference Publication: not speed of development of the field, as is commonly stated. It was also a more American-centered phenomenon.
Formal archiving of conference proceedings made creating journal versions more difficult (because of the 30% rule and so on)
"When conferences became archival, it was natural to focus on quality and selectivity." so the focus of conferences became more gatekeeping and less community.
This in turn has an impact on community: when your papers are rejected, you don't go to the conference. For more on the impact of rejection, see Claire Mathieu's post.
A further consequence is that computer scientists do not develop the skills needed to navigate large, community-building conferences.This is so true ! As someone who frequents SODA, SoCG and occasionally FOCS/STOC, I often find myself gasping for breath at VLDB (600+ participants) or KDD (800+). It's overwhelming to keep track of everything. And it makes it harder for me to attend such conferences regularly, even though it's important for me to go.

His analysis of where we should be heading is also sound. Much like the Blue-ray-HD-DVD wars of a few years ago, the whole journal vs conference argument seems like an argument between two dinosaurs as the meteor arrives. We have many different ways of disseminating, commenting on, and reviewing works of research now, and it's time to think beyond the current models. Some of his ideas:

Accept many more papers at conferences, but designate some to be the 'best in show'
Center attention on the work, rather than the conference, by keeping wikipedia-like entries for pieces of work as they evolve. This is similar to Dan Wallach's idea.

p.s for those of you who want to complain about ACM's closed policy on the CACM, and how you'll never read an article in the CACM because of the paywall, consider your opinion expressed.

Wednesday, February 02, 2011

Sample Complexity for eps-approximations of Range Spaces

This is a guest post by Jeff Phillips, who among other things ponders the computation of $\epsilon$-samples for range spaces under uncertainty. Jeff is looking for permanent research positions right now, so uncertainty is a major part of his life !

At some point I had plans to write a blog post detailing a simpler proof of the number of random samples needed for an $\varepsilon$-approximation of a range space. This is a very powerful result that is simple to state (I'll define below) and simple to use. I view it as very geometric, but it originated in learning theory. I had used it in several papers, but until recently had not actually read the full proof, and even slightly less recently was unaware of the best bound. The original paper by Vapnik and Chervonenkis is hard to find, but I eventually found a proof in a learning-theory book by Anthony and Bartlett. The proof uses a series of tricks I was not aware of, and the notation is different from what I was used to.

After reading the 20+ page proof, I felt there was a simpler geometric way to think about the problem, and even sketched-out what I thought was a simpler proof. I thought that would be a great thing to post on this blog! But slowly things began to break down. I prepared a pdf writeup, that started at under one page. Then realizing gaps in the reasoning, it grew to 2-3 pages, and then 4 as I attempted to write the tighter result. In presenting this approach to the informal theory seminar here at Utah and realized one of the main geometric tricks I was hoping to use was insufficient for proving what was needed. Looking back at the Anthony and Bartlett proof, I finally realized why all of tricks and machinery they use are really needed!

Anyways, I'd like sketch a proof of the $\varepsilon$-approximation theorem and try to include intuition of why techniques are being used.

Let $P$ be a set of objects (i.e. points) and $\mathcal{R}$ be a set of subset of $P$ (i.e. ranges, like disks). Then $(P,\mathcal{R})$ is called a range space. We will concern ourselves with range spaces with bounded VC-dimension d, which implies that the total number of distinct ranges is at most $c |P|^d$ for some fixed constant $c$. Now a subset $Q$ of $P$ is an $\varepsilon$-approximation of $(P,\mathcal{R})$ (for $0 \le \varepsilon \le 1$) for all ranges $A \in \mathcal{R}$

$\left| \frac{|P \cap A|}{|P|} - \frac{|Q \cap A|}{|Q|} \right| \leq \varepsilon. \hspace{.2in} (1)$

The main result is that:

Theorem. If $Q$ is a random sample from $P$ of size $h = O((1/\varepsilon^2)(d + \log(1/\delta))$ then it is an $\varepsilon$-approximation of $(P,\mathcal{R})$ with probability at least $1-\delta$.

This was first proven by Talagrand in 94. The original result of Vapnik and Chervonenkis had a slightly worse bound of $h = O((1/\varepsilon^2)(d \log(d/\varepsilon) + \log(1/\delta))$. The main reason this is a powerful tool is that $h$ is independent of $|P|$. As I found out, even this weaker result is difficult to prove.

I'll also note, that I first became aware of this stronger result as a reduction from a more general result by Li, Long, and Srinivasan 01, as cited by Har-Peled and Sharir 07.

There is a nice progression to the full result through a series of weaker results.

Step 1.
We can show that a random sample of $r = (1/2 \varepsilon^2) \ln(2/\delta)$ points is sufficient to prove (1) for any one fixed range, by a Chernoff-Hoeffding (C-H) bound. Then since there are only $c |P|^d$ distinct ranges, by the union bound, we only need $h_1 = (1/2 \varepsilon^2)(d \ln |P| + \ln(2c/\delta))$ samples to hold for all ranges.

Note that $h_1$ is a function of $|P|$, we want to get rid of this dependence.

This C-H + union bound will be used again. Here the C-H bound has $h_1$ events, one for each random sample, we will reduce this in Step 3. The dependence on $|P|$ is through the union bound, we improve this in Step 2.

Step 2.
To get rid of the dependence on |P| we need to apply the union bound on a smaller number of ranges. We use a technique called symmetrization, where we consider two random samples $S$ and $Q$ of $P$, both of size $h_2$. Then we compare them against each other with the goal of showing for all $A \in \mathcal{R}$

$\left| \frac{|Q \cap A|}{|Q|} - \frac{|S \cap A|}{|S|} \right| \leq \varepsilon/2. \hspace{.2in} (2)$

We can intuitively see that showing (2) holds with probability $\geq 1-\delta$ is sufficient to show (1): If $h_2$ is too small, it is likely both $Q$ and $S$ have at least some range $A$ in which they are far from $P$, but since there are many ranges, it is likely that they are far on different ranges. Thus if $h_2$ is too small $Q$ and $S$ are likely different from each other on some range $A$ also. Likewise, if $h_2$ is large enough so $Q$ and $S$ are close on all ranges, then it is likely they are close to $P$ on each of those ranges too.

Another technical trick is required to show (2) holds. We correspond each element in $q_i \in Q$ with an element of $s_i \in S$. Then it is sufficient (once we have sampled $Q$ and $S$) to show (2) occurs with probability $\geq 1-\delta$ w.r.t. possible swaps of ($s_i \in Q$ and $q_i \in S$) vs. ($s_i \in S$ and $q_i \in Q$), for all corresponding pairs.

Since the events in the permutation have bounded effect on each range, and there are only $c |Q \cup S|^d$ ranges we need to consider, we can apply the C-H + Union bounds as in Step 1, but now we only have $O((d/\varepsilon)^d)$ events (independent of $|P|$) and we get the original V-C bound of $h_2 = O((1/\varepsilon^2)(d \log(d/\varepsilon) + \log(1/\delta))$.

Another neat effect of symmetrization is that it allows us to consider continuous sets $P$. The analysis in Step 1 required us to compare $Q$ to all ranges defined by $P$, which would have been infinite if $P$ were continuous. But now since we compare two random samples from $P$, even if it is continuous, the random samples are finite, and hence induce a finite number of ranges.

Step 3.
We now show how to get rid of the extra $\log(d/\varepsilon)$ term. We do this through a technique called chaining which allows us to decompose ranges and change the type of events we measure with the C-H bound.

Using symmetrization consider two subsets $Q$ and $S$ of $P$ and let $P^\prime = Q \cup S$. We define a series of range spaces $(P^\prime, \mathcal{T}_0)$, $(P^\prime, \mathcal{T}_1)$, $\ldots$ so that

(a) each range $A \in \mathcal{R}$ is a disjoint union $A \bigcup_j B_j$ where each $B_j \in \mathcal{T}_j$.
(b) each $B_j \in \mathcal{T}_j$ is of size $|B_j| \leq |P^\prime|/2^j$.
(c) $Q$ is an $\varepsilon_j$-approximation of $(P^\prime, \mathcal{T}_j)$ with probability $\geq \delta_j$ where $\varepsilon_j = \sqrt{j/2^j}/12$ and $\delta_j = \delta/2^{j+2}$.

Fact (a) allows us to analyze $(P^\prime,\mathcal{R})$ (with $\varepsilon, \delta$) by considering each $(P^\prime,\mathcal{T}_j)$ separately (with $\varepsilon_j, \delta_j$) since by (c) $\varepsilon \leq \sum_j \varepsilon_j$ and $\delta \leq \sum_j \delta_j$.

From (c) the extra $\sqrt{j}$ term in $\varepsilon_j$ cancels out the $\log(1/\varepsilon)$ term in the technical analysis leading to $h = O((1/\varepsilon^2)(d + \log(1/\delta))$ as desired.

To prove (c) we need to use (b). The key insight is that for large $j$, each range $B_j$ is small. We can then apply the C-H bound using a different type of event which there are fewer of. For a range $B_j$, instead of measuring the event if $q \in Q$ is inside $B$ (as in Step 1 and 2) we measure if some $b \in B_j$ is in $Q$ -- thus there are only $|B_j| \leq |P'|/2^j$ events. Since $|B_j|$ decreases geometrically as $j$ increases, we can also decrease $\varepsilon_j$ and $\delta_j$ geometrically as $j$ increases, proving (c).

It only remains to show the existence of each $(P^\prime, \mathcal{T}_j)$.

We first consider a series of range spaces $(P^\prime, \mathcal{R}_j)$ where $\mathcal{R}_j$ correspond to all ranges in $(Q_j, \mathcal{R})$ where $Q_j$ is an $(1/2^j)$-approximation of $(P^\prime,\mathcal{R})$. We know $|Q_j| = O((1/\varepsilon^2) d \log(d/\varepsilon))$ by Step 2.
Then $\mathcal{T}_j$ is defined inductively as follows. $\mathcal{T}_0 = \mathcal{R}_0$. Each $B \in \mathcal{T}_j$ corresponds to a range $A \in \mathcal{R}_j$. We know there exists another range $A^\prime \in \mathcal{R}_{j-1}$ such that $A$ and $A^\prime$ differ by at most $|P^\prime|/2^j$ points; let $B$ be those points.

Note: Most proofs construct $\mathcal{T}_j$ using a combinatorial result of Haussler 95, but this approach does not affect the asymptotics of the final solution and may actually improve the constants.

IMA Special Year on the mathematics of information

Dana Randall mentioned at SODA that the IMA (the Institute for Mathematics and its Applications) is looking for increased participation from folks in computer science.

Anna Gilbert at the University of Michigan (and former AT&T colleague!) is one of the organizers of the current 'special year' at the IMA, and it's on a topic that I think many theoryCS folks will be interested in: the mathematics of information.
They're taking applications for postdocs (speaking of which) up until February 4 (two days from now!). So get those applications in.

And what's the special year about ? In Anna's words (edited for formatting):

The Mathematics of Information

During the academic year 2011-2012, the annual program at the Institute for Mathematics and its Applications (IMA) at the University of Minnesota will be in the Mathematics of Information. Information, broadly defined, plays an ever increasingly large role in our scientific and technological endeavors, as well as our daily lives. Collectively, we search the web over billions of times per day, we keep our medical records in digital repositories, we share videos and photos over the web, and we listen to music on MP3 players. Our ability to collect, store, and generate vast quantities of data far outstrips our ability to analyze, process, and understand these data. Many scientific, engineering, and medical applications depend on our ability to handle enormous amounts of complex information. Network engineering involves massive datasets at high speeds, medical imaging generates large data with intricate geometric relationships, and astronomical observations include data at different wavelengths or spectral bands. Our current technology and scientific tools for information lag far behind our ability to collect enormous amounts of data, our need to analyze that data as efficiently as possible, and, finally, our ability to use that analysis to make sound, timely decisions.

The special year will include seven workshops with a thematic focus on the mathematics of information and will include topics in mathematics, algorithms, theoretical computer science, machine learning, network analysis, and applications in biology, include medical informatics and group testing for high throughput screening. Each workshop is designed to be truly interdisciplinary, involving researchers from mathematical sciences, computer sciences, engineering, and other fields. Interspersed will be Hot Topics workshops, IMA public lectures, and tutorial and/or short courses. Faculty, graduate students and postdoctoral fellows are invited to participate as long-term visitors, New Direction Visiting Professors, or postdoctoral fellows (either at the IMA or jointly with industry).

You can find more information about the individual workshops at the annual program webpage.

9/26-30/11    Workshop: High Dimensional Phenomena

10/24-28/11    Workshop: Large Graphs: Modeling, Algorithms and Applications
11/14-18/11    Workshop: Large Data Sets in Medical Informatics

2/13-17/12    Workshop: Group Testing Designs, Algorithms, and Applications to Biology

2/27-3/2/12    Workshop: Network Links: Connecting Social, Communication and Biological Network Analysis

3/26-30/12    Workshop: Machine Learning: Theory and Computation
5/7-11/12    Workshop: User-Centered Modeling

Postdoctoral Fellows

Postdoctoral Fellows applications deadline was January 7, 2011 but applications will be considered until February 4, 2011.

These IMA Postdoctoral Fellowships run one to two years, at the option of the holder, starting August 31, 2011. In the second year of the appointment there are a variety of options to enhance career development, including continued work on the fellow's established research program, participation in the 2012-2013 academic year program, teaching, and working on an industrial project. Postdoctoral fellows receive a salary of $55,000 annually, and a travel allowance.

There is also an option for Industrial Postdoctoral Fellows. In the second year of the appointment, Fellows can work on a project at an industrial partner.

Visiting the IMA

You can contact the IMA if you would like to visit for a long period of time during the year or come for a short visit, especially associated to a workshop. There's more information on general visits and short term visits.

There are sources of funding available for both types of visits and the IMA staff are excellent at making such visits possible.

CRA discussion on postdocs

The CRA commissioned a study on the growing prevalence of postdocs in the CS career path. Specifically,

The question of whether it is healthy for the field, and for the
individuals, to move to a new steady state, in which prior to being
hired in a tenure-track position post doctoral positions are expected,
is an important one for the entire community. The question of whether
support for such a program should come from funding agencies like NSF, perhaps at the expense of other funding, is also important. To encourage a discussion about the need for, and role of, postdocs within the computing research community, the CRA commissioned a committee to prepare a short white paper that reports the statistics associated with academic and industry hiring, articulates the relevant issues about postdocs in the context of the many stakeholders, and specifically solicits input from the community.

They want feedback:

We would like to encourage departments and laboratories to provide
input -- to review the white paper, to discuss the issue, and to post
your views (collectively or individually) on the companion website
(http://cra.org/postdocs). The CRA hopes to obtain a sense of the
community by March 15. Following review of the comments that are
received, the committee will prepare a revised version of the white
paper articulating the community's broad view (and consensus, if any)
on this issue.

So chime in if you have thoughts on this matter.

Personally, I think that this is a dangerous trend, for the following reasons:

Quickly, doing a postdoc will become the norm, rather than an option, when looking for academic jobs. I think this is unnecessary from a training perspective for everyone (though it might be appropriate for some).
One of the things that has kept CS viable academically is that people can leave after a Ph.D, go to industry, and still make it back into academia. This no longer seems to be true in places like the natural sciences, with long postdocs. I wouldn't want their career path.
Postdocs are glorified free labor for PIs. Salaries are miniscule, and competition is fierce. And again, it's not entirely clear that fresh Ph.Ds are so incompetent that they need 5 year postdocs to be ready for a faculty job.
Ph.D training suffers, because "you can fix it in the postdoc". I don't think that's healthy either.

The increase in postdocing is a market driven correction. I know this has happened before (1991-1992 was famous in that respect, in theory), and maybe it's just a cyclical thing. But I really hope it doesn't become permanent. I don't want to be in a field like bio/physics/chemistry, or even math, with indefinite multiple postdocs, and nomadic researchers hunting for ever elusive permanent jobs.

POTD: Continuous Local Search

Inspired by Oded Goldreich's "my choices" page, and my constant and futile attempts to read more, I'm going to attempt to write short (SHORT!) posts summarizing papers that I've been reading lately, with some thoughts. Each such post will be tagged 'potd' for 'paper of the day', in the spirit of John Baez which is to say, not a paper each day, but a paper on the day the post is made :).

Continuous Local Search
C. Daskalakis and C. Papadimitriou
SODA 2011.

Abstract:

We introduce CLS, for continuous local search, a class of polynomial-time checkable total functions that lies at the intersection of PPAD and PLS, and captures a particularly benign kind of local optimization in which the domain is continuous, as opposed to combinatorial, and the functions involved are continuous. We show that this class contains several well known intriguing problems which were heretofore known to lie in the intersection of PLS and PPAD but were otherwise unclassifiable: Finding fixpoints of contraction maps, the linear complementarity problem for P matrices, finding a stationary point of a low-degree polynomial objective, the simple stochastic games of Shapley and Condon, and finding a mixed Nash equilibrium in congestion, implicit congestion, and network coordination games. The last four problems belong to CCLS, for convex CLS, another subclass of PPAD $\cap$ PLS seeking the componentwise local minimum of a componentwise convex function. It is open whether any or all of these problems are complete for the corresponding classes.

Notes:
There are many iterative schemes that get used in practice for which time to convergence is unknown. The Weiszfeld algorithm for computing 1-medians is one such method. There are also alternating optimization schemes like k-means, EM and ICP that have resisted convergence analysis for a while - although we now have a pretty good understanding of the behavior of k-means. Classes like CLS appear to capture some numerical iterative schemes like gradient descent, and it might be interesting to establish connections (via reductions) between such iterative methods and other problems of a more game-theoretic nature that appear to crop up in this class.

One catch though is that while the reductions are poly time, the only requirement is that a solution to one map back to a soliution for another. So it's not clear that reductions in CLS or convex CLS preserve "time to convergence" - in fact it seems unlikely.

Friday, January 28, 2011

Absentmindedness

And now for a break from SODA updates.

They say that professors are absentminded. I say that my mind is not absent: it's very very present - just somewhere else.

On that note, here are some levels of absentmindedness. I will refrain from any comment on how I came up with this list.

10. forgetting your keys are in your pocket
9. putting down your coffee while looking for keys and forgetting where you left it.
8. forgetting your glasses are on your head
7. forgetting that you're wearing your glasses
6. looking for your phone while holding it
5. missing your bus stop because you're day dreaming
4. missing your bus because you're day dreaming at the bus stop
3. taking your bus because you forgot you had the car that day
2. having to be reminded by your spouse not to take the bus because you took the car that day

I used to remember #1, but I've forgotten it.

SODA Day 2.

I seem to be the only person blogging about SODA (even though Jeff, Glencora and Sorelle were around as well - come on people !)

SODA Day 2 was the official day for bidimensional dimensionality reduction in spaces of bounded doubling dimension (ok that almost made sense). Here are some highlights:

I enjoyed Bidimensionality and EPTAS not because of the results themselves (which are interesting) but because the talk was great, and I learnt a lot about the basic framework of bidimensionality. The talk was low on technical details, but the main contribution of the paper is a stronger result connecting bidimensionality of a graph property to the existence of an (E)fficient PTAS for the problem. Earlier work needed a constant factor approximation algorithm to bootstrap the process, and they were able to relax that assumption.
In the same session, Known Algorithms on Graphs on Bounded Treewidth are Probably Optimal was also cool. They showed that any progress on improving the exponential dependence on treewidth for a number of parametrized problems (for example from 2^tw to (2-eps)^tw) would break the Strong Exponential Time Hypothesis (i.e that SAT cannot be solved in (2-delta)^n time).
A paper that drew a lot of attention (as seen by the huge influx of people into the room) was the work by Daskalakis and Papadimitriou (Christos gave a really great talk) on Continuous Local Search. In brief, they constructed a complexity class that combines the approximate fixpoints of PPAD and the local search of PLS and contains some natural problems from algorithmic game theory as well as numerical analysis. Indeed, gradient search can be viewed as a special example of elements of this class.
There was a whole session on approximate geometry, and doubling dimension played a key role in constructing efficient spanners, and in getting below the JL limit. Sandwiched between these was a local JL embeddability result, the idea being that if you only care about preserving distances in your (k)-neighborhood, you can get dimensionality reduction in terms of k, rather than n. The proof itself uses a variant of the Nash embedding theorem.

Wednesday, January 26, 2011

SODA Day 1.

I'm slowly unloading my backlog of posts on SODA 2011. At this point, the purpose is less to be a live-stream of events, and more to be a reflection on things I found interesting. As always, there will be no attempt to be deep, insightful or comprehensive. If you think I've missed THE PAPER OF THE YEAR berate me in comments and then write a guest post :)

The Holiday Inn has two major problems. Firstly, the conference rooms were in the basement, which meant that 3G access was hard to come by. Secondly, the local wifi wasn't particularly strong, and didn't work too well in the rooms or in the higher-up hotel rooms. This unfortunately forced me to actually listen to talks rather than tweeting about them. Good for the conference, bad for the as many as TWO WHOLE people who couldn't attend the conference and were hanging on my every tweet waiting for updates.

T. S. Jayram and David Woodruff had an interesting paper on JL transforms beyond the constant error rate. One of the standard methods in JL proofs is to prove a constant error bound for distortions, and then use multiple parallel copies to reduce the error down to 1/n^2 so that the union bound can kick in. The question is: is this the optimal thing to do ? Might some more clever scheme avoid the need for parallel copies ? Their answer, it turns out, is no. They show lower bounds that match the best upper bounds, and in the process develop a new technique that gives lower bounds for communication complexity that depend on the error probability as well as the approximation guarantee.
The next paper in the session also introduced a new tool for communication complexity lower bounds. Elad Verbin and Wei Yu proposed a generalization of Boolean Hidden Matching (which was used to separate quantum and classical communication complexity) and used it to show new streaming lower bounds for sorting by reversals, among other problems.
A talk I didn't attend, but should have (DABSH), was by Flajolet, Pelletier and Soria on Buffon machines. You can think of the Buffon needle process as a way of generating $2/\pi$. So the question they answer in this paper is: what kinds of "simple processes" can generate other complicated functions like exponentials, trigonometric functions and the like.
Continuing the 'analytic combinatorics' theme, Wojciech Szpankowski talked about his paper with Michael Drmota on discrete divide and conquer recurrences. The main results of the paper were quite neat: a master theorem like formulation to solve exactly recurrences that involve floors and ceilings, without resorting to domination arguments. The advantage is a more precise bound on the running time which also captures the herky-jerky behavior of such algorithms (because of uneven integer splits)

I didn't get as much out of Bruce Reed's talk as I would have liked, mostly because I made the mistake of sitting in the back and could only see half of each slide. The talk itself was rather technical, with less of the high level intuition that might be helpful to an outsider to this area like me. It is however a reasonable model for an invited talk.

If it's Sunday at SODA, it's NFL time. As usual, Kirk Pruhs wandered around wearing his Steelers shirt, and looking mighty pleased. David Johnson was alternately elated (Packers win !) and downcast (Jets lose !) and a number of us drifted towards the hotel bar by early evening to set up shop there in front of the big screen. For those of you sniffing disdainfully at my embrace of brutal American sports, I'll merely say that there are MANY football fans among the SODA community.

Postscript: I was feeling guilty about summarizing papers so briefly. I just found Oded Goldreich's page on papers he's interested in (via this cstheory question) and it appears to be a nice model with short comments on papers he likes. I might try doing something like this either interspersed with other posts here, or on my web page, just to force me to read papers of interest.

Tuesday, January 25, 2011

ALENEX: Experiments with Johnson-Lindenstrauss

I'm three days behind on my postings, so I have the luxury of looking back and attempting a larger perspective.

As Dana Randall put it at the business meeting, this is the Johnson-Lindenstrauss SODA. And it seems apropos to start with our paper at ALENEX. This was work done by my student Qiushi Wang, who's applying for grad schools (Admit him ! or email me for more info!)

The Johnson-Lindenstraus Lemma is one of the most powerful tools in the theory of metric embeddings and dimensionality reduction. Simply put, it says that given any set of $n$ points in a Euclidean space, there exists a linear mapping into roughly $O(log n)$ dimensional Euclidean space that preserves all distances approximately.

There's a long series of proofs of this lemma: all of them yield essentially the same bound on the number of dimensions and the same dependence on the error term, and so the main efforts have focused on improving the running time of the mapping itself. If we're mapping from d dimensions to k, a linear mapping can take time $O(kd)$ per point, and this is the time to beat.

There are two strands of research along these lines: the first family of methods tries to speed things up by sparsifying the projection matrix to speed up the transformation. You can make the matrix quite sparse this way, but there's a limit on what you can do, because if the input vector being projected is itself quite sparse, then the resulting vector has mostly zeros in it, destroying its norm (and any hope of preserving distances)

The trick, which leads to the second strand of research, is to "precondition" the input. The idea is quite elegant: if you apply what is essentially a random rotation to the vector, it becomes dense w.h.p, where density intuitively means that no one coordinate is very large (we assume unit norm vectors w.l.o.g). Once you do this, the resulting projection matrix can be made quite sparse.

There's a catch though: you're now using two matrices instead of one, so you end up spending d^2 time on the first part, which dominates the original $kd$ time. The second trick you need then is a special random rotation that can be applied very quickly. Essentially, you need the walsh-hadamard transform. This is the core idea behind the 2006 paper by Ailon and Chazelle, and there's been much work since on improving the bounds and the preconditioner construction. A third line of work combines the two strands is to sparsify (by subsampling) a special code matrix that has a "preconditioning" effect.

But in all of this, no one has really looked at the actual behavior of these algorithms in practice. There are a number of reasons to do this: first of all $O(\log n)$ dimensions isn't so hot if the constant is large. Secondly the algorithm is randomized, which tends to give practitioners the heebie-jeebies. And finally, the dizzying array of algorithm options available is just plain confusing.

Our paper contains most of the details, so I'll spare you the long exposition, and summarize some of the surprising and not-so-surprising conclusions thus far:

The constant in the dimension of the embedding is small. It's essentially 1 * log P/epsilon^2, where P is the number of "norm probes" you require (P = n^2 for distances and n for norms). This is good, because it means that there are no hidden large constants.
The quality of all algorithms is basically the same, and is very consistent. In other words, the fact that JL is randomized (which often causes a lot of concern in practice), is not a problem for its use (unless you working in a distributed environment and need to share randomness - pointed out to me by TS Jayram).
The distortion error itself is very nicely concentrated (normally) around 1. Unless you have highly clustered data, in which case the distortion distribution looks like a superposition of shifted Gaussians, one for each cluster center.
Since all algorithms behave essentially the same on quality, speed is the main differentiator. Here, the 'best in class' depends heavily on what you know about the data. For dense data, you can be pretty sparse (as predicted by some of the papers) and the embedding is fast. For sparse data, it turns out that at least in MATLAB, and for small dimensions, the dense method work better (a little ironic considering that much of recent work was designed to deal with the sparse case). This is because of MATLAB's heavy optimization for dense matrix multiplication.
Of course, your dimensionality might be too high to store a dense matrix, or you might not even know what the data profile is like. In that case, preconditioning methods like the original Ailon/Chazelle method work great. and there are only small differences between the methods as d increases.

We're not even close to being done with our explorations: there are at least four or five new questions to explore based on feedback we got at SODA. But it's been an illuminating experience, and I've been heartened by all the interest the community has shown in this research, based on the feedback I got.

Monday, January 24, 2011

ALENEX/ANALCO II

Today, someone asked me to post something sensational just to stir up some controversy. It turns out that without realizing it, I already did it yesterday ! I was talking about the use of CPLEX to solve (very effectively) instances of 1-median over strings, and said this:

It's not the "sexiest" thing in the world to solve algorithms problems in practice by using large industrial strength packages. However, both CPLEX and SAT solvers are examples of tools that can be used in practice to solve fairly intractable problems. It still takes a lot of engineering and skill to make the heuristics work well, but it's something that we should be using as a matter of course when designing heuristics before trying to invent an algorithm from scratch.

I should have known better than to bring down the fury of the entire field of OR on my head. Michael Trick, OR blogger extraordinaire, decided to round my variables for me: read what he had to say here. As penance, I promise to download CPLEX and encode at least one problem on it in the next year :).

I've seen Bob Sedgewick give talks a few times now, and I'm always inspired by them. This latest one was titled 'Algorithms for the masses' and was hard to summarize: it was part exhortation to do more analytic combinatorics, part discussion of a new intro CS course he and Kevin Wayne have designed, and part emphasis on using the scientific method properly to design good models for algorithm behavior and data characeteristics.

The principle at the heart of this was a fitting one for this joint talk: we should do more scientific analysis of our algorithms to figure out exactly how our algorithms behave in practice, rather than relying on O() notation as a predictive and comparative tool (both of which it isn't). This goes back to Dick Lipton's coinage of 'galactic' algorithms: Bob made the assertion (not wrong in my view) that most algorithms at STOC and FOCS are 'galactic' and much of the work at SODA is too.

While I agree that it's high time we stopped using O() notation as a cudgel, I think it's harder than one might think. Engineers can model the real world in various ways, and when they want to test their models, they can - well - run it on the real world. Even if come up with a plausible model of how my algorithm works, and what the various cost functions are, I still need to hope that the data doesn't have weird characteristics that make all the results go wonky. Probably the way to see this is that even in "the real world", if we dont know how a particular genetic mechanism works, it's as good (or bad) as not having an accurate model of data that we're testing.

The second invited talk, by James Demmel, was a little harder for me to follow, because it was a much more technical talk about the challenges of designing linear algebra routines for future architectures. He described a machine the DoE is proposing to build, and it's likely to have 1 billion cores ! But even with that many cores, the main bottleneck is going to be communication, and the goal going forward is to design algorithms that parallelize well with minimal communication.

Or as he ended his talk:

Don't communic...

Sunday, January 23, 2011

ALENEX/ANALCO

A few quick hits from ALENEX, or SODA day 0:

Moraru and Anderson used Bloom filters in a nifty way to implement exact pattern matching where you have a large set of patterns and an even larger text. The idea was to do a first pass over the text after storing all the patterns in a Bloom filter. Every subsequence of matching text was stored in a second Bloom filter, and in a second pass, all the patterns were run over this Bloom filter to take care of false positives. A final "exact" pass did the trick (at this point both sets are small enough to be reasonable). They have a companion paper at NSDI (which is a pretty good networking conference) on using this for malware detection, and that's a good example of pairing nice algorithms engineering with some interesting applications.
Chimani, Woste, and Böcker were looking at the 1-median problem on a hamming space, and showed that the simple integer programming formulation actually does great in practice, when you throw CPLEX at it. This was surprising to me on two levels: firstly, that CPLEX is actually free for academic use (who knew!) and that such a simple approach is so effective.

It's not the "sexiest" thing in the world to solve algorithms problems in practice by using large industrial strength packages. However, both CPLEX and SAT solvers are examples of tools that can be used in practice to solve fairly intractable problems. It still takes a lot of engineering and skill to make the heuristics work well, but it's something that we should be using as a matter of course when designing heuristics before trying to invent an algorithm from scratch.
Stanton and Pinar had some experimental results (and some theory) on sampling from the space of graphs that have a prescribed joint degree distribution. While degree sequences are all the rage when trying to model various "naturally occuring" graphs like router graphs or social network graphs, there's a body of work that notes that graphs with the same degree distribution can have very different properties, and that in fact statistics on the number of edges connecting nodes of certain degrees (i.e higher-order statistics on degrees) are even more relevant.They propose a simple Markov chain that allows them to sample from the space of all graphs having a prescribed joint degree distribution, and while they don't yet appear to have theoretical results on the convergence of this chain, it converges quicly in practice.

Other notes: I'll be using the hashtag #soda2011 on twitter during the day. If you're tweeting from SODA (an don't want to let the NIPS tweeters show us up!), do use this hashtag as well.

Wednesday, January 12, 2011

The 5+5 Commandments of a Ph.D.

This article is written by the holy blogging trinity of the School of Computing at the University of Utah: Matt Might, John Regehr, and myself. If you don't read their blogs, you should, because you'll learn how to program robots using an iphone interface in subzero weather.

There have been a lot of Ph.D.-bashing articles lately. There have been some spirited defenses of a Ph.D. too. Most of these articles make good observations, but they're often about the larger Ph.D. ecosystem and therefore fail to provide actionable advice to (potential) Ph.D. students.

We observe that most failures of the Ph.D. system -- including both failure to get the degree and failure to see a good return on time and money invested in obtaining the degree -- boil down to a small set of root causes. These causes are on both sides of the implicit contract between advisor and advisee. Here's our pragmatic view of the conditions that need to be met for a Ph.D. to make sense. (Please keep in mind that we're all computer science professors, though we've made an effort to avoid field-specificity.)

The advisor shall...

1. Advise the student: help find a thesis topic, teach how to do research, write papers, give talks, etc.

2. Provide protection from and information about funding concerns (to the level of expectations of the field, which vary widely).

3. Proactively provide realistic, honest advice about post-Ph.D. career prospects.

4. Provide early and clear guidance about the time frames and conditions for graduation.

5. Introduce the student to the academic community, through conference talks, invited talks, letters of recommendation, etc.

The student shall...

1. As early as possible, do due diligence in researching career prospects. It's not hard to get people to talk about this and there's also plenty of written advice, in books and on the web. Carefully filter what you read since the situations may be very different between engineering fields, science fields, and the humanities. There
may also be significant differences between sub-fields such as theoretical computer science vs. operating systems. A new student should glance at job postings and NSF statistics to determine the ratio of new Ph.D.s to open tenure-track slots.

2. As early as possible, determine if the actual career prospects are a reasonable match for their needs/expectations. Until the student makes her expectations clear, the advisor has no clue if she simply
must have an academic job or whether he'll be perfectly happy getting a Ph.D. and then going to law school or being a stay-at-home parent.

3. Not be deluded or blinded by catchphrases like "life of the mind." Indeed, this life does exist, but probably only during the ABD portion of a Ph.D. A professor would be extremely lucky to live the life of the mind 15 hours a week, leaving 60 hours of advising, teaching, reviewing, writing grant proposals, traveling, and sitting in meetings.

4. Be a good investment in terms of time and money. In other words, work hard. Students who periodically disappear for long bouts of skiing, soul searching, or contract work tend to be put on the back burner by their advisor, making it much more difficult to get re-engaged later on. An easy litmus test: if acting a certain way
would get you fired from a real job, then it's probably a bad idea to try that in a Ph.D. program too.

5. Jump through the administrative hoops appropriately. The hurdles are important and generally not too burdensome: take some classes, do a qualifying exam, write a proposal, and so on. These are easy to
ignore until they become a problem. Your advisor is not likely to remind you, or even remember that you need to do them.

Since nothing is obvious on the internet, a disclaimer: These edicts might come across as cold and overly pragmatic, and might suggest that we are ignoring the joy of discovery, the thrill of learning and the excitement of doing cutting-edge research that goes along with doing a Ph.D. Far from it: we've chosen this life because we experience all of this and enjoy it. But the easiest way to crash and burn in what is a long, multi-year haul is to forget about the brass tacks and float in the clouds.

Tuesday, January 11, 2011

Are open tech report sites taking off in CS ?

For a while now, the math and physics have amused themselves by wondering why the CS community is slow to adopt the arxiv. In the past year or so, I've noticed an uptick in postings on the arxiv (especially around conference deadlines).

Prompted by David Eppstein's review of 2010 in cs.DS, I decided to get some stats on publication counts at the arxiv and ECCC for the past four years. My method:

go to arxiv.org/list/FIELD/YY (thanks, David)
Read off the total number of papers listed

For the ECCC, papers are numbered by YEAR-COUNT, so looking at the last paper published each year sufficed to get the count.

I did this for cs.{CC, DS, CG, LG} (LG is machine learning/learning theory)

Caveat: I ignored cross submissions, so there's some overcounting. I'm hoping that at least to determine trends this is not a major issue.

Here are the results:

Overall, it's clear that arxiv submissions in theory CS are climbing (and rapidly in the case of cs.DS), which I'm quite pleased to see. The growth rates themselves seem quite steady, so it's not clear to me whether the fraction of papers going on the arxiv is itself increasing (there's good evidence that the total number of papers people are writing in general is increasing).

Micro-polymath

As usual, I've been discussing what topic to cover for this semester's research seminar with my students. We usually cover some advanced topic, or a topic that people should know that no one teaches in our department. Students give presentations, I try to foster discussion (and grumble about the presentation style), and hopefully people learn something.

This semester we have decided to try something different. With my students and my postdoc Jeff Phillips (HIRE HIM! He's GREAT ! and needs a job !), the plan is to try a polymath-style enterprise. Specifically, we made up a list of problems that satisfy the following criteria:

The problem is interesting enough in core theory-land that a solution almost guarantees a paper without having to worry about motivation, marketing, etc etc.
The problem has been around and is reasonably difficult, so it's not likely to yield a solution immediately
There's some new idea/paper/line of attack that has emerged (either because I thought of it, or someone else in our group did) that might be fruitful

This last point is very handy to whittle down the set of problems, because there's no shortage of open problems but very few of them might be amenable to attack at this point without new ideas.

We then went over each problem and voted, picking a winner. Luckily the winning problem was a consensus winner, so everyone is hopefully motivated to work on it.

Of course you're waiting for the punchline: which problem did we pick ? Alas, I'm not going to give that out yet. Not because of paranoia on my part, but because I'd like the students to have a certain amount of mind-space to maneuver in without having to worry about the competition. I anticipate complaints over this :).

What I ideally hope to report on a few months from now is an actual solution to the problem. Failing that I'll at least report on the progress made, and how this 'micropolymath' concept worked out.

Sunday, January 09, 2011

Awards from the joint math meetings, and other notes..

It's the new year !! apparently, being on twitter makes blogging frequency drop, because the throwaway information one might blog about just gets tweeted. This is not a bad thing in general.

I've been away in India for much of the winter break, after dealing with the NSF proposal deadlines. On a side note, for anyone complaining about the SODA deadline close to the July 4 weekend, the NSF Dec 17 deadline is MUCH worse. I've submitted proposals now from a cruise ship in Hawaii and at 2:30 in the morning from my parent's place in Bangalore. argghhh

The Joint Math meetings are wrapping up in New Orleans, and award news is beginning to trickle out. Muthu mentions the prizes for Assaf Naor and Ingrid Daubechies. Much to my surprise and great pleasure, the wonderful and entertaining overhang papers by Paterson, Peres, Thorup, Winkler, and Zwick were given the David Robbins Prize for

a paper with the following characteristics: it shall report on novel research in algebra, combinatorics or discrete mathematics and shall have a signifi cant experimental component; and it shall be on a topic which is broadly accessible and shall provide a simple statement of the problem and clear exposition of the work

The citation for their work is:

The Mathematical Association of America proudly awards the 2011 David P. Robbins Prize to Mike Paterson, Yuval Peres, Mikkel Thorup, Peter Winkler, and Uri Zwick for their innovative work on two papers: “Overhang,” American Mathematical Monthly 116, January 2009;
“Maximum Overhang,” American Mathematical Monthly 116, December 2009.

The two papers together solve, to within a constant factor, the classic problem of stacking blocks on a table to achieve the maximum possible overhang, i.e., reaching out the furthest horizontal distance from the edge of the table. The January paper was written by Paterson and Zwick, and the December paper was written by all five people named above. The January paper proves the surprising result that n blocks can be (cunningly) stacked using suitable counterbalancing to achieve an overhang proportional to $n^{1/3}$. (Many people have assumed that the overhang of about log n, given by the standard calculus exercise, is optimal.)
The December paper gave a complementary argument showing that an overhang proportional to n(1/3) is, in fact, the largest possible for any balanced stack.

The papers describe an impressive result in discrete mathematics; the problem is easily understood and the arguments, despite their depth, are easily accessible to any motivated undergraduate.

In other news, cstheory is humming along after the winter break. We're closing on in 3000 users. I'm always hoping for more on-target questions, especially in areas like approximations, geometry and AGT: complexity theory seems over-represented (which isn't a problem, but diversity is great!). We're hoping to develop some formal links with SIGACT fairly soon: stay tuned.

Tuesday, December 07, 2010

ACM Fellows

The ACM just announced the 2010 Fellows, and there are many theoreticians, and many people that I know - congratulations to all !!

Theoreticians on the list:

Also congratulations to Kathleen Fisher, Fernando Pereira, and Lydia Kavraki, ex-colleagues and collaborators.

Tuesday, November 23, 2010

Guest Post: Distinct distances, and the use of continuous mathematics in discrete geometry.

Nabil Mustafa specializes in problems of discrete geometry, and has spent time on problems related to the distinct distance problem. He offers a broader perspective on the new distinct distances result and related breakthroughts in discrete geometry. For a detailed breakdown of the Guth/Katz paper, see Terry Tao's latest opus.As usual, I take responsibility for any mistakes introduced in the editing process.

I think the days when the "Feynman method" was all that was needed to make progress on basic problems in Discrete Geometry are over. Recently there have been a slew of results which make progress on long-standing open problems in Discrete and Computational Geometry that use techniques from a variety of areas in mathematics: algebraic geometry, algebraic topology, complex analysis and so on.

This is wonderful news for the field. Though, two things to consider:

1. So far, apart from the kind of work going in "Computational Topology", this has mostly been a one-way street. There have been fewer cases of discrete and computational geometers going into topology, algebraic geometry etc. and making a similar impact there. Similarly, there are very few collaborations between mathematicians in other areas, and discrete geometers (ed: Mulmuley also argues that the GCT program will only come to fruition when this reverse direction starts happening)

2. From my experience, current graduate students, by-and-large, still seem to be stuck with the general outlook that "If I'm clever enough, and think hard enough, I can solve any problem directly". Such optimism alwayscheers me up. I used to think that too, but the more I learn about other areas, and as the recent work reveals power of techniques, it has become clear to me that it is misguided to think that. I would really advise students to take courses in algebraic topology, differential geometry and so on.

Below I list some such recent breakthroughs.

First-selection Lemma.

Given n points in $R^d$, can one find a point in "many" simplices spanned by these points ?

This has been studied for more than 30 years, with several partial results. The current best result was published this year by M. Gromov, which in fact proves a stronger topological theorem, with better bounds than for restricted earlier cases. Matousek and Wagner have improved this bound slightly for 3D by improving a combinatorial part of Gromov's argument. Gromov's argument is a complicated topological argument that I did not have the background to follow. J. Matousek gave a nice talk at the Bernoulli conference in September with the title "Also Sprach Gromov"!

2. Colored Tverberg Theorem.

Let $C_1,\cdots,C_{d+1}$ be disjoint subsets of $R^d$ , called colors, each of cardinality at least $t$ . A $(d+1)$ -subset $S$ of $\bigcup^{d+1}_{i=1}C_i$ is said to be multicolored if $S\cap C_i\not=\emptyset$ for $i=1,\cdots,d+1$ . Let $r$ be an integer, and let $T(r,d)$ denote the smallest value $t$ such that for every collection of colors $C_1,\cdots,C_{d+1}$ of size at least $t$ there exist $r$ disjoint multicolored sets $S_1,\cdots,S_r$ such that $\bigcap^r_{i=1}{\rm conv}\,(S_i)\not=\emptyset$

The conjecture is that $T(r,d) = r$, and this was proved recently via topological arguments (for all $r$ such that $r+1$ is prime) by Blagojevic, Matschke, and Ziegler (see Gil Kalai's blog for a detailed post on this). Matousek, Tancer and Wagner have translated this argument to a geometric proof. As they state in the abstract, "The purpose of this de-topologization is to make the proof more concrete and intuitive, and accessible to a wider audience."

3. Distinct Distances problem.

The Guth-Katz dramatically improves the best known bound via techniques from algebraic geometry. Terry Tao has more details on this.

4. Joints problem.

A joint is a point formed by the intersection of three non-coplanar lines in 3D. What is the maximum number of joints achievable by a collection of $n$ lines in 3D ?

This was again solved by Guth and Katz via techniques from algebraic geometry. It was subsequently simplified wonderfully by Elekes, Kaplan and Sharir, and Kaplan, Sharir and Shustin.

5. Lower-bounds for eps-nets.

It was very widely believed that for "natural" geometric objects, eps-nets of linear-size should exist. Shockingly, Alon showed that an easy application of Hales-Jewett density version immediately gives a non-linear lower-bound for a simple geometric-system in the plane. While DHJ is a combinatorial result, it was first "proved by Furstenberg and Katznelson in 1991 by means of a significant extension of the ergodic techniques that had been pioneered by Furstenberg in his proof of Szemeredi's theorem". Like earlier proofs, it is possible to "de-ergodicize" it (the polymath project).

6. Regression-depth partitioning conjecture.

(ed: see here for a description of regression depth - it's similar to halfspace depth)

Partial results were shown by Amenta-Bern-Eppstein-Teng in 2000. Recently almost proven by Karasev using topological techniques that I do not understand.

This is just a partial list, there are probably several others that I have missed.

Monday, November 22, 2010

Workshop on Parallelism, and a "breakthrough" in combinatorial geometry

In the blog world, you're either fast, or dead. I waited a few days to post a note about the upcoming DIMACS workshop on parallelism, and was beaten to the punch by Muthu and Dick Lipton.

At this point, you probably know the major points. Phil Gibbons, Howard Karloff and Sergei Vassilvitskii are organizing a DIMACS workshop on a long-range vision for parallelism, with an emphasis on interaction between practitioners and theoreticians in the area. The workshop looks to be quite interesting, and similar in spirit to the one organized by Uzi Vishkin at STOC 2009 on multicore algorithms.

Utah has a major presence in the world of high performance parallel computing, and we're hiring in this area this year ! From my vantage point, I get a fly-on-the-wall view of developments in the area, and it makes me wonder:

Is the field is now moving too fast for models to even make sense ?

A case in point: Google's MapReduce infrastructure has taken the world by storm, and you can even run MapReduce "in the cloud" on Amazon's servers. Howard, Sergei and Sid Suri had a very interesting paper on MapReduce at SODA last year, and there's a ton of academic work on algorithm design (in theory and practice) in this model. But if new reports are to be taken seriously, Google itself is moving on from MapReduce to other distributed data management systems.

The most recent "alternative model of computing" that we've encountered is the stream model, and while it had strong links to practice, it's survived this long because of the incredibly interesting theoretical challenges it poses, as well as the rich arsenal of theory concepts that got deployed in the process. I'm curious to see if something similar can happen with these new models of parallelism and multicore computing (over and above PRAM and NC of course)

In other news: readers of Gil Kalai's blog will have heard of the exciting new breakthrough in the realm of combinatorial geometry on the tantalizingly simple problem first posed by Erdos in 1946:

What is the minimum number of distinct distances achievable by a set of n points in the plane ?

I hope to have a longer guest post up soon on this, so I won't say much right now. What's neat is that this result is the implementation of a larger program/line of attack laid out by Gyorgy Elekes, who sadly died in 2008 before seeing this come to fruition. Micha Sharir has a beautiful writeup with the history of this approach (it predates the new results though). Bill Gasarch has a nice page collecting together the major developments in this area, including links to the Kakeya problem.

p.s Note to other CG bloggers: please do not talk about this problem otherwise Mihai will scold us all ;)

Friday, November 12, 2010

What can the ACM do for you ?

I kid, of course. While the ACM is a popular punching bag in hallway conversations, they do help the CS community in many ways, not the least of which is the digital library, the conference management services, the lobbying efforts in Congress (where I think though that the CRA has them beat), and so on.

But there's this constant running joke that the ACM is always trying to drag us kicking and screaming into the 1970s :). More importantly, while the DL has been spiffed up with comment feeds, all kinds of social sharing and what not (but no RSS for comments - COME ON !!), I think that the ACM could use its considerable weight to really help the research community. In both of the ideas I'm listing, the ACM has particular advantages as an organization with brand recognition, and an imprimatur of authority. They also have resources far beyond what other groups might be able to do.

A Pubmed for CS
More than six years ago, I was complaining about the lack of a Pubmed-style single location to dump all CS papers (titles, abstracts and keywords, with links back to original source). While the arxiv is becoming a better and better place for people to post preprints, what we really need is a single point to view and browse published papers, across journals and conferences.

The DL already maintains paper listings across a variety of publications, and many CS conferences are run by the ACM, so it shouldn't be that hard to do. The DL itself isn't quite there yet, mainly because of the lousy search features that it has, and because it's not quite complete.

I don't know if Dan Wallach's proposal for a central hub for papers is going anywhere, but that's another model that the ACM could help make a reality.

A Mathematical Reviews clone
This one is possibly more localized to theoryCS, and requires more resources. But it's getting ever more important. It's really hard to keep track of the flood of papers showing up in conferences, journals and on the arxiv, and a service that generated short reviews of papers would be great. Plus, as theoryCS gets more fractured and more diverse, this kind of thing becomes ever more important.

It seems like the ACM is getting the 'social bug' if the DL redesign is any indication. I'd argue that these two items are probably the best kind of 'social web' that the ACM can contribute to.

Friday, November 05, 2010

Odds and Ends

Glencora Borradaile makes her recommendations for FOCS videos to watch (can we have all our conferences do videos, please ? My 40 minutes on the bus will take on a whole new dimension). I like the idea of starring speakers who do a good job: might increase the competitive pressure. Come to think of it, here's another perk of having videos for conferences: sheer embarrassment will make sure that speakers improve their presentations.

The cstheory Q&A site rolls on. Those of you at FOCS might have noticed the little sheets in your registration packets, but I don't know if it actually encouraged new people to visit. If you're one of them, do drop a note in the comments. Three points of interest:

We have a collaboratively edited article that we just submitted to SIGACT News (you can read the draft here). The article highlights some of the recent fascinating questions, answers and discussions on the site - do check it out
This is purely anecdotal, but I've been hearing both at FOCS and on the site that people are having to avoid visiting the site because it's so addictive ! Don't worry - it isn't that bad - as a matter of fact I only spent 6 hours last night ! down from.. (never mind)..
Another sign of catastrophic success: our first incident of undergrads in a theory class systematically trying to get their homework questions answered by posting on the site. Many alert users were able to close down the questions, but occasionally answers slip by. If you're an undergraduate and read this blog (HA ! blogs are for old fogies !), be warned...

Finally, because I'd be remiss not to do so, some neat questions (and answers) to chew on:

What's the real complexity of computing the GCD of two numbers ?
How hard is it really to compute the eigendecomposition of a matrix ?

Thursday, November 04, 2010

All FOCS talks are online

All FOCS 2010 talks are now online, at ieee-focs.org and http://techtalks.tv/focs/2010. The player works fine on firefox (the picture is embedded in one corner of the slides) and I find the talks particularly pleasant to listen to if you have a nice backing track :)

Pages