The Geomblog: 05/01/2008

Wednesday, May 28, 2008

P vs NC IV: Linear PRAMs I

1. Prelude

Taken together, the P vs NC paper is quite intimidating in its size and scope. And yet, when it's cut down to bite-sized blogging chunks, the pieces seem to follow inexorably in a manner that leaves one asking, 'how else could it have been'. There's the key inspiration of using parametric complexity, the magic trick we'll see more of today. But the rest of it uses the kind of geometric arguments that are familiar to anyone who's read papers on the complexity of arrangements. Make no mistake: taking care of issues of bit complexity requires delicate arguments. But the main thrust of the approach is surprisingly easy to explain, ONCE we've accepted the magic trick.

In this post and the next, we'll begin to see glimmers of the WHY for the parametric argument. In brief, using parametric complexity allows us to navigate an affine subspace of the space of all possible inputs, and then we let geometry take over and guide us.

2. Linear PRAMs

We're close to seeing the main killer argument at the heart of the paper. But as with many arguments, it's easier if we look at a skeleton argument applied to a somewhat simpler case. We already saw a toy example of this argument, and now we'll move to a more nontrivial model, that of linear PRAMs.

In the first post, we defined a linear PRAM as a PRAM-without-bitops in which all multiplications are by a constant. We'll see a slightly stronger lower bound for computations in this model. Specifically, we will show that for a problem with parametric complexity $\phi(n, \beta(n))$ and bitsize $\beta(n)$ , there exists a large enough constant b such that the problem cannot be solved on a linear PRAM with p processors in time $\log \phi/(b \log p)$ . This generalizes the bound that will hold for the stronger model: namely, time $\sqrt{\log \phi}/b$ with $2^{\sqrt{\log \phi}/b}$ processors.

Substituting what we've already seen for the parametric complexity of max-flow, we recover a lower bound of $n/(b\log p)$ time using p processors, for an n-node network.

Notes:

For clarity, I'm skipping over the specification of bitlengths: these will fold in later, but throwing in all these numbers right now confuses the flow.
As before, the lower bound applies for approximations and randomization: how this happens will be discussed later on.

3. Bounding the shape and number of branchings

The high-level proof strategy first requires us to bound the number of possible branchings of an algorithm in this model, and then requires us to describe the "cells" of these branchings in some geometric manner. As we did earlier, we'll say that two inputs x, x' are t-equivalent if the machine executes the same set of instructions on x and x' for the first t steps of the computation.
This defines an equivalence relation, and our goal will be to estimate the number of equivalence classes in this relation (as a function of t).

Now let's introduce the parametrization. Suppose that each input x is a function of d parameters $z_1, \ldots z_d$ , where each particular input variable $x_i$ is expressed as a linear function of the parameters. We'll further assume that these linear forms are integral, and the parameters range over integer values with some bounded bitlength.

The partitioning of the inputs x into equivalence classes induces a partitioning of the parameters in the obvious way (two parameter vectors z, z' are t-equivalent if the corresponding x, x' are). Let $\sigma(t)$ denote the number of t-equivalent classes in the parameter space. The main result we can show now is that:

For all t, $\sigma(t)$ is bounded by $O(p^{dt})$

Here's the proof argument (and for those of you who've waded through parametric search, this will be familiar): First of all, remember that since all operations are linear (since one term of each multiplication is a constant), each arithmetic operation is some linear function of the input. From this, and the assumption that memory pointers are not functions of the input (this is the circuit view of the computation), it follows that the contents of each memory location are a linear function of the equivalence class the input lies in, and the time t.

Now imagine simulating the execution of the algorithm on an as-yet unspecified parameter vector z (as we do in parametric search). What this means is that arithmetic operations are linear forms defined over the symbolic value of the parameter. When a branch occurs, the branch outcome is determined by comparing two linear forms, which reduces to testing the sign of a linear expression. Let's take some z in a fixed t-equivalence class C. From the above remarks, each branch involves testing at most p linear forms on z, each of which can be viewed as a hyperplane in $R^d$ . If we now look at the arrangement of these hyperplanes, then each cell represents a fixed set of decisions for the branch outcomes. By standard arguments, the complexity of this arrangement is $O(p^d)$ .

Now how does this arrangement evolve at the next time step ? Clearly, any single piece of the equivalence class C can be split by the arrangement into smaller pieces, based on decisions that are made at the next time step. But from above, there are only $p^d$ such splits possible. Therefore, $\sigma(t+1) \le O(p^d)\sigma(t)$ , yielding the desired bound.

A corollary is that if we track any particular t-equivalence class, it is split by at most pt linear forms (one for each time step and each processor), and so it is sufficient to check the signs of these forms to determine whether any particular z belongs to C.

Finally, taking a union over all possible equivalence classes, we get the following result:

If a set of inputs parametrized by d parameters is accepted in the linear PRAM model with p processors in t time, then there is a way of partitioning $R^d$ using $O(p^{dt}pt)$ hyperplanes, so that each cell of the resulting arrangement can be labelled 0 or 1 so that an input z is accepted iff it lies in a cell labelled with a 1.

In the next part of this argument, we will show that a problem with high parametric complexity cannot be represented this way: that is, there must be some cell that contains inputs in the language, and inputs not in the language.

Tuesday, May 20, 2008

P vs NC III: Parametric complexity

(note: this is part III in the series on Mulmuley's separation of P and NC without bit ops. For all posts in this series, click on the tag 'p-vs-'nc')

Parametric complexity is the notion of 'intrinsic hardness' that will be used to show the main lower bound. In this post, we'll go over the notion of parametric complexity, and view s-t mincuts (and dually, max flows) in this light.

Consider the max-flow problem on n vertices. It is parametrized by the capacity of each edge. Suppose now that we replace each such parameter by a linear function of a parameter $\lambda$ , taking care that for some fixed range [p,q] all such capacities are non-negative. We get a parametrized instance in which the optimal flow f is a function of the parameter $\lambda$ . Moreover, since the capacities are linear functions of $\lambda$ , f itself is piecewise linear.

Let $\rho(f)$ be the number of breakpoints of f i.e the number of points at which the slope of f changes. To picture this, imagine that some edge is critical for f, in that increasing the capacity on this edge increases f. Clearly, as the parameter increases, the capacity increases linearly, and so does f. At some point though, the capacity becomes large enough that this edge no longer plays a key role in capping f. At this point, some other edge (with capacity changing at a different rate) starts playing this key role, and f starts changing linearly again, but at a different rate.

Now suppose we compute $\rho(f)$ for all possible parametrizations. The maximum value is called $\phi$, and is called the parametric complexity of f. Technically, $\phi$ is also a function of both n, the number of vertices in the graph, and $\beta(n)$ , the bitlength of the input. The main result of this paper is the following two-step dance:

Max-flow has high (exponential) parametric complexity
A problem with high parametric complexity has high parallel complexity

In the rest of this post, we'll look at the first step.

Before we go on, another simple observation. Parametric complexity starts looking a lot less mysterious if you think geometrically. Let's look at min-cut, since it's a bit easier to see what's going on. Any s-t cut is some set of edges, and its capacity is the sum of the corresponding edge capacities. Therefore, the capacity of a cut is a linear function of $\lambda$ . Graphically, we can think of this capacity as a line segment starting at $\lambda=p$ and ending at $\lambda=q$ .

We can do this for all cuts in the graph (and there are exponentially many of them). What we get is a familiar beast: the arrangement of the collection of line segments. Now what's the min-cut. For any value of $\lambda$ , it's the cut with the smallest value. In other words, the function describing the min-cut as a value of $\lambda$ is the lower envelope of this arrangement of line segments, and the function $\rho(f)$ is merely the complexity of this lower envelope (the number of pieces in it).

Now the lower envelope of a collection of lines (note that although these are line segments, they extend across the domain of interest and therefore function as lines for combinatorial purposes) in 2D is linear in the number of lines. Since there are exponentially many cuts, we can expect that the complexity of the lower envelope is exponential. The hard part is showing that all the dependencies (a single edge can participate in many cuts) don't prevent us from achieving this.

Some technical points:

We assume that non-numeric parts of an input remain fixed for all values of the parameter (i.e an edge doesn't disappear or reappear)
All linear functions come with rational coefficients.
The bitsize of the problem is defined as the maximum bitsize of any of the vertex descriptions and the coefficients of the linear functions. Note that although

$\lambda$ can vary continuously, interesting events happen at breakpoints that can be described rationally.
High parametric complexity bounds have already been proved for many problems of interest. Combinatorial linear programming with linear bitsizes has a parametric complexity of $2^{\Omega(n)}$ . Min-cost flow has similar complexity, and the same bound can be proved for max-flow with quadratic bitsizes. Mulmuley in fact reproves this last result with what he says is a simpler argument, and also shows that in contrast, the global min-cut problem has polynomial parametric complexity (which is a good thing, since there's a RNC algorithm for it !).

Let's take a closer look at the parametrized complexity of max-flow (actually its dual, min cut). The goal is to come up with a construction that exhibits a parametric complexity exponential in the input. So we need a graph, and a way to parametrize edge capacities.

The graph itself is quite simple. It has a source s, a sink t, and vertices from 1 through n (call this set A) and 1' through n' (call this set B). s is connected to all the interior vertices, as is t. Each of A and B forms a clique. Moreover, vertices i and j', and j and i', are connected (i and j are different). One possible cut of the graph consists of all vertices in A (along with s). Another consists of all vertices in B (along with s). It will turn out that by choosing weights appropriately, both of these will represent min cuts, as will any set formed by exchanging some vertices from A with their prime counterparts in B. Since there are $2^n$ ways of doing this, we'll get the desired bound.

This is where things start getting a little tricky, and I must confess that I don't have a clean way of explaining what happens next (which is a short way of saying that you might be better off reading the relevant construction !). In brief, the idea works like this: an edge from s to i (or i') is given a capacity $w_i T$ , where $T = 2^{n+1}$ is a constant, and the capacities are set up to be exponentially decaying: $w_i = (2nT)^{n-i})$ . An edge from i to t is given a capacity $w_i(T+\lambda)$ , and an edge from i' to t is given capacity $w_i(T-\lambda)$ . The parameter $\lambda$ is set to run from -T to T. Note that at these two extremes, either all edges from A to t have capacity zero, or all edges from B to t have capacity zero (making them good candidates for cut edges).

There are also cross edges to consider. An edge from i (or i') to the opposite side has the "default" capacity $w_i T$ . However, edges in the cliques (inside A and B) are weighted differently: the edge from i to j in A has capacity $w_i(T - 2^{n-j+1}$ , and this is the same for an edge from i' to j'.

What's the capacity of the cut that contains s and A on one side ? We add all edges from s to B ( $T \sum w_i$ ), and all edges from A to B ( $Tn(n-1)\sum w_i$ ) and A to t ( $T\sum w_i + n\lambda$ ). Clearly, this is minimized for $\lambda = -T$ , and it turns out that this is in fact a min cut. A similar argument works for a cut containing s and B, and $\lambda = T$ .

But what happens as the parameter increases ? Note that it makes no sense to have i and i' on the same side of a cut, because there's no edge between them and so it's always better to keep them separate. This means that the min cut is always "complementary" with respect to A and B, which is useful. It also means that the cut will change by having i and i' swap positions as the parameter increases. The question that remains is how many times this can happen.

It turns out that via a reasonably clever (but with a boring inductive proof) argument, it can be shown that the region spanned by the parameter can be divided into $2^n$ regions by a binary division, and when each region is encoded using a standard bitstring, the string encodes the min cut set (1s denote elements from A, and 0s denote elements from B, and so on). The proof works by arguing that for any region defined by the binary division, the min cut remains invariant, and can be read off from the binary expansion of the region. This is proved by induction on the length of the bitstring.

Finally, since each min cut defines a linear function of the parameter with a different slope, we obtain the desired result: namely that the number of distinct slope changes is exponential.

Monday, May 19, 2008

CCF restructuring

(note: this post is probably only of interest to readers in the US)

James Lee posts a note from the Visioning workshop held ahead of STOC 2008. There's some interesting news about the restructuring of CCF (the "theory" section within the NSF's Computer Science directorate). The previous CCF structure (run by Michael Foster) consisted of:

Theoretical Foundations (TF), which included all of theoryCS, as well as geometric modelling, some visualization, information theory, and signal processing.
Foundations of Computing processes and artifacts, which among other things included graphics and visualization, as well as some software, architecture, and programming languages
Emerging models, which had quantum, nano, and DNA computing.

The new CCF (which will be run by Sampath Kannan), looks like this:

Algorithmic Foundations
Communication and Information Foundations
Hardware and Software Foundations

The first one contains pretty much all of theoryCS as before, but from the sound of it will not have the other areas like information theory and signal processing (which I imagine will be in the second cluster). The algorithmic side of quantum computing (as opposed to the engineering side) will also be folded in.

One interesting comment was that geometry will be folded back into the main cluster. Currently, CG, geometric modelling, and symbolic methods have a separate pool that awards are made through, under the larger TF cluster. It sounds like this will change, and CG will "compete" in the main theory pool.

This will probably make life interesting in the short run. I don't think I'm overstating the case to say here that there's a not trivial amount of hostility between CGers on one hand, and the "regular" STOC/FOCS crowd on the other. If I had a dollar for each time someone vowed never to attend STOC/FOCS because of perceived slights to geometry, I'd never have to write grant proposals. On the other hand, I have heard (from various sources) of people who say things like 'Geometry is not real theory'.

It should be interesting to see how this all shakes out during panel review. I suspect/hope that things will happen reasonably, in that proposals will be binned and panelists assigned appropriately.

Saturday, May 17, 2008

The history of Alice and Bob

The Alice and Bob approach to cryptography is brilliant on so many levels:

It personalizes the discussion, rather than dealing with impersonal As and Bs
It creates a gender-neutral discussion framework
It's much more effective at capturing the imagination of the non-expert public (see point 1).

I was wondering about the origin of this coinage. It's attributed to the original RSA paper, and here's what Ron Rivest had to say about it:

RSA co-founder Rivest, who is a Massachusetts Institute of Technology (MIT) professor, says he came up with Alice and Bob to be able to use "A" and "B" for notation, and that by having one male and one female, the pronouns "he" and "she" could be used in descriptions. Rivest says it is possible that Alice came to mind because he is something of an Alice in Wonderland buff.

If you've wondered about the deeper personal lives of Alice and Bob, go no further than this speech. And for more on the extended Alice and Bob family, check out the wikipedia entry. (xkcd has more on the sad tale of the maligned other woman Eve).

Friday, May 16, 2008

The rectangle method in communication complexity

It's tempting to take way an impression that the P vs NC proof is a one-off kind of thing, with no real lessons for other kinds of lower bounds that one might want to prove. Only time will tell whether this is the case or not.

But I was thinking about communication complexity lower bounds recently. Consider the standard 'rectangle' method for proving a communication complexity lower bound. As usual, we have some (Boolean) function f(x,y) whose answer we wish to compute. Alice has x and Bob has y, both n-bit strings, and we want to exchange as few bits as possible in order to compute the answer (since f is Boolean, it doesn't really matter who ends up with the final answer: they can always pass a bit back to the other party).

A natural description of the communication is in the form of a $2^n \times 2^n$ matrix, where each row is labelled with one of Alice's inputs, and each column is labelled with one of Bob's possible inputs. Now let's say Alice sends the first bit. This bit depends only on x, and so we can partition the rows into two blocks, one containing all inputs for which Alice sends a 1, and the other containing all rows for which she sends a 0. As the communication proceeds, we can further partition the space $X \times Y$ into equivalence classes of sets of inputs that induce the same communication pattern.

So far, we're doing what most lower bound arguments tend to do: finding equivalence classes of inputs prior to doing some kind of counting (think of the sorting lower bound). But here comes a neat twist.

Each equivalence class thus defined forms a rectangle in the matrix.

The proof of this is not too hard, and is a consequence of the fact that Alice and Bob can only see their side of the $x \times y$ pair. Let's say one of these equivalence classes contains the pair of inputs $(x_1, y_1)$ and $(x_2, y_2)$ , which means that for both these input pairs, the conversation transcript is identical.

But now what happens if the input was instead $(x_1, y_2)$ ? Alice sees $x_1$ , and sends the same first bit out that she would have sent if she had seen $x_1$ (because of the equivalence). Bob sees this bit, and looks at $y_2$ , and has no way of telling that Alice's input was NOT $x_2$ , and so sends out the next bit in the $(x_2, y_2)$ transcript. Repeating this argument for the remainder of the conversation, it's easy to see that $(x_1, y_2)$ must also generate the same transcript, as must $(x_2, y_1)$ . Geometrically, this proves that every equivalence class must be a rectangle, and therefore, the matrix consists of rectangular patterns of 1s.

In other words, what we've done is come up with a geometric way of describing equivalent classes of inputs. This means directly that things like EQ(x,y) (is 1 if x=y) need n bits of communication, because the corresponding matrix has 1s along the diagonal only, and there's no way of covering these with fewer than $2^n$ monochromatic rectangles. Again, we're using the geometry induced by the computation to make arguments about the lower bound construction. This geometric structure is used algebraically for the more involved rank-based lower bound argument as well.

I'm not claiming that these are identical arguments to what we're seeing in P vs NC. It's that the idea of exploiting the geometry induced by a set of computations is not as unfamiliar as it might seem.

The Geomblog

Pages