## Monday, May 05, 2008

### P vs NC Part I: Preliminaries

We start our exploration of Mulmuley's result on P vs NC with some definitions, and an "easy" example of a lower bound.

(For the definition of P and NC, visit the Complexity Zoo)

1. P-completeness

NC is our proxy for "can be parallelized efficiently" just like P is our proxy for "efficient algorithm". P contains NC, and as with P vs NP, the interesting question is whether the containment is strict: that is, are there problems that are inherently sequential ?

A theorem that separates P and NC will show that there's some problem in P that cannot be parallelized. Obviously, we want to pick a hard problem in P. P-completeness captures the idea of 'hard-core' problems for P, and the equivalent of SAT for P-completeness is the circuit value problem: given a circuit, evaluate it. We'll actually be looking at a different P-complete problem here, but one that's equally familiar: min-cost flow on a network with n nodes.

Remember, this result DOES NOT show that P strictly contains NC. What it shows is that P strictly contains a fairly general (but still restricted) subclass of NC.

2. The PRAM model

A convenient computational model to work with when talking about NC algorithms is the P(arallel) RAM model. PRAMs were all the rage in the 80s when supercomputers were being built, fell out of favor in the 90s when parallel computing lost its sheen, and are possibly making a comeback today in the world of multicore systems, GPU clusters, and MapReduce.

As the name suggests, a PRAM consists of many processors, each able to run a general purpose program with branchings, indirections, and arithmetic operations +, - X. A problem is in NC if it can be implemented to run in polylogarithmic parallel steps on polynomially many machines (where in one parallel step, each machine might perform one step of a computation).

A bit operation on a PRAM is any operation that treats data as bit-strings, such as AND, OR, and other boolean operations on bitstrings, or even 'extract-bit': get a particular bit of a string. All such operations will be excluded in the models considered here.

It will be important to pay attention to input size. As always, the input size can either be described by n, the cardinality of the input (the quantity counting the number of "things"), or by N, the bitsize (the quantity measuring the number of bits needed to write everything down). When we talk about polylogarithmic time and polynomial number of processors, unless specified otherwise, we will be referring to polynomials of n AND N.

In the non-uniform setting, we'll assume that we're presented with a PRAM that for fixed n, N, runs in time t(n, N) with p(n, N) processors. Each input consists either of an integer, or a rational with integer numerator and denominator presented separately (so that we can do division more easily). Division of two rationals is encoded via multiplication, so that p/q is computed as p * 1/q, and 1/q is computed by reversing the numerator and denominator of q.

The cost model is unit-cost, which means that any operation takes constant time, regardless of the relative sizes of the inputs. It is very important to exclude the floor operation when we do this: if not, judicious use of the floor function along with unit cost operations would allow us to solve PSPACE-Complete problems in polynomial time ! Since we are only concerned about lower bounds, assuming unit-cost for operations can only make the lower bound weaker than it actually is, so it's a reasonable thing to do.

It'll actually be useful to consider four variants of the basic PRAM-without-bit operations model.
1. Arithmetic PRAM: In this model, the running time depends on the cardinality of the input alone, rather than the total bitlength.
2. Linear PRAM: Here, at least one of the operands in any multiplication operation must be a constant.
3. Arithmetic Linear PRAM: Combine the two above
4. PRAM: None of the above restrictions
We will also elide the difference between the various memory access protocols (EREW, CRCW etc) since they only affect the overall running time by a polylogarithmic factor through well-known simulation results.

3. Are these good models ?

The dilemma of lower bounds is the problem of finding models that are rich enough to describe actual algorithms, while being weak enough that we have any hope of proving lower bounds in the first place. Is the PRAM-without-bit-operations a reasonable model of parallel computation ?
Mulmuley argues that it is, by pointing out that this model captures "virtually all" known parallel algorithms for optimization and algebraic problems. Even the weaker models proposed have exemplars: for example, one problem that admits an arithmetic PRAM is solving linear systems over rationals.I am not aware of any algorithm that appears to need bit operations in any nontrivial way, and would welcome pointers to possible candidates.

4. The problem of bit extraction.

Since we've excluded bit operations from consideration, it's natural to wonder how expensive it really is to extract a bit (or in other words, simulate a step from a general PRAM). It turns that that a very simple version of the general proof argument can be used to show a lower bound here, and although it's not terribly relevant, it's useful as a warmup exercise.

Proposition: The lowest order bit of an n-bit operand cannot be extracted in time using processors in the PRAM-without-bitops model, for some large enough constant a.

Why on earth should this be true ? After all, to test for odd/even, all we have to do is divide the number by 2 and take the remainder ! But how do we do that ? the only operations permitted are on integers or rationals, and the operation "5/2" will merely return the rational (5,2). If we had a floor function, we could compute , but we don't have it.

On the other hand, is this tight ? Clearly, if the bitlength is log-bounded, you can guess it by doubling, and then extract bits starting with the most significant one. This can be done in O(l) time sequentially, if the bitlength is l. Generalizing, you can split an l-length string into pieces of size , and work on each piece, starting with the most significant one, for a constant number of parallel time units. On such a piece, you have processors guessing all possible values, and then check which one is correct. Subtracting, you can then proceed to the next piece. This is one way of understanding where the bounds in the proposition come from.

So if we're convinced that the 'form' of the proposition makes sense, let's see how we might go about proving it.

Proof outline:
Assume that there actually is a machine that can return the lowermost bit in the desired bounds , . Suppose we simulate the action of this machine on two inputs x, x' upto time t. If the machine, upto time t, behaves the same on both inputs, then we say that they're in the same t-equivalence class.

Can we bound the total number of such classes ? Note that if we can do that, then we might have a hope of using the pigeonhole principle to argue that at least some class might have too many elements in it: since all elements in an equivalence class will yield the same outcome, we might be able to use this to create a contradiction.

Consider the branch operations executed by the machines at the (t+1)^th step for two t-equivalent inputs x, x'. Since the computation is identical for the two inputs upto the previous step, any branching operations consist of evaluating the sign of some polynomial in x (because all the stored memory locations are polynomials in x). Each of these polynomials has degree at most , and therefore has at most real roots. Collecting all these roots together for all machines, we get a subdivision of the real line into intervals, and (this is the crucial part) in each of these intervals, the answers to sign queries for any of the polynomials does not change.

[Aside: this trick, of constructing an arrangement over a geometric space such that each cell contains a fixed answer to a problem, is fairly common in computational geometry].

What this means is that each t-equivalence class can blow up into only a certain number of (t+1)-equivalence classes (roughly ), and by induction, this means that over a total of t(n) time steps, there can be at most something like equivalence classes.

Now equivalence classes partition the space of possible inputs, but they don't really tell us how: any one partition might have inputs from all over the real line, and knowing the number of equivalence classes in and of itself doesn't tell us why estimating the lowest bit might be difficult. In order to make that leap, we have to further fracture these equivalence classes into actual segments on the real line, such that in each segment, all inputs are t(n)-equivalent.

This is not too hard to do, since we know that all polynomials evaluated in the process have degree at most . Since there are at most t(n)*p(n) polynomials to consider in any computation for any particular equivalence class, the total number of intervals induced by roots of these polynomials is roughly , which is still . Two inputs from any one of these intervals are treated in exactly the same way by the computation.

Now comes a neat application of the pigeonhole principle. Choose a large enough so that . If we label each of the possible integers with their correct output (the lowermost bit), we get an alternating sequence of zeros and ones, and clearly a correct algorithm must create at least intervals to ensure that no interval contains numbers labelled both one and zero. But if we choose a large enough, we can ensure by the pigeonhole principle that there will be at least one interval that is "too long" i.e must contain inputs labelled both one and zero. Since by construction, the algorithm must return the same output on ALL inputs in this interval, it will be wrong on at least one input, which yields a contradiction.

We will use many elements of this basic proof structure as we go along.

1. 2. 