## Monday, October 31, 2011

### Models for MapReduce

I've been listening to Jeff Phillips' comparison of different models for MapReduce (he's teaching a class on models for massive data). In what follows, I'll add the disclaimer IANACT (I am not a complexity theorist).

There's something that bothers me about the various models floating around that attempt to capture the MapReduce framework (specifically the MUD framework by Feldman et al, the MRC framework by Karloff, Suri and (co-blogger) Vassilvitskii, and the newer Goodrich-Sitchinava-Zhang framework).

For "seasoned" models of computation like the RAM, or PRAM, or even I/O and streaming, there's a sense of identifying some intrinsic feature of the computation that you wish to capture, building a model around it, and then identifying resources that you wish to expend little of. After the dust settled around the different TM/RAM models for effective computation, we now accept (for the most part) the RAM as a good model for sequential computation. The I/O model identifies disk access as the key operation to capture and builds a model of computation where disk access is a limited resource. Streaming identifies "amnesia" as the key limitation on a computation, and builds a model of computation centered around that.

In all these cases, it's possible to grumble that $O(n^5)$ isn't really "efficient" or that galactic algorithms shouldn't count, or that ignoring main memory computations is "cheating" in the I/O model, or even that allowing $n^{1-\epsilon}$ memory storage is useless for streaming. But those discussions are secondary to the model: and in fact history tells us that inevitably models that appear to allow for horribly inefficient computation facilitate algorithm design that is quite efficient !

Which is why I'm puzzled when I look at the different ways in which the MapReduce framework is being modelled. I see a number of choices being made:  number of processors should be $O(n^{2-\epsilon})$, or total memory usage should be $O(n^\epsilon)$ or number of 'rounds' of computation should be polylogarithmic or even constant.

At one level I understand these choices: for example, a memory bound of $n^\epsilon$ appears to lead to constant $(1/\epsilon)$ round algorithms.

But should the model be getting into decisions about what's efficient before even deciding what resources they are trying to capture ?

Ultimately, this for me gets back to a question that I asked at the W8F workshop back in May):
What is the intrinsic computational metaphor that we are trying to capture with these models ?
It's not just parallelism. It's not just distributed computation. It's not even streaming (or more generally sublinear computations). Is it some delicate tradeoff between these three ? Should we think of map-reduce models like the circuit classes that have bounds on both space and depth ? Wouldn't this be an unholy mess ?

I suspect that over time, if the MapReduce computational framework persists, then these issues will get shaken out. But there's a provisional (and exciting!) feel about the models we currently have.

1. I could not agree with you more. I think the current models fail at abstracting out what is fundamentally "going on" with MapReduce. A good model should not only reflect the object being studied but should also give insight into the nature of that object.

2. How about fixing memory/cpu requirement per machine, and modeling the number of machines required as a function of input size...is there a model that does this? Hard-disk storage is being gradually phased out in favor of SSDs nowadays, and that behaves a lot like RAM. Also, in a fabric-interconnected cluster, every machine can talk to every machine equally fast, so this could simplify modeling network resources.

3. Alex Lopez-Ortiz11/01/2011 08:52:00 AM

There is a tradeoff between generality and simplicity of a mdel. For example you could make the same arguments against the standard Turing Machine model: why restrict ourselves to one tape instead of several?

The answer is: for our purposes the one tape model as introduced by Turing is good enough. Of course to periodically revisit the choice of the model and see if still captures the behavior of the objects under study.

Given the size of the data sets over which we run map-reduce algorithms, O(n^{2-epsilon}), if anything, looks overly generous. I made this point during Suri's presentation of the MRC framework in SODA. My take is that this was bringing unnecessary complication to allow for a case that would never happen. I would have capped the amount of resources at O( n polylog(n) ).

4. @Alex, but my point is that such restrictions, whether they be n^2 or n polylog, are judgements about efficiency that should not be introduced into the model unless they create large jumps in complexity or fundamentally address modelling issues. With streaming, the whole point of the model is to limit the ability to store the data, so o(n) memory makes sense. I've yet to hear a similar argument for MRC (I'm not saying there isn't one though)

5. @Yaroslav, my colleagues who work on petascale machines usually reject the claim that any machine can talk to any other machine equally fast.

Having said that, I do think that if we had to focus on a single bottleneck, communication is the one we should focus on. I also think that theoreticians shouldn't get too distracted by the complaints of 'efficiency' levelled by practitioners, since the needs of a model involve theoretical structures as well.

6. Alex Lopez-Ortiz11/01/2011 12:02:00 PM

I've yet to hear a similar argument for MRC .

The subject of study is the processing of data collections larger than RAM on a single machine but small enough to fit in main memory across a distributed cluster or in NS.

Data sizes smaller than that and we are in the realm of the RAM and I/O models, larger than that and we are talking data streams. This means we are talking about a range of interest of o(n^2) space/time not unlike the o(n) space assumption for data streams and the question becomes what can we compute in this model?

Having said that, I do agree with the deeper message that good models should not be overly specific.

I'm surprised you didn't take issue with the other overly specific restriction of some map-reduce frameworksa and models which is imposing orderly, discrete <key,value> passes on the data.

People have already pushed Map-Reduce beyond that to more general computations both in theory and in practice.

7. @Suresh: isn't practical applicability how you choose between different models of efficiency? Or is idea to pick the notion of efficiency which is the most interested from mathematical point of view?

"Every machine talks to every machine equally fast" is the goal of a "fabric" interconnected cluster. There are probably physics reasons that make this technically false, but that's how it seems in practice. I've seen some work on optics based fabric -- essentially every computer in cluster can communicate with laser to any other computer directly through laser beam, this should further mitigate inter-cluster network congestion

8. Again, IANACT, but there are many factors going into designing a model, and an important one is identifying what the critical resource is. Practical considerations make the model predictions valuable of course, but often it's good to ignore it and let the "magic of mathematics" do its thing.

9. I see....PS top result for IANACT" gives "I am not a cow tipper"

10. @Yaroslav oops :)

11. Hi,
Jumping in a bit late here. A few comments:

Given the size of the data sets over which we run map-reduce algorithms, O(n^{2-epsilon}), if anything, looks overly generous. I made this point during Suri's presentation of the MRC framework in SODA. My take is that this was bringing unnecessary complication to allow for a case that would never happen. I would have capped the amount of resources at O( n polylog(n) )

While I don't disagree that n^{2-\eps} is a bit generous, I think a further restriction is a case of premature optimization. Really this came out of three requirements:

(1) Restrict the input on a per machine basis to be sublinear.
(2) Restrict the amount of parallelism to be sublinear as well.
(3) Allow for repetition of the data (i.e. don't insist on a partition).

Now we did take a stance that sublinear should be substantial, i.e. beyond just shaving off a log, but I don't think this is where the complaint is coming from.

To the broader point of what is the limiting resource:

Maybe it's the amount of parallelism? On the one hand, you are forced to be parallel, just like in streaming you are forced to be sublinear in the input. On the other hand, you don't want to be overly parallel. That brings in additional coordination, communication, etc. So let's find algorithms that can exploit the middle ground.

Phrasing it slightly differently: If I force you to parallelize, what is the range of parameters where you can remain work efficient?

PRAMs give us many algorithms that are work efficient if you have O(\log n) rounds. But what if I insist on fewer?

Here things appear to diverge. As an example, take two PRAM primitives: list ranking and prefix sums. Computing prefix sums in MR is easy you can do it in O(1) rounds. But list ranking seems to require \log n rounds, or does it? we have no proof...

12. As an example, take two PRAM primitives: list ranking and prefix sums. Computing prefix sums in MR is easy you can do it in O(1) rounds. But list ranking seems to require \log n rounds, or does it? we have no proof...

I think this is a good example of what I was asking for. In other words, MapReduce can motivate the question, but the model doesn't necessarily have to be about it explicitly.