Monday, October 31, 2011

Models for MapReduce

I've been listening to Jeff Phillips' comparison of different models for MapReduce (he's teaching a class on models for massive data). In what follows, I'll add the disclaimer IANACT (I am not a complexity theorist).

There's something that bothers me about the various models floating around that attempt to capture the MapReduce framework (specifically the MUD framework by Feldman et al, the MRC framework by Karloff, Suri and (co-blogger) Vassilvitskii, and the newer Goodrich-Sitchinava-Zhang framework).

For "seasoned" models of computation like the RAM, or PRAM, or even I/O and streaming, there's a sense of identifying some intrinsic feature of the computation that you wish to capture, building a model around it, and then identifying resources that you wish to expend little of. After the dust settled around the different TM/RAM models for effective computation, we now accept (for the most part) the RAM as a good model for sequential computation. The I/O model identifies disk access as the key operation to capture and builds a model of computation where disk access is a limited resource. Streaming identifies "amnesia" as the key limitation on a computation, and builds a model of computation centered around that.

In all these cases, it's possible to grumble that $O(n^5)$ isn't really "efficient" or that galactic algorithms shouldn't count, or that ignoring main memory computations is "cheating" in the I/O model, or even that allowing $n^{1-\epsilon}$ memory storage is useless for streaming. But those discussions are secondary to the model: and in fact history tells us that inevitably models that appear to allow for horribly inefficient computation facilitate algorithm design that is quite efficient !

Which is why I'm puzzled when I look at the different ways in which the MapReduce framework is being modelled. I see a number of choices being made:  number of processors should be $O(n^{2-\epsilon})$, or total memory usage should be $O(n^\epsilon)$ or number of 'rounds' of computation should be polylogarithmic or even constant.

At one level I understand these choices: for example, a memory bound of $n^\epsilon$ appears to lead to constant $(1/\epsilon)$ round algorithms.

But should the model be getting into decisions about what's efficient before even deciding what resources they are trying to capture ?

Ultimately, this for me gets back to a question that I asked at the W8F workshop back in May):
What is the intrinsic computational metaphor that we are trying to capture with these models ?
It's not just parallelism. It's not just distributed computation. It's not even streaming (or more generally sublinear computations). Is it some delicate tradeoff between these three ? Should we think of map-reduce models like the circuit classes that have bounds on both space and depth ? Wouldn't this be an unholy mess ?

I suspect that over time, if the MapReduce computational framework persists, then these issues will get shaken out. But there's a provisional (and exciting!) feel about the models we currently have.

Disqus for The Geomblog