## Tuesday, January 17, 2012

### The Shonan Meeting (Part 1): In the beginning, there was a disk...

What follows is a personal view of the evolution of large-data models. This is not necessarily chronological, or even reflective of reality, but it's a retroactive take on the field, inspired by listening to talks at the Shonan meeting.

Arguably, the first formal algorithmic engagement with large data  was the Aggarwal-Vitter external memory model from 1988. The idea was simple enough: accessing an arbitrary element of disk was orders of magnitude more expensive than accessing an element of main memory, so let's ignore main memory access and charge a single unit for accessing a block of disk.

The external memory model was (and is still) a very effective model of disk access. It wasn't just a good guide to thinking about algorithm design, it also encouraged design strategies that were borne out well in practice. One could prove that natural-sounding buffering strategies were in fact optimal, and that prioritizing sequential scans as far as possible (even to the extent of preparing data for sequential scans) was more efficient. Nothing earth-shattering, but a model that guides (and conforms to) proper practice is always a good one.

Two independent directions spawned off from the external memory model. One direction was to extend the hierarchy. Why stop at one main memory level when we have multilevel caches  ? A simple extension to handle caches is tricky, because the access time differential between caches and main memory isn't sufficient to justify the idealized "0-1" model that the EM model used. But throwing in another twist - I don't actually know the correct block size for transfer of data between hierarchy levels - led us to cache-obliviousness.

I can't say for sure whether the cache-oblivious model speaks to the practice of programming with caches as effectively as the EM model. Being aware of your cache can bring significant benefits in principle. But the design principles ("repeated divide and conquer, and emphasizing locality of access") are sound, and there's already at least one company (Tokutek, founded by Martin Farach-Colton, and Michael Bender and Bradley Kuszmaul) that is capitalizing on the performance yielded by cache-oblivious data structures.

The other direction was to weaken the power of the model. Since a sequential scan was so much more efficient than random access to disk, a natural question was to ask what you could do with just one scan. And thus was born the streaming model, which is by far the most successful model for large-data to date, with theoretical depth and immense practical value.

What we've been seeing over the past few years is the evolution of the streaming model to capture ever more complex data processing scenarios and communication frameworks.

It's quite useful to think of a stream algorithm as "communicating" a limited amount of information (the working memory) from the first half of the stream to the second. Indeed, this view is the basis for communication-complexity-based lower bounds for stream algorithms.

But if we think of this as an algorithmic principle, we then get into the realm of a distributed computation, where one player possesss the "first half" of the data, the other player has "the second half" and the goal is for them to exchange a small number of bits with each other in order to compute something (while streaming is one-way communication, a multi-pass streaming algorithm is a two-way communication).

Of course, there's nothing saying that we only have two players. This gets you to the $k$-player setup for a distributed computation, in which you wish to minimize the amount of communication exchanged as part of a computation. This model is of course not new at all ! It's exactly the distributed computing model pioneered in the 80s and 90s and has a natural home at PODC. What appears to be different in its new uses is that the questions being asked are not the old classics like leader election or byzantine agreement, but statistical estimation on large data sets. In other words, the reason to limit communication is because of the need to process a large data set, rather than the need to merely coordinate. It's a fine distinction, and I'm not sure I entirely believe it myself :)

There are many questions about how to compute various objects in a distributed setting: of course the current motivation is to do with distributed data centers, sensor networks, and even different cores on a computer. Because of the focus on data analysis, there are sometimes surprising results that you can prove. For example, a recent SIGMOD paper by Zengfeng Huang, Lu Wang, Ke Yi, and Yunhao Liu shows that if you want to do quantile estimation, you only need communication that's sublinear in the number of players ! The trick here is that you don't need to have very careful error bounds on the estimates at each player before sending up the summary to a coordinator.

It's also quite interesting to think about distributed learning problems, where the information being exchanged is specifically in order to build a good model for whatever task you're trying to learn. Some recent work that I have together with Jeff Phillips, Hal Daume and my student Avishek Saha explores the communication complexity of doing classification in such a setting.

An even more interesting twist on the distributed setting is the so-called 'continuous streaming' setting. Here, you don't just have a one-shot communication problem. Each player receives a stream of data, and now the challenge is not just to communicate a few bits of information to solve a problem, but to update the information appropriately as new input comes in. Think of this as streaming with windows, or a dynamic version of the basic distributed setting.

Here too, there are a number of interesting results, a beautiful new sampling trick that I'll talk about next, and some lower bounds.

I haven't even got to MapReduce yet, and how it fits in: while you're waiting, you might want to revisit this post.