Sunday, June 17, 2012

A Reflection On Big Data (Guest post)

Ed: Graham Cormode kindly agreed to send a missive from the recently concluded big data workshop at Duke

There seem to be no shortage of Big Data events at the moment, reflecting some of the enthusiasm around this topic.  Last week, there were meetings at NIST and at Duke and there are upcoming events at Stanford and SAMSI to name but a few. (Ed: also the new Simons Foundation program  on big data)

 I attended the Duke event, and was hoping to blog about the meeting, but suffered a big data problem of my own: with 25 speakers, plus panels and breakout sessions, how to pull out any meaningful summary?  So instead, I'll just pick on one issue that was the subject of much discussion: what exactly is new about big data?  In particular, there has been much interest in what we called 'massive' data over the last years (as captured by MADALGO, the center for massive data algorithmics  and the accompanying MASSIVE workshop).  Is Big Data just a rebranding of Massive Data?  Which is bigger, Big Data, Massive Data, or Very Large Data ?

What came out from our discussions is that we believe that Big Data is qualitatively different from the challenges that have come before.  The major change is the increasing heterogeneity of data that is being produced.  Previously, we might have considered data that was represented in a simple form, as a very high-dimensional vector, over which we want to capture key properties.  While Big Data does include such problems, it also includes many other types of data: database tables, text data, graph data, social network data, GPS trajectories, and more.  Big Data problems can consist of a mixture of examples of such data, not all of which are individually "big", but making sense of which represents a big problem. Addressing these challenges requires not only algorithmics and systems, but machine learning, statistics and data mining insights.

This is by no means a new observation: the axes of Volume, Velocity and Variety have been widely discussed before. But it did remind me that we should not get too hung up on how big our big data is, since size is not the sole defining characteristic of handling big data (although it is a necessary characteristic).

Many other familiar topics came up in discussions, such as: how can big data sets be made more accessible to researchers? how can privacy concerns with big data be addressed?  what are the right models for big data, for both the data and the computation? what are the key techniques for working with big data? do we even know enough to teach a course on big data?

What struck me about these discussions is that not only were we far from reaching a consensus on any point, but also no one seemed to think they had any strong candidate solutions.  Consequently, I left assured that Big Data is much more than a rebranding of existing work.  It represents a significant research agenda that will challenge us for years to come.

1 comment:

  1. Hi Graham,

    Excellent summary. As far as the "what is new about big data" issue is concerned, I would add another point: the necessity of parallel and distributed approaches. 15-20 years ago, when the first "massive data wave" arrived, one could always count on computers getting faster in a year or two. This is no longer the case.



Disqus for The Geomblog