Tuesday, August 27, 2013

On "a theory of big data"

+Moritz Hardt kicked off our Simons big data program with an excellent rumination on the role of theory in "big data". Some followup thoughts:

Moritz makes the point that theory, for better or for worse (mostly for better) made the choice to give problems primacy over data. I harp on this a lot in my algorithms class, and also talked about this in the context of computational thinking: the idea of 'naming a problem' is a key intellectual contribution of theoryCS. 

But to look at what might happen if we allowed more flexibility in problem definition, we don't have to look too far. Machine learning (a discipline that faces data head on) is characterized by the study of problems that are well defined in the broad sense, but have a lot of wiggle room in the specific (and I'm now hoping +Sebastien Bubeck will respond and critique what follows)

Consider classification. In a broad sense, it's very well defined: given a collection of points labeled with (1,-1), find a decision rule that can be used to separate the +s from the -s. But in the specifics: what's the decision rule ? what's the penalty for making a mistake ? how much labeled data do we have ? does it cost us to get labels ? and so on.

For each possible answer to these questions, you can construct a well defined problem. And you could focus on solutions to that very well defined problem. But that's not particularly important. Maybe I use hinge loss to capture errors in my classification, or maybe I use some other error function. I don't care as much as I care about a formulation that allows me to solve different flavors of the problem using  a single paradigm: possibly some form of gradient descent. 

This theme shows up again and again. In clustering. In regression. In the different flavors of learning (active, semisupervised, transfer, multitask, ...). A good solution to a problem focuses not on the specific definition, but a general framework that captures different variations on the problem and reduces them to solving some optimization that can then be engineered. This is also why optimization (and understanding heuristics for optimization) is such a focus on machine learning (Bubeck's lecture notes are a great resource on this, btw)

There is of course a downside. The downside is that you (could) miss out on connections between problems: the reductions that are the lifeblood of work in theoryCS. In fact, John Langford has looked into this issue specifically, with his machine learning reductions project.

So returning to the original question, what should a theory of big data look like ? A big part of theoryCS research is the skillful management of resources (space, time, random bits, communication, what have you..). But an even more important part is the abstraction of computation as a first-order phenomenon, a "theory of the feasible", as it were. 

Consider the example of privacy. Privacy is a concern that arises from access to data, and is a core component of any "big data" discussion. What differential privacy achieved was to frame the problem computationally, as both a definition of what's feasible to protect, and as a set of mechanisms to guarantee protection, and guarantee what cannot be protected.

Similarly, I'm interested in other problems arising out of our need to interact with data that can be framed abstractly in "the language of the feasible". There's work on how to value published data, and how to price it. There's work on how to verify computations, and how to do computations securely. There's work on how to understand and interpret data analysis. And then of course there's the large body of work on how to manage and compute efficiently with data under brand new models of computation. 

The "language of the feasible" is our not-so-secret weapon in defining abstractions: it's more than just resource allocation and management, and it's what gives theoryCS power to speak to other domains. 

No comments:

Post a Comment

Disqus for The Geomblog