Tuesday, August 27, 2013

And now for something different...

People who know me through my blog probably assume I have no life at all outside blogging and big data.

oh wait...

But more seriously, my family is here with me during this sa-battle-cal and +Karen Ho is ably keeping up the non-work blogginess. If you're interested in our adventures outside big theories of data, check out her blog Simply Batty: you might even learn what "academic headcheese" is.

On "a theory of big data"

+Moritz Hardt kicked off our Simons big data program with an excellent rumination on the role of theory in "big data". Some followup thoughts:

Moritz makes the point that theory, for better or for worse (mostly for better) made the choice to give problems primacy over data. I harp on this a lot in my algorithms class, and also talked about this in the context of computational thinking: the idea of 'naming a problem' is a key intellectual contribution of theoryCS. 

But to look at what might happen if we allowed more flexibility in problem definition, we don't have to look too far. Machine learning (a discipline that faces data head on) is characterized by the study of problems that are well defined in the broad sense, but have a lot of wiggle room in the specific (and I'm now hoping +Sebastien Bubeck will respond and critique what follows)

Consider classification. In a broad sense, it's very well defined: given a collection of points labeled with (1,-1), find a decision rule that can be used to separate the +s from the -s. But in the specifics: what's the decision rule ? what's the penalty for making a mistake ? how much labeled data do we have ? does it cost us to get labels ? and so on.

For each possible answer to these questions, you can construct a well defined problem. And you could focus on solutions to that very well defined problem. But that's not particularly important. Maybe I use hinge loss to capture errors in my classification, or maybe I use some other error function. I don't care as much as I care about a formulation that allows me to solve different flavors of the problem using  a single paradigm: possibly some form of gradient descent. 

This theme shows up again and again. In clustering. In regression. In the different flavors of learning (active, semisupervised, transfer, multitask, ...). A good solution to a problem focuses not on the specific definition, but a general framework that captures different variations on the problem and reduces them to solving some optimization that can then be engineered. This is also why optimization (and understanding heuristics for optimization) is such a focus on machine learning (Bubeck's lecture notes are a great resource on this, btw)

There is of course a downside. The downside is that you (could) miss out on connections between problems: the reductions that are the lifeblood of work in theoryCS. In fact, John Langford has looked into this issue specifically, with his machine learning reductions project.

So returning to the original question, what should a theory of big data look like ? A big part of theoryCS research is the skillful management of resources (space, time, random bits, communication, what have you..). But an even more important part is the abstraction of computation as a first-order phenomenon, a "theory of the feasible", as it were. 

Consider the example of privacy. Privacy is a concern that arises from access to data, and is a core component of any "big data" discussion. What differential privacy achieved was to frame the problem computationally, as both a definition of what's feasible to protect, and as a set of mechanisms to guarantee protection, and guarantee what cannot be protected.

Similarly, I'm interested in other problems arising out of our need to interact with data that can be framed abstractly in "the language of the feasible". There's work on how to value published data, and how to price it. There's work on how to verify computations, and how to do computations securely. There's work on how to understand and interpret data analysis. And then of course there's the large body of work on how to manage and compute efficiently with data under brand new models of computation. 

The "language of the feasible" is our not-so-secret weapon in defining abstractions: it's more than just resource allocation and management, and it's what gives theoryCS power to speak to other domains. 

Friday, August 23, 2013

Simons Institute opening, with pictures.

Today was the official kickoff for the Simons Institute, as well as the start of the two special programs in big data and real analysis. For the last few days I've been busy getting my paperwork organized over at Calvin Hall, which is a beautifully redone circular building named after the Nobel Laureate and Berkeley professor (more on the building here).

The inside is very conducive to collaborative work. The offices are organized around the periphery of the building, and are functional, but not overly so. The idea is that people will congregate in the large interior open spaces that are covered with whiteboards and comfy chairs.

This is what the interior looks like: (apologies for the picture quality)

The second floor open lounge

Comfy chairs in front of a whiteboard

even more chairs and more whiteboards

Let us not forget the very fancy coffee machine (which makes nice espresso)

and of course you need your Simons Institute mug to drink it in.

This is the main auditorium for workshops, on the first floor.

The next plan is to pave the outer walls of the building at ground level with chalkboards, so people can think and chat outside as well as inside. Ah, Berkeley weather. I'm told the boards are already there, and just need to be installed, so I hope to have pictures soon. 

There are 93 visitors between the two programs (53/39 by my rough count) which includes 22 Fellows. Offices are either two or three-person, and there are some large cubicle spaces with 7+ inhabitants. The visitors are all on the second and third floors of the building, which both have large open areas (the pictures above are from the second floor). 

Wednesday, August 14, 2013

The Sa-battle-cal starts..

I'm off on sabbatical, and we (two cars, two adults, two children and one cat) just started the long drive to Berkeley from SLC. This has turned out to be more exciting than I anticipated...

Our original plan was to drive 4 hours each day, making up the 12 hour drive to Berkeley in three days. With two drivers for two cars, this seemed like the best option to prevent us from getting over-tired.

Right at Wendover (Las Vegas for poor people!), about halfway on our first leg, my car broke down. Thankfully, I was able to coast it to a mechanic's shop just off the highway as we entered town. I had no clue what the problem was, and the mechanic wouldn't be able to take a look at it for a few hours (this is the week of the Bonneville races on the salt flats).

So I called my mechanic back in Salt Lake, and described the problem to him. He diagnosed it on the spot as a faulty ignition coil, which is apparently an easy part to procure and replace.... if you're near a dealership.

Which I was not...

He also needed me to figure out which coil was failing, which needed a computer scanner, which needed a mechanic.


Here's what needed to happen, in short order. It was now 5pm. We (my wife and I) needed to

  • get the mechanic to at least scan the car to get the appropriate error codes
  • Call the car parts store (closing at 6) to see if they could procure the needed parts
  • Find a hotel in Wendover (did I mention this was the week of the Bonneville Races, and almost everything in town was booked?)
  • Change our reservations downstream because we were stuck in Wendover. 
Thankfully, through a series of small miracles and the generosity of many strangers and non-strangers, we managed to get all of this done. My mechanic even offered to to walk me through the installation myself once I did the appropriate Youtube self-study (MOOCs FTW !!)

Long story short, it's now day II of our trek. We doubled up the driving on day II to make up for lost time, and we're back on schedule (minus one set of car keys that I managed to lose in all the hurry). 

So far, this sabbatical is certainly not relaxing. 

p.s Shout out to Utah Imports of SLC, and S&R auto repair in Wendover. 

p.p.s I-80 through Nevada is one of the most mind-numbingly boring drives imaginable. 

Monday, August 05, 2013

Hi ho, hi ho, it's off to sabbatical we go...

It is now 7 days and counting before I head off on sabbatical, not to return till August 2014. I'll be heading to the Simons Institute to think big thoughts about the theory of data, (or was it big data about theory, or maybe just the theory of big data). After that, I'll be enjoying the fine life at Google for a semester, followed by a visit to MADALGO in Aarhus (because big data in Europe just tastes better, you know).

I've been using this summer to nurse my psychic wounds from six years of the grind, and gently slide into sabbatical mode. The rush of joy that comes everytime I delete a departmental email without reading beyond the subject tells me I'm ready :).

So far, advice I've received on how to conduct myself during a sabbatical includes:

  • Don't plan too much
  • Work very hard, but on something other than your current research
  • Have an adventure, and if work happens, don't blame yourself. 
  • Say NO repeatedly (this also applies to life post-tenure, apparently). I maybe took this advice a little too literally and managed to decline a review request in about 0.5 seconds, which surprised (and possibly annoyed) the editor who had made the request. 
  • Do something different (or do something that you've been meaning to do for years but never got a chance to). 
What else ? 

Disqus for The Geomblog