Wednesday, September 04, 2013

Statistics, geometry and computer science.

Geometry is the glue between statistics and computer science. -- Michael Jordan
That's why everyone gets stuck in it. 
-- Graham Cormode

Today Yesterday was the first day of the "Big Data boot camp" at the Simons Institute. The idea behind these boot camps is to get participants in a program "on the same page". In a program like ours, with statisticians. ML folks, algorithms people and optimizers all mingling together, getting "on the same page" with "the same language" is critical. For a more general overview of the first day activities, see Muthu's post.

Michael Jordan opened the proceedings with a talk titled "Big Data: The computation/statistics interface". From a CS perspective, it was an excellent talk that hit just the right level of detail for me to grasp some basic terminology in statistics, as well as understand some of the questions people are pondering right now.

We think of big data as a PROBLEM. More data, more scaling issues, O(n) becomes infeasible, and so on. In particular, we think of large data as a workload that taxes our resources: time, space, communication, and so on.

Michael presented a counterpoint. From a statistical perspective, more data is better. Estimators converge, error reduces, and the law of large numbers kicks in. Rather than treating data as something that we have to manage efficiently, can we think of data (quality) as a resource itself that can be managed efficiently ? In other words, can we tradeoff estimator accuracy against other resources ?

He proceeded to give a wonderful exposition of the basics of statistical decision theory, including the frequentist and Bayesian views of estimator risk. Along the way, he described a geometric interpretation of the James-Stein estimator which is too neat to skip.

The James-Stein estimator is really a very counterintuitive object. Consider a set of samples $X_i \sim N(\theta_i, 1), i \le n$, where the $\theta_i$ are unknown parameters. The goal is to sample $X_i$ to determine the $\theta_i$, by minimizing some loss function $\sum (\theta_i - \hat{\theta_i})^2$. The $\theta_i$ are assumed unrelated to each other.

The "obvious" estimator is $\hat{\theta_i} = X_i$. It's the MLE after all. But it turns out that this estimator is strictly dominated (in terms of expected loss aka risk) by the so-called shrinkage estimator:
\[ \hat{\theta_i} = (1 - \frac{c}{S^2}) X_i \]
where $S = \sum_j X_j^2$.

What's surprising about this is that the estimator for a single $\theta_i$ takes into account samples from other $X_i$, even though the $\theta_i$ are assumed to be unrelated.

It turns out that there's an elegant geometric interpretation of this estimator, first attributed to Galton by Stephen Stigler. Consider the pairs $(X_i, \theta_i)$ as points in the plane. We don't quite know where these points lie because we don't know what $\theta_i$ is. Because the $X_i$ are normally distributed around the $\theta_i$, each point is really a horizontal interval of interest.

Now the standard estimator $\hat{\theta_i} = X_i$ arises from trying to solve a regression problem of $X$ versus $\theta$, or more precisely solving the regression $\hat{X} = E(X\mid \theta)$. But really, what we're trying to do is solve the regression $\hat{\theta} = E(\theta \mid X)$. In other words, regression of $y$ against $x$ is different from the regression of $x$ against $y$, as long as we have more than three points. And this is precisely what the JS estimator says.

Returning to the question he started the talk with, can we show a formal tradeoff between data estimation accuracy and sampling cost ? In a recent paper with Venkat Chandrasekharan from Caltech, they show a very nice (and counterintuitive) result: that using a cruder relaxation of an optimization problem can actually lead to more efficient estimation as long as you have sufficient data. Note that this goes against the standard idea that a crude relaxation is "worse" in an approximation sense.

The idea is as follows. Consider the very simple problem of denoising where the observations $\mathbf{y}$ are generated from the input $\mathbf{x}$ by a noise perturbation:
\[ \mathbf{y} = \mathbf{x} + \sigma \mathbf{z} \]
where $\mathbf{z}$ is normally distributed. Let us assume that $\mathbf{x}$ is drawn from some set $S$ (for example, $S$ is the set of low-rank matrices, or the set of permutation matrices).

The simplest estimator for $x$ is an $\ell_2$ projection: compute the sample mean $\overline{\mathbf{y}}$ and then find its projection onto $S$. But this involves a minimization over $S$, which might be intractable.

We can relax $S$ to some $S'$, where $S \subset S'$ and minimization becomes easier. But this would worsen the approximation ratio of the optimization depending on the relaxation. And here's where the insight comes in.

Suppose you instead look at the statistical risk of this estimator. In other words, look at the expected difference between the true $\mathbf{x}^*$ and the estimated $\hat{\mathbf{x}}$. The main result they show in this paper (paraphrased) is
 expected risk is upper bounded by (C/n) * geometric complexity of $S, \mathbf{x}^*$
where $n$ is the number of samples.

Suppose we fix the desired risk. Then an increase in $n$ can be used to "pay for" increased "geometric complexity". And here's where the final idea comes in. The "geometric complexity" used here is the Gaussian-squared complexity, which is defined as
\[ g(D) = E[ \sup_{x \in D} \langle x, z\rangle^2 ] \]
where $z$ is normally distributed.

In particular, the set $D$ used in the above expression is the set of directions from $\mathbf{x}^*$ to points in $S$. Suppose $S$ is very "pointed" at $\mathbf{x}^*$. Then the Gaussian-squared complexity is small and the number of samples needed is small. But if instead we use a relaxation $S'$ that is "blunt". The Gaussian complexity goes up, but if we have more samples, that keeps the risk fixed. If the optimization for $S'$ is significantly easier than the corresponding problem for $S$, then we still win, even though $n$ has increased, and even though the classical "approximation ratio" might have become worse.

In the final part of his talk, Michael talked about his recent work on "bags of little bootstraps", which Muthu also covered in his post. This post is already too long, so I'll defer this to another time.

Tuesday, August 27, 2013

And now for something different...

People who know me through my blog probably assume I have no life at all outside blogging and big data.

oh wait...

But more seriously, my family is here with me during this sa-battle-cal and +Karen Ho is ably keeping up the non-work blogginess. If you're interested in our adventures outside big theories of data, check out her blog Simply Batty: you might even learn what "academic headcheese" is.

On "a theory of big data"

+Moritz Hardt kicked off our Simons big data program with an excellent rumination on the role of theory in "big data". Some followup thoughts:

Moritz makes the point that theory, for better or for worse (mostly for better) made the choice to give problems primacy over data. I harp on this a lot in my algorithms class, and also talked about this in the context of computational thinking: the idea of 'naming a problem' is a key intellectual contribution of theoryCS. 

But to look at what might happen if we allowed more flexibility in problem definition, we don't have to look too far. Machine learning (a discipline that faces data head on) is characterized by the study of problems that are well defined in the broad sense, but have a lot of wiggle room in the specific (and I'm now hoping +Sebastien Bubeck will respond and critique what follows)

Consider classification. In a broad sense, it's very well defined: given a collection of points labeled with (1,-1), find a decision rule that can be used to separate the +s from the -s. But in the specifics: what's the decision rule ? what's the penalty for making a mistake ? how much labeled data do we have ? does it cost us to get labels ? and so on.

For each possible answer to these questions, you can construct a well defined problem. And you could focus on solutions to that very well defined problem. But that's not particularly important. Maybe I use hinge loss to capture errors in my classification, or maybe I use some other error function. I don't care as much as I care about a formulation that allows me to solve different flavors of the problem using  a single paradigm: possibly some form of gradient descent. 

This theme shows up again and again. In clustering. In regression. In the different flavors of learning (active, semisupervised, transfer, multitask, ...). A good solution to a problem focuses not on the specific definition, but a general framework that captures different variations on the problem and reduces them to solving some optimization that can then be engineered. This is also why optimization (and understanding heuristics for optimization) is such a focus on machine learning (Bubeck's lecture notes are a great resource on this, btw)

There is of course a downside. The downside is that you (could) miss out on connections between problems: the reductions that are the lifeblood of work in theoryCS. In fact, John Langford has looked into this issue specifically, with his machine learning reductions project.

So returning to the original question, what should a theory of big data look like ? A big part of theoryCS research is the skillful management of resources (space, time, random bits, communication, what have you..). But an even more important part is the abstraction of computation as a first-order phenomenon, a "theory of the feasible", as it were. 

Consider the example of privacy. Privacy is a concern that arises from access to data, and is a core component of any "big data" discussion. What differential privacy achieved was to frame the problem computationally, as both a definition of what's feasible to protect, and as a set of mechanisms to guarantee protection, and guarantee what cannot be protected.

Similarly, I'm interested in other problems arising out of our need to interact with data that can be framed abstractly in "the language of the feasible". There's work on how to value published data, and how to price it. There's work on how to verify computations, and how to do computations securely. There's work on how to understand and interpret data analysis. And then of course there's the large body of work on how to manage and compute efficiently with data under brand new models of computation. 

The "language of the feasible" is our not-so-secret weapon in defining abstractions: it's more than just resource allocation and management, and it's what gives theoryCS power to speak to other domains. 

Friday, August 23, 2013

Simons Institute opening, with pictures.

Today was the official kickoff for the Simons Institute, as well as the start of the two special programs in big data and real analysis. For the last few days I've been busy getting my paperwork organized over at Calvin Hall, which is a beautifully redone circular building named after the Nobel Laureate and Berkeley professor (more on the building here).

The inside is very conducive to collaborative work. The offices are organized around the periphery of the building, and are functional, but not overly so. The idea is that people will congregate in the large interior open spaces that are covered with whiteboards and comfy chairs.

This is what the interior looks like: (apologies for the picture quality)


The second floor open lounge


Comfy chairs in front of a whiteboard


even more chairs and more whiteboards


Let us not forget the very fancy coffee machine (which makes nice espresso)



and of course you need your Simons Institute mug to drink it in.


This is the main auditorium for workshops, on the first floor.

The next plan is to pave the outer walls of the building at ground level with chalkboards, so people can think and chat outside as well as inside. Ah, Berkeley weather. I'm told the boards are already there, and just need to be installed, so I hope to have pictures soon. 

There are 93 visitors between the two programs (53/39 by my rough count) which includes 22 Fellows. Offices are either two or three-person, and there are some large cubicle spaces with 7+ inhabitants. The visitors are all on the second and third floors of the building, which both have large open areas (the pictures above are from the second floor). 

Wednesday, August 14, 2013

The Sa-battle-cal starts..

I'm off on sabbatical, and we (two cars, two adults, two children and one cat) just started the long drive to Berkeley from SLC. This has turned out to be more exciting than I anticipated...

Our original plan was to drive 4 hours each day, making up the 12 hour drive to Berkeley in three days. With two drivers for two cars, this seemed like the best option to prevent us from getting over-tired.

Right at Wendover (Las Vegas for poor people!), about halfway on our first leg, my car broke down. Thankfully, I was able to coast it to a mechanic's shop just off the highway as we entered town. I had no clue what the problem was, and the mechanic wouldn't be able to take a look at it for a few hours (this is the week of the Bonneville races on the salt flats).

So I called my mechanic back in Salt Lake, and described the problem to him. He diagnosed it on the spot as a faulty ignition coil, which is apparently an easy part to procure and replace.... if you're near a dealership.

Which I was not...

He also needed me to figure out which coil was failing, which needed a computer scanner, which needed a mechanic.

So....

Here's what needed to happen, in short order. It was now 5pm. We (my wife and I) needed to

  • get the mechanic to at least scan the car to get the appropriate error codes
  • Call the car parts store (closing at 6) to see if they could procure the needed parts
  • Find a hotel in Wendover (did I mention this was the week of the Bonneville Races, and almost everything in town was booked?)
  • Change our reservations downstream because we were stuck in Wendover. 
Thankfully, through a series of small miracles and the generosity of many strangers and non-strangers, we managed to get all of this done. My mechanic even offered to to walk me through the installation myself once I did the appropriate Youtube self-study (MOOCs FTW !!)

Long story short, it's now day II of our trek. We doubled up the driving on day II to make up for lost time, and we're back on schedule (minus one set of car keys that I managed to lose in all the hurry). 

So far, this sabbatical is certainly not relaxing. 

p.s Shout out to Utah Imports of SLC, and S&R auto repair in Wendover. 

p.p.s I-80 through Nevada is one of the most mind-numbingly boring drives imaginable. 

Monday, August 05, 2013

Hi ho, hi ho, it's off to sabbatical we go...

It is now 7 days and counting before I head off on sabbatical, not to return till August 2014. I'll be heading to the Simons Institute to think big thoughts about the theory of data, (or was it big data about theory, or maybe just the theory of big data). After that, I'll be enjoying the fine life at Google for a semester, followed by a visit to MADALGO in Aarhus (because big data in Europe just tastes better, you know).

I've been using this summer to nurse my psychic wounds from six years of the grind, and gently slide into sabbatical mode. The rush of joy that comes everytime I delete a departmental email without reading beyond the subject tells me I'm ready :).

So far, advice I've received on how to conduct myself during a sabbatical includes:

  • Don't plan too much
  • Work very hard, but on something other than your current research
  • Have an adventure, and if work happens, don't blame yourself. 
  • Say NO repeatedly (this also applies to life post-tenure, apparently). I maybe took this advice a little too literally and managed to decline a review request in about 0.5 seconds, which surprised (and possibly annoyed) the editor who had made the request. 
  • Do something different (or do something that you've been meaning to do for years but never got a chance to). 
What else ? 

Friday, July 05, 2013

FSTTCS in December

At this point, you're probably wondering: exactly how much more coffee do I need to infuse into my system to get my 1/3/10  papers submitted to SODA before the deadline ? Do your stomach lining (and your adenosine receptors) a favor and consider submitting to FSTTCS: the abstracts deadline is Jul 8, and the submission deadline is July 15. You get to go to Guwahati in December and you might even get to stay here:



Wednesday, June 26, 2013

SODA 2014 abstract submission deadline approaching

+Chandra Chekuri informs us that the abstract submission deadline for SODA 2014 is approaching rapidly. Abstracts MUST be in by July 3 or else you cannot submit a full paper (by Jul 8).

For more information, visit the SODA site: http://siam.org/meetings/da14/submissions.php

Tuesday, June 04, 2013

Computational thinking in other disciplines.

While we wait for the (literal) flames to die down at the STOC business meeting (who'd have thought a two-tier PC would be so controversial), a thought on "computational thinking". 

I've been on PhD committees for students in other departments far removed from computer science. I'm there partly as an external person, and partly as someone who "understands computation" and can guide the student in a project that involves some nontrivial algorithmic thinking.

It's very striking to see how these students approach the notion of computing. They have strong engineering/math backgrounds, but they tend to view computational problems as black boxes for which the first idea that comes to mind is the correct answer. This is pernicious enough that when I try to nudge them to define the problem more generally, they have no idea what to do.

This is a common theme in disciplines that make use of computation. Everyone has their favorite biology story along these lines, but this article in Science magazine is a good example of the problem of scientists using code. In brief, they tend to use code as a blackbox without really understanding what the code does or why, or how to validate the results.

In my algorithms class, I emphasize the important of naming problems. Especially when one is trying to model a fuzzy real-world problem, it's very important to recognize named problems that might be hidden inside. But this very fundamental idea - that the problem comes first, and that by naming it we can recognize and understand this - is foreign to people not trained to think computationally. Rather, they think in terms of solutions, which is dangerous until they really know what they want to solve.

We talk a lot about computational thinking and the intellectual content of computer science. For anyone without CS training who wishes to make use of computational tools, this is one of the most important lessons to learn: how to name a problem.




Wednesday, May 29, 2013

MOOCification and summer programs

I was walking across campus today and saw the usual crowds of high school and middle school students doing various summer camps (the U. does a GREAT - yes it's called GREAT - summer camp in robotics/graphics, for example).

In our MOOCified future where all classes are taught by "course assistants" who report to "quality assurance staff", will there still be a way for excited middle and high school students to come to a university campus and get an interactive hands-on education at a camp ? I don't doubt that Udacity (or Coursera, or edX) will be happy to spin off a junior varsity division for the high schoolers (and if you do, please note that I said it first). And that will be one more demographic that the university loses as our educational mission gets sliced and diced.

Sunday, May 19, 2013

21st century proverbs

Having nothing better to do, I decided to create a twitter meme: updating proverbs for the 21st century. Use the hashtag #21stcentproverbs and join in the fun below.

 

Coding, Complexity and Sparsity 2013 (SPARC) 2013.

Atri Rudra reminds us that the 3rd incarnation of SPARC is soon to be upon us (disclaimer: I'll be speaking at the event):
Efficient and effective transmission, storage, and retrieval of information on a large-scale are among the core technical problems in the modern digital revolution. The massive volume of data necessitates the quest for mathematical and algorithmic methods for efficiently describing, summarizing, synthesizing, and, increasingly more critical, deciding when and how to discard data before storing or transmitting it. Such methods have been developed in two areas: coding theory, and sparse approximation (SA) (and its variants called compressive sensing (CS) and streaming algorithms). 
Coding theory and computational complexity are both well established fields that enjoy fruitful interactions with one another. On the other hand, while significant progress on the SA/CS problem has been made, much of that progress is concentrated on the feasibility of the problems, including a number of algorithmic innovations that leverage coding theory techniques, but a systematic computational complexity treatment of these problems is sorely lacking. The workshop organizers aim to develop a general computational theory of SA and CS (as well as related areas such as group testing) and its relationship to coding theory. This goal can be achieved only by bringing together researchers from a variety of areas. 
The Coding, Complexity and Sparsity workshop (SPARC 13) will be held in Ann Arbor, MI on Aug 5-7. 
These will be hour-long lectures designed to give students an introduction to coding theory, complexity theory/pseudo-randomness, and compressive sensing/streaming algorithms. We will have a poster session during the workshop and everyone is welcome to bring a poster but graduate students and postdocs are especially encouraged to give a poster presentation. 
This is the third incarnation of the workshop and the previous two workshops were also held in Ann Arbor in August of 2011 and 2012. 
Confirmed speakers:
* Jin Yi Cai (University of Wisconsin, Madison)
* Shafi Goldwasser (MIT)
* Piotr Indyk (MIT)
* Swastik Kopparty (Rutgers University)
* Dick Lipton (Georgia Tech)
* Andrew McGregor (University of Massachusetts, Amherst)
* Raghu Meka (IAS)
* Jelani Nelson    (Harvard)
* Eric Price (MIT)
* Christopher Ré (University of Wisconsin, Madison)
* Shubhangi Saraf (Rutgers University)
* Suresh Venkatasubramanian (University of Utah)
* David Woodruff (IBM)
* Mary Wootters    (Michigan)
* Shuheng Zhou (Michigan)

We have some funding for graduate students and postdocs with preference given to those who will be presenting posters. For registration and other details, please look at the workshop webpage: http://eecs.umich.edu/eecs/SPARC2013/

Friday, May 10, 2013

On GPU algorithms

Lance talks about GPU algorithms in his latest post:
The theory community hasn't seem to catch on yet. There should be some nice theoretical model that captures the vector and other operations of a GPGPU and then we should search for algorithms that make the best use of the hardware. The theory world writes algorithms for non-existent quantum computers but not for the machines that we currently use.
Well.

As someone who was doing GPU algorithms back when this meant flipping state bits on the fixed function pipeline (this is the GPU analog of "I wrote video games in assembly"), maybe a little perspective is in order.

Aside: I actually gave a talk on GPU algorithms at Chicago when Lance was there back in 2004. Clearly I wasn't interesting enough to get him to take note :).

Back in the early 2000s, I got quite interested in GPU algorithms. This came about from a summer project that Nabil Mustafa was doing with Shankar Krishnan and myself on real-time map simplification. Shankar had this cool idea to try and accelerate the Douglas-Peuker algorithm by implementing it on a GPU, which at the time meant trying to retrofit the algorithm to a very strange OpenGL rendering pipeline.

One thing led to another, and it got us thinking more abstractly about what kinds of computations could be done on the GPU. This was pre-CUDA, which meant that essentially we could abstract the GPU machine model as a massively parallel SIMD architecture in which each "processor" ran a simple straight line program (no loops, limited conditionals, and small memory - much like a streaming algorithm). The parallelism lent itself nicely to geometric problems, and we wrote papers on map simplification, depth contours, and general geometric optimization. We also designed an algorithm for a CSG rendering problem that had buried inside it a simple streaming model for proving lower bounds: in fact we were able to show a exponential gap (in the number of streaming passes needed) between deterministic and randomized median finding in this model (the conference wasn't interested in the lower bounds though :)).

In a bout of perfect timing, I stopped working on GPUs a year or so before CUDA. At the time, my reasons were simple. I was tired of the computational model changing on me every six months, and didn't have the resources to keep buying the latest and greatest graphics cards (although AT&T was very generous in getting me a state of the art card originally). It was also annoying that in order to really exploit the power of the hardware, I needed secret sauce that was only revealed if you had NVIDIA inside connections (or went to the NVIDIA U events).

Then CUDA came along and everyone went nuts. If you go to hgpu.org (originally gpgpu.org) you'll see the number of papers being published every year on GPU-accelerated methods. CUDA changed (quite radically) the way in which we thought about GPU algorithms: the memory model became more sophisticated, and the SIMD component was more powerful, although it was still a good idea to avoid excessive looping and conditionals.

Eventually I got dragged back into the GPU world. I collaborated on a  paper on graph coloring which was very instructive in demonstrating the differences between a GPU and straight up parallel machine. I also did a series of lectures on GPU algorithms at a 2012 MADALGO summer school on new models of computing. Incidentally, folks at Utah spend a lot of time working on issues relating to GPUs: Mary Hall's group has some nifty autotuning tools for mapping algorithms to the GPU, and Ganesh Gopalakrishnan has been thinking about memory consistency issues for GPUs.

Is there a good "GPU model" that theoreticians can study ? I think this is the wrong question. I think the GPU is one point (multi-core CPUs are another) in a space of tradeoffs between local computation, memory latency, and communication. I've been saying for a while now that communication appears to be the key resource to understand theoretically when dealing with our new networked/distributed/multiprocessor world. GPU algorithms are interesting to study as one example of how communication affects a computation. But the "right" way to deal with this involves focusing on communication, and not the GPU model per se. The latter has many idiosyncracies and what I'd consider distractions that take away from a clean formal understanding.

I should add that I'm not even the only person from theory-land who's been thinking about GPU-derived models. There's a great body of work on multicore models (see Phil Gibbons' lecture notes at the same summer school), and I recall Michael Mitzenmacher having recent work on GPU-accelerated hashing. Uzi Vishkin had a STOC workshop a few years ago on multicore models as well. (Turing Award winner) Valiant's bridging model clearly needs to get a lot more attention as well so that people don't think no one's paying attention to the models.

For more on this, also see my colleague Jeff Phillips' lecture notes on GPUs from his class on new models of computation.

Saturday, March 23, 2013

Free, Freemium, and Paid

There was a time when I'd bridle at the idea of having to pay for software or services. But I browse the iTunes app store now, and see people pleading to have the chance to pay for an app that they like, so that the authors won't stop updating it. The whole kerfuffle with Google Reader, Keep and Evernote is another example of how people have begun to prefer to pay for products, rather than rely on something free.

It feels like the end of an era where open source and free software (not the same thing, but often referred to in the same breath) were the default. Maybe we've come to the realization that nothing is really ever free, and that it's more realistic to get the costs out in the open rather than "being the product".

Friday, March 15, 2013

The SIGACT CG column

As you might have just discovered, I'm the second half of the two-headed monster that's taken over the SIGACT Geometry Column after Joe O'Rourke stepped down (Adrian Dumitrescu, who hopefully does not mind being referred to as the head of a monster, is the other half). My first column is up, and it talks about some recent fascinating developments in nonnegative matrix factorization.

My hope with the pieces I write is to cover areas of geometry that may have not had sufficient representation in the past (especially things closer to problems I'm interested in). My next column is due in August, and apart from doing a wrapup on SoCG, other things that come to mind include Laplacians and graph geometry, reproducing kernels, or even Bregman geometry.

But everything is now mobile, crowdsourced and social networked, so I'm looking for your suggestions on interesting topics to cover, new emerging areas, results that I'm not tracking, and so on. So post away here or on G+.

Monday, February 25, 2013

Data analysis, interpretation and explanations

There was a recent data-related kerfuffle between the New York Times and the makers of the Tesla electric car. If you haven't read the articles, (and the NYT public editor's post on this has good links), the crude summary is this:

  • NYT reporter takes Tesla on long test drive, and reports problems with running out of charge.
  • Tesla Motors CEO Elon Musk writes a long takedown of the reporter's review, complete with graphs generated from data the car recorded during the drive
  • NYT comes back with their own interpretation of data, rebutting Musk's claims.
  • Others attempt to reproduce the reporter's experience and fail, but arguably in different weather conditions that might or might not make a difference.
In an insightful meta-analysis of the dustup, Taylor Owen of the Tow Center for Digital Journalism discusses the implications for journalism in a data-driven world. He also references an article by David Brooks that makes the point:
People are really good at telling stories that weave together multiple causes and multiple contexts. Data analysis is pretty bad at narrative and emergent thinking...
I've felt for a while now that it's time to design mechanisms for providing context and narrative to data analysis*. Some of the research my student Parasaran does is on metaclustering: essentially the meta-analysis of different perspectives (clusterings) of data to draw out a larger meaning. We've also just submitted a paper on how to derive local markers of prediction validity in clustering (rather than just relying on global measures of quality). And my seminar this semester is about the larger problems of explanations and accounting in data mining.

I think as computer scientists, we have a lot to offer in the realm of data mining - not just in the design of tools for prediction, but in the design of tools to facilitate better understanding.


* None of this is surprising to experimental scientists, who will routinely attack a problem from multiple directions in order to acquire a larger understanding of a phenomenon rather than just the ability to predict. 


Friday, February 08, 2013

TCS+ hangout on the TSP polytope.

+Thomas Vidick  and the folks at the +TCS+ community have started a new experiment in G+ hangout talks. The first talk in this series was by Ronald de Wolf on the STOC 2012 paper that showed that there was no polynomial-sized lifted representation of the TSP polytope.

Overall, it was a very pleasant experience. I was able to reserve one of the 10 slots (a Hangout limit) for myself and my students at the University of Utah to attend and interact with the speaker, and from Thomas's post-review it seems that many more were signed on for "view-only" access to the stream.

There were very few technical hiccups, and +Oded Regev did a great job making sure that people were muted when not talking, and had a chance to ask questions in an orderly way. The chat sidechannel was a great help.

The talk itself was the best part: de Wolf did an excellent job conveying the main ideas of the proof without getting bogged down in details, and it felt as comfortable as listening to a talk live at a conference. Given the number of people listening in, this was already approaching medium-sized-workshop levels.

I'm looking forward to more of these events, and I'm glad that the +TCS+  folks are doing this. I also hope they can try more experiments with Google Hangout. For example, two ideas come to mind:

  • A reddit-style AMA ("Ask Me Anything"). One way to make this work is that the speaker would do a short presentation (maybe 5-10 minutes) and then would open up the floor for questions. To keep things manageable, people could write in questions on chat, and the moderator could filter them and ask the questions live. With sufficient preparation, some questions could be submitted ahead of time.
  • A live panel discussion with a moderator and a few participants, which again could have questions from the audience moderated by the moderator.

Monday, January 21, 2013

Accountability in Data Mining: A new seminar

Glencora Borradaile makes a number of interesting points about the implications of Aaron Swartz's life and work for us academic computer scientists. 
As computer science academics we are in a very powerful position. We are trusted with shaping the next generation that will make very important decisions that will have far-reaching social implications. Decisions like those over Facebook’s privacy defaults, motivating technology that enables autonomous private vehicles at the expense of the public interest, defining ownership of electronic media. We make those decisions ourselves in our research; what we research, how we allow our research to be used.
 Whether or not you agree with her particular policy preferences, the point remains that the technologies we develop can have lasting consequences for the "shape" of the world to come. And if we don't speak up (either through our work, or through our own advocacy), others will take the technologies we develop and find their own ways of using or misusing them.

All of this leads up to my current interests in data mining. I've been thinking about the problems of accountability in data mining for a while now: initially mostly in private, but now more and more in public (along with +Graham Cormode and +Andrew McGregor) as I see the importance of the issue.

What is accountability in data mining ? It means many things really, but for me, for now, it means this:

The fruits of data mining pervade every aspect of our life. We are recommended books and movies; given differential pricing for insurance; screened for potential terror threats; diagnosed with various diseases; and targeted for political advertising. The ability to sift through massive data sets with sophisticated algorithms has resulted in applications with impressive predictive power. 
Since the internals of a learning process can be complex and opaque, it is important to verify that the results satisfy the properties claimed by the algorithm. The importance of this goes beyond merely checking the algorithm itself. Validation mechanisms also provide a way for users to demand accountability from authorities who might use the results of mining to affect users in some way (by changing their insurance rates, or even putting them on a terrorism watchlist). As the results of data mining affect more and more of our lives, the more crucial it is that the user be able to validate decisions made on their behalf and that affect them. 
The above is an introduction to a seminar I'm running this semester on this topic. I'm a little nervous about it, because the topic is vast and unstructured, and almost anything I see nowadays on data mining appears to be "in scope". I encourage you to check out the outline, and comment on topics you think might be missing, or on other things worth covering. Given that it's a 1-credit seminar that meets once a week, I obviously can't cover everything I'd like, but I'd like to flesh out the readings with related work that people can peruse later.

It's entirely possible that we don't care about our privacy any more (as Facebook seems to think*). But we will care about the consequences of the use of our data. My perspective is that a better understanding of what is technically possible (and not possible) to demand and get accountability will be critical to informing larger policy discussions on this topic.

* In an earlier version of this post I wrongly attributed this sentiment to an individual. Apologies.

Wednesday, January 16, 2013

A sampling gem: sampling from $\ell_p$ balls.

A while back, I had written about uniform sampling from the $d$-simplex. In brief, sample exponentially distributed random variables, and normalize them. Note that the simplex is one piece of the $d$-dimensional $\ell_1$ unit sphere.

I had also mentioned a similar trick to sample from the ($\ell_2$) unit sphere: sample Gaussian random variables, and normalize them.

As I discovered today, these are really just special cases of a much more general result that includes both of these approaches, as well as providing a general way to sample from the $d$-dimensional unit ball (as well as sphere) under any $\ell_p$ norm.

The result is due to Barthe, Guedon, Mendelson and Naor, and is breathtakingly elegant. Here it is:

To sample from the unit ball under the $\ell_p$ norm in $d$ dimensions, randomly sample $d$ coordinates $x_1, \ldots, x_d$ i.i.d from a density proportional to $\exp(-|x|^p)$. Also sample $Y$ from the exponential distribution with parameter $\lambda = 1$. Then the desired sample is
$$\frac{(x_1, x_2, \ldots, x_d)}{\Bigl(\sum_i |x_i|^p + Y\Bigr)^{1/p}}$$
Notice that merely eliminating the $Y$ recovers the procedure for sampling from the unit sphere and includes both of the above sampling procedures as a special case. It's far more efficient than either rejection sampling, or even the metropolis-sampling method from the theory of sampling from convex bodies. Also, if you only want to sample from an $\ell_2$ ball you can do it without this hammer using the sphere sampling technique and a nonuniform sampling of the length, but this seems so much more elegant.

Thursday, January 10, 2013

LEGO and math teaching

I was discussing Paul Tough's book with my wife yesterday at breakfast, and we somehow got onto my pet peeve: the mechanical way in which math is taught in schools, and how math gets reduced to arithmetic and counting. Of course the definitive word on this is Paul Lockhart's A Mathematician's Lament.

I looked over the floor of the living room, strewn with random bits of deadly adult-foot-stabbing Lego, and had a minor epiphany. As anyone with small children knows, LEGO sells a lot more Hollywood tie-in kits now compared to the "bags of bricks" that it used to sell. You can buy the Millenium Falcon, an X-wing fighter, a spiderman action scene, or some kit related any cartoon on Nickelodeon.

Which is a pain. Lots of specialized parts, key pieces that if misplaced causes massive bouts of tears and anguished searching, and so on.

But how do kids play with them ? They put it together carefully from the instructions the first time. This lasts about a day or so. Then they get bored, take it apart and mix up all the pieces with the OTHER kits they have, and spend many happy days building all kinds of weird contraptions.

Here's the thing about the kits. They show you how to build fairly complicated devices and doohickeys. The kids pick up on this, and they mix and match the doohickeys in their own new constructions. Essentially they figure out the functions of the different parts, and realize how to build brand new things with them.

But suppose they were told that all they could do was repeatedly assemble the kits (and maybe even different kits) over and over again. It's unlikely they'd learn anything more than just how to assemble kits really quickly. They'd also get terribly bored. In fact, the popular toys in our house are multipurpose objects: any single-purpose toy gets discarded rather quickly.

To cut a long story short, it feels to me that a lot of math (and theoryCS) education at the school and college level involves looking at various "kits", and seeing how they get put together. You get rewarded for putting kits together correctly, but not for playing with the kits and creating bizarre new objects. But finding a way to let students (in an unstructured way) play with mathematical concepts is the key to having them really understand the concepts and their different facets.

Tuesday, January 08, 2013

ICERM and Simons postdocs

Two upcoming TCS postdoc deadlines:

ICERM Program on Network Science and Graph Algorithms

This is a program out of Brown organized by Kelner, Klein, Mathieu, Shmoys and Upfal. It sounds quite fascinating if you're doing anything with graph data and spectral analysis. Deadlines for postdoc applications is Jan 14.

Simons special program on the theory of big data. 

As I've mentioned before, the Simons Institute is running a semester long program on the theory of big data. The deadline for applying for postdocs is soon (middle of January)

These two programs are coordinating, and if you're so inclined you might be able to spend one semester at Berkeley and another in Providence. Please let the organizers know if this is something you're interested in.

Monday, January 07, 2013

SODA 2013 4/n: Business

If you haven't been following my live-tweets at the SODA business meeting, here's a summary of the unusually quiet meeting:
  • Attendance at the conference was 311, which is quite low, but is a local high for New Orleans (third time in the last 10 years). 
  • 135 papers were accepted out of a net 459 submissions. This is an acceptance rate of nearly 30%, which should strike fear into the hearts of tenure-track assistant professors everywhere. 
  • PC members had to review "only" 42 papers each. Yes, I know this is insane. And no, we're not going to do anything about it. 
  • Shiri Chechik received one of two best student papers for her paper "New Additive Spanners". Bernhard Haeupler received the other for "Simple, fast and deterministic gossip and rumor spreading".
  • I've already mentioned the two best papers, on graph minors and dynamic connectivity
  • SODA 2014 is in Portland, land of beer, books and birds. Chandra Chekuri is chair. 
  • After Salt Lake City observers had to withdraw because of serious voting irregularities, the trumped up unfair winner of the SODA 2015 sweepstakes was San Diego. But since we never go with our first choice location (Honolulu, anyone?), San Francisco was listed as a second choice, with DC as a third choice. 
  • David Johnson is handing the baton over to Cliff Stein, after running SODA since before I knew it wasn't a fizzy drink. 

SODA 2013: Part 3/n

I just attended Valerie King's talk on her paper with Bruce Kapron and Ben Mountjoy that does dynamic connectivity in worst-case polylog time (randomized). This paper received a Best Paper award (along with the Grohe et al paper on graph minors I mentioned yesterday).

The problem is a classic: given a stream of updates (edge insertions/deletions) to a graph, maintain a data structure that can answer "Are u and v connected" efficiently. There's been a slew of work on the problem: if you want worst-case bounds for updates, the best till now was $O(\sqrt{n})$ by Eppstein, Galil, Italiano and Nissenzweig (from 1992). There are polylogarithmic amortized bounds, but they can incur worst-case update times of $\Theta(n)$.

In this paper, after 20 years, they knock down the worst-case update times to polylogarithmic: the algorithms are randomized (with 1-sided error: it might occasionally declare two nodes disconnected when they were connected).

The idea itself is very elegant, and it connects to techniques in streaming algorithms in a way that I think is neat. WARNING: IANAE in this area, so my rendition might be slightly naïve, and is drawn from my understanding of Valerie King's excellent presentation.

The basic problem is this: I can maintain connectivity by maintaining a collection of trees. Adding an edge can be done without too much difficulty (indeed, if all you want to do is insert things, then you're back to union-find). It's insertion and deletion together that makes things hard: imagine that you delete an edge and disconnect a tree: you need to quickly determine if there's some other non-tree edge that can be used to reconnect the pieces, or if the two sides are now really disconnected.

More generally, consider the cutset maintainence problem. You have updates as before, but now a query is: given a set S, find a cut edge if one exists between S and V - S. It's not hard to see that this is the core routine you need for the tree.

The first elegant idea is this: suppose you represent each edge by a bit string consisting of the encoding of one endpoint followed by an encoding of the other. For example, the edge 2-3 might be written as 010011. Now for each vertex, compute the XOR of all the edges adjacent to it and call this its signature.

If you XOR all the signatures of vertices in a set, and if the set has exactly one cut edge, then what you get is the signature of that edge ! This is because each edge that's inside S will occur twice and therefore will zero out.

So this works if your cut set is of size 1. But suppose it's not ? Now you maintain $O(\log n)$ sets for each vertex. Each edge adjacent to the vertex is thrown into the $i^{\text{th}}$ set with probability $1/2^i$. With some appropriate duplication, you can show that if the cut set is of size $k$, then w.h.p the $\log k$th cut set will have exactly one element in it, and that's the element you can retrieve when you need to find a replacement edge.

There's a lot more technical work to do to get the whole algorithm working, and I won't go into that. But those of you who are familiar with streaming will recognize something here.

In essence, they've created a sketch for the set of edges adjacent to the vertex, and they have a way to represent the set compactly and retrieve individual elements from it. The trick with exponentially decaying levels is standard in streaming count estimation, and the GF(2) trick for storing the sketch is very similar to the trick used by Ahn, Guha and McGregor in their SODA 2012 paper on graph sketching.

And that's what I thought was the best part: that ideas that really have power in streaming can be used to help solve a longstanding open problem in graph algorithms. I should point out that the connection is post-facto: this paper was developed independently of the streaming work. But one does have to wonder what other connections we might be able to establish between sketching tricks on graphs and basic graph algorithms.

SODA 2013: Part 2/n

My previous post is here.

Day 2 of SODA, and the tenth time I've been asked "are you chairing all the sessions" ? No, just that many of my PC colleagues didn't (or couldn't) show up :), so those of us who did are doing more lifting. In reward, we got a nice dinner in the French quarter, and I tasted boudin for the first time (and maybe the last). 

An interesting talk today morning by Dror Aiger on reporting near neighbors. They were able to show a not-super-exponential relation between the number of points at unit $\ell_\infty$ distance from a query, and the number of points at unit $\ell_2$ distance. This was wrapped into a fast algorithm for reporting Euclidean near neighbors in high dimensions that has some interesting (if preliminary) experimental backing as well in comparison with ANN, FLANN and LSH. 

Jan Vondrák gave the invited talk on submodular optimization. I mentioned Satoru Fujishige's talk at NIPS, and this was an excellent complement. Fujishige's talk focused on the geometric intuition behind submodular functions (especially the associated polymatroid). Vondrák's talk focused on the algorithmic implications of submodularity, and he gave very convincing arguments for why it can be viewed as discrete convexity OR discrete concavity, or even neither. He pointed out how the Lovasz extension is useful for minimization and the multilinear extension is more useful for maximization, and gave a number of "recipes" for designing algorithms that optimize submodular functions. I hope the slides go online at some point: they were very clear and well-balanced. 

There was some discussion over whether next year's SODA should adopt the two-tier PC that STOC is currently experimenting. The jury's still out on that, and since the STOC PC is not done with their work, we don't yet have formal feedback. I will admit to being a little frustrated with the level of conservativeness on display here: it's not as if EVERY OTHER COMMUNITY IN CS doesn't do this and doesn't have best practices that we can learn from, and given our reviewing loads, it's really crazy that we aren't desperately trying things to alleviate the problem. 

Sunday, January 06, 2013

SODA 2013, Part I/n

On twitter, it's common to post longer thoughts in 140 character bursts. If you don't know how long the thought will be, you label them as 1/n, 2/n, and so on, so as not to exceed an arbitrary limit, but also imply that there is a finite limit. 

So we're back in the Big Easy for SODA. This time the conference has merged ALENEX and ANALCO with the main program, so at least for today and tomorrow, we have four parallel tracks. Having just come back from NIPS, SODA feels like a small cozy cocktail party in comparison (ed: not that I know anything about cocktail parties: I have small children)

The highlight was Bob Sedgewick's tribute to Philippe Flajolet. I've talked about their work on analytic combinatorics before, but it was nice to hear some of it again. Flajolet's contributions to the field are immense and valuable. He did big data before it was cool: his $\ell_0$ estimation work with Nigel Martin is now a seminal paper in the world of streaming, and his HyperLogLog paper on counting a stream (with Fusy, Gandouet and Meunier) is the best-in-class on this problem. Bob quoted one of Flajolet's students as saying, "Read any of his papers. You will learn something".

I chaired a geometry session in the morning, and one of the papers there caught my eye a while back. The Fréchet distance problem (commonly known as the dog-walking problem) is the problem of computing a monotone mapping between two curves that has minimim maximum length (you can think of this as a man on one curve walking a dog on another, and asking for the minimum leash length when both man and dog can walk forward at arbitrary speeds).

There's a relatively straightforward dynamic program that can compute the Fréchet distance between two polygonal chains in time $O(nm\log (nm))$ (if $n$ and $m$ are the number of links in the two chains). But it's been a big open problem to figure out whether this can be done in subquadratic time. The paper, (by Agarwal, Ben Avraham, Kaplan and Sharir) shows that for the discrete version of the problem (that Rinat Ben Avraham in her talk calls the frog hopping problem: the frogs hop from vertex to vertex) you can get a slightly subquadratic time algorithm. The main idea involves adapting "string-like" methods for the problem, which is hard because the "alphabet" is infinitely large.

It will be interesting to see if this spurs more progress. There's already a newer paper by Buchin et al that gets a slightly improved (but still super-quadratic) algorithm for the continuous Fréchet distance (i.e the original problem). If for no other reason, please click the link to see their awesome title.

Martin Grohe gave an award talk for his work with Kawarabayashi and Reed on an improved algorithm for graph minor decompositions. The problem is as follows: given a graph G and a graph H supposedly excluded by G, compute a decomposition of G promised by the graph minor theorem, or produce an H-minor.

The improvement reduces the dependency on the size of $G$ to quadratic (from cubic) and makes use of the wonderful and mysterious Courcelle's theorem. The dependence on the size of $H$ is equally mysterious (at least Martin declined to tell us), but is nowhere near as wonderful.

Please post papers you thought were interesting (and why) in the comments, and I'll add them as updates.

Tuesday, January 01, 2013

A SODA announcement, and a happy new year !

While you're all recovering from New Year's eve revelries, here's an important note from the ALENEX/ANALCO organizers regarding the debauchery in New Orleans otherwise known as SODA:


Dear SODA attendees,
We want to make sure that you are aware of a change in the format of SODA/ALENEX/ANALCO.  In recent years, ALENEX and ANALCO have been held on the Saturday before SODA and had separate registration.  This year we are trying a different format.  ANALCO and ALENEX will both be held in parallel with SODA; ANALCO on Sunday and ALENEX on Monday.  There is no longer an separate registration for ALENEX and ANALCO, everyone is automatically registered for all three.  That is, all SODA attendees are welcome to attend  ANALCO and  ALENEX talks.  We encourage you to look at the program and attend any talks that interest you.  We also welcome any feedback on this new format, after you have attended the conference. 
Regards,
Bob Sedgewick (ANALCO steering committee chair)
Cliff Stein (ALENEX steering committee chair)

Saturday, December 15, 2012

NIPS II: Deep Learning and the evolution of data models

(tl;dr: some rambles and musings on deep learning and data, as I attempt to sort out in my head what this all means)

Over the years, as we've engaged with "big data" more and more, the way we construct mental models of data has changed. And as I've argued before, understanding how we think about data, and what shape we give it, is key to the whole enterprise of finding patterns in data.

The model that one always starts with is Euclidean space. Data = points, features = dimensions, and so on. And as a first approximation of a data model, it isn't terrible.

There are many ways to modify this space. You can replace the $\ell_2$ norm by $\ell_1$. You can normalize the points (again with $\ell_2$ or $\ell_1$, sending you to the sphere or the simplex). You can weight the dimensions, or even do a wholesale scale-rotation.

But that's not all. Kernels take this to another level. You can encode weak nonlinearity in the data by assuming that it's flat once you lift it. In a sense, this is still an $\ell_2$ space, but a larger class of spaces that you can work with. The entire SVM enterprise was founded on this principle.

But that's not all either. The curse of dimensionality means that it's difficult to find patterns in such high dimensional data. Arguably, "real data" is in fact NOT high dimensional, or is not generated by a process with many parameters, and so sparsity-focused methods like compressed sensing start playing a role.

But it gets even more interesting. Maybe the data is low-dimensional, but doesn't actually lie in a subspace. This gets you into manifold learning and variants: the data lies on a low-dimensional curved sheet of some kind, and you need to learn on that space.

While the challenge for geometry (and algorithms) is to keep up with the new data models, the challenge for data analysts is to design data models that are realistic and workable.

So what does this have to do with deep learning ?

Deep learning networks "work" in that they appear to be able to identify interesting semantic structures in data that can be quite noisy. But to me it's not entirely clear why that is. So I've been mulling the question of what kind of structure in data might be "visible" to such a network. In the absence of formal results ("if your data can be separated in some RKHS, then an SVM will find it"), what follows is some speculation, based on talks I attended and conversations I had with NIPs attendees.

Observation I: Both Stephané Mallat's talk and a nice paper by Coates, Karpathy and Ng talk about the idea of "first-level" features that identify (low-dimensional) invariants and eliminate noise. In the Coates et al paper, they start with a very fine $k$-means clustering ($k$ very large), and attempt to "glue" the centers together into low dimensional subspace pieces. These are then propagated up a network of higher-order feature processors.

Observation 2: Yoshua Bengio's survey of deep learning (a very readable account) points to work by Geoff Hinton as part of the reinvigoration of the field. A central idea of this work is that deep belief networks can be trained "layer by layer", where each layer uses features identified from the previous layer.

If you stare at these things long enough, you begin to see a picture not of sparse data, or low-rank data, or even manifold data. What you see is a certain hierarchical collection of subspaces, where low-dimensional spaces interact in a low dimensional way to form higher level spaces, and so on. So you might have a low-level "lip" feature described by a collection of 2-3 dimensional noisy subspaces in an image space. These "lip" features in turn combine with "eye" features and so on.

This might seem rather far fetched, and a strange way to think about data. But I can't claim originality even here. Back in June, Mauro Maggioni gave a talk at CGWeek in Chris Bishop's workshop on the connection between analysis and geometry. In this talk, he described a multi-resolution view on data that admits structure at different scales, and admits different structures at these scales.

The upshot of all of this: it is possible that deep learning is trying to capture hierarchies of low dimensional subspaces that interact in a low dimensional way. This would explain how one is able to avoid the curse of dimensionality, and might also explain why it sometimes can find structure that other methods (kernels, manifold methods, etc) might miss.

Problem is: I'm not sure how one tests this hypothesis. Almost any data set could be viewed this way if you allow enough features and enough "depth" in the hierarchy.

Monday, December 10, 2012

NIPS ruminations I

(tl;dr a bucket load of trends/papers that I found interesting at the conference)


I just returned from NIPS in Lake Tahoe. For those not in the know, NIPS is one of the premier machine learning conferences, and has been growing rapidly over the last few years. This year's incarnation (the first of at least 7 in Lake Tahoe) has over 1600 attendees, a gazillion workshops, 4 tutorials, and more papers each DAY than all of STOC.

The format of the conference is very interesting. There are keynotes each morning and afternoon, a few selected 20 minute talks, and a few more more 5 minute "spotlight" talks, all in single session mode. All accepted papers get a poster slot one of of four days, and the poster session is a mammoth event from 7pm to midnight each night (yes, you read that correctly).

I'll say more about the format in a later post. For the next few posts, I'll highlight some of the papers and trends that were interesting to me. For more conference blogging, check out Anand Sarwate's sequence (I, II), and Hal Daumé's recap.

Perhaps the most obvious trend at the conference was on deep learning. While it's been in the news recently because of the Google untrained search for youtube cats, the methods of deep learning (basically neural nets without lots of back propagation) have been growing in popularity over a long while, and I was awash in autoencoders, pooling, and other jargon from the area. It was amusing to see speakers apologize for NOT talking about deep learning.

Stéphane Mallat's keynote on this topic was a thing of beauty. While I understood very little of the technical details, the central idea was that by using certain convolution-based encodings, and looking for "invariant substructures", you could essentially filter out irrelevant noise in the feature space, generating new features for the next level of the learning network. This simple idea is quite powerful: there was a paper on learning faces from videos from Coates, Karpathy and Ng that used a simple version of this idea by using k-means with lots of clusters to identify these low-D subspaces and factor them out.

Another trend that's been around for a while, but was striking to me, was the detailed study of optimization methods.  Optimization is a major theme in machine learning. This is primarily because so many machine learning problems can be formulated as convex/nonconvex optimizations. Solving these problems takes a lot of ingenuity and engineering, and after you've been swimming in a pond of regularization, sparsity, mixed norms, stochastic gradient descent, and alternating optimization for a while, things can get crazy. There are at least two different workshops on optimization in machine learning (DISC and OPT), and numerous papers that very carefully examined the structure of optimizations to squeeze out empirical improvements.

If I had to mount a critique of this large body of work, it would only be that a lot of this appears to be black magic (a criticism that seems even more true for deep learning). There are number of tricks in the optimization toolbox, and you can always try any number of them, but it's not always clear which methods work best, and why.

Quick hits on papers that interested me:

Bregman divergencesA paper by Iyer and Bilmes show how to extend Bregman divergences to the case when the generating function $\phi$ is not convex, but sub modular. This is a little tricky, because in such cases there's no unique gradient, and so technically the Bregman divergence is a function of both the function and its subdifferential. One interesting variant is the Lovasz Bregman divergence, which comes from using the Lovasz extension of a submodular function as the convex generator for a Bregman divergence. An application of these constructions comes in ranking, and it's interesting that things like the Hamming distance can be constructed.

On a separate note, Satoru Fujishige gave a wonderful invited lecture at DISC on submodular functions. I particularly liked the geometric interpretation of submodularity via its polymatroid and how a matroid can be viewed as the object you get when the submodular function is monotone. Of course this has been known for over 40 years, but I "got it" for the first time. If his book is anything like his talk, I really want to get it.

Kernel distances: Ever since I discovered  the kernel distance (as in, found it, not invented it) I've been fascinated by how it behaves more or less like the earth mover distance, but is so much easier to compute. Scott Aaronson (at his NIPS invited talk) made this joke about how nature loves $\ell_2$. The  kernel distance is "essentially" the $\ell_2$ variant of EMD (which makes so many things easier).

There's been a series of papers by Sriperumbudur et al. on this topic, and in a series of works they have shown that (a) the kernel distance captures the notion of "distance covariance" that has become popular in statistics as a way of testing independence of distributions. (b) as an estimator of distance between distributions, the kernel distance has more efficient estimators than (say) the EMD because its estimator can be computed in closed form instead of needing an algorithm that solves a transportation problem and (c ) the kernel that optimizes the efficient of the two-sample estimator can also be determined (the NIPS paper).

Metrics for PSD: In my continuing quest to find strange geometries to design algorithms for, I had this paper some time ago on doing geometry on the Riemannian manifold of positive definite matrices. It's a tricky business, and life gets sad quickly when you can't describe the convex hull of three points finitely.

Suvrit Sra proposed a new metric for the space of $n \times n$ positive definite matrices. It's what you get if you apply the Jensen-Shannon construction to the Burg entropy on matrices. It has the property of being more easily computable than the standard Riemannian metric, and also shares with that metric a nice closed form for the midpoint of two matrices. What remains open is the kind of geometry this metric induces: a weird property is that the associated kernel exp(-\gamma d^2) is p.d iff $\gamma$ is half integral below $n-1$, and any real above it.

Distance metric learning and MKL.

Our presentation at OPT was about multiple kernel learning, which is closely related to the general problem of distance metric learning (especially when the distance metric is defined by a kernel). There were a good number of  distance metric learning at NIPS - different approaches that either did local learning of "multi-metrics" or learned a metric based on k-NN points, or even Hamming distance metric learning.

Distributed learning.

Distributed learning (in a communication-restricted environment) seems to be a growing concern, which of course makes me happy. There were a few papers on different kinds of communication-limited learning, including one on doing this for probabilistic PCA (which didn't have formal bounds on the amount of communication though) and one on distributed "non-stochastic" experts.



Wednesday, November 07, 2012

Data, Dimensions and Geometry oh my !

The following is a summary of a talk I gave to undergraduates interested in going on to graduate school. It's targeted at the layperson, and tries to convey a sense of the interplay between data mining and geometry. I gave this talk partly because I realized that things that we take utterly for granted in the rarified world of high dimensional data mining are completely foreign to people who don't think about this for a living. 

tl;dr: High dimensional geometry (and non standard geometry) is the unifying language that binds data mining problems together.

We're trying to find patterns all over the place, with very different kinds of data. A search engine is trying to find patterns in web pages that might indicate that they have similar content. Brain researchers are throwing MRIs of patients with diseases into an algorithm that attempts to infer patterns in brain scans that might yield clues about pathology and diagnosis. Genome-wide analysis takes what are essentially long strings of letters and tries to explain why certain populations might be susceptible to certain diseases.

Pandora, Shazam and other music sites analyze fragments of music to find related artists, or even just match a tune. While an infinite gangnam-style video might be a frivolous use of data mining on video streams, video scans are being analyzed by robots trying to drive unmanned cars or even just play soccer. Social networks are all the rage: Facebook, Twitter and Google are desperate to understand your circle of friends in order to sell things to you more effectively.

How are we able to find patterns, clusters and trends in such different kinds of data ? The key step in all of this is the idea of features. The trick is to describe each object we are studying as a sequence (or set) of features. For example, a feature set for a document could be the number of times each particular word appears. The feature set for an image could be the count of different colors. For a tune, it might be a collection of features identified by hiring artists to list out which of more than 700 characteristics a piece of music has.

And so on, and so on. What this allows us to do is represent each object as a set of features, whether it's a web page, a brain scan, a video, or even a social graph. Once we do that, we no longer have to worry about the original data (well, kind of !), and different types of data are all on an equal footing.


But what now ? Here's where a cool trick that dates back to the 1600s comes in.

I doubt that René Descartes ever heard of the term "data mining". But in a strange way, he's the father of the field. One of Descartes' claim to fame was establishing a link between geometry and algebra. He said that if we wanted to represent points, lines and other shapes, we could do so not abstractly as Euclid and other classical geometers did, but using algebra. A "point" in the plane could be described by two coordinates (x,y), and a line in the plane could be described by the equation y = mx + c.

This is a very simple idea - children learn it in middle school. And yet like most simple ideas, it was revolutionary, and completely changed our perspective on geometry. The unification of algebra and geometry is so powerful and so natural, we use it almost unconsciously today.

But what does Descartes have to do with data mining ?

Remember those features I was telling you about ? That every object can be represented by a sequence of numbers, each number describing some aspect of the object.

Let's do the opposite of what Descartes proposed ! In other words, let's pretend that the objects are actually points in a space. Now this space is not the plane (unless we had only two features). It's a high dimensional space, with one feature for each dimension.

This process is entirely artificial. There is no real reason why we must think of objects as points in a high dimensional space. But here's the cool thing that happens. We can now express all basic mining primitives as geometric concepts, by translating the language of the primitive to this high dimensional space.

A clustering of data becomes a grouping of points so that "nearby" points are grouped together. A classification of reviews into positive and negative statements becomes a way to separate "positive" points and "negative" points by a line. Finding trends in data becomes the problem of fitting a straight line to a collection of points.

It is hard to emphasize how utterly bizarre this is. There is no underlying "geometry" in the problem we're solving. We're essentially creating a castle out of thin air in order to understand the problem. And yet it works,  and is the bedrock of how we think about data mining.

 But wait ! there's more. What exactly does it mean to say that points are "nearby" or they are "separated" ? To answer this question, it's not enough to view objects as points in a high dimensional space. You have to give this space a shape - a geometry (and also a topology, but I'll skip that for now).

For example, if I have two feature lists, how do I measure the distance between them ? If they were points on a map, I could do the usual straight line distance. But does that really capture how far apart they are ? After all, if you're in downtown Salt Lake with its grids, a "crow flies" distance doesn't help estimate how far things are. If you're on the surface of the earth, you can't really tunnel through the center to get from one point to another.

So we have to create the shape of the space by defining how far apart two points are. And this is one of the trickiest parts of data mining. Either we have to use some domain knowledge to estimate which features are significant and control more of the distance between objects, or we have to try and learn what seems like the right distance between objects based on user estimates or other information.

The good thing is that we have a huge collection of shapes to play with, and different shapes tend to work well for different classes of objects. Some are easy to interpret, others are easy to compute, and so on. So a good part of the "art" in data mining is understanding the data and estimating the right geometry for it, so that our tasks (expressed geometrically) give us meaningful answers.

Or as Dorothy famously said to her dog, "Toto, I don't think we're in the plane any more!"

Tuesday, November 06, 2012

On the elections, Nate Silver, and lessons for data mining

One interesting side story from this election has been the intense focus on Nate Silver's election predictions, and the matter of aggregate polling statistics. While there's certainly a partisan element to much of the discussion, there's also a larger sense of unease about what the predictions are actually saying.

There are lessons here for the greater goal of using data mining for prediction and modelling, and this will get more and more important the more predictive analytics intrude on our lives.

People don't quite understand probability, even though they understand odds. 
There's a lot of confusion about what it means for a one-time event to have a probability associated with it, even though people are quite comfortable with the idea of odds in (for example) sports. This to me reflects a deeper confusion between the frequentist and Bayesian view of probability: namely, is probability the long-run average of the frequency of an event, or an a priori expression of uncertainty in the likelihood of an event.

I'm not going to say anything here that hasn't been said for more than a century, but from the point of view of interpreting the results of mining, this distinction will be important.

Aggregation is a GOOD THING. 
It is entirely likely that all the poll aggregators will have egg on their face tomorrow when the results come in. But it does seem that aggregation helps smooth out the natural irregularities and biases in individual polls. This is a theme that comes up time and again in data mining, and especially in clustering: rather than trusting a single algorithm, you should try to run algorithms that return diverse answers and aggregate them in some fashion.

It's not enough to predict: you must also explain. 
Among the good reasons to feel uneasy about the aggregate predictions is that they don't give much insight into why things are going the way they are. To be fair, Nate Silver laid out some economic markers back in the spring, and tied them to possible outcomes via regression. While this provides some  insight, it's still pattern matching as opposed to a mechanism.

More generally, it is very difficult to convince people that an answer is pertinent or "correct" unless there's some way to explain the answer without listing a sequence of coefficients or writing down a collection of centers. Much of the popularity of decisions trees comes from the ease of explanation it seems to provide.

Conclusion. 

Most of the controversy around data mining in the public sphere has centered around privacy issues. Indeed, privacy issues become a concern precisely because people worry that the mining algorithms are too accurate. But in fact we don't really understand the behavior of many algorithms that we use, and that is dangerous regardless of privacy concerns. The angst over the methods used to predict this election are an illustration of what will happen when the predictions we make start to matter, and matter to many people.

Monday, October 08, 2012

On why I'm excited about "big data"

I was in Aarhus recently for a MADALGO workshop on large-scale parallel and distributed models, where I did a sequence of lectures on GPU algorithms. I was briefly interviewed by a university reporter for an article, and did a little video on why I think big data/big iron problems are interesting.

At the risk of embarrassing myself even more than I usually do, here's the video. Note that this was recorded at a time of great crisis across the globe, when all hair styling products had mysteriously disappeared for a few days.


Wednesday, October 03, 2012

We're hiring FIVE (count 'em, FIVE) faculty this year.

We had an incredible hiring season two years ago, making seven offers and hiring seven new faculty. And now we're doing it again !

Our department is looking to hire five new faculty (at least four are at the assistant professor level). I'm particularly excited that we're hiring two people in the general area of big data (following up on our data mining and database hires from two years ago).

One slot is in what I'll call "big data meets big performance": I'll have more to say about this shortly, but the challenges of large data analysis are not just about managing the data, but about managing large numbers of machines to crunch this data (MapReduce is perhaps the most well known example of this). We're looking for people who can "speak extreme data and extreme processors" fluently - these could be on the data/systems management side, or on the analysis side, or the modelling side.

Utah has a strong presence in high performance computing (the Supercomputing confererence is happening in Salt Lake, and Mary Hall is the general chair), and we're one of the few places that has a good understanding of both sides of the large data story (i.e machines and bits).

The second slot is in machine learning, with ties to language. Text (and language) provide one of the best large data sources for scalable machine learning, and we're looking for people interested in the challenges of doing ML at scale, especially when dealing with NLP. If you're that person, you'll be coming into a department that has the entire range of data analysis folks from algorithms to analysis to systems to NLP (with many other faculty that are heavy users of ML technology).

Our plan, once we fill these slots, is to make Utah a one-stop shop for large-scale data analysis and visualization - in addition to the above slots, we're also looking to hire in information visualization to complement our strong viz presence.

In addition to the above slots, we are also hiring in computer security and HCI/user interfaces. While I'm equally excited about these positions, I know much less about the areas :). I will point out that we have a big systems group that covers many aspects of security (language security, verification, and network security) already. We've also had strong demand from students and industry for research in HCI, which will complement our info-viz efforts (and even our data mining work)

For more details on how to apply, see our ad. We'll start reviewing applications after Dec 1. Feel free to email me if you have questions about the slots (but don't send me your application material - send them in directly)

Disclaimer: the above views are my own personal views, and don't represent the views of the department or the hiring subcommittees.


Monday, October 01, 2012

Things a _____ should do at least once...

Without quite realizing it, I managed to create a (tiny) meme in the rarefied circles of TCS/math with my post "Things a TCSer should have done at least once".

Firstly, you should check out the G+ post for even more scathing commentary on my original post.

Next, you should see the followups (in case you haven't already):
Let me know if there are more - I'm still waiting for a quantum computing version.(thanks, Pontiff!)

Wednesday, September 19, 2012

FOCS 2012 Student Support

I've been asked by Rafail Ostrovsky and David Shmoys to highlight that student support for travelling to FOCS 2012 is still available, and the deadline is this Friday.

They note that they can fund 24-25 people, so people in the US should apply EVEN IF they don't have a paper.

The NSF has been very generous of late in supporting student travel to conferences, and the best way to encourage them to continue is to show that we use the money they give us :). My students have taken advantage of these opportunities in the past, and it's been a great experience for them.

So if you're a student reading this (and I know there are many of you!) or if you're an advisor who may not have the funding to send your student(s) to FOCS, do take advantage of this chance.


Sunday, September 09, 2012

Things a TCSer should have done at least once

There are many discussions about what things a TCS grad student should know. I thought it might be useful instead to list out some things a theoretician should have done at some point early in their career.

Rules of the game:
  • You lose points if you do it as part of a class. If you decide to round an LP in a class on approximations, that's something you're being taught to do. But if you do it as part of solving a problem, then you've also done the difficult job of recognizing that an LP needs to be rounded. That latter skill demonstrates some mastery.
  • The goal is not complete coverage of all of TCS; rather, it's coverage of techniques that will come up fairly often, no matter what kinds of problems you look at.
  • This list is necessarily algorithms-biased. I doubt you'll need many of these if you're doing (say) structural complexity. 
  • A similar caveat applies to classical vs quantum computing. While it seems more and more important even for classical computations that one knows a little quantumness, I don't know enough about quantum algorithm design to add the right elements. Comments ? 
With those caveats out of the way, here's my list:
  • Show that a problem is NP-hard (for bonus points, from some flavor of SAT via gadgets)
  • Show a hardness of approximation result (bonus points for going straight from a PCP)
  • Prove a lower bound for a randomized algorithm
  • Prove a lower bound via communication complexity or even information theory
  • Round an LP (bonus points for not just doing the obvious rounding)
  • Show an integrality gap for an LP
  • Design a primal-dual algorithm
  • Use projective duality to solve a problem (bonus points for using convex duality)
  • Apply a Chernoff bound (bonus for using negative dependence, Janson's inequality, and an extra bonus for Talagrand's inequality)
  • Design an FPT algorithm (maybe using treewidth, bonus for using bidimensionality or kernelization)
  • Design a nontrivial exponential time algorithm (i.e an algorithm that doesn't just enumerate, but does something clever)
  • Do an amortized analysis (for extra bonus get a log*n bound)
  • use an advanced data structure - something beyond van emde Boas trees (extra bonus for exploiting word-size)
  • invoke VC dimension to solve a problem 
What else ? 

Tuesday, August 28, 2012

FOCS 2012 Call for Posters

Four times makes it an institution ? 

After the poster events at STOC 2011, STOC 2012 and SoCG 2012, FOCS 2012 makes it four times. Tim Roughgarden notes that the deadline for submitting posters to FOCS 2012 is in two weeks (Sep 11). So if you have interesting work you'd like to air out in the theory comunity, and a deep longing to visit New Brunswick, New Jersey, then submit your entry.

By this time, there's a good chance that you've already experienced the posters event either as presenter or audience at one of these venues. If not, I'll reiterate what I've said before: presenting a poster is a great way to disseminate your work with more personalized interaction than you often get with a paper presentation.

Disqus for The Geomblog