Tuesday, November 06, 2012

On the elections, Nate Silver, and lessons for data mining

One interesting side story from this election has been the intense focus on Nate Silver's election predictions, and the matter of aggregate polling statistics. While there's certainly a partisan element to much of the discussion, there's also a larger sense of unease about what the predictions are actually saying.

There are lessons here for the greater goal of using data mining for prediction and modelling, and this will get more and more important the more predictive analytics intrude on our lives.

People don't quite understand probability, even though they understand odds. 
There's a lot of confusion about what it means for a one-time event to have a probability associated with it, even though people are quite comfortable with the idea of odds in (for example) sports. This to me reflects a deeper confusion between the frequentist and Bayesian view of probability: namely, is probability the long-run average of the frequency of an event, or an a priori expression of uncertainty in the likelihood of an event.

I'm not going to say anything here that hasn't been said for more than a century, but from the point of view of interpreting the results of mining, this distinction will be important.

Aggregation is a GOOD THING. 
It is entirely likely that all the poll aggregators will have egg on their face tomorrow when the results come in. But it does seem that aggregation helps smooth out the natural irregularities and biases in individual polls. This is a theme that comes up time and again in data mining, and especially in clustering: rather than trusting a single algorithm, you should try to run algorithms that return diverse answers and aggregate them in some fashion.

It's not enough to predict: you must also explain. 
Among the good reasons to feel uneasy about the aggregate predictions is that they don't give much insight into why things are going the way they are. To be fair, Nate Silver laid out some economic markers back in the spring, and tied them to possible outcomes via regression. While this provides some  insight, it's still pattern matching as opposed to a mechanism.

More generally, it is very difficult to convince people that an answer is pertinent or "correct" unless there's some way to explain the answer without listing a sequence of coefficients or writing down a collection of centers. Much of the popularity of decisions trees comes from the ease of explanation it seems to provide.

Conclusion. 

Most of the controversy around data mining in the public sphere has centered around privacy issues. Indeed, privacy issues become a concern precisely because people worry that the mining algorithms are too accurate. But in fact we don't really understand the behavior of many algorithms that we use, and that is dangerous regardless of privacy concerns. The angst over the methods used to predict this election are an illustration of what will happen when the predictions we make start to matter, and matter to many people.

No comments:

Post a Comment

Disqus for The Geomblog