Thursday, December 11, 2014

Accountability in data mining

For a while now, I've been thinking about the endgame in data mining. Or more precisely, about a world where everything is being mined for all kinds of patterns, and the contours of our life are shaped by the patterns learnt about us.

We're not too far from that world right now - especially if you use intelligent mail filtering, or get your news through social media. And when you live in such a world, the question is not "how do I find patterns in data", but rather
How can I trust the patterns being discovered ? 
When you unpack this question, all kinds of questions come to mind: How do you verify the results of a computation ? How do you explain the results of a learning algorithm ? How do you ensure that algorithms are doing not just what they claim to be doing, but what they should be doing ?

The problem with algorithms is that they can often be opaque: but opacity is not the same as fairness, and this is now a growing concern amongst people who haven't yet drunk the machine learning kool-aid.

All of this is a lead up to two things. Firstly, the NIPS 2014 workshop on Fairness, Accountability and Transparency in Machine Learning, organized by +Moritz Hardt and +Solon Barocas and written about by Moritz here.

But secondly, it's about work that +Sorelle Friedler+Carlos Scheidegger and I are doing on detecting when algorithms run afoul of a legal notion of bias called disparate impact. What we're trying to do is pull out from a legal notion of bias something more computational that we can use to test and certify whether algorithms are indulging in legally proscribed discriminatory behavior.

This is preliminary work, and we're grateful for the opportunity to air it out in a forum perfectly designed for it.

And here's the paper:
What does it mean for an algorithm to be biased?
In U.S. law, the notion of bias is typically encoded through the idea of disparate impact: namely, that a process (hiring, selection, etc) that on the surface seems completely neutral might still have widely different impacts on different groups. This legal determination expects an explicit understanding of the selection process. 
If the process is an algorithm though (as is common these days), the process of determining disparate impact (and hence bias) becomes trickier. First, it might not be possible to disclose the process. Second, even if the process is open, it might be too complex to ascertain how the algorithm is making its decisions. In effect, since we don't have access to the algorithm, we must make inferences based on the data it uses. 
We make three contributions to this problem. First, we link the legal notion of disparate impact to a measure of classification accuracy that while known, has not received as much attention as more traditional notions of accuracy. Second, we propose a test for the possibility of disparate impact based on analyzing the information leakage of protected information from the data. Finally, we describe methods by which data might be made "unbiased" in order to test an algorithm. Interestingly, our approach bears some resemblance to actual practices that have recently received legal scrutiny.
In one form or another, most of my recent research touches upon questions of accountability in data mining. It's a good thing that it's now a "trend for 2015". I'll have more to say about this in the next few months, and I hope that the issue of accountability stays relevant for more than a year.

Friday, December 05, 2014

Experiments with conference processes

NIPS is a premier conference in machine learning (arguably the best, or co-best with ICML). NIPS has also been a source of interesting and ongoing experiments with the process of reviewing.

For example, in 2010 Rich Zemel, who was a PC chair of NIPS at the time, experimented with a new system he and Laurent Charlin were developing that would determine the "fit" between a potential reviewer and a submitted paper. This system, called the Toronto Paper Matching System, is now being used regularly in the ML/vision communities.

This year, NIPS is trying another experiment. In brief,

10% of the papers submitted to NIPS were duplicated and reviewed by two independent groups of reviewers and Area Chairs.
And the goal is to determine how inconsistent the reviews are, as part of a larger effort to measure the variability in reviewing. There's even a prediction market set up to guess what the degree of inconsistency will be. Also see Neil Lawrence's fascinating blog describing the mechanics of constructing this year's PC and handling the review process.

I quite like the idea of 'data driven' experiments with format changes. It's a pity that we didn't have a way of measuring the effect of having a two-tier committee for STOC a few years ago, and instead had to rely on anecdotes about its effectiveness, and didn't even run the experiment long enough to collect enough data. I feel that every time there are proposals to change anything about the theory conference process, the discussion gets drowned out in a din of protests, irrelevant anecdotes, and opinions based entirely on ... nothing..., and nothing ever changes.

Maybe there's something about worst-case analysis (and thinking) that makes change hard :).

Disqus for The Geomblog