The Geomblog: 2014

Thursday, December 11, 2014

Accountability in data mining

For a while now, I've been thinking about the endgame in data mining. Or more precisely, about a world where everything is being mined for all kinds of patterns, and the contours of our life are shaped by the patterns learnt about us.

We're not too far from that world right now - especially if you use intelligent mail filtering, or get your news through social media. And when you live in such a world, the question is not "how do I find patterns in data", but rather

How can I trust the patterns being discovered ?

When you unpack this question, all kinds of questions come to mind: How do you verify the results of a computation ? How do you explain the results of a learning algorithm ? How do you ensure that algorithms are doing not just what they claim to be doing, but what they should be doing ?

The problem with algorithms is that they can often be opaque: but opacity is not the same as fairness, and this is now a growing concern amongst people who haven't yet drunk the machine learning kool-aid.

All of this is a lead up to two things. Firstly, the NIPS 2014 workshop on Fairness, Accountability and Transparency in Machine Learning, organized by +Moritz Hardt and +Solon Barocas and written about by Moritz here.

But secondly, it's about work that +Sorelle Friedler, +Carlos Scheidegger and I are doing on detecting when algorithms run afoul of a legal notion of bias called disparate impact. What we're trying to do is pull out from a legal notion of bias something more computational that we can use to test and certify whether algorithms are indulging in legally proscribed discriminatory behavior.

This is preliminary work, and we're grateful for the opportunity to air it out in a forum perfectly designed for it.

And here's the paper:

What does it mean for an algorithm to be biased?
In U.S. law, the notion of bias is typically encoded through the idea of disparate impact: namely, that a process (hiring, selection, etc) that on the surface seems completely neutral might still have widely different impacts on different groups. This legal determination expects an explicit understanding of the selection process.

If the process is an algorithm though (as is common these days), the process of determining disparate impact (and hence bias) becomes trickier. First, it might not be possible to disclose the process. Second, even if the process is open, it might be too complex to ascertain how the algorithm is making its decisions. In effect, since we don't have access to the algorithm, we must make inferences based on the data it uses.

We make three contributions to this problem. First, we link the legal notion of disparate impact to a measure of classification accuracy that while known, has not received as much attention as more traditional notions of accuracy. Second, we propose a test for the possibility of disparate impact based on analyzing the information leakage of protected information from the data. Finally, we describe methods by which data might be made "unbiased" in order to test an algorithm. Interestingly, our approach bears some resemblance to actual practices that have recently received legal scrutiny.

In one form or another, most of my recent research touches upon questions of accountability in data mining. It's a good thing that it's now a "trend for 2015". I'll have more to say about this in the next few months, and I hope that the issue of accountability stays relevant for more than a year.

Friday, December 05, 2014

Experiments with conference processes

NIPS is a premier conference in machine learning (arguably the best, or co-best with ICML). NIPS has also been a source of interesting and ongoing experiments with the process of reviewing.

For example, in 2010 Rich Zemel, who was a PC chair of NIPS at the time, experimented with a new system he and Laurent Charlin were developing that would determine the "fit" between a potential reviewer and a submitted paper. This system, called the Toronto Paper Matching System, is now being used regularly in the ML/vision communities.

This year, NIPS is trying another experiment. In brief,

10% of the papers submitted to NIPS were duplicated and reviewed by two independent groups of reviewers and Area Chairs.

And the goal is to determine how inconsistent the reviews are, as part of a larger effort to measure the variability in reviewing. There's even a prediction market set up to guess what the degree of inconsistency will be. Also see Neil Lawrence's fascinating blog describing the mechanics of constructing this year's PC and handling the review process.

I quite like the idea of 'data driven' experiments with format changes. It's a pity that we didn't have a way of measuring the effect of having a two-tier committee for STOC a few years ago, and instead had to rely on anecdotes about its effectiveness, and didn't even run the experiment long enough to collect enough data. I feel that every time there are proposals to change anything about the theory conference process, the discussion gets drowned out in a din of protests, irrelevant anecdotes, and opinions based entirely on ... nothing..., and nothing ever changes.

Maybe there's something about worst-case analysis (and thinking) that makes change hard :).

Thursday, November 20, 2014

Open access, ACM and the Gates Foundation.

Matt Cutts, in an article on the new Gates Foundation open access policy (ht +Fernando Pereira) says that

while the ACM continues to drag its heels, the Gates Foundation has made a big move to encourage Open Access...

Which got me thinking. Why can't the ACM use this policy as a guidelines to encourage open access ? Specifically,

Announce that from now on, it will subsidize/support the open access fees paid by ACM members
(partially) eat the cost of publication in ACM publications (journals/conferences/etc)
Use the resulting clout to negotiate cheaper open access rates with various publishers in exchange for supporting open access fees paid to those journals.

Of course this would put the membership side of ACM at odds with its publication side, which maybe points to another problem with ACM having these dual roles.

Monday, October 06, 2014

Algorithms is the new Algorithms...

Hal Daumé wrote a rather provocative post titled ‘Machine Learning is the new algorithms’, and has begged someone to throw his gauntlet back at him. Consider it now thrown !

His thesis is the following quote:

Everything that algorithms was to computer science 15 years ago, machine learning is today

And among the conclusions he draws is the following:

we should yank algorithms out as an UG requirement and replace it with machine learning

Having spent many years having coffees and writing papers with Hal, I know that he truly does understand algorithms and isn’t just trying to be a troll (at least I hope not). So I’m trying to figure out exactly what he’s trying to say. It will help if you read his article first before returning…

First off, I don’t understand the conclusion. Why not (say) replace architecture with ML, or databases with ML. Or why replace anything at all ? the assumption is that ML is a core tool that students will need more than any other topic. Now I have no problem with adding ML to the list of “things a CS person ought to know”, but I don’t see why it’s not important for a CS person to understand how a database works, or how a compiler processes code, or even how we design an algorithm. This fake mutual exclusiveness appears to be without basis.

But maybe he’s saying that algorithms and ML are two flavors of the same object, and hence one can be replaced by the other. If so, what exactly is that object ? In his view, that object is:

given an input, what’s the best way to produce an (optimal) output ?

This seems to be an overly charitable view of ML. In ML, the goal is to, well, learn. Or to be more stodgy about it, ML provides tools for making systematic inferences and predictions from data.

Which suggests that the concerns are fundamentally orthogonal, not in opposition (and Sasho Nikolov makes this point very well in a comment on Hal’s post). As Hal correctly argues, the hard work in ML is front-loaded: modeling, feature extraction, and the like. The algorithm itself is mostly an afterthought.

But what’s ironic is that one of the most important trends in ML of late is the conversion of an ML problem to an optimization problem. The goal is to make good modeling choices that lead to an optimization problem that can be solved well. But wait: what do you need to know how to solve an optimization ? Wait for it…… ALGORITHMS !!

The argument about stability in algorithms is an odd one, especially coming from someone who’s just written a book on ML. Yes, some core algorithms techniques haven’t changed much in the last many years, but did you see that 2014 paper on improvements in recurrence analysis ? Or the new sorting algorithm by Mike Goodrich ? or even the flurry of new results for Hal’s beloved flow problems ?

As for “everything’s in a library”, I’ll take your boost graph library and give you back WEKA, or libSVM, or even scikit-learn. I don’t need to know anything from Hal’s book to do some basic futzing around in ML using libraries: much like I could invoke some standard hashing subroutine without knowing much about universal hash families.

In one sense though, Hal is right: ML is indeed where algorithms was 15 years ago. Because 15 years ago (approximately) is when the streaming revolution started, and with it the new interest in sub linear algorithms, communication complexity, matrix approximations, distributed algorithms with minimal communication, and the whole “theory of big data” effort. And ML is now trying to catch up to all of this, with some help with from algorithms folks :).

What is true is this though: it wouldn’t hurt us to revisit how we construct the core algorithms classes (undergrad and grad). I realize that CLRS is the canon, and it would cause serious heart palpitations to contemplate not stuffing every piece of paper of that book down students’ throats, but maybe we should be adapting algorithms courses to the new and exciting developments in algorithms itself. I bring in heavy doses of approximation and randomization in my grad algorithms class, and before we forked off a whole new class, I used to teach bits of streaming, bloom filters, min-wise hashing and the like as well. My geometry class used to be all about the core concepts from the 3 Marks book, but now I throw in lots of high dimensional geometry, approximate geometry, kernels and the like.

Ultimately, I think a claim of fake mutual exclusivity is lazy, and ignores the true synergy between ML and algorithms. I think algorithms research has a lot to learn from ML about robust solution design and the value of "noise-tolerance", and ML has plenty to learn from algorithms about how to organize problems and solutions, and how deep dives into the structure of a problem can yield insights that are resilient to changes in the problem specification.

Wednesday, October 01, 2014

On the master theorem vs Akra-Bazzi

Everyone knows the master theorem.

Or at least everyone reading this blog does.

And I'm almost certain that everyone reading this blog has heard of the generalization of the master theorem due to Akra and Bazzi. It's particularly useful when you have recurrences of the form

$$ T(n) = \sum_i a_i T(n/b_i) + g(n) $$

because like the master theorem it gives you a quick way to generate the desired answer (or at least a guess that you can plug in to the recurrence to check).

(And yes, I'm aware of the generalization of A/B due to Drmota and Szpankowski)

When I started teaching grad algorithms this fall, I was convinced that I wanted to teach the Akra-Bazzi method instead of the master theorem. But I didn't, and here's why.

Let's write down the standard formulation that the master theorem applies to

$$ T(n) = a T(n/b) + f(n) $$

This recurrence represents the "battle" between the two terms involved in a recursive algorithm: the effort involved in dividing (the $a T(n/b)$) and the effort involved in putting things back together (the $f(n)$).

And the solution mirrors this tension: we look at which term is "stronger" and therefore dominates the resulting running time, or what happens when they balance each other out. In fact this is essentially how the proof works as well.

I have found this to be a useful way to make the master theorem "come alive" as it were, and allow students to see what's likely to happen in a recurrence without actually trying to solve it. And this is very valuable, because it reinforces the point I'm constantly harping on: that the study of recurrences is a way to see how to design a recursive algorithm. That decimation as a strategy can be seen to work just by looking at the recurrence. And so on.

But the Akra-Bazzi method, even though it's tremendously powerful, admits no such easy intuition. The bound comes from solving the equation

$$ \sum a_i b_i^p = 1 $$ for $p$, and this is a much more cryptic expression to parse. And the proof doesn't help make it any less cryptic.

Which is not to say you can't see how it works with sufficient experience. But that's the point: with sufficient experience. From a purely pedagogical perspective, I'd much rather teach the master theorem so that students get a more intuitive feel for recurrences, and then tell them about A/B for cases (like in median finding) where the master theorem can only provide an intuitive answer and not a rigorous one.

Friday, September 26, 2014

STOC 2015 Deadline: Nov 4, 2014

Via Ronitt Rubinfeld comes word that the STOC 2015 CFP is out.

Submission deadline: Tuesday, November 4, 2014, 3:59pm EST

Conference: June 14-17 2015, Portland, Oregon (part of FCRC)

Saturday, August 23, 2014

LaTeX is code...

I'm giving a talk on LaTeX this Monday as part of our new grad student "boot camp" series. It's really more of an interactive presentation: I'll use writelatex (or sharelatex) to demo examples, give student simple assignments, and use real-time chat to see how things are going. It should be quite interesting.

Here's the talk announcement:

Did you know that every time you use $..$ to italicize text, or use \\ to force a newline, Leslie Lamport cries out in agony and Don Knuth starts another fascicle of Volume IV of TAoCP ?

Come and learn about how to use LaTeX, and use it well. Your collaborators will love you, your advisors will love me, and you'll realize that the most awfully written drivel looks awesome when typeset well.

This will be interactive!! I'll be using a shared space for editing and viewing latex documents, and there will be class activities, so please do bring your laptops/tablets/other editing device so you can follow along and participate.

For this talk I solicited comments from colleagues as to what they'd like their students to learn. Probably the most useful comment I got was from +Robert Ricci and +Eric Eide: to whit,

LaTeX is code.

This might seem obvious, but once you internalize it, all kinds of other things become very natural. For example

You should really use some kind of IDE to write and build your documents
Version control is your friend
*sections should be separate files.
Text should be readable.
Use macros where convenient
Don't reinvent: use the many many built-in packages at ctan.org
Use tex.stackexchange.com to learn how to hack whatever you need in LaTeX.

A corollary: to see a theoretician editing LaTeX close to a STOC/FOCS/SODA deadline is to realize that theory folk are AWESOME programmers.

Tuesday, August 19, 2014

Long Live the Fall Workshop (guest post by Don Sheehy)

An announcement for the Fall Workshop in Computational Geometry, by Don Sheehy.

In all the conversation about SoCG leaving the ACM, there were many discussions about ownership, paywalls, and money. This leads naturally to questions of ideals. What can and ought a research community be like? What should it cost to realize this? Isn't it enough to bring together researchers and in an unused lecture hall at some university somewhere, provide coffee (and wifi), and create a venue for sharing problems, solutions, and new research in an open and friendly atmosphere? There is a place for large conferences, with grand social events (Who will forget the boat cruise on the Seine at SoCG 2011?), but there is also a place for small meetings run on shoestring budgets that are the grassroots of a research community.

The Fall Workshop on Computational Geometry is such a meeting. It started in 1991, at SUNY Stony Brook and has been held annually every fall since. I first attended a Fall Workshop during my first year of graduate school, back in 2005. This year marks the 24th edition of the workshop, and this time, I will be hosting it at the University of Connecticut. It is organized as a labor of love, with no registration fees. There are no published proceedings and it is a great opportunity to discuss new work and fine-tune it in preparation for submission. It is perfectly timed to provide a forum for presenting and getting immediate feedback on your potential SoCG submissions. I cordially invite you to submit a short abstract to give a talk and I hope to see you there.

Important dates:

Submission deadline: Oct 3 midnight (anywhere on earth)

Conference: Oct 31-Nov 1, 2014.

Wednesday, August 13, 2014

Interdisciplinary research and the intellectual richness of data analysis

Slides on a brief introduction to themes in machine learning from an algorithms perspective, and some thoughts on the mathematical richness of the study of data.

A brief note on Fano's inequality

I've been bumping into Fano's inequality a lot lately, and have found the various explanations on the web somewhat lacking. Not because they aren't useful, but because their perspective is very different to the kind that I'd prefer as an algorithms person.

So after grumbling and mumbling and complaining, I decided the only solution was to write my own ! And here it is, as a raindrop.

Eh ? What's that you say ? And here we're just getting used to twitter ?

Raindrops are a web publishing form designed by the company run by our very own V. Vinay. When I was in India last I visited him in Bangalore, and he showed me the system. It's a nice way to make presentations or short lectures.

The raindrop I created is embedded directly in my blog, but can also be viewed directly at this link. I hope you like the medium, and the content !

Sunday, July 27, 2014

A response to Vint Cerf

A somewhat grumpy response to +Vint cerf 's request for feedback on how the ACM can do more for "professional programmers".

Dear Dr. Cerf
In your recent letter to the members of ACM, you write "I would like to ask readers how they satisfy their need to keep informed about computing practices and research results that may influence their own work". While I suspect your goal is to understand how ACM can serve the larger tech community and not the research community and I am a card-carrying member of the latter group, I thought I'd respond anyway.

First up, it's an ambitious (and brave!) idea to think that the ACM (or any single entity for that matter) can serve the needs of the vast technology enterprise. There was probably a time before the web when professional societies played an important role in collecting together people with shared interests and disseminating valuable information out to interested individuals. But now we have your current employer ! and online communities galore ! and Quora ! and the Stackexchange ecosystem ! and so many different ways for people to build communities, share information and learn about new ideas percolating through the world of tech.

It's a little funny though that you're worried about ACM's presence in the professional world. Many of us have long assumed that ACM spends most of its focus on that side of the computing community (the excellent revamp of the CACM under +Moshe Vardi being the exception that proved the rule). In fact, I'd go as far as to argue that the ACM would be much better served if it were instead to realize how it's driving itself into irrelevance in a research community that so desperately needs an institutional voice.

How do we satisfy our need to keep informed about results that might influence our work ? We (still) read papers and go to conferences. And how does the ACM help ? Well not very well.

Aggregating the deluge of information: anyone will tell you that the amount of research material to track and read has grown exponentially. But we still, to this day, have nothing like PUBMED/MEDLINE as a central clearinghouse for publications in CS-disciplines. The ACM DL is one step towards this, but it's a very poor imitation of what a 21st century repository of information should look like. It's not comprehensive, its bibliographic data is more erroneous than one expects, and the search mechanisms are just plain depressing (it's much easier to use Google)
Dealing with the changing nature of peer review and publication: Sadly, ACM, rather than acting like a society with its members' interests at heart, has been acting as a for-profit publisher with a some window dressing to make it look less execrable. Many people have documented this far more effectively than I ever could.
Conference services: One of the services a national organization supposedly provides are the conference services that help keep communities running. But what exactly does the ACM do ? It sits back and nitpicks conference budgets, but provides little in the way of real institutional support. There's no infrastructure to help with conference review processes, no support for at-conference-time services like social networking, fostering online discussion and communities, and even modern web support. I only bring this up because all of these services exist, but piecemeal, and outside the ACM umbrella.

Underneath all of this is a slow but clear change in the overall CS research experience. The CRA has been doing yeoman service here: taking the temperature of the community every year with the Taulbee surveys, putting out a best practices document for postdocs after extensive community discussion, and even forming action groups to help gain more support for CS research from the government. Does the ACM do any of this ?

In many ways, this is a golden era for computer science, as the fruits of decades of work in our field seep out into the larger world under the guise of computational thinking, big data and learning. It's a perfect time for an organization that has deep connections in both the academic side of CS and the industrial side to help with the translation and tech transfer needed to maximize the impact of the amazing new technologies we're all developing, as well as reach out to similar institutions in other areas to bring more CS into their communities (as you rightly pointed out)

But there is no sign that ACM has ideas about how to do this or even wants to. And while it continues to chase a professional tech community that doesn't seem to care about it at all, the academics who would have cared are finding their own way.

Monday, June 09, 2014

A declaration of independence (kind of), and technology-induced disintermediation

Musings on the SoCG move to declare independence of the ACM.

It's a little odd to be sitting in Denmark while high quality rød pølse (with remoulade!) is being made across the world in Kyoto. There's an abysmal lack of blogging/tweeting/facebooking/instagramming/snapchatting/can-I-stop-now-ing from the conference.

Of course following in the lines of the complexity conference, we're attempting to declare our own independence from the ~~EEVVIIL SKELETOR AND HIS MINIONS~~ ACM. Jeff Erickson, our ~~esteeemed grand poobah~~ steering committee chair has put out a series of posts outlining the issues at hand, the history of the matter, and the current status of discussions with SIGACT and ACM. Please visit http://makingsocg.wordpress.com to read, discuss and/or heckle as you see fit.

It might be obvious, but this new wave of devolution is being driven largely by technology - most importantly the existence of LIPIcs. Having a platform for open-access publication with minimal costs exposes as false the claims of institutional providers that they alone can provide the support needed for conferences. In fact, there are a number of related claims that appear to be not true in practice (or at least are not as obviously true as once thought).

* We need the imprimatur of an institutional provider as a signalling mechanism for quality. While it might be too soon to tell if dropping affiliation with established institutions (IEEE or ACM) will affect how publications in a venue will be perceived, there's a lot of confidence in the communities that their long-established reputation will outweigh any loss of prestige.

* Institutional providers provide quality editorial and archival services necessary for a serious venue. I think juxtaposing 'quality' and 'editorial' together with Sheridan Printing might cause my two remaining readers to die of hysterical laughter. But the archival issue is a good one. LIPIcs appears to be funded solidly by the German government for a while, and will take a fixed fee from a conference for publication (capped at 750 €). But the Arxiv was struggling for a while. Should we view the ACM and IEEE as "more likely to last" than any other entity ?

* Institutional providers provide the backing and clout needed for conference organization, hotel bookings and so on. This is another good example of a time-money tradeoff. Organizations like SIAM actually do take over the management of the conference: while the results are not always optimal, there is a clear reduction in hassle for the conference organizers. But organizations like the ACM don't take things over in the same way (and from what I can tell, neither does IEEE). I'm surprised though that there aren't yet lean-and-mean event planning providers that we can just pay money to and make our planning problems go away.

* Institutional providers have the financial wherewithal to manage cycles in revenue streams for a conference. This is another real issue. Conferences that have gone independent have eventually managed to maintain a steady income stream, but theory conferencs are smaller and poorer: it remains to be seen whether we can generate the kind of endowment needed to insulate the community against the natural variation in revenue from year to year.

What's disappointing is that none of this had to play out this way.

Take LIPICs for example: they clearly marked out their scope -- indexing, archiving functions and hosting -- while staying away from the more content-driven aspects of the process (editing, proofing etc). This makes a lot of sense, given that everyone who publishes knows how to use LaTeX and style files, but might still not be able to make a web page. Why couldn't the ACM have realized this and provided a slimmed-down publishing service ?
Why do we go to Microsoft (or Easychair, or hotCRP, or Shai Halevi's software) for our conference submission servers ? If the ACM had provided a service of this kind (or even provided hosting for hotCRP/Shai's software), we'd be happily using it right now, and it could have then tied nicely into Regonline, that ACM already partners with.
A lot of the current angst seems to have tone as a root cause: a certain feeling about the provider's attitude towards the conference. This is again something that could have been recognized and addressed before things got to this stage.

While it's exciting to be part of the "academic spring" (?), people tend to forget that in all revolutions someone gets hurt, and often things don't get better for a long time. I'm intrigued by our attempt to move towards independence though, and the people involved have thought this through very carefully.

Tuesday, May 20, 2014

On beauty and truth in science.

Philip Ball writes a thought-provoking article in Aeon with the thesis that the kind of beauty that scientists describe does not necessarily match the aesthetic notions of art, and is not even consistent among scientists.

It was hard for me to get beyond the casual conflating of beauty in mathematics (the four-color theorem, the proof of Fermat's theorem, and proofs in general) and beauty in scientific theories (relativity, evolution, and so on). But if one goes beyond the artificial duality constructed by the author, the idea of beauty as a driver in science (and mathematics) is a rich one to explore.

A particular example: for a long time (and even codified in books) it was taught that there were five natural classes of approximation hardness: PTAS, constant factor-hard, log-hard, label-cover (superlogarithmic) hard, and near-linear hard. There were even canonical members of each class.

Of course, this nice classification no longer exists. There are even problems that are $\log^* n$-hard to approximate, and can also be approximated to that factor. And to be fair, I'm not sure how strong the belief was to begin with.

But it was such a beautiful idea.

At least in mathematics, the search for the beautiful result can be quite fruitful. It spurs us on to find better, simpler proofs, or even new ideas that connect many different proofs together. That notion of connection doesn't appear to be captured in the article above: that beauty can arise from the way a concept ties disparate areas together.

MADALGO Summer School on Learning At Scale

I'm pleased to announce that this year's MADALGO summer school (continuing a long line of summer programs on various topics in TCS) will be on algorithms and learning. The formal announcement is below, and registration information will be posted shortly.

Save the date ! Aug 11-14, 2014.

MADALGO Summer School on
LEARNING AT SCALE

August 11- 14, 2014, Aarhus University, Denmark

madalgo.au.dk/learningatscale2014

OVERVIEW AND GOAL
The MADALGO Summer School 2014 will introduce attendees to the latest developments in learning at scale. The topics will include high dimensional inference, algorithmic perspectives on learning and optimization, and challenges in learning with huge data.

LECTURES
The school will be taught by experts in learning:

Amr Ahmed (Google)

Mikhail Belkin (Ohio State)

Stefanie Jegelka (Berkeley)

Ankur Moitra (MIT)

PARTICIPATION
The summer school will take place on August 11-14, 2014 at Center for Massive Data Algorithmics (MADALGO) at the Department of Computer Science, Aarhus University, Denmark. The school is targeted at graduate students, as well as researchers interested in an in-depth introduction to Learning. Registration will open soon at the school webpage. Registration is free on a first-come-first serve basis - handouts, coffee breaks, lunches and a dinner will be provided by MADALGO and Aarhus University.

ORGANIZING COMMITTEE

Suresh Venkatasubramanian (University of Utah)

Peyman Afshani (MADALGO, Aarhus University)

Lars Arge (MADALGO, Aarhus University)

Gerth S. Brodal (MADALGO, Aarhus University)

Kasper Green Larsen (MADALGO, Aarhus University)

LOCAL ARRANGEMENTS

Trine Ji Holmgaard (MADALGO, Aarhus University)

Katrine Østergaard Rasmussen (MADALGO, Aarhus University)

ABOUT MADALGO

Center for Massive Data Algorithmics is a major basic research center funded by the Danish National Research Foundation. The center is located at the Department of Computer Science, Aarhus University, Denmark, but also includes researchers at CSAIL, Massachusetts Institute of Technology in the US, and at the Max Planck Institute for Informatics and at Frankfurt University in Germany. The center covers all areas of the design, analysis and implementation of algorithms and data structures for processing massive data (interpreted broadly to cover computations where data is large compared to the computational resources), but with a main focus on I/O-efficient, cache-oblivious and data stream algorithms.

Thursday, May 01, 2014

The history of the vector space model

Gerald Salton is generally credited with the invention of the vector space model: the idea that we could represent a document as a vector of keywords and use things like cosine similarity and dimensionality reduction to compare documents and represent them.

But the path to this modern interpretation was a lot twistier than one might think. David Dubin wrote an article in 2004 titled 'The Most Influential Paper Gerard Salton Never Wrote'. In it, he points out that most citations that refer to the vector space model refer to a paper that doesn't actually exist (hence the title). Taking that as a starting point, he then traces the lineage of the ideas in Salton's work.

The discoveries he makes are quite interesting. Among them,

Salton's original conception of the vector space model was "operational" rather than mathematical. In other words, his earliest work really uses 'vector space' to describe a collection of tuples, each representing a document. In fact, the earliest terminology used was the 'vector processing model'.
In later papers, he did talk about things like orthogonality and independence, as well as taking cosines for similarity, but this was done in an intuitive, rather than formal manner.
It was only after a series of critiques in the mid 80s that researchers (Salton included) started being more direct in their use of the vector space model, with all its attendant algebraic properties.

Of course today the vector space model is one of the first things we learn when doing any kind of data analysis. But it's interesting to see that it didn't start as this obvious mathematical representation (that I've taken to calling the reverse Descartes trick).

Wednesday, April 30, 2014

SIAM Data Mining 2014: On differential privacy

After my trip to Haverford, I attended the SIAM Data Mining (SDM) conference in Philly. For those who aren't that familiar with the data mining universe, SDM is the SIAM entrant in the data mining conference sweepstakes, along with ACM (KDD) and IEEE (ICDM). SDM is probably also the smallest of the three venues, which makes it comparable in feel to SODA (also because of SIAM organization). The conference attracts the usual data mining suspects, but also more of the applied math folks.

I was the tutorials chair this year, and there were a number of very well-attended tutorials ranging from applications to core mining to theory. In particular, +Moritz Hardt and +Aleksandar Nikolov did a very nice tutorial on differential privacy entitled 'Safer Data Mining'.

SDM is a good venue for theory folks wanting to "test the waters" with data mining: the papers are consistently more mathematically oriented and less "business-heavy", and it's a friendly crowd :).

Shameless plug: I'm the PC co-chair next year along with Jieping Ye and I'd encourage more algorithms folks to submit, and visit Vancouver in April.

In a future post I'll talk more about a panel I also ran at the conference titled 'Ethics in Data Mining'

Tuesday, April 22, 2014

The Shape of Information

A brief synopsis of my general-audience talk at Haverford College.

I'm currently visiting Haverford College at the invitation of +Sorelle Friedler as part of Haverford's big data lecture series. Today's talk was a general audience talk about data mining, titled 'The Shape Of Information': (GDrive link)

The Shape Of Information

What makes data mining so powerful, and so ubiquitous? How can the same set of techniques identify patients at risk for a rare genetic disorder, consumers most likely to like Beyonce's latest album, or even a new star from an sky survey ?

The answer starts with an idea Descartes had nearly 500 years ago. He suggested expressing geometry in terms of numbers (coordinates). This turned out to be a powerful technique that led (among other things) to the development of the calculus. Data mining returns the favor. It starts with sets of numbers that describe a collection of objects. To find patterns in these objects, we create a geometry in which the numbers are coordinates. And just like that, objects become shapes, and the search for information becomes a quest for common structure in these shapes.

In this search, we are not limited by the geometry of our world: we can dream up ever more intricate geometries that capture the shape of the information that we seek to find in our data. In this sense, data mining is the best kind of science fiction come to life: we craft a world out of our imagination, and let the laws of this world lead us to fascinating discoveries about the data that inhabits it.

I had a great time visiting with Sorelle's students in their data mining class. Haverford College continues to impress me with the quality of their undergrads (and their faculty !)

Friday, April 18, 2014

danah boyd, Randall Munro, and netizens.

danah boyd, author of 'It's Complicated' just gave a tech talk at Google. Her book has been in the news a lot lately, so I'll skip the details (although Facebook ought to be at least slightly worried).

But what I enjoyed the most about her talk was the feeling that I was listening to a true netizen: someone who lives and breathes on the internet, understands (and has helped build) modern technology extremely well (she is a computer scientist as well as an ethnographer), and is able to deliver a subtle and nuanced perspective on the role and use of technology amidst all the technobabble (I'm looking at you, BIG data) that inundates us.

And she delivers a message that's original and "nontrivial". Both about how teens use and interact with social media, and about how we as a society process technological trends and put them in context of our lives. Her discussion of context collapse was enlightening: apart from explaining why weddings are such fraught experiences (better with alcohol!) it helped me understand incidences of cognitive frisson in my own interactions.

What she shares with Randall Munro in my mind is the ability to speak unselfconsciously and natively in a way that rings true for those of us who inhabit the world of tech, and yet articulate things that we might have felt, but are unable to put into words ourselves. Of course they're wildly different in so many other ways, but in this respect they are like ambassadors of the new world we live in.

Thursday, April 17, 2014

STOC 2014 announcement.

Howard Karloff writes in to remind everyone that the STOC 2014 early registration deadline is coming up soon (Apr 30 !). Please make sure to register early and often (ok maybe not the last part). There will be tutorials ! workshops ! posters ! papers ! and an off-off-Broadway production of Let It Go, a tragicomic musical about Dick Lipton's doomed effort to stop working on proving P = NP.

At least a constant fraction of the above statements are true.

And if you are still unconvinced, here's a picture of Columbia University, where the workshops and tutorials will take place:

Monday, April 07, 2014

Directed isoperimetry and Bregman divergences.

A new cell probe lower bound for Bregman near neighbor search via directed hypercontractivity.

Amirali Abdullah and I have been pondering Bregman divergences for a while now. You can read all about them in my previous post on the topic. While they generalize Euclidean geometry in some nice ways, they are also quite nasty. The most important nastiness that they exhibit is asymmetry:
\[ D_\phi(x, y) \ne D_\phi(y, x)\]
What's worse is that this asymmetry can grow without bound. In particular, we can quantify the degree of asymmetry by the parameter $\mu$:
\[ \mu = \max_{x,y \in \Delta} \frac{D_\phi(x,y)}{D_\phi(y,x)} \]
where $\Delta$ is the domain of interest.

There's been a ton of work on clustering Bregman divergences, and our work on low-dimensional approximate nearest neighbor search. In almost all these results, $\mu$ shows up in the various resources (space and/or time), and it seems hard to get rid of it without making other assumptions.

After our low-D ANN result, we started trying to work on the high-dimensional version, and our thoughts turned to locality-sensitive hashing. After struggling for a while to get something that might be useful, we started pondering lower bounds. In particular, it seemed clear that the worse $\mu$ was, the harder it became to design an algorithm. So could we actually come up with a lower bound that depended on $\mu$ ?

Our first result was via a reduction from the Hamming cube (and $\ell_1$ near neighbors). While these gave us the desired lower bound, it wasn't $\mu$-sensitive. So we went looking for something stronger.

And that's where isoperimetry made an entrance. There's been a long line of results that prove lower bounds for LSH-like data structures for near neighbor search. Roughly speaking, they work in the following manner:

Construct a gap distribution: a distribution over inputs and queries such that a query point is very close to its nearest neighbor and very far from any other point in the space. In the $\ell_1$ case, what you want is something like $\epsilon d$ distance to the nearest neighbor, and $\Omega(d)$ distance to the second nearest neighbor.
Imagine your data structure to be a function over the Hamming cube (assigning blocks of points to cells) and look at what happens when you perturb it (formally, by defining the function value to be its average over all nearby points)
Use isoperimetry (or hypercontractivity) to argue that the function "expands" a lot. What this implies is that points get scattered everywhere, and so any particular cell you probe doesn't have enough information to determine where the true nearest neighbor actually is. This last step can be proved in different ways, either using Fourier methods, or via the use of Poincaré-type inequalities.

Our initial hope was to use this framework to prove our bound, while incorporating terms relating to the asymmetry of Bregman divergences more directly.

This turned out to be hard. The asymmetry destroys many of the nice algebraic properties associated with the noise operator and related inner products. There's even a deeper problem: if you think of an asymmetric noise operator as pushing things "forward" on a cube with one probability and backwards with another, then it's not hard to imagine a scenario where applying it actually ends up concentrating the mass of the function in a smaller region (which would break hypercontractivity).

We needed two things: a way to get around this not-so-small issue that hypercontractivity isn't true in a directed setting, and a way to analyze the asymmetric noise operator. It turned out that we could circumvent the first problem: we could restrict the class of hash functions we considered in a way that didn't significantly change their entropy (only a constant). Alternatively, we could view hypercontractivity as being true "on average".

The second problem was a bit harder. The solution we came up with was to try and link the directed noise operator to a symmetric operator on a biased space. The intuition here was that the directed operator was creating a nonuniform distribution, so maybe starting off with one might give us what we need.

This actually worked ! There are versions of the Bonami-Beckner inequality in biased spaces that look almost just like their counterparts in uniform space (you merely set the probability of a bit being 1 to $p \ne 1/2$, and the Fourier coefficients are defined in terms of $p$).

There was lots more to address even after the isoperimetry result, but you'll have to read the paper for that :).

I should note that in many ways, this was a "Simons baby": Amir visited during one of the workshops for the program on real analysis, and we had fruitful conversations with Elchanan Mossel and Nathan Keller in early stages of this work.

Thursday, March 20, 2014

Data Mining, machine learning and statistics.

How does one tell data mining, machine learning and statistics apart ?

If you spend enough time wandering the increasingly crowded landscape of Big Data-istan, you'll come across the warring tribes of Datamine, MachLearn and Stat, whose constant bickering will make you think fondly of the People's front of Judea:

Cosma Shalizi has what I think is a useful delineation of the three tribes that isn't prejudicial to any of them ("Stats is just inefficient learning !", "MachLearn is just the reinvention of statistics!" "DataMine is a series of hacks!"). It goes something like this:

Data mining is the art of finding patterns in data.
Statistics is the mathematical science associated with drawing reliable inferences from noisy data
Machine learning is [the branch of computer science] that develops technology for automated inference (his original characterization was as a branch of engineering).

I like this characterization because it emphasizes the different focus: data mining is driven by applications, machine learning by algorithms, and statistics by mathematical foundations.

This is not to say that the foci don't overlap: there's a lot of algorithm design in data mining and plenty of mathematics in ML. And of course applied stats work is an art as much as a science.

But the primary driving force is captured well.

Wednesday, March 19, 2014

Reproducibility in data mining/learning

Is reproducibility in data mining/learning different from reproducibility in systems research ?

+John Regehr and +Shriram Krishnamurthi have discussions (John, Shriram) arising from a new paper on reproducibility in computer systems research from people at the U. of Arizona. You should read their posts/G+ discussions for the specifics.

But it got me thinking: is there anything intrinsically different about reproducibility in my neck of the woods ? Many concerns are shared:

The need for a standard computing environment to build and test code: MATLAB and scikit-learn make that a lot easier for us: we don't need to worry too much about details of the computing environment/hardware just to get the code to run.
Discrepancies between what is stated in the paper and what the code actually does: that's a problem with any experimental work: John in fact mentions one such issue in one of his recent papers.
Code rot: otherwise known as 'graduation' :). I think if more people used public repos like github from the beginning, some of these problem might go away. I would be remiss if I didn't also plug +Robert Ricci's new system AptLab for helping distribute research.

But here's one that may not be shared: data preparation.

It's no secret that data preparation takes most of the time in a data mining pipeline. This includes

data cleaning (removing errors, reformatting data)
feature engineering (deciding how to truncate, modify or retain certain features)
training (cross-validation levels, what data is used for model building and so on)

As with any of the above, a proper specification can ensure true reproducibility, but there are lots of things we might do to data without necessarily even realizing it (if there are bugs in the transformation pipeline) or without thinking we need to mention it (truncating ranges, defining null values away).

Feature engineering can also be a problem, especially with models that use many features (for example, deep learning systems have lots of "knobs" to tune, and they can be quite sensitive to the knobs).

So one thing I often look for when reviewing such papers is sensitivity: how well can the authors demonstrate robustness with respect to the parameter/algorithm choices. If they can, then I feel much more confident that the result is real and is not just an artifact of a random collection of knob settings combined with twirling around and around holding one's nose and scratching one's ear.

Tuesday, March 11, 2014

A short not-review of the first episode of Cosmos

If you've been living on a rogue planet wandering through the far reaches of the Local group, you may not have heard of the Cosmos reboot on FOX with Neil deGrasse Tyson.

We finally got around to seeing the first episode yesterday. I was nervous because I loved the original so much and hoped that my kids would like it as well (as all parents know, you can't show fear or the sharks will smell it :)).

Things I liked:

The homage to Carl Sagan was done beautifully. If you're going to reboot a franchise, you don't have to pretend that nothing existed before, you know.
The mix of CG + real person and animation was a good way to tell the different stories. I was a little worried about the denouement of the Giordano Bruno tale because my six-year old was watching, but it was done without being too graphic.
The graphics were non-cheesy: but then I wasn't ever worried about the quality of graphics in the original. I'm pretty sure that 20 years from now these graphics will look cheesy as well. The year-long calendar of the universe was a fantastic demonstration of the immensity of time.

Things I could live without:

Granted, it's the first episode, so there's no point in being too harsh. But

If I had any heartstrings ever, they've been pulled out of shape into long strands of spaghetti. The constant sweeping orchestral background made me feel like I was watching NBC's coverage of the Winter Olympics. I almost expected to see a personal profile of how the universe overcame the adversity of the Big Bang to become the star performer it is today.
As with all modern American shows purporting to be about science, I found the balance between sweeping rhetoric and actual facts to be disturbingly skewed towards the soaring fluffy. Watch some David Attenborough, people ! or even, I don't know, COSMOS itself. But see below...

Overall:

I liked it. No question that I'm watching it again. And the eight-year old loved it ! Thought it was full of information. So maybe my assessment of graphics-to-information ratio is completely off.

The funniest thing: when it was over, this happened:

Kids: "Let's see the next episode now!"
Me: We'll have to wait a week for it.
Kids: "What do you mean, we have to wait a week ?"

Netflix should be proud. Binge watching on demand is now the default mode.

Tuesday, March 04, 2014

Two big data talks at Haverford College

I'm giving two talks at Haverford College just before SDM 2014 in Philadelphia. If you're in the area. stop by and say hello !

Thursday, February 06, 2014

On "reverse engineering" algorithms, contextual readings, and the invisible choices that add value to research.

tl;dr: Understanding the value in a piece of work (a paper, an algorithm, a system) is not just a matter of understanding the various parts and how they fit. It is also a matter of understanding why the problem is decomposed that way. This observation has implications for how we read, write and evaluate research.

Nick Seaver is an cultural anthropologist at UC Irvine. He just wrote an article at Medium on reverse engineering algorithms. In it he distinguished between the factual reconstruction of an algorithm ("how does it work") from a more probing examination of the way it was designed ("why it is broken down in a particular way").

He comes to a conclusion that is quite sound and not at all controversial:

I want to suggest here that, while reverse engineering might be a useful strategy for figuring out how an existing technology works, it is less useful for telling us how it came to work that way. Because reverse engineering starts from a finished technical object, it misses the accidents that happened along the way — the abandoned paths, the unusual stories behind features that made it to release, moments of interpretation, arbitrary choice, and failure. Decisions that seemed rather uncertain and subjective as they were being made come to appear necessary in retrospect. Engineering looks a lot different in reverse.

But along the way, he makes an insightful observation about the very idea of structuralism as applied to algorithm design: namely, the idea that by breaking down the parts of the algorithm and understanding the pieces and how they fit together we can "understand" the algorithm at some higher level.

When you break an object down into its parts and put it back together again, you have not simply copied it — you’ve made something new. A movie’s set of microtags, no matter how fine-grained, is not the same thing as the movie. It is, as Barthes writes, a “directed, interested simulacrum” of the movie, a re-creation made with particular goals in mind. If you had different goals — different ideas about what the significant parts of movies were, different imagined use-cases — you might decompose differently. There is more than one way to tear apart content. (emphasis added)

In other words, the value of a reductionist analysis of a piece of work is not just in understanding the parts and how they fit, but in understanding the choices that led to that particular decomposition.

I think there are important lessons here for anyone involved in the creation and evaluation of new work. In particular:

Reading: While this is mostly advice for students, it applies whenever you're reading unfamiliar material. The particular form of the paper -- how the proofs are structured, how the system is built, or how the algorithm components work -- is a choice made by the authors and should not be left unexamined or unquestioned. All too often a student will read a paper as a factual news report, rather than reading it as a specific interpretation of a problem/algorithm/theorem that could lend itself to multiple interpretations (just a few days ago I was discussing some JL results with a colleague and realized that we had totally distinct and valid interpretations of some recent work in the area).

Reviewing: It's very easy (and common) to read a paper, understand the proofs, and then be completely underwhelmed by it: the dreaded "too simple" feeling that leads papers to get rejected from unnamed theory conferences. This is especially true when you're not familiar with an area, and don't realize that it was the choices the author(s) made that made the paper look simple. And so your assessment has to factor that choice in, rather than taking it for granted.

Writing/Presenting: Of course all of this impacts how you as a creator choose to present your work. There is not one way to tell a story (in writing or in a talk). But you do make choices about how to tell it. And so it's important to make those choices visible (either by explaining why they are needed, or why different choices don't work, or how they get around certain roadblocks) so that your work can receive a fair evaluation.

This can be excruciating, especially when (like many good papers) your work admits multiple interpretations, and you have to gamble on the right one to please the fickle and time-stretched reviewer. But be mindful of these choices in your work.

p.s On a separate note, it's intriguing to me how so many tools from the study of literature (narrative, contextual analysis, critical reading) show up in other forms of writing far removed from the humanities. This is yet another reason why I'm saddened (even though I'm on the "right" team) by the relentless focus on STEM in opposition to the liberal arts. It's a particularly virulent form of divide-and-conquer (the British colonial technique, not the algorithmic paradigm).

p.p.s Has it really been three months since I wrote anything ? I must be working !!

Pages