Saturday, December 12, 2015

White Elephant parties and fair division

It's holiday party season, which means that it's time for white elephant parties. For those who don't know what a white elephant party is, here's a description from Wikipedia:
Each participant supplies one wrapped gift. The gifts are placed in a central location, and participants determine in which order they will take turns selecting them. The first person opens a wrapped gift, and the turn ends. On subsequent turns, each person can open a new present or gets the choice to "steal" another person's gift. The gift cannot be stolen once the third participant touches the gift (i.e. - it is stolen for the 2nd time). When a person's gift is stolen, that person can either choose another wrapped gift to open or can steal from another player. The game is over when the last person goes and the first person goes again
I just finished my algorithms class with the "traditional" (n=2) discussion of fair division algorithms (aka cake cutting, but with actual cake). White elephant parties are also a form of fair division with indivisible goods.

A few years ago, I started wondering about the theory behind white elephant exchanges and posted a question on cstheory. The answer I got (from usul) was excellent (tl;dr: worst-case definitions of fairness don't work). But it's also a few years old, and I wonder if there are new results on this topic.

Tuesday, December 01, 2015

Fairness and The Good Wife

The Good Wife is a long-running legal TV show with a Chicago politics backdrop. It's one of the most popular shows on network TV, and so it was particularly fascinating to see an episode devoted in part to issues of algorithmic fairness.

It's helpful to know that the show frequently features a Google-like company called ChumHum. In this episode, they're being sued because of a feature in their maps application called "Safe Filter" that marks regions of a city (Chicago) as safe or not safe. A restaurant owner claims that she went out of business because Chummy Maps (their horribly named maps application) marked the region around her restaurant as unsafe.

The writers of this episode must have been following some of the news coverage of fairness in algorithms over the past year: a number of news items were referenced in passing. What follows is an annotated list of references.

  • "Math is not racist". This is what the character playing the COO of Chumhum says when he's first deposed, arguing that the Safe Filter uses only algorithms to determine which regions are safe. This is reminiscent of the overly uncritical article that NPR did about the use of machine learning in hiring: the CEO of Jobaline (one of the companies doing this) happily proclaimed that "math is blind"
  • "Tortious intent": Disparate impact is one vehicle that might be used to determine bias-by-algorithm. But in the episode, the lawyers argue a stronger claim, that of tortious intent, which is interesting because they then have to show deliberate racial bias. 
  • "Objective third party statistics and user generated content": The initial line of defense from Chumhum is that they use third party statistics like crime rates. The lawyers immediately point out that this could introduce bias itself. They also say they use user-generate content as a defense ("We're not racist: our users are"). This is then rebutted by the lawyers pointing out that the users of the maps app skew heavily Caucasian (bringing up another good point about how bias in training data can leech into the results)
  • "Full discovery": Chumhum wanted to hide behind its algorithm: the opposing lawyers made a successful argument for discovery of the algorithm. I doubt this could ever happen in real life, what with trade secrets and all. More on this later. 
  • "Home ownership rates as proxy for race": One subplot involved determining whether home-ownership rates were being used in the Safe Filter. The characters immediately realized that this could be a  proxy for race and could indicate bias. 
  • "The animal incident": This was a direct reference to the image-tagging fiasco of a few months ago when Google's photo app started labelling pictures of African-Americans as 'gorillas'. While at first this is a throw-away incident (including a line "Even google did it!"), it comes back later to haunt the company when a lawyer looks at the code (ha!) and discovers a patch that merely removes the 'animal' tag (instead of fixing the underlying problem). This appears to also be what Google did to "solve" its problem. 
  • "Differential ad placement": A hat tip to the work by Latanya Sweeney and the CMU team, another plot point turned on the discovery that ads in the maps application were targeting the white lawyer with ads for skiing and the black lawyer with ads for soul food.  This in and of itself was not a problem for the case, but it led to a much more fascinating argument: that Chumhum was drawing on user profile data from all its properties (search, email etc) to target ads, and so discovery could not be limited solely to maps-related data and code. This is in general the problem with asking for code to do an audit: if you don't know where the training data is coming from, the code is basically useless. Remember, an algorithm isn't always just a piece of code :)
  • "Bad training data/non-diverse workforce": One of the employee characters made the argument that the bad image tagging results were the result of "bad training data", which is an accurate statement and is part of the fairness concerns with algorithms. The lawyer also made the point that a less-homogenous workplace might have helped as well (which brings to mind the Al Jazeera panel I participated on a few months ago)
  • "IMPLICIT BIAS": I was happy when this phrase was used correctly to argue for how even "non-racist" people can help perpetuate a racist system. I would have been happier if someone had said "Disparate impact" though :). 
If you're wondering, the final resolution of the case did NOT turn on a determination of bias or not. It turned out that the restaurant had been losing money before the filter was even put into place. But it was interesting to see an example (albeit on TV) of how a court case on this might pan out. A lot of the side show involved trying to claim that programmers on the Maps app were racist (or had racist inclinations) to argue for why the code might be biased as well. 

Sunday, November 01, 2015

Rock bands with CS names.

You all know (or should know) about Positive Eigenvalues, the band formed by Michael Jordan that has included (among others) Christos Papadimitriou. But I'm not talking about that kind of band.

I'm looking for bands with names that have a CS connection (accidentally or otherwise). I was playing a Spotify playlist called (coincidentally) Deep Focus and the first band on the list was called Random Forest.

So here's my list so far:

Any more ? 

Saturday, October 31, 2015

Data and Civll Rights II

I just got back from the one-day Data and Civil Rights conference organized by Data and Society. As I mentioned in my previous post, the conference operated under Chatham House Rules, which means I can't really reveal any of the specific discussions that went on in the main sessions or the breakout groups.

This was a conference full of civil rights activists, lawyers, policy types, and even police folk. It was a conference where people could get up and proclaim that they hate numbers, to cheers from the audience. Feeling like the odd one out is unfamiliar to me.

But it was full of passion and fire. And very depressing if you think of data analysis and ML as a GOOD THING. Because in this context, it is at best a blunt weapon that is being wielded carelessly and violently.

We've had a good run designing algorithms that tell people what to buy and what to watch. But when these same algorithms start deciding whether people can live their lives on their own terms, then as the cool kids are wont to say:

Friday, October 30, 2015

Eschew obfuscation: write clearly

There's an article in the Atlantic about the "needless complexity of academic writing". Apart from learning that there's a Plain Writing Act (who says Congress is gridlocked!), I wasn't too surprised by the points made. Yes, academic writing can be turgid and yes, part of this is because we want to "impress the reviewers", and no academics can't be coerced into changing the way they do things - at least not easily.

Steven Pinker has proposed an alternate theory of why academic writing is so jargon-heavy. Paraphrasing from the Atlantic article:
Translation: Experts find it really hard to be simple and straightforward when writing about their expertise. He calls this the “curse of knowledge” and says academics aren’t aware they’re doing it or properly trained to identify their blindspots—when they know too much and struggle to ascertain what others don’t know. In other words, sometimes it’s simply more intellectually challenging to write clearly.
For me, blogging has always been a way out of this blind spot. First of all, I can be more conversational and less stilted. Secondly, even if I'm writing for a technical audience, I'm forced to pare down the jargon or go crazy trying to render it.

But I wonder how hard it really is for experts to write clearly about their work. I wonder this because these same experts who write prose that you can clobber an elephant with are remarkably colorful and vivid when describing their work in person, on a board, or at a conference (though not at a talk itself: that's another story).

While it's common to assume that the obfuscation is intentional (STOC papers need to be hard!), I think it's more a function of deadline-driven writing and last-minute proof (or experiment) wrangling.

I'm thinking about this because I'm planning to run a seminar next semester that I'm calling 'Reading with Purpose'. More on that in a bit...

Monday, October 26, 2015

Data and Civil Rights

I'm in DC right now for a one-day conference on Data and Civil Rights, run by the Data and Society Institute.

This is an annual event (this is the second such conference). Last year's conference was themed "Why Big Data is a civil rights issue", and this year's conference focuses on the very hot-button topic of big data and criminal justice.

Needless to say, issues of fairness and discrimination are front and center in an area like this, and so I'm hoping to learn a lot about the state of play (and maybe contribute as well).

This is more of a working meeting than a traditional conference: all material is private during the conference and we're expected not to talk about the discussions outside the event (a la Chatham House rules). Digested material from the different working groups will be posted in November.

JHU Workshop on Sublinear Algorithms

The latest in a long line of workshops on sublinear algorithms (streaming! sketching ! property testing ! all of the above !) will be held at JHU this year just before SODA 2016. The message from the organizers is below: do consider attending if you're planning to attend SODA. (Disclaimer: I'm giving one of the 20+ talks, but I will not promise that it's excellent). 

Dear colleagues,

We are organizing a Sublinear Algorithms workshop that will take place at Johns Hopkins University, January 7-9, 2016. The workshop will bring together researchers interested in sublinear algorithms, including sublinear-time algorithms (e.g., property testing and distribution testing), sublinear-space algorithms (e.g., sketching and streaming) and sublinear measurements (e.g., sparse recovery and compressive sensing).

The workshop will be held right before SODA’16, which starts on January 10 in Arlington, VA (about 50 miles from JHU).

Participation in this workshop is open to all, with free registration. In addition to 20+ excellent invited talks, the program will include short contributed talks by graduating students and postdocs, as well as a poster session. To participate in the contributed talk session and/or the poster session, apply by December 1.

For further details and registration, please visit .

Vladimir Braverman, Johns Hopkins University
Piotr Indyk, MIT
Robert Krauthgamer, Weizmann Institute of Science
Sofya Raskhodnikova, Pennsylvania State University

Friday, October 02, 2015

An algorithm isn't "just code"

I've been talking to many people about algorithmic fairness of late, and I've realized that at the core of pushback against algorithmic bias ("algorithms are just math! If the code is biased, just look at it and you can fix it !") is a deep misunderstanding of the nature of learning algorithms, and how they differ fundamentally from the traditional idea of an algorithm as "a finite set of well-defined elementary instructions that take an input and produce an output".

This misunderstanding is crucial, because it prevents people from realizing why algorithmic fairness is actually a real problem. And that prompted me to write a longer note that takes the "algorithm == recipe" analogy and turn it on its head to capture how machine learning algorithms work.

Tuesday, July 28, 2015

The 2nd Workshop on Fairness, Accuracy and Transparency in Machine Learning: A review

I was one of the organizers of the 2nd workshop on Fairness, Accuracy and Transparency in Machine Learning (FATML) at ICML 2015, and in my alternate career as moderator of data mining panels, I moderated the closing panel. The panelists were Fernando Diaz from MSR New York, Sorelle Friedler from Haverford College, Mykola Pechenizkiy from Eindhoven Instt. of Technology and Hanna Wallach from UMass-Amherst and MSR.

While my original intent was to do a review of the panel, it became clear that the panel discussion touched on themes that were bubbling up throughout the day. So what follows is organized by panel questions, but weaves in discussion from outside the panel as well.

This year's workshop, unlike the one at NIPS 2014, had a bit more of a technical focus: we had some of the early researchers in fairness work from Europe and Japan give talks on their work.  So as a counterweight, I thought I'd ask the panel to look beyond the presentations for the first question:
Question 1 
What is one thing you think we are missing (or should be paying more attention to) in our current discussions of fairness and discrimination  in terms of interaction with the social/policy/legal world OUTSIDE CS ?

Some themes that emerged:

...on the difference between Europe and the US

Mykola made an interesting observation based on his experience in educational data mining. Governments around Europe are very concerned about the use of student data (including health records and academic information) for any kind of data driven tools, and have severely regulated use of such data. As a result, the nascent field of educational data mining has been crippled by lack of access to data.

This is almost the opposite of the situation in the US, where data driven policy is the newest buzzword in town, and those of us who are interested in issues of fairness and transparency feel like we're constantly on the outside looking in (though attention to these issues is increasing).

...connecting to other communities

It's been clear from the beginning that discourse on fairness and transparency in machine learning must draw on the corresponding discourses in society at large. Which means that before we can start solving problems, we have to understand what the problems really are. This came through very strongly in the discussions. To paraphrase one of the panelists,  "Computer science likes to solve problems, and that's the problem !" (also known as  "slap a metric on it and optimize").

So what are the different communities we should be connecting to, and how ?

a) Connecting with social science

A major concern is the "prediction vs understanding" problem. For the most part, machine learning is about prediction: you classify, label, clustering, regress, rank and so on. But in the study of society and the dynamics of human interactions, the goal is not just to predict how humans might behave, but to understand their behavior. Which is to say, data analysis (or even fairness-aware analysis) has to be the first step in a larger conversation, rather than the last one.

While I don't think this issue is specific to fairness and transparency, it  plays a role in understanding the sources of inequality and discrimination. It's not enough to to detect examples of bias: what must happen next is an investigation of why the bias is happening.

(ed: personally, while I understand this concern, I don't think it's necessarily something computer scientists need to prioritize. This is after all what the social sciences do, and it doesn't make sense for us as computer scientists to merely try to acquire those skills. I think we need to be aware of the deeper issues of understanding a domain, but we also have strengths that we bring to the table and I'll say more about that later)

"galaxies don't care how they are studied, but people do"

Another point that was made over and over is that  issues of fairness and bias are not abstract: they affect actual people. Keeping the human in focus is important for the ethical underpinning of what we do, and even how we might design experiments.

b) connecting with journalists

Nick Diakopoulos gave a talk on "algorithmic accountability" in journalism. In addition to talking about what made research on fairness newsworthy:

  • discriminatory/unfair practices
  • mistakes that denies a service
  • censorship
  • activities that break the law or social norms
  • false prediction

he made the strong argument that (government) legitimacy comes from transparency, and talked about what that might entail in the age of data driven policy, including transparency involving data collection, the algorithms used, the inferences generated, and the humans involved in the process.

(ed: I don't think that our demands on transparency should be limited to government entities: the sad fact is that at least in the US, much of what would be considered basic internet infrastructure is controlled by private corporations, and they should be held to similar standards: if not for legitimacy, at least for fairness)

c) connecting with the law

Our fearless leader Solon Barocas made a number of interesting observations on the connection between algorithmic fairness  and the law, all the while disclaiming IANAL :). But his point (which he's made before) is worth repeating. One of the things that computer science can do well is  make precise concepts that might be defined vaguely or only indirectly through case law. And then we can get to work teasing out the relationships between different concepts (both abstractly and computationally). Indeed, the idea of a "reduction" between concepts in fairness might be one of the most useful things that computer science can uniquely contribute.

It's clear we're in a "let a thousand definitions bloom" phase in fairness research. And it's interesting to see the different reactions to this: on the social science side, there appears to be some nervousness that we're "playing games with math", but from Solon's comments this doesn't seem like a bad thing as long as we're also trying to connect the definitions together.

 Question 2 
In your view, what’s the next most pressing question we should be asking (limited to INSIDE CS to distinguish from the previous question) ?

...better definitions

It was very clear from the discussion that we need broader definitions of F-A-T beyond what's mathematically plausible. One particular example that's reminiscent of the metrics for privacy: There's a notion of "utility": how much can we make the data or the task "fair" without changing the underlying results produced by the "unfair" data/algorithm. The problem is that utility itself is not very well defined. Firstly, you might be benefiting from discriminatory policies, so your perceived "utility" itself is a problem. Trying to maintain this defeats the purpose of fairness. Secondly, even if this is not the case, the framing of the question as a tradeoff implies that these two notions are necessarily in opposition. That shortchanges the moral imperative of fairness and is different from the parallel situation in privacy. Finally, we measure utility in terms of classifier accuracy. But that's a very poor notion of overall task effectiveness. For example, is there a Bayesian perspective to bring to this ?

At any rate, since we are good at understanding tradeoffs in computer science, we should understand the different dimensions of the space of fairness preserving methods, rather than limiting ourselves to a one-dimensional false dichotomy of "fairness vs utility".

...better usable artifacts

Nick asked us the following question at the panel:

when a CEO or an agency head comes to us and asks "what should we do about this fairness stuff". what do we tell them ?

We didn't have a good response, and that was interesting. While we're beginning to explore the space of what's possible, we don't have clear examples of artifacts to hand over and say "use this".

As usual, the topic of benchmarking came up. I joke that when industry folks bring up the issue of benchmarking, I always ask "so where's the data" and they usually go very silent. But I do think there are useful data sets to be explored that come to us from the government. Some common data sets that get used are the US census data on salaries and a German data set on consumer credit. The entire data set from the Ricci court case is also available (even though it's tiny), and there are Bureau of Justice recidivism data sets to play with.

Of course this goes against the imperative coming from the social scientists to look at specific domains and ask meaningful questions in that domain. And I think we need to look more at the literature on fairness and bias over the decades and extract data that people have studied.

...better problems

For the most part, researchers have been considering binary classification as the suspect task. But of course there are much more general tasks that we could be considering: what about unsupervised learning ? what about structured prediction ? Is there a way to define fairness when you don't have a simple binary response variable and binary attributes ?

One final question I asked was this:
 Question 3 
do we have to solve the causality problem in order to talk about fairness ? 

This question was possibly not as well-posed as I would have liked, but it led to interesting discussions.

The law deals with intent, because the goal of the law is to assign responsibility. Algorithms are not agents and can't exhibit intent. Causality is a proxy for intent, in that if we can say that something caused something else, we can assign blame in a different way. In fact there were two talks at the workshop that talked about causality directly in the context of fairness.

But causality is a very hard problem. It's extremely subtle (if you doubt this, read through some of the examples Judea Pearl discusses in his book), and extremely controversial: different camps have their own view of how to mechanize causal inference, and the battles there make frequentists and Bayesians look like life-long friends.

In the discussion that followed, it became clear that there were really two ways of thinking about causality as it relates to fairness. The first way is to think about the underlying causal mechanisms that might lead to otherwise innocent features leading to biased outcomes: that is, how might zip code correlate with racial identity for example. The second way, which is closer to what I had in mind, is to think about the behavior of an algorithm causally: the use of these inputs or this algorithm *caused* a particular decision to be made. This second idea is not as far-fetched as it seems: some work in the database community has looked at trying to find which tuples "caused" a certain output to be generated from a query.

If you think it's not important to understand causality as it comes to automated methods, you might not want to drive a self-driving car or fly a plane. But as Solon suggested in the discussion, one way of getting around causality is to think about negligence with respect to algorithms: can we design reasonable best practices for predictive tools and argue that a failure to use these methods is negligence ? The legal ramifications of these idea have been explored in the context of robotics (article, and response) but more work is yet to be done.

...back to narratives

Another comment by Nick D, again connecting to the journalism perspective: narratives and story telling are a powerful way to explain the results of data mining. I haven't talked much about interpretability, which is an important part of the larger discussion of transparency and accountability. But one way to communicate the results of (say) a fairness audit would be to provide a human-interpretable linkage between the problematic attributes being used for prediction and the protected attribute. For more on this, see Michael Nielsen's very timely new Quanta article on machine-generated explanations.

It's clear from all the discussion that there's a lot of work to be done and a small but active community of people interested in pushing these issues forward. Fairness, and algorithmic bias, are hot topics in the news nowadays, and it's a good time to take advantage of this burst of interest.

Friday, July 24, 2015

Racism/sexism in algorithms

For my sins, or more specifically because of my interest in fairness, I was asked to be on a panel discussing algorithmic racism/bias on The Stream, a web/tv show on Al Jazeera English (as opposed to Al Jazeera America, in case you're not already confused).

I've never done live TV before, so it was quite nerve-wracking. While my "role" was to be the tech explainer, I was worried about saying something I wouldn't be able to rephrase or take back, and I was worried about the 'shouting talking heads' nature of TV discussions.

I'm glad to say that none of this transpired. We had a very genteel discussion on the merits of the issue, and there was a lot of informed commentary on the topic. Personally I thought that the discussion leaned more heavily on dealing with bias via diversity in hiring rather than on structural bias in algorithm design, but it's a very important part of the bigger picture of fairness, and it's definitely a more accessible discussion than the arcana of algorithmic disparate impact (something that I hope to change:)).

Here's the show in its entirety. And irony of ironies, I went to twitter later to change my user name back to my full name, and Twitter wouldn't let me fill in my whole last name !

Saturday, July 04, 2015

On the different stages of learning and teaching (algorithms)

Descending a rabbit hole of links prompted by a MeFi discussion (thanks, +David Eppstein) of Steven Pinker's essay on the curse of knowledge (thanks, +Jeff Erickson), I came across an article by Alistair Cockburn on a learning framework inspired by aikido called 'Shu-Ha-Ri'.

In brief,

  • In the Shu stage, you're a beginning learner trying to find one way to solve a problem. It doesn't matter that there might be multiple ways. The goal is to learn one path, and learn it well. 
  • In the Ha stage, you understand one way well enough to realize its limits, and are ready to encounter many different strategies for reaching your goal. You might even begin to understand the pros and cons of these different approaches. In effect, you have detached from commitment to a single approach. 
  • In the Ri stage, you have "transcended" the individual strategies. You might use one, or another, or mix and match as needed. You'll create new paths as you need them, and move fluidly through the space of possibilities. 
Reading through this article while I ponder (yet again) my graduate algorithms class for the fall, I realize that this three-stage development process maps quite well to what we expect from undergraduates, masters students and Ph.D students learning about an area. 

The undergraduate is learning a tool for the first time (recurrence analysis say) and if they can understand the master theorem and apply it, that's pretty good. 

At the next level, they realize the limitations of the master theorem, and might learn about the Akra-Bazzi method, or annihilators, or even some probabilistic recurrence methods. 

Of course, once you're dealing with some thorny recurrence for the analysis in your next SODA submission, then the standard templates are helpful, but you'll often have to do something creative and nontrivial to wrestle the analysis into a form where it makes sense. 

Pick your own topic if you don't like recurrences. 

Which also explains why it's hard to explain how to prove things. Beginning students expect a standard formula (which is why induction and proof by contradiction get taught over and over). But once you go beyond this, there aren't really good templates. In effect, there's no good second level with a set of proof techniques that you can throw at most problems, which explains why students taking a grad algorithms class tend to struggle with exactly this step. 

Sunday, June 21, 2015

On Pixar, creativity and advising

I'm in Bertinoro for the Algorithms and Data Structures workshop organized by Camil Demetrescu, Andrew Goldberg and Valerie King. I will try to post updates from the event, but with the density of talks, no promises :). I'm still waiting to hear more about the STOC theoryfest deliberations from the conference: come on, bloggers !

In the meantime, I wanted to point to an excerpt from Ed Catmull's book on the Pixar process.

I don't generally enjoy "behind the scenes" books about "genius companies" or "genius" individuals. I feel that the word "genius" gets bandied around far too often to be useful, and there's too much 'gee whiz smart people do smart things' that perpetuates the 'Great Man' (and yes, Man) theory of intellectual discovery.

But I enjoyed the excerpt, and am now interested in reading the book (Disclaimer: Ed Catmull is ONE OF US). Catmull doesn't appear to trot out trite recipes for success. If the excerpt is any indication, it's an extremely thoughtful and nuanced take on what worked at Pixar and why, and brings in many voices of the actual people doing the hard work to make movies. Here's a paragraph on leading a team as a director:
Andrew likens the director’s job to that of a ship captain, out in the middle of the ocean, with a crew that’s depending on him to make land. The director’s job is to say, “Land is that way.” Maybe land actually is that way and maybe it isn’t, but Andrew says that if you don’t have somebody choosing a course—pointing their finger toward that spot there, on the horizon—then the ship goes nowhere. It’s not a tragedy if the leader changes her mind later and says, “Okay, it’s actually not that way, it’s this way. I was wrong.” As long as you commit to a destination and drive toward it with all your might, people will accept when you correct course.
This reminds me very much of things I say to my students as an advisor. To whit, "It's important to have some direction you're heading towards with a problem, even if that direction turns out to be a bad one". First of all, it keeps you moving forward instead of around in circles. Second, (and this is particularly true in research), even a failed exploration of a direction teaches you more about a problem than going in circles without a plan. This manifests itself in a few different ways:

  • If you go this way, are you getting somewhere interesting (an interesting problem, an interesting structural insight, something that is novel regardless of success or failure)
  • Do you have some sense that this approach might work ? In other words, a fishing trip with a new tool is fine, but a fishing trip with a new tool that you have some good feelings about is even better. This reminds of of a comment that I've heard before but took me a while to understand: "I don't try to solve a problem till I think I know how to solve it". 
  • Can you test the direction quickly ? This is a corollary of the 'hurry up and fail' concept that startups like to preach, without the valorization of failure. That is, since we accept that a direction might not pan out, it would be best to find ways to test this quickly so we can move on. This doesn't mean that repeated failure is good, or as Gavin Belson would say, "FAILURE = PRE-SUCCESS".  

Tuesday, May 19, 2015

ITA, or a conference I really enjoy.

Continuing my thoughts on the STOC 2017 reboot, I went back to Boaz's original question:

What would make you more likely to go to STOC?

And thought I'd answer it by mentioning an event that I really enjoy attending. I didn't post it as a comment because it's a little out of scope for the blog post itself: it doesn't make concrete recommendations so much as relay anecdotal evidence. 

The Information Theory and Applications workshop is a workshop: it doesn't have printed proceedings, and it encourages people to present work that has been published (or is under review) elsewhere. Keep that caveat in mind: the structure here might not work for a peer-reviewed venue like STOC. 

Having said that, the ITA is a wonderful event to go to. 
  • It's in San Diego every year in February - what's not to like about that
  • It runs for 5 days, so is quite long. But the topics covered change over the course of the 5 days: the early days are heavy on information theory and signal processing, and the algorithms/ml/stats shows up later in the week. 
  • There are multiple parallel sessions: usually 5. And lots of talks (no posters)
  • There are lots of fun activities. There's an irreverent streak running through the entire event, starting with the countdown clock to the invitations, the comedy show where professional comedians come and make fun of us :), various other goofy events interspersed with the workshop, and tee-shirts and mugs with your name and picture on them. 
The talks are very relaxed, probably precisely because there isn't a sense of "I must prove my worth because my paper got accepted here". Talk quality varies as always, but the average quality is surprisingly high, possibly also because it's by invitation. 

But the attendance is very high. I think the last time I attended there were well over 600 people, drawn from stats, math, CS, and EE. This had the classic feel of a 'destination workshop' that STOC wants to emulate. People came to share their work and listen to others, and there was lots of space for downtime discussions. 

My assertion is that the decoupling of presentation from publication (i.e the classical workshop nature of ITA), makes for more fun talks, because people aren't trying to prove a theorem from the paper and feel the freedom to be more expansive in their talks (maybe covering related results, or giving some larger perspective). 

Obviously this would be hard to do at STOC. But I think the suggestions involving posters are one way of getting to this: namely, that you get a pat on the back for producing quality research via a CV bullet ("published at STOC") and an opportunity to share your work (the poster). But giving a talk is a privilege (you're occupying people's time for slice of a day), not a right, and that has to be earned. 

I also think that a commenter (John) makes a good point when they ask "Who's the audience?". I'm at a point where I don't really enjoy 20 minutes of a dry technical talk and I prefer talks with intuition and connections (partly because I can fill in details myself, and partly because I know I'll read the details later if I really care). I don't know if my view is shared by everyone, especially grad students who have the stamina and the inclination to sit through hours of very technical presentations. 

Monday, May 18, 2015

STOC 2017 as a theory festival

Over on Windows on Theory, there's a solid discussion going on about possible changes to the format for STOC 2017 to make it more of a 'theory festival'. As Michael Mitzenmacher exhorts, please do go and comment there: this is a great chance to influence the form of our major conferences, and you can't make a change (or complain about the lack of change) if you're not willing to chime in.

I posted my two comment there, and you should go and read number one  and number two. Two things that I wanted to pull out and post here are in the form of a 'meta-suggestion':
1. Promise to persist with the change for a few years. Any kind of change takes time to get used to, and every change feels weird and crazy till you get used to it, after which point it’s quite natural. 
Case in point: STOC experimented one year with a two-tier committee, but there was no commitment to stick to the change for a few years, and I’m not sure what we learned at all from one data point (insert joke about theorists not knowing how to run experiments). 
Another case in point: I’m really happy about the continued persistence with workshops/tutorials. It’s slowly becoming a standard part of STOC/FOCS, and that’s great. 
2. Make a concerted effort to collect data about the changes. Generate surveys, and get people to answer them (not as hard as one might think). Collect data over a few years, and then put it all together to see how the community feels. In any discussion (including this one right here), there are always a few people with strong opinions who speak up, and the vast silent majority doesn’t really chip in. But surveys will reach a larger crowd, especially people who might be uncomfortable engaging in public.

Friday, May 15, 2015

Higher order interactions and SDM 2015

This year I'm one of the PC Chairs for SIAM Data Mining (along with Jieping Ye), and so I've been spending time in decidedly-not-sunny Vancouver. Being a PC Chair, or even being on a PC, is an exercise in constant deja vu: I hear a talk and think "Where have I heard this before" before realizing that I've probably reviewed the paper, or looked at its reviews or meta-reviews.

Being the PC chair means though that I can float around the conference freely without feeling the pressure to attend talks, network or otherwise be social: I've earned my keep :). Following standard conference networking maxims though, I made an effort to meet at least one coffee shop that I've met before and introduce myself to at least one new coffee shop, and another.

But I am attending talks ! And I enjoyed listening to some work on tensor spectral clustering by Benson, Gleich and Leskovec. It got me thinking about the larger issue of modeling higher-order interactions and what appear to be many different ways of modeling the problem.

The problem.

Imagine that you have a number of interacting entities. These could be points in a space, or vertices in a graph, or even dimensions of a point (aka variables). The easiest way to model a collection of such entities is to assume they're independent of each other. For example, I might draw i.i.d samples from a distribution. Or I might be looking at a collection of independent features describing an object, and so on. 

Independence assumptions are powerful. They allow us to "factorize" structure into combinations of individual elements, which either leads to simple summations directly, or eventually after a log transformation of a product. This means we can deal with entities independently, and the "inference complexity blowup" is linear in the number of entities. A good example of this is the Naive Bayes approach to learning, where assuming all entities are independent leads to a likelihood cost function that's just a sum of terms, one for each entity.

I'm necessarily being rather loose with my statements here. Making them precise is possible in specific contexts, but it gets unwieldy. 

But independence is too restrictive an assumption. It limits modeling power, and therefore will be unable to capture interactions that might make understanding structure a lot easier. For one thing, you'd never find correlations if you assumed that all features are independent.


The easiest form of interaction is a pairwise interaction. Modeling pairwise interactions gets us to a graph, and who doesn't love a graph ! More importantly for what follows,

a graph and its associated structures is a rich representation of a system of pairwise interactions

in that we have a colorful vocabulary and an arsenal of algorithms for talking about pairwise interactions and structures built on them.

Of course we've paid a price - in complexity. Instead of the linear cost incurred by independent entities, we now have quadratically many potential pairwise interactions to model. But (and here's the key), we can interpret a sparse graph as capturing weak interactions, and it's still a rich model to model different phenomena.

Higher-order interactions.

But what happens if we want to model interactions that aren't just pairwise ? What is the correct higher-order structure to model such interactions as effectively as graphs ? It turns out that there are many different ways to do this, and they all can be reduced to a sentence (pace Saunders MacLane) of the form

A graph is just a very special kind of X
for different values of X. 

1. The graphical model view.

A graph is just a special kind of clique intersection structure, if you only have 2-cliques.

One way to manage a collection of higher order interactions is to factorize them in a more general way. This is the basis for the clique-tree idea in graphical models, where the interaction structure is viewed as a set of complete joint interactions (aka cliques) all connected together in a tree structure for easy inference. Another name for this would be the 'bounded treewidth' model, but it misses the fact that we are allowing higher-order interactions, but in a controlled way. 

The advantage of this representation is that in a true parametrized complexity way, it isolates where the true complexity is coming from (the size of each clique) from the overall complexity (the size of the graph). 

A spectral perspective.

When graphs arise from natural sources of data (social networks and the like), we have to deal with noise and spurious signals. And the simple language of connectivity and paths is no longer robust enough. For example a graph might be connected, but only because there's one edge connecting two huge components. If this edge was spurious, we've just made a huge mistake on modeling this graph structure. 

Spectral methods are currently our best way of dealing with noisy interactions. By focusing not on the topological structure of connectivity but on the amount of connectivity measured via cuts, spectral analysis of graphs has become perhaps the best way of finding structures in large graphs.

The spectral lens sees a graph through random walks on the edges. This is great for modeling a collection of pairwise interactions between entities, but not for modeling interactions among sets of entities. We have to be careful here. Spectral methods are actually quite good at finding community structure in graphs (i.e a partition into sets of vertices). What they can't do is find higher order partitionings in graphs (i.e sets of triangles or sets of special 4-vertex structures). And that's where the next three higher-order methods enter the picture

2. The algebraic topology view.

A graph is just the 1-skeleton of a simplicial complex. 

If we're looking to model higher-order interactions, we need a language for describing a collection of well-defined higher order structures. That's what a simplicial complex is. I'll skip the formal definition, but the basic idea is that if you have a simplex (an interacting group of entities), then all subsets must be simplices as well. But you declare the simplices first, which means that the simplex $\{a,b,c\}$ is different from the three simplices $\{a,b\}, \{b, c\}, \{c, a\}$, even though the first must contain the second. 

A simplicial complex is a topological object. It generalizes a graph because a graph is what you get if you limit yourself to simplices of size at most 2. Because it's a discrete topological object, you can now play with it using all the tools of topology, and in particular very powerful tools like homology and homotopy that reveal all kinds of hidden structure not accessible via a graph metaphor.

While simplicial complexes allow you to express higher order interactions (tunnels! holes!), they don't remove the problem of noise: one edge/simplex can still change the structure in nontrivial ways. There are two approaches that researchers have taken to this problem: one spectral, and one not. 

The non-spectral approach is by far the most well-established one. It is based on the idea of persistence:  a way to determine from the homology groups of the simplicial complex what structures are interesting and which ones are not. Persistence is the dominant weapon in the arsenal of topological data analysis, and I'll say no more about it here, so as to keep focus on spectral methods. 

The spectral approach is less well developed, but is quite interesting. The idea is to generalize the notion of expansion from a graph to higher-order simplices, as well as generalizing the Laplacian operator to higher order simplices (or homology groups). Then a random walk on the simplicial complex can be linked to the Laplacian operator, and the eigenstructure of the operator can be linked to the existence (or nonexistence) of certain homology groups. Two places to start with this topic are one on capturing generalizations of edge expansion, and another on building random walks on simplicial complexes and connecting them to a combinatorial Laplacian. 

3. The differential geometry view.

A graph is just a discrete approximation of (scalar functions on) a manifold. 

The graph Laplacian is a discrete approximation of the Laplacian second order differential operator, and more generally the Laplace-Beltrami operator on a manifold. Indeed, one way to build intuition for what the graph Laplacian means is that it's capturing heat diffusion on an implicit manifold that the graph is merely approximating.

The Laplace-Beltrami operator is a "zeroth-order" operator in that it applies to the zero-dimensional entities on a manifold: namely, scalar fields over the points of the manifold. Suppose you were to build a vector field instead over the manifold, and wished to reason about it. Then the generalization of the L-B operator that you'd need is called the Laplace-de Rham operator which formally acts like a Laplacian on the higher order differential forms defined over the manifold (formally, on sections of the tangent bundle). Writing down the L-R operator is a little tricky: it involves a combination of the exterior derivative and its dual (via the Hodge * operator). But one useful observation is that the L-R operator on graphs amounts to a Laplacian on the set of edges, rather than vertices.

This means that you can now treat edges as first-class objects for grouping, rather than vertices. And this is useful for higher-order clustering. Whether this can be generalized even further remains to be seen.

4. The linear algebraic view.

A graph (adjacency matrix) is just a special case of a tensor structure on the entities.

This is perhaps the most well-known of the different higher-order approaches to modeling interactions, and is making the most waves right now. The idea is very simple. If we think of a graph in terms of its adjacency matrix, then each entry encodes a relation between two basic entities. If we wished to encode relations between (say) three entities, then we need a "3D matrix", or more precisely, a tensor

Of course a tensor is more than just a way to assemble a collection of triples into a box, just like a matrix is much more than just a grid of numbers. The most basic question in tensor modeling is factorization: just like we can use the SVD to write down a matrix as a linear combination of outer products of basis vectors, can we write a tensor as 3-way wedge product of basic vectors ? If so, then we've been able to identify the key 3-way factors controlling an interaction. 

Tensor factorization is a hard problem: unlike the SVD, most formulations of tensor factorization are NP-hard. I won't get into this very rich topic right now, and instead will point you to some slides by Ravi Kannan as well as an older paper of his

But can we cluster using tensors ? Or more generally, is there a spectral theory of tensors, and maybe even the analog of Cheeger's inequality ? It turns out that it's possible to define eigenvalues and eigenvectors of tensors, and at least some of the standard spectral theory can be made to carry over - for more on this see these slides by Lek-Heng Lim. But we're still looking for a theory can truly work on hypergraphs or tensors in general. In the meantime, tensor-based approaches to clustering boil down to factorization and PCA-like methods, of which there are many. 

5. Coda

Are these approaches in any sense equivalent ? It's hard to say in general, although for a particular problem there might be connections. Certainly, the de Rham cohomology yields a nice filtration over homology groups that could be wrestled into a persistence framework (research question !!!) thus allowing us to extract robust higher-order structures from both geometric and topological considerations. The tensor approach is closely linked to probabilistic models, not surprisingly. But whether the spectral perspective (and all its attendant intuition and value) will extend nicely across these different frameworks remains to be seen. 

Tuesday, May 12, 2015

Special Issue of Internet Mathematics on Evolving Networks

Aris Gionis and I are co-editing a special issue of Internet Mathematics on evolving networks.
The widespread adoption of digitization has opened the way to gathering large amounts of data that record detailed information for many systems of interest. Examples include telecommunication networks, online social-media platforms, and biological systems. Many of these systems are typically represented as networks, and graph-theoretic techniques are used to analyze the available data. Furthermore, as our data-gathering capacity has increased, it is now possible to collect data the record not only a static aggregate view of the underlying network, but a continuous stream of events that captures the full dynamic behavior of the network. Such events may take the form of structural changes, or they may encode different types of actions and interactions performed by the network entities. This view of time-evolving networks poses new challenges and opens new research directions. The objective is to develop the theoretical foundations and to design the algorithmic principles that will allow to efficiently manage and analyze such evolving networks.

Read more about it here.  Submissions are due Oct 15, 2015. 

Thursday, April 30, 2015

LaTeX, Word, ggplot, and n00b-ness

I've been forcing myself to learn enough R and ggplot to make nice looking plots. It's taking me painfully long to learn to do even the most basic things, and +Carlos Scheidegger has been a big help !

What makes learning ggplot hard is grasping its underlying design philosophy. There's a notion of layers in a plot that I'm still getting used to. Once you get the hang of it it's quite elegant, and makes modifying plots very easy -- once you get the hang of it.

All of which makes me a little more sympathetic to people who struggle with using LaTeX. LaTeX has many of the same unnatural design principles, starting with the lack of a WYSIWIG interface (and no, LyX doesn't really count). It's an incredibly powerful interface (like ggplot), but it's darned hard to do simple things.

In fact, I've been wishing for a "Word equivalent" for ggplot2. Just like is now trying to make LaTeX easier for the general public, I would love to see some kind of interactive interface to ggplot2 that can generate the code automatically for standard tasks.

Tuesday, April 28, 2015

Prof. V returns from Washington

I've had an intense and exciting day and a half here at "politics camp", otherwise known as LISPI. Many thanks to the CRA for organizing this event. It's been an eye-opener in many ways. I've come away from this meeting with a new appreciation for the kind of lobbying efforts CS folks undertake in DC to make sure we can continue to do research and live in our little bubbles. 

One of the "homework assignments" that we were given yesterday for today morning was to practice a 3-4 minute pitch to a staffer for a representative/Senator on a topic of our choosing (about the importance of NSF funding, about a specific issue we think is important, etc). Our "panel" consisted of former Hill staffers who've been on both sides of the pitch table, and each of us had to do a "Shark Tank"-like pitch to them.

It was surprisingly nerve-wracking. I had prepared a pitch on recidivism and the importance of fairness when using data-driven predictors, and having to reduce it down to a few minutes of intelligible speech took some effort. Luckily, the panelists were very constructive in their criticism.

Outside the presentations and discussions, there are some serious issues bubbling up in DC right now that kept cropping up during the workshop. 

Firstly, the 2007 America COMPETES Act is up for reauthorization, and the House Science committee is attempting to  take an axe to NSF funding for social sciences and geosciences. In a move that would have made the British Raj (and all computer scientists) proud, they dangled a classic divide-and-rule trick over the head of the NSF by increasing the budget of CISE, engineering and MPS as a "compensation". This level of detailed guidance for budgeting is apparently unprecedented and the NSF (and the CRA) is fighting it. 

Secondly, there's a big debate going on about backdoor encryption and whether it's technically possible and/or desirable to allow government backdoors into encryption on devices like iPhones etc. I'm not particularly competent to weigh in on these issues, but there were a lot of security folks at the workshop who brought this up as their major concern during our morning pitch session. 

In any case, it's away from the US and onto Canada, for SDM 2015 in Vancouver. 

Monday, April 27, 2015

Prof. Venkatasubramanian goes to Washington

I'm at the CRA-organized Leadership in Science Policy Institute, otherwise known as LISPI. After a day of intense talks, I feel like day three of a conference: which is to say, totally drained and overwhelmed by the amount of information coming my way. Who knew that science sausage was so hard to make !

Here are some things I've learnt so far:

  • Physics is our feared and admired enemy STEM partner. 
  • I'd be glad to have half as much energy as Ed Lazowska when I get to his level. 
  • The CRA continues to be truly awesome
  • I feel like I'm in an episode of the West Wing sometimes. 
  • The budget process seems far more structured and less crisis-ridden than the media makes it out to be. 
And for all other details of the meeting, I will refer you to the Chatham House Rule.

Monday, March 30, 2015

Revisiting the Misra-Gries estimator

If you've ever taken a class on streaming algorithms, or want to ace some tech company puzzle interview question, you've probably heard of this little gem:
Given n items, determine whether a majority element (one occurring strictly more than n/2 times) exists. You are allowed one pass over the data (which means you can read the elements in sequence, and that's it), and exactly TWO units of working storage.
This is a special case of the general Misra-Gries estimator for finding frequent items in a stream with small space. The MG estimator allows you to find (approximately) all items that occur more than $\epsilon n$ times using $O(1/\epsilon)$ space. Setting $\epsilon = 1/2$ yields the desired result.

This result has been proved and reproved a number of times, and there's a fairly general proof based on the fact that what you're really doing is approximating the true frequency to an additive error of $\epsilon n$ (and so any item that occurs more often than that will be retained).

What's more interesting is how you might go about proving the basic statement from first principles if you don't know the proof. This came up at dinner, when I was talking to people who hadn't heard of the problem before and were able to reconstruct a proof on the fly.

So let's return to the basic statement, and let's be a little more precise. The precise guarantee required is as follows:

  • If a strict majority element exists, you MUST return it
  • If not, you can return any element you want (even the most infrequent one)
Because of the second statement above, we only have to worry about streams in which a strict majority element does exist. 

Observation 1: if you haven't seen a majority element in any prefix of the stream, you can throw that prefix away. 

Why: we know there's a majority element somewhere. If it didn't show up so far, it has to be at least much of a majority in what's remaining. 

Observation 2: This prefix could be as short as two elements long. 

In other words, we see an element. Call it x. If we now see $y \ne x$, x is no longer a majority element so far, and we can chuck it, starting a new stream at $y$.

But if we do see $x$ again, then it could be a majority element. How do we keep track of whether it ever stops being a majority element ? 

We could keep track of the number of items seen so far. But that's an extra counter. Why not just pair instances of $x$ with instances of other elements seen so far by subtraction. If any time we hit zero, we can invoke observation 1 and start again. 

If not, then we will end with a nonzero count for $x$, which must be the correct element. 

And this gives us the exact algorithm we need. 

Thursday, March 19, 2015

The coming funding crunch

It's no secret we're in the middle of another boom in CS enrollment. Departments everywhere are struggling to keep up with the increased number of students wanting to get into our programs.

But there's a chain reaction with potentially nasty consequences coming our way. And while some of this will be obvious to academic types, it might not be obvious to the community at large, or even graduate students considering going into academia. 

  • As departments everywhere try to adjust to the new load of students, they have two options: hire a lot more teaching faculty to teach classes, or start negotiating with their universities for more tenure-track positions. The latter is clearly preferable (if you don't believe me, look at the plight of adjuncts in the humanities). But....
  • Universities can live with hiring more faculty, because the increased tuition from more students helps justify it, and more faculty means more research grants, and more hafta for the administration. But...
  • More faculty means more applications for research awards, to the various organizations that dole out money (NSF, NIH, DARPA, ONR, DoE, ...). But...
Have you seen science budgets lately ? They're basically flat. 

We've seen this Ponzi scheme before in biology, and the consequence of that is that the average age of a first time PI has crossed 40, coupled with increased time spent doing postdocs. It's now the norm rather than the exception in CS to see faculty candidates with at least one postdoc. 

And there's no easy way to de-escalate. All the possible dams we can build are bad, or difficult to execute on: 
  • Don't hire more faculty. Then we're stuck with ever increasing class sizes and lower quality of student education.
  • Hire more contingent faculty. This might be a short term solution, but it's a horrible way to treat new Ph.Ds, and frankly given the options out there in industry, you wouldn't get as many takers (at least if people think rationally about this)
  • De-link academic success from funding, or encourage the teaching mission more. This is a complete non-starter at most schools that rely on overhead. And for R1 universities, research dollars are not just a bottomline factor, but are a prestige element. 
And don't even get me started on how this is going to affect our already stretched-near-breaking-point conference review process. 

Sunday, February 22, 2015

STOC 2015 Call for Workshops and Tutorials

I've had two week from hell with proposal deadlines, paper deadlines, faculty candidates and ever-more demanding hungry cries of the baby chickadees  meetings with my students.

Thankfully, all that is now in the past and I can sally forth to the next deadline. Which is, you ask ?

STOC 2015 will have a workshops and tutorials day on Sunday June 14, the day before the conference starts. +Sanjeev Khanna and +Chandra Chekuri are running the event, and they want your proposals.
We invite groups of interested researchers to submit workshop or tutorial proposals. The goal of a workshop is to provide an informal forum for researchers to discuss important research questions, directions, and challenges. Connections between theoretical computer science and other areas, topics that are not well represented at STOC, and open problems are encouraged as workshop topics. Organizers are completely free to choose their workshop formats (invited speakers, panel discussions, etc.). The program for June 14th may also involve tutorials, each consisting of 1-2 survey talks on a particular area, and we welcome tutorial proposals as well.
Deadline for submissions is Mar 15, 2015. For more details on the format etc, visit the workshops/tutorials page

Monday, February 16, 2015

Research customs for the 21st century.

 I've begun to notice a number of customs that seem unique to our modern process of doing collaborative research in the 21st century. Most of them are technology-driven, and most of them involving annoying debates about the multitude of choices available for collaborating.

  • Research Meetings: whether to do them over Skype, or Google Hangouts, or or, or even with phones on a conference call. 
  • Audio/Video/Chat: if over skype/G+, whether to do audio, or video, or chat. And then the obligatory pre-videconference primping to look presentable (or the reliance on bandwidth failure to NOT go on video)
  • Coordination: careful calculations among multiple time zones and daylight savings times to plan meetings. The use of Doodle, Google calendar, or even random emails passed around haphazardly to plan said meetings.
  • Writing I: the protracted negotiations over whether to use git, svn, rcs, or cvs, or Dropbox, or random emails passed around haphazardly (and yes I've done all of this, sometimes at the same time)
  • Writing II: if using an actual modern VCS, then the equally protracted negotiations over who's hosting it, who needs account access, and where public keys need to be placed, and why CS researchers in the 21st century STILL need tutorials on ssh. 
  • Writing III: Dealing with comments on a writeup: as fixmes, as todonotes, as issues in the git repository hosting the document, or as random emails passed around haphazardly. 

And of course all the collaborator "digital" personalities that emerge irrationally and violently. Alice hates using bibtex, and Bob will only use personally crafted bibtex files. Charlie loves his own special fonts and will have raging debates over $\varepsilon$ vs $\epsilon$. Dan has never used version control and doesn't see the point. Erin handcoded her own version control system in a cutting-edge fragment of Haskell and refuses to use anything else. Frank hates online meetings, and Mallory only thinks online during a meeting. Oscar insists that Postscript is Turing-complete and is therefore sufficient for all drawing needs. Peggy insists that git rebase is Turing-complete and is therefore sufficient to fix all commit disasters... eventually.

Note to all my collaborators who I'm currently writing papers with: you are awesome and have NOTHING AT ALL to do with this list.

Wednesday, February 11, 2015

One Body Problems

It's hiring season, and we've all heard about two-body problems.

But you may not have heard of the one-body problem:
Is the place a single person takes a job in likely to have enough other single people around to facilitate searching for a partner ? 
I was 'spoken for' before I even finished my Ph.D, and never had to deal with this (rather, I had to work with the more traditional (ha!) two-body problem).

Friday, February 06, 2015

Friday chest-thumping.

  • My paper with Amirali Abdullah (or should I say Amir's paper with me) got into STOC ! One aspect not discussed in the linked blog post is the connection to Partial Match (a notoriously hard problem in the data structures literature).  In brief, our result allows for a "smooth-ish" interpolation between approximate near neighbor lower bounds in $\ell_1$ and general partial match lower bounds, reinforcing the intuition that Partial Match is an "extremely asymmetric" nearest neighbor problem. 
  • Another paper with 'the Streaming Kings' (said to the sound of a flamenco strum) of Cormode, Chakrabarti, McGregor and Thaler got into CCC (and this is my first paper at Complexity). I'll have more to say about this work in a separate post: in brief, we looked at streaming interactive proofs (of the kind first developed by Cormode, Thaler and Yi) where you have a prover and a streaming verifier and wish to verify a computation with constant rounds of communication. There's a 3-round nearest neighbor protocol in this paper, as well as a number of subtle results about the nature of multi-round communication protocols in the streaming setting. 
"Well Suresh, you have two new papers, what are you going to do next" ?

Wednesday, February 04, 2015

No one is in control...

The New Scientist has a cover this week on the way in which algorithms have taken over our lives. They interviewed some of the participants at the FATML workshop I participated in (thanks to Solon Barocas for sending the reporter our way), and Sorelle Friedler got some nice quotes in regarding the work she, Carlos Scheidegger and I have been doing.

The article is currently paywalled, but a friend of mine sent me the text. The relevant paragraph, lightly edited (the "she" here is Sorelle):

By understanding the biases inherent in the underlying data, she hopes to eliminate bias in the algorithm. Her system looks for correlations between arbitrary properties – like height or address – and demographic groupings like race or gender. If the correlation is expected to lead to unwanted bias, then it would make sense to normalise the data. It is essentially affirmative action for algorithms, she says.
I'm not sure I'd describe our work as affirmative action :), but that's a longer argument.

Monday, February 02, 2015

(the dangers of) Data Mining hits the Super Bowl (ads) and prime time comedy

I get strange looks from people in my work circles when I admit to watching American football, and stranger looks from people in my not-work circles when I admit to watching the Super Bowl for the game.

I didn't avoid the ads though. And two ads (both from Esurance) riffed off the idea of segmenting: that the business goal of data mining customer data is to segment people into little categories that can be targeted the same way. The ads are a little heavy-handed, but then I don't think most people actually know how data mining works to segment them. And getting Bryan Cranston to play a "sort-of" pharmacist was brilliant.

Esurance: "Sorta Pharmacy"


Esurance: "Sorta Mom"

And finally, an entire 30 minute show centered around the idea of a Facegoogle-like company invading our privacy by reading our texts/emails etc. If you don't watch Parks and Recreation, then it's too difficult to summarize six-odd seasons of shenanigans, but in brief: A facebook-google mashup-like company is setting up shop in a small town in Indiana, and is sending people free gifts via Amazon-like shopping drones based on analyzing their private communication, and then shenanigans ensure. If you can't stand the horror of watching network TV, then there's a short clip here:


 and the full episode is on Hulu (for now) here.

Thursday, January 29, 2015

More FOCS 2014-blogging

In the spirit of better late than never, some more updates from Amirali Abdullah from his sojourn at FOCS 2014. Previously, he had blogged about the higher-order Fourier analysis workshop at FOCS.

I'll discuss now the first official day of FOCS, with a quick digression into the food first: the reception was lovely, with some nice quality beverages, and delectable appetizers which I munched on to perhaps some slight excess. As for the lunches given to participants, I will think twice in future about selecting a kosher option under dietary restrictions. One hopes for a little better than a microwave instant meal at a catered lunch, with the clear plastic covering still awaiting being peeled off. In fairness to the organizers, once I decided to revert to the regular menu on the remaining days, the chicken and fish were perfectly tasty.

I will pick out a couple of the talks I was most interested in to summarize briefly. This is of course not necessarily a reflection of comparative quality or scientific value; just which talk titles caught my eye.

The first talk is "Discrepancy minimization for convex sets" by Thomas Rothvoss. The basic setup of a discrepany problem is this: consider a universe of $n$ elements, $[n]$ and a set system of $m$ sets ($m$ may also be infinite), $S = \{S_1, S_2, \ldots, S_m \}$, where $S_i \subset [n]$. Then we want to find a $2$-coloring $\chi : [n] \to \{-1, +1 \}$ such that each set is as evenly colored as possible. The discrepany then measures how unevenly colored some set $S_i \in S$ must be under the best possible coloring.

One fundamental result is that of Spencer, which shows there always exists a coloring of discrepancy $O(\sqrt{n})$. This shaves a logarithmic factor off of a simple random coloring, and the proof is non-constructive. This paper by Rothvoss gives the first algorithm that serves as a constructive proof of the theorem.

The first (well-known) step is that Spencer's theorem can be recast as a problem in convex geometry. Each set $S_i$ can be converted to a geometric constraint in $R^n$, namely define a region $x \in R^n : \{ \sum_{j \in S_i} | x_j | \leq 100 \sqrt{n} \}$. Now the intersection of these set of constraints define a polytope $K$, and iff $K$ contains a point of the hypercube $\{-1 , +1 \}^n$ then this corresponds to the valid low discrepancy coloring.

One can also of course do a partial coloring iteratively - if a constant fraction of the elements can be colored with low discrepancy, it suffices to repeat.

The algorithm is surprisingly simple and follows from the traditional idea of trying to solve a discrete problem from the relaxation. Take a point $y$ which is generated from the sphercial $n$-dimensional Gaussian with variance 1. Now find the point $x$ closest to $y$ that lies in the intersection of the constraint set $K$ with the continuous hypercube $[-1, +1]^n$. (For example, by using the polynomial time ellipsoid method.) It turns out some constant fraction of the coordinates of $x$ are actually tight(i.e, integer valued in $\{-1, +1 \}$) and so $x$ turns out to be a good partial coloring.

To prove this, the paper shows that with high probability all subsets of $[-1 +1]^n$ with very few tight coordinates are far from the starting point $y$. Whereas with high probability, the intersection of $K$ with some set having many tight coordinates is close to $y$. This boils down to showing the latter has sufficiently large Gaussian measure, and can be shown by standard tools in convex analysis and probabilitiy theory. Or to rephrase, the proof works by arguing about the isoperimetry of the concerned sets.

The other talk I'm going to mention from the first day is by Karl Bringmann on the hardness of computing the Frechet distance between two curves. The Frechet distance is a measure of curve similarity, and is often popularly described as follows: "if a man and a dog each walk along two curves, each with a designated start and finish point, what is the shortest length leash required?"

The problem is solvable in $O(n^2)$ time by simple dynamic programming, and has since been improved to $O(n^2 / \log n)$ by Agarwal, Avraham, Kaplan and Sharir. It has long been conjectured that there is no strongly subquadratic algorithm for the Frechet distance. (A strongly subquadratic algorithm being defined as $O(n^{2 -\delta})$ complexity for some constant $\delta$, as opposed to say  $O(n^2 / polylog(n))$.)

The work by Bringmann shows this conjecture to be true, assuming SETH (the Strongly Exponential Time Hypothesis), or more precisely that there is no $O*((2- \delta)^N)$ algorithm for CNF-SAT. The hardness result holds for both the discrete and continuous versions of the Frechet distance, as well as for any $1.001$ approximation.

The proof works on a high level by directly reducing an instance of CNF-SAT to two curves where the Frechet distance is smaller than $1$ iff the instance is satisfiable. Logically, one can imagine the set of variables are split into two halves, and assigned to each curve. Each curve consists of a collection of "clause and assignment" gadgets, which encode whether all clauses are satisfied by a particular partial assignment. A different such gadget is created for each possible partial assignment, so that there are $O*(2^{N/2})$ vertices in each curve. (This is why solving Frechet distance by a subquadratic algorithm would imply a violation of SETH.)

There are many technical and geometric details required in the gadgets which I won't go into here. I will note admiringly that the proof is surprisingly elementary. No involved machinery or complexity result is needed in the clever construction of the main result; mostly just explicit computations of the pairwise distances between the vertices of the gadgets.

I will have one more blog post in a few days about another couple of results I thought were interesting, and then comment on the Knuth Prize lecture by the distinguished Dick Lipton.

Tuesday, January 27, 2015

Streaming @ SODA: Part II

This is the second of two posts by Samira Daruki on the streaming sessions at SODA 2015. For the first post, see here. 

In the third paper from the streaming graph family in SODA15: "Parameterized Streaming: Maximal Matching and Vertex Cover", Chitnis, Cormode, Hajiaghayi and Monemizadeh introduce a new approach to handling graph streams called  parameterized streaming algorithms. Also, in addition to insertion-only model, they consider the dynamic model of streaming graphs in which the input is a sequence of insertion/deletion on the edges.

This dynamic model of streaming graph processing is popular when the graph size is changing, and has recently received much attention due to breakthroughs by Ahn, Guha and McGregor (one, and two).  Over these two papers, they showed the first results for a number of graph problems over dynamic streams. This has provoked much interest into what can be computed over dynamic graph streams, although still there is not much work on solving graph optimization problems in this model. The challenge here is that when an edge is deleted, sometimes it requires a substantial work to repair the solution again, so we need to make sure that the algorithm has enough information to do so, while keeping only a bounded amount of working space. (ed: not surprisingly, some of these ideas are useful for non-streaming dynamic graph algorithms: witness the paper by Kapron, King and Mountjoy on dynamic connectivity in (randomized) worst-case polylog time from SODA a few years ago)

Returning to parametrized streaming, in this paper instead of solving exactly the optimization problem on the graph stream, the goal is to solve the “parametrized” version of the problem, where the parameter $k$ is given and we want to solve the following decision problem:
Is there a solution with size bounded by $k$? 
The motivation behind parametrizing the problem comes from real world applications in which the solution of the graph problems is small comparing to the size of the input (i.e. sublinear in the size of input). In these cases, the interesting challenge is to solve the optimization graph problems in streaming fashion using space bounded by some function of the “solution size” instead of the “input size”.

To solve the parameterized problems, one of the techniques which is used is kernelization, which uses a polynomial time preprocessing to map the input to another equivalent input of smaller size $f(k)$ (called a kernel) with a new parameter value $k’ \le g(k)$, for a computable function $g$.

In this paper, by combining kernelization techniques with randomized sketch structures, the first streaming algorithms for the parameterized versions of the Vertex Cover problem is obtained. The main idea here is to maintain a maximal matching of underlying graph in a streaming fashion. Then run the well-known kernelization algorithm for Vertex Cover on the maintained maximal matching. The data structure to maintain the maximal matching use the $k$-sample recovery sketching algorithm, which is a generalization of linear sketching for $\ell_0$-sampling, as the main tool and apply it to the neighborhood of each vertex (incident edges) in the resulted matching. So as the edges are inserted or deleted, these sketches can be updated without needing knowledge of the full neighborhood of nodes. However, there are some challenges with deletion of edges: as the edges are deleted we need to have an intelligent mechanism to ensure the matching remains maximal using only limited stored information.

Another nice result here is showing a tight lower bound of $\Omega(k^2)$ (by reducing from the INDEX problem in communication complexity) for the space complexity of any (randomized) streaming algorithms for  parameterized Vertex Cover, which holds even in the insertion-only model.

Besides the general models of insert-only and dynamic, another restricted model in dynamic framework is also discussed in which we know for sure that at time $i$, the size of the vertex cover of underlying graph induced by the edges till that point is at most $k$. With this promise, they develop a dynamic parameterized streaming algorithm whose space usage matches the proved lower bound.

It is interesting to think about other NP-hard problems in the framework of parameterized streaming and explore how kernelization can be helpful in this direction or see whether we can find other powerful hammers to overcome the challenges which arises in designing algorithms for hard problems in streaming setting.

Along with the three papers discussed above, there was another paper on streaming presented at SODA (by Naumovitz and Saks) which provides a deterministic polylogarithmic-space streaming algorithm for approximating distance to monotonicity for a sequence of $n$ numbers, compared to the corresponding randomized result presented at SODA two years ago.

While I won't discuss this work here so as to keep these posts  just about streaming graph algorithms, I encourage the interested readers to take a look at this paper as well, as the last one in the streaming family of SODA15.

Streaming @ SODA: Part I

This two part series is written by my student Samira Daruki

Modern graph data sets are too large to fit in the memory. And so the streaming model is one of the more popular and attractive ones for analyzing massive graphs: in this model, for an input graph $G = (V, E)$ with $n$ vertices and $m$ edges, the edges arrive in an arbitrary order in a stream and the algorithm can only use $\tilde{O}(n)$ space. The algorithm is allowed to have several passes over the stream but usually the ideal case is to have just one pass.

For many graph problems such as matching, spanning tree, graph sparsification, approximate distance and counting subgraphs there now exist streaming algorithms with space complexity $O(n \text{poly} (\log n))$. In these algorithms, usually we assume that the nodes of the graphs can be stored in memory but edges cannot. Notice that for constructing  matchings, spanners and sparsifiers, the output size is often $\Omega(n)$, so it forces the streaming algorithm to use $\Omega(n)$ space.

But if you don't need to report the output, then this can be avoided. For an example, see the work of Kapralov, Khanna and Sudan from SODA 2014 which approximates the matching size to a $\text{poly}(\log n)$ factor using $\text{poly}(\log n)$ space in a random stream (where edges appear in a random order rather than arbitrarily)

Thus, the question now is:
can we obtain a $\text{poly}\log$ space streaming algorithm for approximating the solution cost for other graph problems? 
Consider for instance MAX-CUT. There are several results on approximating the maximum cut in graphs in a non-streaming model; the trivial approach is to take a random cut. This selects half of the edges in expectation, resulting in a factor $1/2$-approximation.

Thus implies that in a streaming model we can just count the number of edges $m$ and output $\frac{m}{2}$ which results in a $O(\log n)$-space algorithm. By keeping a sample of the edge set we can get a different tradeoff: a $(1+\epsilon)$-approximation algorithm which uses $O(\frac{n}{\epsilon^2})$ space.

Can we get a streaming algorithm with better than factor-$2$ approximation using just $poly(\log n)$ space?

A paper by Kapralov, Khanna and Sudan in the streaming session of SODA15 this year answers this question. This is the latest in a line of results on streaming graph problems by Kapralov and others from SODA 2012, 2013 and 2014 (mentioned above)

Here is their main result
For any constant $\epsilon > 0$ a single pass streaming algorithm for approximating the value of MAX-CUT  to a factor $2 − \epsilon$ requires $\Omega(\sqrt{n})$ space, even in the random order model.
This result rules out the possibility of $\text{poly}(\log n)$ space, but suggests that $\tilde{O}(\sqrt{n})$ space cost might be possible in some specific settings.

Another major result of this paper is as follows:

For any constant $\epsilon > 0$ a single pass streaming algorithm for approximating MAX-CUT value to factor $1 + \epsilon$ requires $n^{1−O(\epsilon)}$ space in adversarial streams.

The main fact they  use here is the connection between the MAX CUT value and (distance from) bipartiteness:
if a graph $G$ with $m$ edges is $\beta$-far from being bipartite, then maxcut value of $G$ is at most $(1-\beta)m$. 
The other simple observation is that any algorithm that computes a $\gamma$-approximation to MAX CUT distinguishes between bipartite graphs and graphs that are $1-\frac{1}{\gamma}$-far from being bipartite. Thus to show that no streaming algorithm using space $c$ can achieve a $\gamma$- approximation with failure probability at most $\delta$, it's enough enough to show no streaming algorithm using space $c$ can distinguish between bipartite graphs and graphs which are $1- \frac{1}{\gamma}$-far from being bipartite with probability at least $1- \delta$.

Given these facts, now the core idea to prove the main result here is exhibiting a distribution over inputs where $(2-\epsilon)$ approximation to MAX-CUT requires $\Omega(\sqrt{n})$ space.

More precisely, the input graph instances are based on random Erdos-Renyi graphs, which are either bipartite in YES case or non-bipartite in the NO case. In order to achieve a $(2-\epsilon)$-factor gap for the MAX CUT in this structure, we choose the expected degree of vertices to be $\Theta(\frac{1}{\epsilon^2})$.

This way, the input streaming graph can be partitioned and given to the algorithm in $\Omega(\frac{1}{\epsilon^2})$ phases, which can be simulated as a $\frac{1}{\epsilon^2}$-party one-way communication game.

Then, by giving a reduction from a variation of Boolean Hidden Matching(BHM)  called Distributional Boolean Hidden Partition(D-BHP) to the MAX-CUT on the input instance of the problem, and showing that $\Omega(\sqrt{n})$ space is necessary to differentiate between these two cases, the main streaming lower bound result for obtaining approximate MAX-CUT is straightforward.

There are many technical details in performing this reduction, but roughly speaking they show that any algorithm that solves MAX-CUT on the constructed instances must solve the two-party communication problem in at least one of the phases.

There are still some main open problems left in this paper:

  • One is whether breaking $2$-approximation barrier in $\sqrt{n}$ space is possible if we are allowed to have $poly(\log n)$ passes over the input stream?
  • Also it is interesting to think about designing streaming algorithms for other graph problems using $o(n)$ space.

This brings us to another paper presented in this session. In Streaming Algorithms for Estimating the Matching Size in Planar Graphs and Beyond (by Esfandiari, Hajiaghayi, Liaghat, Monemizadeh and Onak), this latter question is answered about finding the maximum matching for planar graphs using $o(n)$ space.

Here is the main result:
If the underlying graph is planar, then there is a streaming algorithm which provides a $O(1)$-approximation solution to maximum matching with high probability using $O(n^{\frac{2}{3}})$ space.
The main idea for proving the result here is to use a known structural graph property:
If we characterize the nodes of the input graph based on the degrees to two groups of light (deg < 9) and heavy (deg > 9) vertices and then define the shallow edges as the ones with  two light endpoints, then we have the following nice property: (Assuming |maximum matching| = m, |shallow edges| = s and | heavy vertices| = h): 
$$ \frac{\max (s, h)}{12} \leq m \leq h + s$$

Then using this structural fact as the main tool, constructing a small size matching (bounded by $c n^{\frac{2}{3}}$) as we read the edges in a greedy manner, and estimating the number of shallow edges and heavy vertices in the induced subgraph by a subset of sampled vertices with size $c n^{\frac{2}{3}}$, we can approximate the size of the maximum matching by a constant factor.

In addition to planar case, they show that similar results for approximating maximum matching in other graph structures such as $d$-degenerate graphs and forests are achievable.

Coming up: parametrized streaming and VERTEX COVER.

Disqus for The Geomblog