Sunday, January 07, 2018

Report on double blind reviewing in ALENEX 2018

+Rasmus Pagh and I chaired ALENEX 2018, and we decided to experiment with double blind review  for the conference. What follows is a report that we wrote on our experiences doing this. There are some useful notes about logistics, especially in the context of a theoretically-organized conference on experimental algorithms.

ALENEX 2018 Double Blind Review

For ALENEX 2018, we decided to experiment with a double blind review process i.e one in which authors and reviewers were unaware of each others’ identity. While double blind review is now almost standard in most computer science conferences, it is still relatively uncommon in conferences that focus on theoretical computer science and related topics.

The motivation

In the original argument we presented to the ALENEX Steering Committee, we presented the following reasons for why we wanted double blind review:
1. Eliminating bias.
Andrew Tomkins did an experiment for WSDM this year and wrote a report on it: https://arxiv.org/abs/1702.00502

One particular observation:

"Reviewers in the single-blind condition typically bid for 22% fewer papers, and preferentially bid for papers from top institutions. Once papers were allocated to reviewers, single-blind reviewers were significantly more likely than their double-blind counterparts to recommend for acceptance papers from famous authors and top institutions. The estimated odds multipliers are 1.66 for famous authors and 1.61 and 2.10 for top universities and companies respectively, so the result is tangible”

2. Common practice.

Virtually every CS community except theory is doing double blind review, including most of ML (NIPS, ICML), DB (VLDB, SIGMOD), Systems (NSDI), etc. While theory papers have their own idiosyncrasies, we argued that ALENEX is much closer in spirit and paper structure to more experimental venues like the ones listed.

3. Limited burden on authors and reviewers for an experiment

There was no real logistical burden. We were not blocking people from posting on the arXiv, or talking about their work. We’re merely requiring submissions be blinded (which is easy to do). For reviewers also, this is not a problem - typically you merely declare conflicts based on domains and that takes care of the problem of figuring out who’s conflicted with what paper (but more on this later).

4. Prototyping.

While theoryCS conferences in general do not make use of double blind review, ALENEX is a small but core venue where such an experiment might reveal useful insights about the viability of double blind overall. So we don’t have to advocate changes at SODA/STOC/FOCS straight up without first learning how it might work.

5. PC submissions.

We are allowing PC members to submit papers, and this has been done before at ALENEX. In this case double blind review is important to prevent even the appearance of conflict.

The process

Before submission: We provided a submission template for authors that suppressed author names. We also instructed authors on how to refer to prior work or other citations that might leak author identity - in brief, they were asked to treat these as any third-party reference. We also asked authors to declare conflicts with PC members.
After submission/before reviews: We recognized that authors might not be used to submitting articles in double blind mode and so went over each submission after it was submitted and before we opened up bidding to PC members to make sure that the submissions were properly blinded. In a few cases (less than 10/49) we had to ask authors to make modifications.
During review process: The next step was to handle requests for subreviewers. Since PC members could not determine CoIs (conflicts of interest) on their own, all such requests were processed through the PC chairs. A PC member would give us a list of names and we would pick one. (so more private information retrieval than a zero knowledge protocol!)

Issues

A number of issues came up that appear to be unique to the theory conference review process. We document them here along with suggested mitigations.
  1. Managing the CoI process: In theoryCS conferences, subreviewing happens outside the umbrella of the PC. PC members have the power to request any number of subreviewers for papers, and this process happens after the papers are submitted. In contrast, in other venues, subreviewers essentially function as members of the PC - they are invited to be reviewers ahead of time, and are listed when the author declare conflicts of interest. This means that under the process we used, PC members cannot determine for themselves whether a subreviewer has a CoI with a paper, whereas in the alternate process, this is taken care of automatically. One possible mitigation is to ask PC members to list potential reviewers ahead of time and have them registered in the system for authors to declare CoI with. While this might generate a long list of subreviewers for authors to review, this process is customarily handled by a) allowing authors to declare conflicts by affiliation (domain name) and then b) presenting them with a filtered set of reviewers to mark conflicts with. Domain-based filtering is probably the most effective method for handling conflicts based on current or recent affiliation: it allows for reviewers to be added after the fact, and systems like Microsoft’s CMT allow for it.
  2. The difficulty of hiding identity based on prior work: In experimental work, a group will often write a series of papers that builds on infrastructure that they’ve developed. The relative difficulty of building such infrastructure also means that groups become “known” for working in certain areas. This made it a little difficult for authors to honestly blind their identity, because their papers clearly built on software that wasn’t publicly available and therefore had to be part of their group. The solution of blinding that reference itself does not always work because then it is hard to evaluate the quality of the work described in the paper. We note that this problem occurs in other, more experimental parts of CS. The typical solution is to continue with the blinding effort in any case, and make an effort to release code publicly, so anyone could have used the code being built on. In our view, this is a less significant problem than the first point. To this end, here are some guidelines from CHI and CSCW (both ACM conferences).
  3. Is the paper provably double blind? A common complaint about double blind review is that it is not perfect -- that it’s possible with some effort to determine some subset of the authors with some degree of certainty. The response that we gave when asked, and that is usually given, is that the goal of double blind review is not to provide a zero knowledge protocol, but to prevent the immediate implicit bias that comes from directly seeing the author names prior to reading the paper. We note that this is a common complaint from people in the theory community: however our experience with double blind review in other venues has been that after a while, one gets accustomed to reviewing papers without attempting to first determine the authors and the process works as intended.

Feedback

We also solicited feedback from the program committee after the review process was complete. We asked them three questions:
  1. What did you like (and what worked) about the double blind review process instituted this year for ALENEX?
  2. What in your opinion did NOT work?
  3. Is there any other feedback you'd like to provide about the double blind review process?
The responses we got for question 1 were uniformly of the form of “I’m glad that we did it, and felt that papers got a fairer shake than they would have otherwise”. One PC member even said unequivocally that the double blind review process made it more likely that they would submit to ALENEX in the future.  
On question 2, PC members brought up the issues we raised above, recommending that we make it clearer to authors how they need to blind their submissions and also mentioning the difficulty of assigning subreviewers.  
On question 3, the feedback was uniformly positive in favor of continuing with double blind review, inspite of the issues raised.

Summary

On balance, our experience with double blind review was positive. While there were logistical issues, many of these can be resolved by the methods we describe above. Further, there is now a wealth of knowledge accumulated in other areas of computer science that we can learn from. SIGPLAN has built a comprehensive FAQ around this issue that answers many of the questions that arise.
We recommend continuing with double blind review for at least the next two years at ALENEX, firstly because this brings us in line with common practice in most other parts of computer science, and secondly because many of the logistical issues people face with DB review will go away with habituation. At that point, the potential inconvenience of the process reduces and will be outweighed by the benefits in transparency and lack of bias.


No comments:

Post a Comment

Disqus for The Geomblog