Tuesday, July 24, 2007

STOC/FOCS/SODA: The Cage Match (with data!)

(Ed: This post, and the attendant web page and data, was initiated as a joint effort of Piotr Indyk and myself. Many others have helped improve the data, presentation and conclusions.)

Inspired by Michael Mitzenmacher's flamebait post on SODA/STOC/FOCS, we decided to roll up our sleeves, and resolve the biggest outstanding issue in Theoretical Computer Science, namely the great "STOC/FOCS vs. SODA" debate ("P vs. NP" is a tainted second, sullied by all that money being offered to solve it). We have some interesting preliminary observations, and there are many interesting problems left open by our work ;)

The hypothesis:
There is a significant difference in citation patterns between STOC/FOCS and SODA
The plan:

First, we obtained the list of titles of conference papers appearing in STOC, FOCS and SODA in the last 10 years (1997-2006). We deliberately excluded 2007 because FOCS hasn't happened yet. We got this list from DBLP (Note: DBLP does not make any distinction between actual papers and tutorials/invited articles; we decided to keep all titles because there weren't that many tutorials/invited papers in any case).

For each title, we extracted the citation count from Google Scholar, using a process that we henceforth refer to as "The Extractor". Life is too short to describe what "The Extractor" is. Suffices to say that its output, although not perfect, has been verified to be somewhat close to the true distribution (see below).

The results, and methodology, can be found at this link. The tables and graphs are quite self-explanatory. All source data used to generate the statistics are also available; you are welcome to download the data and make your own inferences. We'll be happy to post new results here and at the webpage.

OBSERVATIONS:

The main conclusion is that the hypothesis is valid: a systematic discrepancy between citation counts of SODA vs. STOC/FOCS does appear to exist. However, the discrepancy varies significantly over time, with years 1999-2001 experiencing the highest variations. It is interesting to note that 1999 was the the year when SODA introduced four parallel sessions as well as the short paper option.

Although most of the stats for STOC and FOCS are quite similar, there appears to be a discrepancy at the end of the tail. Specifically, the 5 highest citation counts per year for STOC (years 1997-2001) are all higher than the highest citation count for FOCS (year 2001). (Note: the highest cited STOC article in 2001 was Christos Papadimitriou's tutorial paper on algorithms and game theory). The variation between SODA and STOC/FOCS in the 1999-2001 range shows up here too, between STOC and FOCS themselves. So maybe it's just something weird going on these years. Who knows :)

Another interesting observation comes from separating the SODA cites into long and short paper groups (for the period 1999-2005). Plotting citations for short vs long papers separately indicates that the presence of short papers caused a net downward influence on SODA citation counts, but as fewer and fewer shorts were accepted, this influence decreased.

There are other observations we might make, especially in regard to what happens outside the peak citations, but for that we need more reliable data. Which brings us to the next point.

VALIDATION OF THE DATA:

To make sure that the output makes sense, we performed a few "checks and balances". In particular:
  • we sampled 10 random titles from each of FOCS, STOC and SODA for each of the 10 years, and for each title we checked the citation count by hand. Results: there were 7 mistakes in FOCS, 9 in STOC, and 11 in SODA, indicating a current error rate in the 9-10% range.
  • for each of FOCS, STOC, SODA, we verified (by hand) the values of the top 10 citation numbers, as reported by The Extractor
  • we compared our stats for the year 2000 with the stats obtained by Michael. The results are pretty close:





Median (Mike's/Ours)Total (over all 10 years) (Mike's/Ours)
FOCS 38/38 3551/3315
STOC 21/21 3393/2975
SODA 14/13 2578/2520

A CALL FOR HELP:
Warning: the data displayed here is KNOWN to contain errors (our estimate is that around 10% of citation counts are incorrect). We would very much appreciate any efforts to reduce the error rate. If you would like to help:
  1. choose a "random" conference/year pair (e.g., STOC 1997)
  2. check if this pair has been already claimed in this blog; if yes, go to (1)
  3. post a short message claiming your pair (e.g., "CLAIMING STOC 1997") on the blog.
  4. follow the links to check the citations. For each incorrect citation, provide two lines: (1) paper title (2) a Google Scholar link to the correct paper
  5. Email the file to Suresh Venkatasubramanian.
(and yes, we know that this algorithm has halting problems). Many thanks in advance. Of course, feel free to send us individual corrections as well.

Citation guidelines: The world of automatic citation engines is obviously quite messy, and sometimes it is not immediately clear what is the "right" citation count of a paper. The common "difficult case" is when you see several (e.g., conference and journal) versions of the same paper. In this case, our suggestion is that you ADD all the citation counts that you see, and send us the links to ALL the papers that you accounted for.


Acknowledgement: Adam Buchsbaum for suggesting the idea of, and generating data for the short-long variants of SODA. David Johnson, Graham Cormode, and Sudipto Guha for bug fixes, useful suggestions and ideas for further plots (which we can look into after the data is cleaned up)

44 comments:

  1. Claiming FOCS 97.

    Sudipto.

    ReplyDelete
  2. CLAIMING STOC 1997

    Piotr

    ReplyDelete
  3. Claiming STOC 01, STOC 00.

    Alex

    ReplyDelete
  4. First, I am not surprised by the differences in the citation counts. Second, what exactly should we do based on this data? Give some ammunition to tenure committees that SODA papers are not as good as we thought them to be? I understand that data munging is fun though.

    ReplyDelete
  5. Citation counts? The question I have is, how many of these papers have been uploaded to the authors' web pages?

    ReplyDelete
  6. Chandra, actually the citation counts of the 10th to 20th highest (in citation) paper in FOCS/STOC/SODA in 00-02 seem to be quite similar - in fact statements similar to this begs for more clean data.

    And given that there are excellent papers, which close off questions and therefore have low citation counts, what is the ammunition? The citation is an (one of many) indicator of activity and possibly of outreach, and in that context it is interesting.

    Sudipto

    ReplyDelete
  7. This is true. In fact, there is other data that suggests that SODA also does well at capturing the 'mid range papers' (i.e the low max, but low variance part of the spectrum), but again, we need cleaner data to make these kinds of assertions.

    ReplyDelete
  8. > Second, what exactly should we do based on this data?

    Hi Chandra,

    I don't think that, based on this data, one should do anything in particular. I think that the main benefit of this study is a better (and quantitative) understanding of the citation patterns in algorithms conferences.

    As a bonus, there are some interesting findings. For example, what the heck happened during the years 1999-2001 ? I know only some partial answers to this question.

    As far as tenure committees go, if they indeed want to make decisions based on citation counts, I believe they are smart enough to collect the data for a particular person considered.

    Piotr

    ReplyDelete
  9. Claiming FOCS 98.

    Sudipto

    ReplyDelete
  10. Naturally, STOC and FOCS deadlines are during the period when students are around working hard on campus. On the other hand, SODA deadline is in the summer. Its hot, its hard to think, and there are no slaves around to do the real writing, thinking or research. It thus follows that SODA papers gets less references being inherently weaker.

    As such, I think the real question is the quality of the conference as a function of the date of the deadline. Same considerations goes to the day of the week/month when the deadline is. I think that once you normalize SODA quality by the unauspicious index of its deadline, it would turn out that it is in fact a much better conference than STOC and FOCS.

    As a way to verify my correct and unified theory of this important topic, I suggest to fork this universe into 10 parallel universes, move back in time by 30-40 years, and this time randomly choose the deadlines for these conferences, till the current time, in each of these universes. I am sure the new statistics would prove my point.

    ReplyDelete
  11. Naturally, STOC and FOCS deadlines are during the period when students are around working hard on campus. On the other hand, SODA deadline is in the summer. It's air-conditioned, there's no teaching or committee burden, and there are no students around to distract from the real writing, thinking or research. It thus follows that SODA papers answer harder questions, closing off entire lines of research, and therefore get fewer citations.

    Isn't this fun?

    ReplyDelete
  12. Jeff Erickson said...

    It's air-conditioned..

    Well, while you imperialisitc american pigs enjoy your AC campuses, and ruining the environment to everybody else, in the rest of the world, we have to do research in the sweltering heat of summer. Thus proving my point. QED.

    ReplyDelete
  13. Now now: no need for incivility, or I'll have to invoke a Godwin's law variant. There are enough issues to chew on without bringing in global warming.

    ReplyDelete
  14. I think comparing median of SODA with STOC/FOCS should be corrected to account for the fact that the number of papers accepted in SODA is around twice as that of STOC/FOCS. I think you should only take the median of the top 70-80 papers of SODA to have a more accurate data (based on my experience, if you double the size of your class I bet the average student quality goes down since you don't get as many good students, instead a large pile of well-below average). The same is for papers. If there were twice as many papers accepted to FOCS/STOC the median would be closer to that of SODA.

    ReplyDelete
  15. You're probably right about that, and as we get better quality data, we can run more experiments along those lines.

    What I don't know is this: what is a statistically sound way of comparing medians that corrects for sample size ?

    ReplyDelete
  16. Claiming FOCS 2000 and SODA 2000 (we got data from Michael)

    Piotr

    ReplyDelete
  17. @Jeff Erickson: 1 point!

    ReplyDelete
  18. Here are some more statistics to fuel the fire. (Hopefully this will still be readable after HTMLification.)

    H-index (the largest integer h such that h papers each have at least h citations):

    year FOCS STOC SODA
    1997 33 34 29
    1998 34 34 32
    1999 34 32 31
    2000 35 29 29
    2001 27 30 28
    2002 25 31 27
    2003 25 28 24
    2004 20 22 19
    2005 13 19 17
    2006 6 10 11


    10th highest citation count:

    year FOCS STOC SODA
    1997 113 156 54
    1998 73 131 89
    1999 81 81 57
    2000 84 61 60
    2001 57 74 46
    2002 46 80 50
    2003 36 48 54
    2004 32 33 26
    2005 15 27 21
    2006 5 10 12

    25th highest citation count:

    year FOCS STOC SODA
    1997 42 58 32
    1998 41 49 40
    1999 45 42 38
    2000 45 39 31
    2001 30 33 29
    2002 25 36 30
    2003 25 29 24
    2004 17 21 17
    2005 9 15 15
    2006 3 7 8

    ReplyDelete
  19. To me the biggest source of puzzlement is the gap between STOC and FOCS citation counts, as most of us consider those conferences functionally equivalent.

    Could it be due to the slightly larger number of accepted papers by STOC (about 20% more) thus echoing Mohammad R. comment on the SODA vs STOC/FOCS gap?

    Or could it be due to the summer effect as suggested by anonymous 7/26/2007 12:19:00 AM?

    Personally, I'm inclined towards Mohammad explanation, a larger conference almost by necessity will have somewhat weaker papers at the bottom thus dragging the citation count a bit down. Jeff's counts of the H-index, top 10 and top 25 certainly confirm this.

    ReplyDelete
  20. Thanks for the stats, Jeff!

    One factor that came up in the context of STOC vs FOCS comparison is that, in any given year, STOC occurs a few months before FOCS (and therefore, STOC papers have more time to accumulate citations). This influences the results, at least in the short term.

    Piotr

    ReplyDelete
  21. I think the 'larger conference' argument breaks down a little when you look at the *total* number of citations to all papers per conference (over all years): SODA loses out to STOC on this measure.

    ReplyDelete
  22. SODA loses out to STOC on this measure.

    True, but so does STOC to FOCS, even though STOC is larger.

    I really don't understand how come STOC has less total citations than FOCS, even though they seem similar in quality and STOC has the advantage in size... the mistery deepens.

    ReplyDelete
  23. STOC might be competing with more conferrences than FOCS. For example, STOC and SoCG deadlines are identical, and the better geometry papers are usually submitted to SoCG. Indeed, FCRC include STOC, implying that there are a big number of conferrences with similar deadlines...

    ReplyDelete
  24. Hi,

    Responding to the last 3 posts about the total number of citations: the actual numbers are:

    FOCS: 20427
    STOC: 26481
    SODA: 23517

    These numbers were not posted anywhere earlier. I just computed them for the data available on this blog.

    HOWEVER, it looks like we've got an unfortunate typo: in the table comparing our findings with Mikes, for the year 2000, there is a random phrase "over 10 years". This phrase SHOULD NOT HAVE BEEN THERE, it is a leftover from the old version of the post, and it does not make sense in the context. It will be deleted whenever Suresh comes on-line, or when I learn how to edit blog posts.

    Anyway, it seems like it created some confusion. Sincere apologies.

    Piotr

    ReplyDelete
  25. Claiming STOC 1998

    Piotr

    ReplyDelete
  26. Claiming FOCS 99.

    Sudipto

    ReplyDelete
  27. What I don't know is this: what is a statistically sound way of comparing medians that corrects for sample size ?


    How about the following algorithm:

    bestchoice=median;
    For (every other justfiable measure m)
    {
    Compare citations counts for FOCS, STOC,SODA using measure m.
    If (sodaperformance[m] better than sodaperformance[bestchoice]) { bestchoice=m}
    }


    Next forget everything done so far and proclaim that bestchoice is a good measure of typical quality. Then compare the three conferences with respect to bestchoice, and see what results you get.

    ReplyDelete
  28. How about the following algorithm:

    Look, papers 1-70 in FOCS surely will beat papers 70-140 in SODA. SODA would have to be a conference with much higher prestige than FOCS for that not to be the case. This is something that not even the most ardent proponent of SODA has ever claimed. The claim, at least as relayed by Lance a while back, is that SODA has joined (not surpassed) STOC and FOCS as one of the big three.

    ReplyDelete
  29. Anyway, it seems like it created some confusion. Sincere apologies.

    That explains it, good. It really made no sense otherwise. This goes to show the value of sanity checking experimental data.

    ReplyDelete
  30. What I don't know is this: what is a statistically sound way of comparing medians that corrects for sample size ?

    As far as I know, this is typically done using bootstrapping.

    ReplyDelete
  31. As said by someone above, Jeff's counts of the H-index, supports my theory I mentioned earlier.

    I think the 'larger conference' argument breaks down a little when you look at the *total* number of citations to all papers per conference (over all years): SODA loses out to STOC on this measure.

    It depends how you define "break down". I think the total number of citations for all three are pretty close; SODA is about 12% behind STOC and is 15% ahead of FOCS. These numbers aren't very significant IMHO. You could argue though that a more accurate comparison would be to take only the count of top 70 or so papers of each of these three. My guess is they won't be too far off either.

    ReplyDelete
  32. When the journal version of a paper appears, people usually cites the journal version. So I am wondering if this has been taken into consideration ...

    Of course, a lot of papers never have its journal versions, especially those appeared in FOCS/STOC.

    ReplyDelete
  33. Claiming STOC 1999

    Piotr

    ReplyDelete
  34. About the journal vs conference versions:

    1) Google Scholar typically does collapse such papers into one entry. Thus, the Extractor did not take this into account.

    2) However, the results are being currently verified and corrected by humans (see CALL FOR HELP), who take this issue into account, to the best of their abilities.

    Piotr

    ReplyDelete
  35. There are a number of factors in these numbers.

    One factor may be that the summer is when most people are not involved in courses giving them more time for research which then gets written up in time for STOC. (This is different from the "nobody is around to write things up in the summer" comment about SODA.)

    Some things are just 'random'. For example, Irit Dinur's paper on the new proof of the PCP theorem, which won the best paper award at STOC 2006 and will likely be heavily cited in future, would have appeared at FOCS 2005 except for the fact that she was on the FOCS PC that year.

    ReplyDelete
  36. Claiming SODA 1999

    Piotr

    ReplyDelete
  37. Claiming SODA 2001

    Piotr

    ReplyDelete
  38. it might also be interesting to look at how many citations come from papers that weren't themselves in STOC/FOCS/SODA ... how much bathwater drinking is going on? i know that at least for my own field(s), the answer is probably: a lot. you probably couldn't do this with things more recent than, say, the early 2000s, since percolation takes a while.

    ReplyDelete
  39. Top ten papers by citation count in STOC 2003:

    145 A tight bound on approximating arbitrary metrics by tree metrics.
    95 Exponential algorithmic speedup by a quantum walk.
    65 Better streaming algorithms for clustering problems.
    63 Simpler and better approximation algorithms for network design.
    61 Cell-probe lower bounds for the partial match problem.
    54 Near-optimal network design with selfish agents.
    53 Pricing network edges for heterogeneous selfish users.
    52 A stochastic process on the hypercube with applications to peer-to-peer networks.
    48 Optimal oblivious routing in polynomial time.
    48 Hidden translation and orbit coset in quantum computing.

    Top ten papers in SODA 2003:

    189 Skip graphs.
    162 Data streams: algorithms and applications.
    101 The similarity metric.
    87 Comparing top k lists.
    73 Simultaneous optimization for concave costs: single sink aggregation or single source buy-at-bulk.
    64 Packing Steiner trees.
    63 An approximate truthful mechanism for combinatorial auctions with single parameter agents.
    61 Computing homotopic shortest paths in the plane.
    58 Dominating sets in planar graphs: branch-width and exponential speed-up.
    54 High-order entropy-compressed text indexes.

    From which we can clearly deduce that ... erh... SODA papers have shorter titles on the average. Any other conclusions that jump out?

    ReplyDelete
  40. Just saw this today. Very interesting.

    I think all top theory conf should have some statistics like that.

    IMHO, with these data, we should really look "# of highly ranked papers in STOC/FOCS/SODA" instead of "# of papers in STOC/FOCS/SODA".

    Suresh: try to do this for SoCG, I am sure you will get lots of help.

    ReplyDelete
  41. I need volunteers if I were to do this for SoCG. it was hard enough doing it for these conferences :).

    also, I think SoCG is too small a community for the results to be meaningful in comparison to the "big three", so I'm not sure what the exercise would achieve.

    ReplyDelete
  42. Hi,

    The "grand plan" would be to exploit the knowledge that we gained in the process, and construct a *fully automatic* extractor - then no human volunteers would be needed. Right now the Extractor utilizes a combination of automatic crawling and human help (and perhaps even some goblin magic, who knows), so it does not scale that well.

    A fully automatic extractor (with low error rate) seems doable. The only problem is that it would require a fair amount of perl hacking, and so it has been relegated to the "to do" list for now. But if anyone was interested in doing this, we would be happy to provide guidelines.

    Cheers,

    Piotr

    ReplyDelete
  43. A fully automatic extractor would be nice to the community (very much like EasyChair).

    I guess this whole idea is from the database area, i.e., they give best paper award to the mostly referred paper at the top SIGMOD (?) conf exact 10 years ago.

    With these citation statistics, maybe we can change the atmosphere a bit in the theory community. (Personally I have heard too many complains "that guy got another so-so paper into STOC".) These kind of data will certainly be useful for tenure/promotion purposes, especially for the promotion as one needs time to accumulate citations. For instance, if one only got 2 FOCS/STOC/SODA papers at the time of promotion, under the current situation, it might not be enough (people can easily say "he hasn't made enough contribution as he hasn't got many FOCS/STOC/SODA papers". But if both of the papers have high citations, then the candidate can easily list this as a fact, e.g., "both of my 2 FOCS/STOC/SODA papers were cited over 70 times and are among the top-10 mostly cited papers at that year's FOCS/STOC/SODA".

    At the end of one's career, it is the 2-3 best papers which decide one's statue. Period.

    ReplyDelete

Disqus for The Geomblog