## Monday, December 26, 2011

### On PC members submitting papers

Update: Michael Mitzenmacher's posts (one, two, and three, and the resulting comments) on implementing CoI at STOC are well worth reading (thanks, Michael). The comments there make me despair that *any* change will ever be implemented before the next century, but given that we've been able to make some changes already (electronic proceedings, contributed workshops, and so on), I remain hopeful.

For all but theory researchers, the reaction to the above statement is usually "don't they always?". In theoryCS, we pride ourselves on not having PC members submit papers to conferences. What ends up happening is:
• You can't have too many PC members on a committee because otherwise there won't be enough submissions
• The load on each PC member is much larger than reasonable (I'm managing 41 papers for STOC right now, and it's not uncommon to hit 60+ for SODA)
There's an ancillary effect that because of the first point, theory folks have fewer 'PC memberships' on their CV  which can cause problems for academic performance review, but this is a classic Goodhart's Law issue, so I won't worry about it.

The main principle at play here is: we don't want potentially messy conflicts or complex conflict management issues if we do have PC members submitting papers. However, it seems to me that the practice of how we review papers is far different from this principle.

Consider: I get an assignment of X papers to review if I'm on a conference PC. I then scramble around finding subreviewers for a good fraction of the papers I'm assigned (I used to do this less, but I eventually realized that a qualified subreviewer is FAR better than me in most subareas outside my own expertise, and is better for the paper).

Note (and this is important) my subreviewers have typically submitted papers to this conference (although I don't check) and I rely on them to declare any conflicts as per conference guidelines.

Subreviewers also get requests from different PC members, and some subreviewers might themselves review 3-4 papers.

Compare this to (say) a data mining conference: there are 30+ "area chairs" or "vice chairs", and over 200 PC members. PC members each review between 5-10 papers, and often don't even know who the other reviewers are (although they can see their reviews once they're done). The area/vice chairs manage 20-30 papers each, and their job is to study the reviews, encourage discussion as needed, and formulate the final consensus decision and 'meta-review'.

If you set "theory subreviewer = PC member" and "theory PC member = vice chair", you get systems that aren't significantly different. The main differences are:
• theory subreviewers don't typically get to see other reviews of the paper. So their score assignment is in a vacuum.
• theory PC members are expected to produce a review for a paper taking the subreviewer comments into account (as opposed to merely scrutinizing the reviews being provided)
• managing reviewer comments for 30 papers is quite different to generating 30 reviews yourself (even with subreviewer help)
• A downside of the two-tier PC system is also that there isn't the same global view of the entire pool that a theory PC gets. But this is more a convention than a rule: there's nothing stopping a PC for opening up discussions to all vice chairs.
• One advantage of area chairs is that at least all papers in a given area get one common (re)viewer. that's not necessarily the case in a theory PC without explicit coordination from the PC chair and the committee itself.
But the main claimed difference (that people submitting papers don't get to review them) is false. Even worse, when submitters do review papers, this is 'under the table' and so there isn't the same strict conflict management that happens with explicit PC membership.

We're dealing with problems of scale in all aspects of the paper review and evaluation process. This particular one though could be fixed quite easily.

## Friday, December 23, 2011

### Thoughts on ICDM II: Social networks

The other trend that caught my eye at ICDM is the dominance of social networking research. There was a trend line at the business meeting that bore this out, showing how topics loosely classified as social networking had a sharp rise among accepted papers in ICDM over the past few years.

There were at least three distinct threads of research that I encountered at the conference, and in each of them, there's something to interest theoreticians.
• The first strand is modelling: is there a way to describe social network graphs using abstract evolution models or random graph processes. I spent some time discussing this in a previous post, so I won't say more about it here. Suffice it to say that there's interesting work in random graph theory underpinning this strand, as well as a lot of what I'll call 'social network archaeology': scouring existing networks for interesting structures and patterns that could be the basis for a future model.
• The second strand is pattern discovery, and the key term here is 'community': is there a way to express natural communities in social networks in a graph-theoretic manner ? While modularity is one of the most popular ways of defining community, it's not the only one, and has deficiencies of its own. In particular, it's not clear how to handle "soft" or "overlapping" communities. More generally, there appears to be no easy way to capture the dynamic (or time-varying) nature of communities, something Tanya Berger-Wolf has spent a lot of energy thinking about. Again, while modelling is probably the biggest problem here, I think there's a lot of room for good theory, especially when trying to capture dynamic communities.
• The final strand is influence flow. After all, the goal of all social networking research is to monetize it (I kid, I kid). A central question here is: can you identify the key players who can make something go viral for cheap ? is the network topology a rich enough object to identify these players, and even if you do, how can you maximize flow (on a budget, efficiently).
There were many papers on all of these topics -- too many to summarize here. But the landscape is more or less as I laid it out. Social networking research is definitely in its bubble phase, which means it's possible to get lots of papers published without necessarily going deep into the problem space. This can be viewed as an invitation to jump in, or a warning to stay out, depending on your inclination. And of course, the definitive tome on this topic is the Kleinberg-Easley book

This concludes my ICDM wrap-up. Amazingly, it only took me a week after the conference concluded to write these up.

## Thursday, December 22, 2011

### CGWeek !!!

If you're on the compgeom mailing list, you probably know this already, but if not, read on:

There's an increased interest in expanding the nature and scope of events at SoCG beyond the conference proper. To this end, Joe Mitchell put together a committee titled "CG:APT" (CG: applications, practice and theory) to chalk out a strategy and solicit contributions.

The call for contributions is now out, and the main nugget is this:
Proposals are invited for workshops/minisymposia, or other types of events on topics related to all aspects of computational geometry and its applications. Typical events may feature some number of invited speakers and possibly some number of contributed presentations. Events may feature other forms of communications/presentations too, e.g., via software demos, panel discussions, industry forum, tutorials, posters, videos, implementation challenge, artwork, etc. CG:APT events will have no formal proceedings; optionally, the organizers may coordinate with journals to publish special issues or arrange for other dissemination (e.g., via arXiv, webpages, printed booklets, etc).
In other words, anything goes ! (this is an experiment, after all). Topics are essentially anything that might be of interest geometrically in a broad sense (i.e not limited to what might appear at the conference itself).

I'd strongly encourage people to consider putting together a proposal for an event. The procedure is really simple, and only needs a two page proposal containing:

1. Title/theme of the workshop/minisymposium/event
2. Organizer(s) (name, email)
3. Brief scientific summary and discussion of merits (to CG) of the proposed topic.
4. A description of the proposed format and agenda
5. Proposed duration: include both minimum and ideal; we anticipate durations of approximately a half day (afternoon), with the possibility that some meritorious events could extend across two half-days.
6. Procedures for selecting participants and presenters
7. Intended audience
8. Potential invited speakers/panelists
9. Plans for dissemination (e.g., journal special issues)
10. Past experience of the organizer(s) relevant to the event
Please note: EVERY COMMUNITY DOES THIS (and now, even theory). The deadline is Jan 13, 2012, and proposals should be emailed to Joe Mitchell (joseph.mitchell@stonybrook.edu). Note that the idea is for such events to be in the afternoon, after morning technical sessions of SoCG.

There are many people out there who grumble about how insular the CG community is. Now's your chance to walk into the lion's den (at Chapel Hill) and tell it off :).

### Thoughts on ICDM I: Negative results (part C)

This is the third of three posts (one, two) on negative results in data mining, inspired by thoughts and papers from ICDM 2011.

If you come up with a better way of doing classification (for now let's just consider classification, but these remarks apply to clustering and other tasks as well), you have to compare it to prior methods to see which works better. (note: this is a tricky problem in clustering that my student Parasaran Raman has been working on: more on that later.).

The obvious way to compare two classification methods is how well they do compared to some ground truth (i.e labelled data), but this is a one-parameter system, because by changing the threshold of the classifier (or if you like, translating the hyperplane around),you can change the false positive and false negative rates.

Now the more smug folks reading these are waiting with 'ROC' and "AUC" at the tip of their tongues, and they'd be right ! You can plot a curve of the false positive vs false negative rate and take the area under the curve (AUC) as a measure of the effectiveness of the classifier.

For example, if the y axis measured increase false negatives, and the x-axis measured increasing false positives, you'd want a curve that looked like an L with the apex at the origin, and a random classifier would look like the line x+y = 1. The AUC score would be zero for the good classifier and 0.5 for the bad one (there are ways of scaling this to be between 0 and 1).

The AUC is a popular way of comparing methods in order to balance the different error rates. It's also attractive because it's parameter-free and is objective: seemingly providing a neutral method for comparing classifiers independent of data sets, cost measures and so on.

But is it ?

There's a line of work culminating (so far) in the ICDM 2011 paper 'An Analysis of Performance Measures For Binary Classifiers' by Charles Parker of BigML.com (sorry, no link yet). The story is quite complicated, and I doubt I can do it justice in a blog post, so I'll try to summarize the highlights, with the caveat that there's nuance that I'm missing out on, and you'll have to read the papers to dig deeper.

The story starts with "Measuring classifier performance: a coherent alternative to the area under the ROC curve" ( link) by David Hand (copy of paper here). His central result is a very surprising one:
The AUC measure implicitly uses different misclassification costs when evaluating different classifiers, and thus is incoherent as an "objective" way of comparing classifiers.
To unpack this result a little, what's going on is this. Suppose you have a scenario where correct classification costs you nothing, and misclassification costs you a certain amount (that could be different for the two different kinds of misclassification). You can now write down an overall misclassification cost for any threshold used for a classifier, and further you can compute the optimal threshold (that minimizes this cost). If you don't actually know the costs (as is typical) you can then ask for the expected misclassification cost assuming some distribution over the costs.

If you run this computation through, what you end up with is a linear transformation of the AUC, where the distribution over the costs depends on the distribution of scores assigned by the classifier ! In other words, as Hand puts it,
It is as if one measured person A’s height using a ruler calibrated in inches and person B’s using one calibrated in centimetres, and decided who was the taller by merely comparing the numbers, ignoring the fact that different units of measurement had been used
This is a rather devastating critique of the use of the AUC. While there's been pushback (case in point is an ICML 2011 paper by Flach, Hernandez-Orallo and Ferri which is a very interesting read in its own right), the basic premise and argument is not contested (what's contested is the importance of finding the optimal threshold). Hand recommends a few alternatives, and in fact suggests that the distribution of costs should instead be made explicit, rather than being implicit (and subject to dependence on the data and classifiers)

What Parker does in his ICML paper is take this further. In the first part of his paper, he extends the Hand analysis to other measures akin to the AUC, showing that such measures are incoherent as well. In the second part of his paper, he unleashes an experimental tour de force of classifier comparisons under different quality measures, showing that
• nearly 50% of the time, measures disagree on which classifier in a pair is more effective. He breaks down the numbers in many different ways to show that if you come up with a new classification algorithm tomorrow, you'd probably be able to cherry pick a measure that showed you in a good light.
• It's the measures perceived to be more "objective" or parameter-less that had the most trouble reconciling comparisons between classifiers.
• It's also not the case that certain classifiers are more likely to cause disagreements: the problems are spread out fairly evenly.
• His experiments also reinforce Hand's point that it's actually better to define measures that explicitly use domain knowledge, rather than trying to achieve some objective measure of quality. Measures that were either point-based (not integrating over the entire range) or domain specific tended to work better.
I'm not even close to describing the level of detail in his experiments: it's a really well-executed empirical study that should be a case study for anyone doing experimental work in the field. It's especially impressive because from personal experience I've found it to be REALLY HARD to do quality methodological studies in this area (as opposed to the "define algorithm-find-toy-data-profit" model that most DM papers seem to follow").

At a deeper level, the pursuit of objective comparisons that can be reduced to a single number seems fundamentally misguided to me. First of all, we know that precise cost functions are often the wrong way to go when designing algorithms (because of modelling issues and uncertainty about the domain). Secondly, we know that individual methods have their own idiosyncracies - hence the need for 'meta' methods. And finally, we're seeing that even the meta-comparison measures have severe problems ! In some ways, we're pursuing 'the foolish hobgoblin of precision and objectivity' in an area where context is more important than we as mathematicians/engineers are used to.

## Tuesday, December 20, 2011

### Thoughts on ICDM I: Negative Results (part B)

Continuing where I left off on the idea of negative results in data mining, there was a beautiful paper at ICDM 2011 on the use of Stochastic Kronecker graphs to model social networks. And in this case, the key result of the paper came from theory, so stay tuned !

One of the problems that bedevils research in social networking is the lack of good graph models. Ideally, one would like a random graph model that evolves into structures that look like social networks. Having such a graph model is nice because
• you can target your algorithms to graphs that look like this, hopefully making them more efficient
• You can re-express an actual social network as a set of parameters to a graph model: it compacts the graph, and also gives you a better way of understanding different kinds of social networks: Twitter is a (0.8, 1, 2.5) and Facebook is a (1, 0.1, 0.5), and so on.
• If you're lucky, the model describes not just reality, but how it forms. In other words, the model captures the actual social processes that lead to the formation of a social network. This last one is of great interest to sociologists.
But there aren't even good graph models that capture known properties of social networks. For example, the classic Erdos-Renyi (ER) model of a random graph doesn't have the heavy-tailed degree distribution that's common in social networks. It also doesn't have a property that's common to large social networks: densification, or the fact that even as the network grows, the diameter stays small (implying that the network seems to get denser over time).

One approach to fixing this models a social network as a Stochastic Kronecker graph. You can read more about these graphs here: a simple way of imagining them is that you add an edge in the graph by a random process that does a (kind of) quad tree like descent down a partitioning of the adjacency matrix and places a 1 at a leaf. SKGs were proposed by Leskovec, Chakrabarti, Kleinberg and Faloutsos, and include ER graphs as a special case. They appear to capture heavy tailed degree distributions as well as densification, and have become a popular model used when testing algorithms on social networks. They're also used as the method to generate benchmark graphs for the HPC benchmark Graph500.

But a thorough understanding of the formal properties of SKGs has been lacking. In "An In-Depth Analysis of Stochastic Kronecker Graphs", Seshadhri, Pinar and Kolda show some rather stunning results. Firstly, they provide a complete analysis of the degree distribution of an SKG, and prove a beautiful result showing that it oscillates between having a lognormal and exponential tail. Their theorems are quite tight: plots of the actual degree distribution match their theorems almost perfectly, and convincingly display the weird oscillations in the degree frequencies (see Figure 2 in the paper).

Secondly, they also formally explain why a noisy variant of SKGs appears to have much more well-behaved degree distribution, proving that a slightly different generative process will indeed generate the desired distribution observed in practice.

Finally, they also show that the graphs generated by an SKG have many more isolated nodes than one might expect, sometimes upto 75% of the total number of vertices ! This has direct implications for the use of SKGs as benchmarks. Indeed, they mention that the Graph500 committee is considering changing their benchmarks based on this paper - now that's impact :)

What I like about this paper is that it proves definitive theoretical results about a popular graph model, and very clearly points out that it has significant problems. So any methodology that involves using SKGs for analysis will now have to be much more careful about the claims it makes.

p.s There's also more supporting evidence on the lack of value of SKGs from another metric (the clustering coefficient, that measures how many configurations uv, uw also have the third edge vw). Real social networks have a high CC, and SKGs don't.This was first mentioned by Sala, Cao, Wilson, Zablit, Zheng and Zhao, and Seshadhri/Pinar/Kolda have more empirical evidence for it as well. (Disclaimer: I was pointed to these two references by Seshadhri: my opinions are my own though :))

## Sunday, December 18, 2011

### Thoughts on ICDM I: Negative Results (part A)

I just got back from ICDM (the IEEE conference on Data Mining). Data mining conferences are quite different from theory conferences (and much more similar to ML or DB conferences): there are numerous satellite events (workshops, tutorials and panels in this case), many more people (551 for ICDM, and that's on the smaller side), and a wide variety of papers that range from SODA-ish results to user studies and industrial case studies.

While your typical data mining paper is still a string of techniques cobbled together without rhyme or reason (anyone for spectral manifold-based correlation clustering with outliers using MapReduce?), there are some general themes that might be of interest to an outside viewer. What I'd like to highlight here is a trend (that I hope grows) in negative results.

It's not particularly hard to invent a new method for doing data mining. It's much harder to show why certain methods will fail, or why certain models don't make sense. But in my view, the latter is exactly what the field needs in order to give it a strong inferential foundation to build on (I'll note here that I'm talking specifically about data mining, NOT machine learning - the difference between the two is left for another post).

In the interest of brevity, I'll break this trend down in three posts. The first result I want to highlight isn't even quite a result, and isn't even from ICDM ! It's actually from KDD 2011 (back in August this year). The paper is Leakage in Data Mining: Formulation, Detection, and Avoidance, by Shachar Kaufman, Saharon Rosset, and Claudia Perlich, and got the Best Paper Award at KDD this year.

The problem they examine is "leakage", or the phenomenon that even in a 'train model and test model" framework, it is possible for valuable information to "leak" from the test data or even from other sources to the training system, making a learned model look surprisingly effective and even give it predictive powers beyond what it really can do. Obviously, the problem is that when such models are then applied to new data, their performance is worse than expected, compared to a model that wasn't "cheating" in this way.

They cite a number of examples, including many that come from the data challenges that have become all the rage. The examples highlight different kinds of leakage, including "seeing the future", cross contamination of data from other data sets, and even leakage by omission, where a well-intentioned process of anonymization actually leaks data about the trend being analyzed.

While there are no results of any kind (experimental or theoretical), the authors lay out a good taxonomy of common ways in which leakage can happen, and describe ways of avoiding leakage (when you have control over the data) and detecting it (when you don't). What makes their paper really strong is that they illustrate this with specific examples from recent data challenges, explaining how the leakage occurs, and how the winners took advantage of this leakage explicitly or implicitly.

There are no quick and dirty fixes for these problems: ultimately, leakage can even happen through bad modelling, and sometimes modellers fail to remove leaky data because they're trying to encourage competitors to build good predictive models. Ironically, it is this encouragement that can lead to less predictive models on truly novel data. But the paper makes a strong argument that the way we collect data and use it to analyze our algorithms is fundamentally flawed, and this is especially true for the more sophisticated (and mysterious) algorithms that might be learning models through complex exploitation of the trained data.

It remains to be seen whether the recommendations of this paper will be picked up, or if there will be more followup work along these lines. I hope it does.

### A rant about school science fair projects

It's science fair time at my son's school. He's in the first grade, so admittedly there's not  a lot that he can do without a reasonable amount of *cough* parental*cough* help. But why do we not have a 'mathematics' fair or a 'programming fair ?

The science fair project format is very confining. You have to propose a hypothesis, tabulate a bunch of results, do some analysis, and discuss conclusions, with nice charts/graphs and other such science cultism. Even if you're interested in something more 'mathematical', there's no real way of shoehorning it into the format. A year or so ago, a colleague of mine was asking me about origami-related projects (because his daughter loves paper folding) but apart from experimenting with knapsack-style algorithms to determine how to fold a ruler into a specified length, we couldn't figure out something that fit into the 'hypothesis-experiment' setting.

Granted, it's a science fair. But at this age level, I assume the whole point of participation in science fairs is about learning something about science, rather than conducting rigorous analysis. You could equally well learn about something mathematical and demonstrate that knowledge. But there's no forum to do that in.

## Friday, December 16, 2011

### Simons Fellowship for theory grad students.

The Simons Foundation has been a great boon for theoretical computer science, supporting postdocs galore, and even running a "sooper-seekrit" search for a new TCS institute.

Their latest initiative is at the grad student level. They're offering a 2-year fellowship to grad students "with an outstanding track record of research accomplishments". I think the idea is to support students who've established a good body of work, and could use this to coast towards their graduation and ramp up their research even more.

The support is considerable:
Each award will provide annual support for the following:
• Fellowship stipend as set by the student’s institution.
• Tuition and fees at the student's institution, to the extent previously covered by other outside funding.
• $3,000 in additional support for the student to use at his/her discretion. •$5,000 per year toward the student’s health insurance and other fringe benefits.
• $5,000 per year for travel, books, a computer and other supplies. •$5,000 in institutional overhead allowance.
Fellowships will start September 2012 and end August 2014.

How do you apply ?
Applicants may apply through proposalCENTRAL (http://proposalcentral.altum.com/default.asp?GMID=50) beginning December 7, 2011. The deadline to apply is May 1, 2012. Please coordinate submission of the proposal with the appropriate officials in accordance with institution/university policies. Please see the Application Instructions for further information.
Application Requirements:
• Research Statement (two page limit): A statement summarizing the applicant’s research contributions, research plans for the immediate future, and career goals. References do not need to be included within the page limit, but should not exceed an additional page.
• A curriculum vitae (two page limit), which includes institution, advisor, and a list of publications.
• A letter of support from the Department Chair.
• A letter of support from the student’s advisor.
• A letter from a reference outside the student’s institution. This letter must be submitted directly via proposalCENTRAL by the reference. Please see the Application Instructions for more information.
• Thesis topic.

(HT Sampath Kannan for pointing this out)

## Thursday, December 08, 2011

### ACM Fellows, 2011

Many theoreticians on this year's list of ACM Fellows:
• Serge Abiteboul
• Guy Blelloch
• David Eppstein
• Howard Karloff
• Joe Mitchell
• Janos Pach
• Diane Souvaine
Congratulations to them, and to all the Fellows this year (especially my collaborator Divesh Srivastava)

### SoCG and ACM: The Results Show

From Mark de Berg:
The bottom row of the table [below] gives the number of votes for each of the three options

A.    I prefer to stay with ACM.

B.    If involvement of ACM can be restricted to publishing the proceedings, at low cost for SoCG, then I prefer to stay with ACM; otherwise I prefer to leave ACM.

C.     I prefer to leave ACM, and organize SoCG as an independent conference with proceedings published in LIPIcs.

and it also gives a breakdown of the votes by the number of SoCG’s that the voter attended in the last 10 years.
 A: stay B: proceedings only C: leave total A: 0 4 3 3 10 B: 1-2 6 16 19 41 C: 3-5 11 15 16 42 D: >5 8 14 9 31 total 29 48 47 124

Based on the result of the Poll, the Steering Committee decided to start negotiating with ACM to see if they can offer SoCG an arrangement in which involvement of ACM is limited primarily to publishing the proceedings, with the possible option for greater involvement if the local organizers request/require it.
It will be interesting to see how this plays out.