## Thursday, November 14, 2013

### The many stages of writing a paper, and how to close the deal.

[tl;dr] Producing a piece of research for publication has many stages, and each stage has different needs, requiring different ways of operating. Learning these stages is a key developmental step for a graduate student. In what follows, I describe this process.

Caveat: this is obviously customized to more theoretical-style work, although I'll talk about how experimental evaluation fits in, in the context of my work. So YMMV.

From my conversations with students (mine and others), I think this accurately describes how students think a paper gets written:

1. Advisor produces problem miraculously from thin air.
2. Come up with solution.
3. Write down solution
4. Advisor makes annoying and mystifying edit requests on irrelevant introductory stuff, while throwing out long complicated proofs (or experiments) student has spent many hours sweating over.
5. Make final edits and submit paper.
Most student figure out  how to do 2) , and eventually learn how to do 3) (which is itself a topic for another post). 5) is probably the first thing students learn how to do: fix typos, edit latex, and generally do yak-shaving.

But step 4) is perhaps the most mysterious part of the writing process for a new researcher, and the least structured. I call it "closing the deal" and it's really about going from a bag of results to an actual submittable paper.

Let me elaborate.

1) Coming up with a problem.
Of course coming up with a problem is the essence of the research process ("it's about the questions, not the answers", he shrieks). This takes experience and vision, and can often be changed by things you do in stage 4. I'll say no more about it here.

2) Solving a problem.
This is the stage that everyone knows about. That's what we do, after all - solve problems ! This is where we drink lots of coffee, live Eye of the Tiger montages, get inspiration in our sleep, and so on.

It often happens that you don't exactly solve the problem you set out to attack, but you make many dents in it, solving special cases and variants. It's important to be flexible here, instead of banging your head against a wall head-on. At any rate, you either exit this stage of the project completely stuck, with a complete solution, or with a collection of results, ideas and conjectures surrounding the problem (the most typical case)

3) Writing it all down.
Again, I could spend hours talking about this, and many people better than I have. It's a skill to learn in and of itself, and depends tremendously in the community you're in.

4) Closing, or getting to a submission.
But this is the part that that's often the most critical, and the least understood. I call it "closing the deal": getting from 80% to 100% of a submission, and it requires a different kind of skill. The overarching message is this:
A paper tells a story, and you have to shape your results - their ordering, presentation, and even what you keep and what you leave out - in order to tell a consistent and clear story.
(before people start howling, I'm not talking about leaving out results that contradict the story; that would be dishonest. I'm talking about selecting which story to tell from among the many that might present themselves)

So you have a bag of results centering around a problem you're trying to solve. If the story that emerges is: "here's a problem that's been open for 20 years and we solved it", then your story is relatively easy to tell. All you have to do is explain how, and using what tools.

But in general, life isn't that easy. Your results probably give some insights into the hard core of the problem: what parts are trivial, what directions might be blocked off, and so on.

Now you need to find/discover the story of your paper. You can't do this too early in the research process: you need to explore the landscape of the problem and prove some results first. But you shouldn't wait too long either: this stage can take time, especially if the story changes.

And the story will change. One way of thinking about what you need for a conference submission is a relatively tight, compelling and interesting story. While the loose ends and unexplored directions are probably the thing most interesting to you and your research, they are best left to a conclusions section rather than the main body. What the body should contain is a well-thought out march through what you have discovered and what it says about the problem you're solving. In doing so, you will find yourself making decisions about what to keep, and what to leave out, and how to order what you keep.

And so, speculations need to be made into concrete claims or triaged. Experiments need to be run till they tell a definite story. Introductions need to be made coherent with the rest of the paper. There's also an element of bolt-tightening: making the bounds as tight as possible, stating claims as generally as makes sense for the overarching story (if your story is about points in the plane, then stating some claims in a general metric space might not always make sense).

And all of this has to be done to serve the overarching story that will make the most compelling paper possible. The story can change as new results come in, or expand, or sometimes even die, (but this latter is rare). But there is this constant drumbeat of "am I getting closer to a submission with a nice story with each step".

Telling a good story is important. For someone to appreciate your paper, cite it, or even talk about it (whether it's accepted, or on the arxiv) they have to be willing to read it and retain its results. And they'll be able to do that if it tells a clear story, which is not just a union of results.

## Tuesday, November 05, 2013

### SODA 2014 travel awards and other news

Via Cliff Stein and Kirsten Wilden:

The SIAM Symposium on Discrete Algorithms (SODA14) will be held January 5-7, 2014 at the Hilton Portland & Executive Tower in Portland, Oregon, USA. Algorithm Engineering and Experiments (ALENEX14) and Analytic Algorithmics and Combinatorics (ANALCO14) will be co-located with the conference on January 5 and 6, respectively.
The deadline for students to apply for travel support is approaching!   The NSF, Google, IBM Research, and Microsoft Corporation have provided generous travel support for students interested in attending SODA14 or its associated meetings, ALENEX14 and ANALCO14. Application materials and a letter of recommendation from your advisor are due by November 15. Please visit http://www.siam.org/meetings/da14/tsupport.php for detailed funding information and application procedures.

Pre-registration for all three meetings is also available at http://www.siam.org/meetings/da14/reginfo.php.

Additional conference information is available at the following websites:

## Sunday, October 27, 2013

### FOCS Reception sing-along

Tomorrow night at 7:30 (before the FOCS business meeting at 9), the Positive Eigenvalues will be playing a live set for the entertainment of  FOCSers.

Who are the Positive Eigenvalues, you might ask ? Well, you'll have to come and hear us to find out :). Two of the songs we're playing are compositions by our resident philosopher/graphic novelist/keyboardist/composer/complexity theorist, Christos Papadimitriou.

The first is a riff off the Dylan song "Man gave names to all the animals", with the same tune but different lyrics. If you're so inspired, you're encouraged to sing along to the chorus (which repeats after each verse). It goes like this:
Chorus:
Theorists gave names to all the classes
in the beginning, in the beginning
Theorists gave names to all the classes
in the beginning long time ago

Verse 1:
They saw a woman slim and tall
she went by fast but that's no crime
uh I think I'll call her PTIME

Verse 2:
They saw a kid who soldiered on
he got no rest from dusk to dawn
he kept sweating in the same place
uh I think I'll call him PSPACE

Verse 3:
They saw a blind man comb his shack
to find his walkstick in the black
but once he found it he could see
uh I think I'll call him NP

Verse 4:
They saw a boy who walked and walked all night
never veered to the left or right
was kind of slow but not that bad
uh I think I'll call him PPAD

Verse 5:
There was a beast and on its back
it carried Heisenberg and Dirac
it was as weird as a beast can be
uh...
For more on the tailed off line at the end, see the wikipedia entry.

The second song is an original composition by Christos in rock ballad/power rock mode. The lyrics speak poignantly about the life of a theoretician:
In Theory
Verse
Think of a carpenter of sorts
he has the flu he has the blues
He burns his oil every night
all night frustrated and confused

Bridge
And worst of all he can't complain
In theory his job is plain  (twice)

Chorus
Build something sturdy and beautiful and useful
He's always short in one goal
He juggles true with elegant and useful
and he keeps dropping one ball

Verse 2
His buddies like him alright
except they love to criticize
They go and take his stuff apart
with clever words with clever eyes

Verse 3
Some nights a girl comes to his shop
to sip his tea learn something raw
his foggy mind his tangled soul
want her to stay but let her go

## Thursday, October 17, 2013

### Searching randomly for needles.

Suppose you're given a large set of objects $X$, and you know that some subset $I$ is "interesting". A particular example that hits close to (my regular) home is in bug-testing: $X$ is the set of possible inputs to a program, and $I$ is the set of inputs that generate bugs (this is the downside of talking to +John Regehr too much). We'll assume that if you're given a candidate object, you can check easily if it's interesting or not (for example, by running the program)

You'd like to find the interesting items, so you consult an expert (in our running example, maybe it's a method to generate inputs that test certain kinds of bugs). The expert produces items that it thinks are interesting. But experts have biases: maybe your expert only cares about certain kinds of interesting items.

So you ask multiple experts, in the hope that their biases are different, and that together they can cover the space of objects. But you don't know anything about their biases except what you can learn from repeatedly asking them for candidate objects.

What's your strategy for asking questions so that at any stage (say after asking $N$ questions), you've found the optimal number of interesting items ?

This was the topic of a talk by +Sebastien Bubeck at the AMPLab in Berkeley on Tuesday, based on this paper.  A key idea in the algorithm design is to make use of estimators of "things I haven't seen yet", or the famous Good-Turing estimator (which your humble blogger wrote about many æons ago). Here's how it works.

Formally, let us assume that each "expert" $i$ is a distribution $P_i$ over $X$. At each step, the algorithm will determine some $i$, and ask it for a random draw from $P_i$. Suppose we knew the fraction of items that $i$ had not seen yet. Then a simple greedy strategy would be to pick the $i$ that had the largest value of this "not-yet-seen" quantity. That's all well and good, as long as we know the fraction of items not yet seen.

Here's where the G-T estimator comes in. What it says is that if we're given samples from a distribution, and count the number of items in the sample that occurred exactly once, then this "frequency of frequency", divided by the number of samples, is a good estimate for the mass of items not yet seen. Moreover, it can be shown that this estimator has good concentration properties around the true answer.

So that's what the algorithm does. It maintains estimates (for each expert) of the mass not yet seen, and in each round picks the expert that maximizes this term, corrected by an adjustment coming from the tail of the concentration bound.

The algorithm is really simple, and elegant. The question is, how well does it do ? And now things get a little harder. The above ideal greedy strategy is optimal as long as the supports of the experts are disjoint. Under the same assumption (and some other technical ones), it can be shown that the expected difference between the number of items found by the algorithm and the number found by the ideal greedy algorithm is $O(\sqrt{Kt \log t})$ where $K$ is the number of experts and $t$ is the current time step.

It's not clear how to extend these bounds to the case when supports can overlap (or rather, it's not clear to me :)), but the authors show that the algorithm works quite well in practice even when this assumption is broken, which is encouraging.

## Wednesday, October 16, 2013

### Progressively complex representations for graphs

After NIPS last year, I had posted a note on the evolution of data models and how we think about the "shape" of data. In brief, we have increasingly complex ways of thinking about the (geometric) shape of data that carefully balance the need for expressivity and the ability to determine the parameters of the representation efficiently.

There's a basic fact of data analysis that you come up against once you start playing with data. Either you endow the space with a geometry, or you endow it with a graph structure (and of course, nowadays you might throw in a spherical cow or two). And there are only a few ways to mix the representations (spectral methods being one such approach).

But as far as I know, there are no standard ways to create a layered representation of graphs to model increasing complexity and expressivity in a similarly balanced manner. There are of course an infinite number of ways to parametrize graphs by quantities that capture increasing complexity (treewidth, connectivity, expansion, induced metric structure, and so on). But I don't know of any established and standard ways to model families of graphs with increasing complexity that capture data patterns of interest.

One of the things that I've been tasked with (along with Peter Richtarik and Han Liu) is to draft out a report on the activities at the big data program at Simons this semester. We're looking also at challenges in the theory of big data, and in my mind coming up with good models for graph structures that capture the intricacies of real data is one of them.

## Wednesday, October 02, 2013

### Call for tutorials at SIAM Data Mining

I'm the tutorials chair for the 2014  SIAM conference on data mining (SDM) (for readers not aware of the data mining landscape, SDM is a SODA-style data mining conference run by SIAM - i.e it's a CS-style venue with peer-reviewed papers, rather than a math-style conference like SIAM discrete math). SDM is one of the major venues for data mining research, especially the more statistically focused kind. It will be held in Philadelphia between Apr 24-26, 2014.

SDM runs 4-5 tutorials each year on various aspects of data mining (ranging from theory/algorithms to specific application areas). I'm personally very interested in encouraging submissions from people in the theory community working with data who might want to share their expertise about new methods/techniques from theoryCS land with the larger data analysis community.

The deadline is Oct 13, and all you need is a few pages describing the tutorial content, target audience, and some sample material (more details here). If you have an idea and are not sure whether it will fit, feel free to email me directly as well.

## Monday, September 16, 2013

### SIGACT CG Column: Moving heaven and earth

My new column for SIGACT News is up (non pay-walled version here). In it, I discuss two different ways of defining distances between distributions.

The earthmover distance is a familar measure to TCS folk, and the kernel distance (or the maximum mean distortion, or the distance covariance) is a familiar character in machine learning and statistics.

It turns out that while statistically, these are not as different as one might think, they are different when it comes to computing them, and understanding their geometry.

The column describes both measures and their properties, and does a compare-contrast between them. More generally, I try to describe how one should think about the process of constructing a distance function between two distributions.

## Monday, September 09, 2013

### Life at Simons

There have been many posts on the technical talks happening at Simons (see +Moritz Hardt's latest on gradient descent if you haven't yet). Not as many yet on the less-formed discussions that are happening in and around the building, but maybe it's too soon for that.

In the meantime, I thought it might be fun to describe "a day at Calvin Hall".

I usually bike in after 9 am. Biking in Berkeley is not for the (literally) weak-kneed, and I'm discovering how weak my knees actually are. I haven't yet found a good site to plot elevation gains and losses for an arbitrary route, but this one is pretty decent.

The Institute lends out bikes if you don't have your own. It's a great way to get around if you're staying near campus, because parking is difficult and expensive.

Of course, the next step is:

It's a great machine. Only downside is that it shuts down at 5pm when the building does. But there are plenty of cafes around for later.

Then of course the decision: which talks should I feel guilty about missing today ? There are A LOT OF TALKS going on at the various workshops/boot camps. I feel embarrassed complaining about this, because we have such a stellar crowd of speakers and such fascinating talks. But there are almost too many talks: I have to shut down the feelings of guilt in order to find big chunks of time to think (or write this blog post <ahem>).

Thankfully, there's the live stream. I've already mentioned the Simons livestream on G+, but it's worth mentioning again. If you're late to the start of any talk you can watch it on the high quality live feed. If you can't watch it live you can watch it as soon as it's done on the Ustream site for the event. And a few days after that the talks are archived on the Simons site directly. There's also this cute feature where the live stream plays the most recent talk during coffee breaks. I know people outside the Institute are watching talks live because someone pinged me on twitter to ask the speaker to repeat questions they get from the audience.

So I've either attended talks or I've spent the morning brainstorming on some of the projects I have going. It's now time for lunch: thankfully, we're walking distance from both Northside and Southside Berkeley, which means a huge variety of eating places all within 5-10 minutes walking. All we need to do is avoid the huge class let-out at 12:20 or so. On Tuesdays we have a lunch seminar (organized by +Moritz Hardt) as part of the big-data program.

Once we're back, it's time for this:

Did I mention that the coffee machine is awesome ?

Usually, discussions start picking up in the afternoon, and you'll often find me here:

3:30 is time for the daily tea, at which we get all kinds of nice cookies, and the occasional cake/ice cream (on thursdays). This is much like the Dagstuhl tea time, but without the fancy tortes (hint hint!). Of course, no daily tea is complete with the addition of this:

## Tuesday, January 01, 2013

### A SODA announcement, and a happy new year !

While you're all recovering from New Year's eve revelries, here's an important note from the ALENEX/ANALCO organizers regarding the debauchery in New Orleans otherwise known as SODA:

Dear SODA attendees,
We want to make sure that you are aware of a change in the format of SODA/ALENEX/ANALCO.  In recent years, ALENEX and ANALCO have been held on the Saturday before SODA and had separate registration.  This year we are trying a different format.  ANALCO and ALENEX will both be held in parallel with SODA; ANALCO on Sunday and ALENEX on Monday.  There is no longer an separate registration for ALENEX and ANALCO, everyone is automatically registered for all three.  That is, all SODA attendees are welcome to attend  ANALCO and  ALENEX talks.  We encourage you to look at the program and attend any talks that interest you.  We also welcome any feedback on this new format, after you have attended the conference.
Regards,
Bob Sedgewick (ANALCO steering committee chair)
Cliff Stein (ALENEX steering committee chair)