Monday, January 21, 2013

Accountability in Data Mining: A new seminar

Glencora Borradaile makes a number of interesting points about the implications of Aaron Swartz's life and work for us academic computer scientists. 
As computer science academics we are in a very powerful position. We are trusted with shaping the next generation that will make very important decisions that will have far-reaching social implications. Decisions like those over Facebook’s privacy defaults, motivating technology that enables autonomous private vehicles at the expense of the public interest, defining ownership of electronic media. We make those decisions ourselves in our research; what we research, how we allow our research to be used.
 Whether or not you agree with her particular policy preferences, the point remains that the technologies we develop can have lasting consequences for the "shape" of the world to come. And if we don't speak up (either through our work, or through our own advocacy), others will take the technologies we develop and find their own ways of using or misusing them.

All of this leads up to my current interests in data mining. I've been thinking about the problems of accountability in data mining for a while now: initially mostly in private, but now more and more in public (along with +Graham Cormode and +Andrew McGregor) as I see the importance of the issue.

What is accountability in data mining ? It means many things really, but for me, for now, it means this:

The fruits of data mining pervade every aspect of our life. We are recommended books and movies; given differential pricing for insurance; screened for potential terror threats; diagnosed with various diseases; and targeted for political advertising. The ability to sift through massive data sets with sophisticated algorithms has resulted in applications with impressive predictive power. 
Since the internals of a learning process can be complex and opaque, it is important to verify that the results satisfy the properties claimed by the algorithm. The importance of this goes beyond merely checking the algorithm itself. Validation mechanisms also provide a way for users to demand accountability from authorities who might use the results of mining to affect users in some way (by changing their insurance rates, or even putting them on a terrorism watchlist). As the results of data mining affect more and more of our lives, the more crucial it is that the user be able to validate decisions made on their behalf and that affect them. 
The above is an introduction to a seminar I'm running this semester on this topic. I'm a little nervous about it, because the topic is vast and unstructured, and almost anything I see nowadays on data mining appears to be "in scope". I encourage you to check out the outline, and comment on topics you think might be missing, or on other things worth covering. Given that it's a 1-credit seminar that meets once a week, I obviously can't cover everything I'd like, but I'd like to flesh out the readings with related work that people can peruse later.

It's entirely possible that we don't care about our privacy any more (as Facebook seems to think*). But we will care about the consequences of the use of our data. My perspective is that a better understanding of what is technically possible (and not possible) to demand and get accountability will be critical to informing larger policy discussions on this topic.

* In an earlier version of this post I wrongly attributed this sentiment to an individual. Apologies.

Wednesday, January 16, 2013

A sampling gem: sampling from $\ell_p$ balls.

A while back, I had written about uniform sampling from the $d$-simplex. In brief, sample exponentially distributed random variables, and normalize them. Note that the simplex is one piece of the $d$-dimensional $\ell_1$ unit sphere.

I had also mentioned a similar trick to sample from the ($\ell_2$) unit sphere: sample Gaussian random variables, and normalize them.

As I discovered today, these are really just special cases of a much more general result that includes both of these approaches, as well as providing a general way to sample from the $d$-dimensional unit ball (as well as sphere) under any $\ell_p$ norm.

The result is due to Barthe, Guedon, Mendelson and Naor, and is breathtakingly elegant. Here it is:

To sample from the unit ball under the $\ell_p$ norm in $d$ dimensions, randomly sample $d$ coordinates $x_1, \ldots, x_d$ i.i.d from a density proportional to $\exp(-|x|^p)$. Also sample $Y$ from the exponential distribution with parameter $\lambda = 1$. Then the desired sample is
$$\frac{(x_1, x_2, \ldots, x_d)}{\Bigl(\sum_i |x_i|^p + Y\Bigr)^{1/p}}$$
Notice that merely eliminating the $Y$ recovers the procedure for sampling from the unit sphere and includes both of the above sampling procedures as a special case. It's far more efficient than either rejection sampling, or even the metropolis-sampling method from the theory of sampling from convex bodies. Also, if you only want to sample from an $\ell_2$ ball you can do it without this hammer using the sphere sampling technique and a nonuniform sampling of the length, but this seems so much more elegant.

Thursday, January 10, 2013

LEGO and math teaching

I was discussing Paul Tough's book with my wife yesterday at breakfast, and we somehow got onto my pet peeve: the mechanical way in which math is taught in schools, and how math gets reduced to arithmetic and counting. Of course the definitive word on this is Paul Lockhart's A Mathematician's Lament.

I looked over the floor of the living room, strewn with random bits of deadly adult-foot-stabbing Lego, and had a minor epiphany. As anyone with small children knows, LEGO sells a lot more Hollywood tie-in kits now compared to the "bags of bricks" that it used to sell. You can buy the Millenium Falcon, an X-wing fighter, a spiderman action scene, or some kit related any cartoon on Nickelodeon.

Which is a pain. Lots of specialized parts, key pieces that if misplaced causes massive bouts of tears and anguished searching, and so on.

But how do kids play with them ? They put it together carefully from the instructions the first time. This lasts about a day or so. Then they get bored, take it apart and mix up all the pieces with the OTHER kits they have, and spend many happy days building all kinds of weird contraptions.

Here's the thing about the kits. They show you how to build fairly complicated devices and doohickeys. The kids pick up on this, and they mix and match the doohickeys in their own new constructions. Essentially they figure out the functions of the different parts, and realize how to build brand new things with them.

But suppose they were told that all they could do was repeatedly assemble the kits (and maybe even different kits) over and over again. It's unlikely they'd learn anything more than just how to assemble kits really quickly. They'd also get terribly bored. In fact, the popular toys in our house are multipurpose objects: any single-purpose toy gets discarded rather quickly.

To cut a long story short, it feels to me that a lot of math (and theoryCS) education at the school and college level involves looking at various "kits", and seeing how they get put together. You get rewarded for putting kits together correctly, but not for playing with the kits and creating bizarre new objects. But finding a way to let students (in an unstructured way) play with mathematical concepts is the key to having them really understand the concepts and their different facets.

Tuesday, January 08, 2013

ICERM and Simons postdocs

Two upcoming TCS postdoc deadlines:

ICERM Program on Network Science and Graph Algorithms

This is a program out of Brown organized by Kelner, Klein, Mathieu, Shmoys and Upfal. It sounds quite fascinating if you're doing anything with graph data and spectral analysis. Deadlines for postdoc applications is Jan 14.

Simons special program on the theory of big data. 

As I've mentioned before, the Simons Institute is running a semester long program on the theory of big data. The deadline for applying for postdocs is soon (middle of January)

These two programs are coordinating, and if you're so inclined you might be able to spend one semester at Berkeley and another in Providence. Please let the organizers know if this is something you're interested in.

Monday, January 07, 2013

SODA 2013 4/n: Business

If you haven't been following my live-tweets at the SODA business meeting, here's a summary of the unusually quiet meeting:
  • Attendance at the conference was 311, which is quite low, but is a local high for New Orleans (third time in the last 10 years). 
  • 135 papers were accepted out of a net 459 submissions. This is an acceptance rate of nearly 30%, which should strike fear into the hearts of tenure-track assistant professors everywhere. 
  • PC members had to review "only" 42 papers each. Yes, I know this is insane. And no, we're not going to do anything about it. 
  • Shiri Chechik received one of two best student papers for her paper "New Additive Spanners". Bernhard Haeupler received the other for "Simple, fast and deterministic gossip and rumor spreading".
  • I've already mentioned the two best papers, on graph minors and dynamic connectivity
  • SODA 2014 is in Portland, land of beer, books and birds. Chandra Chekuri is chair. 
  • After Salt Lake City observers had to withdraw because of serious voting irregularities, the trumped up unfair winner of the SODA 2015 sweepstakes was San Diego. But since we never go with our first choice location (Honolulu, anyone?), San Francisco was listed as a second choice, with DC as a third choice. 
  • David Johnson is handing the baton over to Cliff Stein, after running SODA since before I knew it wasn't a fizzy drink. 

SODA 2013: Part 3/n

I just attended Valerie King's talk on her paper with Bruce Kapron and Ben Mountjoy that does dynamic connectivity in worst-case polylog time (randomized). This paper received a Best Paper award (along with the Grohe et al paper on graph minors I mentioned yesterday).

The problem is a classic: given a stream of updates (edge insertions/deletions) to a graph, maintain a data structure that can answer "Are u and v connected" efficiently. There's been a slew of work on the problem: if you want worst-case bounds for updates, the best till now was $O(\sqrt{n})$ by Eppstein, Galil, Italiano and Nissenzweig (from 1992). There are polylogarithmic amortized bounds, but they can incur worst-case update times of $\Theta(n)$.

In this paper, after 20 years, they knock down the worst-case update times to polylogarithmic: the algorithms are randomized (with 1-sided error: it might occasionally declare two nodes disconnected when they were connected).

The idea itself is very elegant, and it connects to techniques in streaming algorithms in a way that I think is neat. WARNING: IANAE in this area, so my rendition might be slightly naïve, and is drawn from my understanding of Valerie King's excellent presentation.

The basic problem is this: I can maintain connectivity by maintaining a collection of trees. Adding an edge can be done without too much difficulty (indeed, if all you want to do is insert things, then you're back to union-find). It's insertion and deletion together that makes things hard: imagine that you delete an edge and disconnect a tree: you need to quickly determine if there's some other non-tree edge that can be used to reconnect the pieces, or if the two sides are now really disconnected.

More generally, consider the cutset maintainence problem. You have updates as before, but now a query is: given a set S, find a cut edge if one exists between S and V - S. It's not hard to see that this is the core routine you need for the tree.

The first elegant idea is this: suppose you represent each edge by a bit string consisting of the encoding of one endpoint followed by an encoding of the other. For example, the edge 2-3 might be written as 010011. Now for each vertex, compute the XOR of all the edges adjacent to it and call this its signature.

If you XOR all the signatures of vertices in a set, and if the set has exactly one cut edge, then what you get is the signature of that edge ! This is because each edge that's inside S will occur twice and therefore will zero out.

So this works if your cut set is of size 1. But suppose it's not ? Now you maintain $O(\log n)$ sets for each vertex. Each edge adjacent to the vertex is thrown into the $i^{\text{th}}$ set with probability $1/2^i$. With some appropriate duplication, you can show that if the cut set is of size $k$, then w.h.p the $\log k$th cut set will have exactly one element in it, and that's the element you can retrieve when you need to find a replacement edge.

There's a lot more technical work to do to get the whole algorithm working, and I won't go into that. But those of you who are familiar with streaming will recognize something here.

In essence, they've created a sketch for the set of edges adjacent to the vertex, and they have a way to represent the set compactly and retrieve individual elements from it. The trick with exponentially decaying levels is standard in streaming count estimation, and the GF(2) trick for storing the sketch is very similar to the trick used by Ahn, Guha and McGregor in their SODA 2012 paper on graph sketching.

And that's what I thought was the best part: that ideas that really have power in streaming can be used to help solve a longstanding open problem in graph algorithms. I should point out that the connection is post-facto: this paper was developed independently of the streaming work. But one does have to wonder what other connections we might be able to establish between sketching tricks on graphs and basic graph algorithms.

SODA 2013: Part 2/n

My previous post is here.

Day 2 of SODA, and the tenth time I've been asked "are you chairing all the sessions" ? No, just that many of my PC colleagues didn't (or couldn't) show up :), so those of us who did are doing more lifting. In reward, we got a nice dinner in the French quarter, and I tasted boudin for the first time (and maybe the last). 

An interesting talk today morning by Dror Aiger on reporting near neighbors. They were able to show a not-super-exponential relation between the number of points at unit $\ell_\infty$ distance from a query, and the number of points at unit $\ell_2$ distance. This was wrapped into a fast algorithm for reporting Euclidean near neighbors in high dimensions that has some interesting (if preliminary) experimental backing as well in comparison with ANN, FLANN and LSH. 

Jan Vondrák gave the invited talk on submodular optimization. I mentioned Satoru Fujishige's talk at NIPS, and this was an excellent complement. Fujishige's talk focused on the geometric intuition behind submodular functions (especially the associated polymatroid). Vondrák's talk focused on the algorithmic implications of submodularity, and he gave very convincing arguments for why it can be viewed as discrete convexity OR discrete concavity, or even neither. He pointed out how the Lovasz extension is useful for minimization and the multilinear extension is more useful for maximization, and gave a number of "recipes" for designing algorithms that optimize submodular functions. I hope the slides go online at some point: they were very clear and well-balanced. 

There was some discussion over whether next year's SODA should adopt the two-tier PC that STOC is currently experimenting. The jury's still out on that, and since the STOC PC is not done with their work, we don't yet have formal feedback. I will admit to being a little frustrated with the level of conservativeness on display here: it's not as if EVERY OTHER COMMUNITY IN CS doesn't do this and doesn't have best practices that we can learn from, and given our reviewing loads, it's really crazy that we aren't desperately trying things to alleviate the problem. 

Sunday, January 06, 2013

SODA 2013, Part I/n

On twitter, it's common to post longer thoughts in 140 character bursts. If you don't know how long the thought will be, you label them as 1/n, 2/n, and so on, so as not to exceed an arbitrary limit, but also imply that there is a finite limit. 

So we're back in the Big Easy for SODA. This time the conference has merged ALENEX and ANALCO with the main program, so at least for today and tomorrow, we have four parallel tracks. Having just come back from NIPS, SODA feels like a small cozy cocktail party in comparison (ed: not that I know anything about cocktail parties: I have small children)

The highlight was Bob Sedgewick's tribute to Philippe Flajolet. I've talked about their work on analytic combinatorics before, but it was nice to hear some of it again. Flajolet's contributions to the field are immense and valuable. He did big data before it was cool: his $\ell_0$ estimation work with Nigel Martin is now a seminal paper in the world of streaming, and his HyperLogLog paper on counting a stream (with Fusy, Gandouet and Meunier) is the best-in-class on this problem. Bob quoted one of Flajolet's students as saying, "Read any of his papers. You will learn something".

I chaired a geometry session in the morning, and one of the papers there caught my eye a while back. The Fréchet distance problem (commonly known as the dog-walking problem) is the problem of computing a monotone mapping between two curves that has minimim maximum length (you can think of this as a man on one curve walking a dog on another, and asking for the minimum leash length when both man and dog can walk forward at arbitrary speeds).

There's a relatively straightforward dynamic program that can compute the Fréchet distance between two polygonal chains in time $O(nm\log (nm))$ (if $n$ and $m$ are the number of links in the two chains). But it's been a big open problem to figure out whether this can be done in subquadratic time. The paper, (by Agarwal, Ben Avraham, Kaplan and Sharir) shows that for the discrete version of the problem (that Rinat Ben Avraham in her talk calls the frog hopping problem: the frogs hop from vertex to vertex) you can get a slightly subquadratic time algorithm. The main idea involves adapting "string-like" methods for the problem, which is hard because the "alphabet" is infinitely large.

It will be interesting to see if this spurs more progress. There's already a newer paper by Buchin et al that gets a slightly improved (but still super-quadratic) algorithm for the continuous Fréchet distance (i.e the original problem). If for no other reason, please click the link to see their awesome title.

Martin Grohe gave an award talk for his work with Kawarabayashi and Reed on an improved algorithm for graph minor decompositions. The problem is as follows: given a graph G and a graph H supposedly excluded by G, compute a decomposition of G promised by the graph minor theorem, or produce an H-minor.

The improvement reduces the dependency on the size of $G$ to quadratic (from cubic) and makes use of the wonderful and mysterious Courcelle's theorem. The dependence on the size of $H$ is equally mysterious (at least Martin declined to tell us), but is nowhere near as wonderful.

Please post papers you thought were interesting (and why) in the comments, and I'll add them as updates.

Tuesday, January 01, 2013

A SODA announcement, and a happy new year !

While you're all recovering from New Year's eve revelries, here's an important note from the ALENEX/ANALCO organizers regarding the debauchery in New Orleans otherwise known as SODA:

Dear SODA attendees,
We want to make sure that you are aware of a change in the format of SODA/ALENEX/ANALCO.  In recent years, ALENEX and ANALCO have been held on the Saturday before SODA and had separate registration.  This year we are trying a different format.  ANALCO and ALENEX will both be held in parallel with SODA; ANALCO on Sunday and ALENEX on Monday.  There is no longer an separate registration for ALENEX and ANALCO, everyone is automatically registered for all three.  That is, all SODA attendees are welcome to attend  ANALCO and  ALENEX talks.  We encourage you to look at the program and attend any talks that interest you.  We also welcome any feedback on this new format, after you have attended the conference. 
Bob Sedgewick (ANALCO steering committee chair)
Cliff Stein (ALENEX steering committee chair)

Disqus for The Geomblog