The Geomblog

Friday, May 15, 2015

Higher order interactions and SDM 2015

This year I'm one of the PC Chairs for SIAM Data Mining (along with Jieping Ye), and so I've been spending time in decidedly-not-sunny Vancouver. Being a PC Chair, or even being on a PC, is an exercise in constant deja vu: I hear a talk and think "Where have I heard this before" before realizing that I've probably reviewed the paper, or looked at its reviews or meta-reviews.

Being the PC chair means though that I can float around the conference freely without feeling the pressure to attend talks, network or otherwise be social: I've earned my keep :). Following standard conference networking maxims though, I made an effort to meet at least one coffee shop that I've met before and introduce myself to at least one new coffee shop, and another.

But I am attending talks ! And I enjoyed listening to some work on tensor spectral clustering by Benson, Gleich and Leskovec. It got me thinking about the larger issue of modeling higher-order interactions and what appear to be many different ways of modeling the problem.

The problem.

Imagine that you have a number of interacting entities. These could be points in a space, or vertices in a graph, or even dimensions of a point (aka variables). The easiest way to model a collection of such entities is to assume they're independent of each other. For example, I might draw i.i.d samples from a distribution. Or I might be looking at a collection of independent features describing an object, and so on.

Independence assumptions are powerful. They allow us to "factorize" structure into combinations of individual elements, which either leads to simple summations directly, or eventually after a log transformation of a product. This means we can deal with entities independently, and the "inference complexity blowup" is linear in the number of entities. A good example of this is the Naive Bayes approach to learning, where assuming all entities are independent leads to a likelihood cost function that's just a sum of terms, one for each entity.

I'm necessarily being rather loose with my statements here. Making them precise is possible in specific contexts, but it gets unwieldy.

But independence is too restrictive an assumption. It limits modeling power, and therefore will be unable to capture interactions that might make understanding structure a lot easier. For one thing, you'd never find correlations if you assumed that all features are independent.

Graphs.

The easiest form of interaction is a pairwise interaction. Modeling pairwise interactions gets us to a graph, and who doesn't love a graph ! More importantly for what follows,

a graph and its associated structures is a rich representation of a system of pairwise interactions

in that we have a colorful vocabulary and an arsenal of algorithms for talking about pairwise interactions and structures built on them.

Of course we've paid a price - in complexity. Instead of the linear cost incurred by independent entities, we now have quadratically many potential pairwise interactions to model. But (and here's the key), we can interpret a sparse graph as capturing weak interactions, and it's still a rich model to model different phenomena.

Higher-order interactions.

But what happens if we want to model interactions that aren't just pairwise ? What is the correct higher-order structure to model such interactions as effectively as graphs ? It turns out that there are many different ways to do this, and they all can be reduced to a sentence (pace Saunders MacLane) of the form

A graph is just a very special kind of X

for different values of X.

1. The graphical model view.

A graph is just a special kind of clique intersection structure, if you only have 2-cliques.

One way to manage a collection of higher order interactions is to factorize them in a more general way. This is the basis for the clique-tree idea in graphical models, where the interaction structure is viewed as a set of complete joint interactions (aka cliques) all connected together in a tree structure for easy inference. Another name for this would be the 'bounded treewidth' model, but it misses the fact that we are allowing higher-order interactions, but in a controlled way.

The advantage of this representation is that in a true parametrized complexity way, it isolates where the true complexity is coming from (the size of each clique) from the overall complexity (the size of the graph).

A spectral perspective.

When graphs arise from natural sources of data (social networks and the like), we have to deal with noise and spurious signals. And the simple language of connectivity and paths is no longer robust enough. For example a graph might be connected, but only because there's one edge connecting two huge components. If this edge was spurious, we've just made a huge mistake on modeling this graph structure.

Spectral methods are currently our best way of dealing with noisy interactions. By focusing not on the topological structure of connectivity but on the amount of connectivity measured via cuts, spectral analysis of graphs has become perhaps the best way of finding structures in large graphs.

The spectral lens sees a graph through random walks on the edges. This is great for modeling a collection of pairwise interactions between entities, but not for modeling interactions among sets of entities. We have to be careful here. Spectral methods are actually quite good at finding community structure in graphs (i.e a partition into sets of vertices). What they can't do is find higher order partitionings in graphs (i.e sets of triangles or sets of special 4-vertex structures). And that's where the next three higher-order methods enter the picture

2. The algebraic topology view.

A graph is just the 1-skeleton of a simplicial complex.

If we're looking to model higher-order interactions, we need a language for describing a collection of well-defined higher order structures. That's what a simplicial complex is. I'll skip the formal definition, but the basic idea is that if you have a simplex (an interacting group of entities), then all subsets must be simplices as well. But you declare the simplices first, which means that the simplex $\{a,b,c\}$ is different from the three simplices $\{a,b\}, \{b, c\}, \{c, a\}$, even though the first must contain the second.

A simplicial complex is a topological object. It generalizes a graph because a graph is what you get if you limit yourself to simplices of size at most 2. Because it's a discrete topological object, you can now play with it using all the tools of topology, and in particular very powerful tools like homology and homotopy that reveal all kinds of hidden structure not accessible via a graph metaphor.

While simplicial complexes allow you to express higher order interactions (tunnels! holes!), they don't remove the problem of noise: one edge/simplex can still change the structure in nontrivial ways. There are two approaches that researchers have taken to this problem: one spectral, and one not.

The non-spectral approach is by far the most well-established one. It is based on the idea of persistence: a way to determine from the homology groups of the simplicial complex what structures are interesting and which ones are not. Persistence is the dominant weapon in the arsenal of topological data analysis, and I'll say no more about it here, so as to keep focus on spectral methods.

The spectral approach is less well developed, but is quite interesting. The idea is to generalize the notion of expansion from a graph to higher-order simplices, as well as generalizing the Laplacian operator to higher order simplices (or homology groups). Then a random walk on the simplicial complex can be linked to the Laplacian operator, and the eigenstructure of the operator can be linked to the existence (or nonexistence) of certain homology groups. Two places to start with this topic are one on capturing generalizations of edge expansion, and another on building random walks on simplicial complexes and connecting them to a combinatorial Laplacian.

3. The differential geometry view.

A graph is just a discrete approximation of (scalar functions on) a manifold.

The graph Laplacian is a discrete approximation of the Laplacian second order differential operator, and more generally the Laplace-Beltrami operator on a manifold. Indeed, one way to build intuition for what the graph Laplacian means is that it's capturing heat diffusion on an implicit manifold that the graph is merely approximating.

The Laplace-Beltrami operator is a "zeroth-order" operator in that it applies to the zero-dimensional entities on a manifold: namely, scalar fields over the points of the manifold. Suppose you were to build a vector field instead over the manifold, and wished to reason about it. Then the generalization of the L-B operator that you'd need is called the Laplace-de Rham operator which formally acts like a Laplacian on the higher order differential forms defined over the manifold (formally, on sections of the tangent bundle). Writing down the L-R operator is a little tricky: it involves a combination of the exterior derivative and its dual (via the Hodge * operator). But one useful observation is that the L-R operator on graphs amounts to a Laplacian on the set of edges, rather than vertices.

This means that you can now treat edges as first-class objects for grouping, rather than vertices. And this is useful for higher-order clustering. Whether this can be generalized even further remains to be seen.

4. The linear algebraic view.

A graph (adjacency matrix) is just a special case of a tensor structure on the entities.

This is perhaps the most well-known of the different higher-order approaches to modeling interactions, and is making the most waves right now. The idea is very simple. If we think of a graph in terms of its adjacency matrix, then each entry encodes a relation between two basic entities. If we wished to encode relations between (say) three entities, then we need a "3D matrix", or more precisely, a tensor.

Of course a tensor is more than just a way to assemble a collection of triples into a box, just like a matrix is much more than just a grid of numbers. The most basic question in tensor modeling is factorization: just like we can use the SVD to write down a matrix as a linear combination of outer products of basis vectors, can we write a tensor as 3-way wedge product of basic vectors ? If so, then we've been able to identify the key 3-way factors controlling an interaction.

Tensor factorization is a hard problem: unlike the SVD, most formulations of tensor factorization are NP-hard. I won't get into this very rich topic right now, and instead will point you to some slides by Ravi Kannan as well as an older paper of his.

But can we cluster using tensors ? Or more generally, is there a spectral theory of tensors, and maybe even the analog of Cheeger's inequality ? It turns out that it's possible to define eigenvalues and eigenvectors of tensors, and at least some of the standard spectral theory can be made to carry over - for more on this see these slides by Lek-Heng Lim. But we're still looking for a theory can truly work on hypergraphs or tensors in general. In the meantime, tensor-based approaches to clustering boil down to factorization and PCA-like methods, of which there are many.

5. Coda

Are these approaches in any sense equivalent ? It's hard to say in general, although for a particular problem there might be connections. Certainly, the de Rham cohomology yields a nice filtration over homology groups that could be wrestled into a persistence framework (research question !!!) thus allowing us to extract robust higher-order structures from both geometric and topological considerations. The tensor approach is closely linked to probabilistic models, not surprisingly. But whether the spectral perspective (and all its attendant intuition and value) will extend nicely across these different frameworks remains to be seen.

Tuesday, May 12, 2015

Special Issue of Internet Mathematics on Evolving Networks

Aris Gionis and I are co-editing a special issue of Internet Mathematics on evolving networks.

The widespread adoption of digitization has opened the way to gathering large amounts of data that record detailed information for many systems of interest. Examples include telecommunication networks, online social-media platforms, and biological systems. Many of these systems are typically represented as networks, and graph-theoretic techniques are used to analyze the available data. Furthermore, as our data-gathering capacity has increased, it is now possible to collect data the record not only a static aggregate view of the underlying network, but a continuous stream of events that captures the full dynamic behavior of the network. Such events may take the form of structural changes, or they may encode different types of actions and interactions performed by the network entities. This view of time-evolving networks poses new challenges and opens new research directions. The objective is to develop the theoretical foundations and to design the algorithmic principles that will allow to efficiently manage and analyze such evolving networks.

Read more about it here. Submissions are due Oct 15, 2015.

Thursday, April 30, 2015

LaTeX, Word, ggplot, and n00b-ness

I've been forcing myself to learn enough R and ggplot to make nice looking plots. It's taking me painfully long to learn to do even the most basic things, and +Carlos Scheidegger has been a big help !

What makes learning ggplot hard is grasping its underlying design philosophy. There's a notion of layers in a plot that I'm still getting used to. Once you get the hang of it it's quite elegant, and makes modifying plots very easy -- once you get the hang of it.

All of which makes me a little more sympathetic to people who struggle with using LaTeX. LaTeX has many of the same unnatural design principles, starting with the lack of a WYSIWIG interface (and no, LyX doesn't really count). It's an incredibly powerful interface (like ggplot), but it's darned hard to do simple things.

In fact, I've been wishing for a "Word equivalent" for ggplot2. Just like overleaf.com is now trying to make LaTeX easier for the general public, I would love to see some kind of interactive interface to ggplot2 that can generate the code automatically for standard tasks.

Tuesday, April 28, 2015

Prof. V returns from Washington

I've had an intense and exciting day and a half here at "politics camp", otherwise known as LISPI. Many thanks to the CRA for organizing this event. It's been an eye-opener in many ways. I've come away from this meeting with a new appreciation for the kind of lobbying efforts CS folks undertake in DC to make sure we can continue to do research and live in our little bubbles.

One of the "homework assignments" that we were given yesterday for today morning was to practice a 3-4 minute pitch to a staffer for a representative/Senator on a topic of our choosing (about the importance of NSF funding, about a specific issue we think is important, etc). Our "panel" consisted of former Hill staffers who've been on both sides of the pitch table, and each of us had to do a "Shark Tank"-like pitch to them.

It was surprisingly nerve-wracking. I had prepared a pitch on recidivism and the importance of fairness when using data-driven predictors, and having to reduce it down to a few minutes of intelligible speech took some effort. Luckily, the panelists were very constructive in their criticism.

Outside the presentations and discussions, there are some serious issues bubbling up in DC right now that kept cropping up during the workshop.

Firstly, the 2007 America COMPETES Act is up for reauthorization, and the House Science committee is attempting to take an axe to NSF funding for social sciences and geosciences. In a move that would have made the British Raj (and all computer scientists) proud, they dangled a classic divide-and-rule trick over the head of the NSF by increasing the budget of CISE, engineering and MPS as a "compensation". This level of detailed guidance for budgeting is apparently unprecedented and the NSF (and the CRA) is fighting it.

Secondly, there's a big debate going on about backdoor encryption and whether it's technically possible and/or desirable to allow government backdoors into encryption on devices like iPhones etc. I'm not particularly competent to weigh in on these issues, but there were a lot of security folks at the workshop who brought this up as their major concern during our morning pitch session.

In any case, it's away from the US and onto Canada, for SDM 2015 in Vancouver.

Monday, April 27, 2015

Prof. Venkatasubramanian goes to Washington

I'm at the CRA-organized Leadership in Science Policy Institute, otherwise known as LISPI. After a day of intense talks, I feel like day three of a conference: which is to say, totally drained and overwhelmed by the amount of information coming my way. Who knew that science sausage was so hard to make !

Here are some things I've learnt so far:

Physics is our feared and admired ~~enemy~~ STEM partner.
I'd be glad to have half as much energy as Ed Lazowska when I get to his level.
The CRA continues to be truly awesome
I feel like I'm in an episode of the West Wing sometimes.
The budget process seems far more structured and less crisis-ridden than the media makes it out to be.

And for all other details of the meeting, I will refer you to the Chatham House Rule.

Monday, March 30, 2015

Revisiting the Misra-Gries estimator

If you've ever taken a class on streaming algorithms, or want to ace some tech company puzzle interview question, you've probably heard of this little gem:

Given n items, determine whether a majority element (one occurring strictly more than n/2 times) exists. You are allowed one pass over the data (which means you can read the elements in sequence, and that's it), and exactly TWO units of working storage.

This is a special case of the general Misra-Gries estimator for finding frequent items in a stream with small space. The MG estimator allows you to find (approximately) all items that occur more than $\epsilon n$ times using $O(1/\epsilon)$ space. Setting $\epsilon = 1/2$ yields the desired result.

This result has been proved and reproved a number of times, and there's a fairly general proof based on the fact that what you're really doing is approximating the true frequency to an additive error of $\epsilon n$ (and so any item that occurs more often than that will be retained).

What's more interesting is how you might go about proving the basic statement from first principles if you don't know the proof. This came up at dinner, when I was talking to people who hadn't heard of the problem before and were able to reconstruct a proof on the fly.

So let's return to the basic statement, and let's be a little more precise. The precise guarantee required is as follows:

If a strict majority element exists, you MUST return it
If not, you can return any element you want (even the most infrequent one)

Because of the second statement above, we only have to worry about streams in which a strict majority element does exist.

Observation 1: if you haven't seen a majority element in any prefix of the stream, you can throw that prefix away.

Why: we know there's a majority element somewhere. If it didn't show up so far, it has to be at least much of a majority in what's remaining.

Observation 2: This prefix could be as short as two elements long.

In other words, we see an element. Call it x. If we now see $y \ne x$, x is no longer a majority element so far, and we can chuck it, starting a new stream at $y$.

But if we do see $x$ again, then it could be a majority element. How do we keep track of whether it ever stops being a majority element ?

We could keep track of the number of items seen so far. But that's an extra counter. Why not just pair instances of $x$ with instances of other elements seen so far by subtraction. If any time we hit zero, we can invoke observation 1 and start again.

If not, then we will end with a nonzero count for $x$, which must be the correct element.

And this gives us the exact algorithm we need.

Thursday, March 19, 2015

The coming funding crunch

It's no secret we're in the middle of another boom in CS enrollment. Departments everywhere are struggling to keep up with the increased number of students wanting to get into our programs.

But there's a chain reaction with potentially nasty consequences coming our way. And while some of this will be obvious to academic types, it might not be obvious to the community at large, or even graduate students considering going into academia.

Consider:

As departments everywhere try to adjust to the new load of students, they have two options: hire a lot more teaching faculty to teach classes, or start negotiating with their universities for more tenure-track positions. The latter is clearly preferable (if you don't believe me, look at the plight of adjuncts in the humanities). But....
Universities can live with hiring more faculty, because the increased tuition from more students helps justify it, and more faculty means more research grants, and more hafta for the administration. But...
More faculty means more applications for research awards, to the various organizations that dole out money (NSF, NIH, DARPA, ONR, DoE, ...). But...

Have you seen science budgets lately ? They're basically flat.

We've seen this Ponzi scheme before in biology, and the consequence of that is that the average age of a first time PI has crossed 40, coupled with increased time spent doing postdocs. It's now the norm rather than the exception in CS to see faculty candidates with at least one postdoc.

And there's no easy way to de-escalate. All the possible dams we can build are bad, or difficult to execute on:

Don't hire more faculty. Then we're stuck with ever increasing class sizes and lower quality of student education.
Hire more contingent faculty. This might be a short term solution, but it's a horrible way to treat new Ph.Ds, and frankly given the options out there in industry, you wouldn't get as many takers (at least if people think rationally about this)
De-link academic success from funding, or encourage the teaching mission more. This is a complete non-starter at most schools that rely on overhead. And for R1 universities, research dollars are not just a bottomline factor, but are a prestige element.

And don't even get me started on how this is going to affect our already stretched-near-breaking-point conference review process.

Sunday, February 22, 2015

STOC 2015 Call for Workshops and Tutorials

I've had two week from hell with proposal deadlines, paper deadlines, faculty candidates and ~~ever-more demanding hungry cries of the baby chickadees~~ meetings with my students.

Thankfully, all that is now in the past and I can sally forth to the next deadline. Which is, you ask ?

STOC 2015 will have a workshops and tutorials day on Sunday June 14, the day before the conference starts. +Sanjeev Khanna and +Chandra Chekuri are running the event, and they want your proposals.

We invite groups of interested researchers to submit workshop or tutorial proposals. The goal of a workshop is to provide an informal forum for researchers to discuss important research questions, directions, and challenges. Connections between theoretical computer science and other areas, topics that are not well represented at STOC, and open problems are encouraged as workshop topics. Organizers are completely free to choose their workshop formats (invited speakers, panel discussions, etc.). The program for June 14th may also involve tutorials, each consisting of 1-2 survey talks on a particular area, and we welcome tutorial proposals as well.

Deadline for submissions is Mar 15, 2015. For more details on the format etc, visit the workshops/tutorials page.

Monday, February 16, 2015

Research customs for the 21st century.

I've begun to notice a number of customs that seem unique to our modern process of doing collaborative research in the 21st century. Most of them are technology-driven, and most of them involving annoying debates about the multitude of choices available for collaborating.

Research Meetings: whether to do them over Skype, or Google Hangouts, or http://mathim.com or awwapp.com, or even with phones on a conference call.
Audio/Video/Chat: if over skype/G+, whether to do audio, or video, or chat. And then the obligatory pre-videconference primping to look presentable (or the reliance on bandwidth failure to NOT go on video)
Coordination: careful calculations among multiple time zones and daylight savings times to plan meetings. The use of Doodle, Google calendar, or even random emails passed around haphazardly to plan said meetings.
Writing I: the protracted negotiations over whether to use git, svn, rcs, or cvs, or Dropbox, or random emails passed around haphazardly (and yes I've done all of this, sometimes at the same time)
Writing II: if using an actual modern VCS, then the equally protracted negotiations over who's hosting it, who needs account access, and where public keys need to be placed, and why CS researchers in the 21st century STILL need tutorials on ssh.
Writing III: Dealing with comments on a writeup: as fixmes, as todonotes, as issues in the git repository hosting the document, or as random emails passed around haphazardly.

And of course all the collaborator "digital" personalities that emerge irrationally and violently. Alice hates using bibtex, and Bob will only use personally crafted bibtex files. Charlie loves his own special fonts and will have raging debates over $\varepsilon$ vs $\epsilon$. Dan has never used version control and doesn't see the point. Erin handcoded her own version control system in a cutting-edge fragment of Haskell and refuses to use anything else. Frank hates online meetings, and Mallory only thinks online during a meeting. Oscar insists that Postscript is Turing-complete and is therefore sufficient for all drawing needs. Peggy insists that git rebase is Turing-complete and is therefore sufficient to fix all commit disasters... eventually.

Note to all my collaborators who I'm currently writing papers with: you are awesome and have NOTHING AT ALL to do with this list.

Wednesday, February 11, 2015

One Body Problems

It's hiring season, and we've all heard about two-body problems.

But you may not have heard of the one-body problem:

Is the place a single person takes a job in likely to have enough other single people around to facilitate searching for a partner ?

I was 'spoken for' before I even finished my Ph.D, and never had to deal with this (rather, I had to work with the more traditional (ha!) two-body problem).

Friday, February 06, 2015

Friday chest-thumping.

My paper with Amirali Abdullah (or should I say Amir's paper with me) got into STOC ! One aspect not discussed in the linked blog post is the connection to Partial Match (a notoriously hard problem in the data structures literature). In brief, our result allows for a "smooth-ish" interpolation between approximate near neighbor lower bounds in $\ell_1$ and general partial match lower bounds, reinforcing the intuition that Partial Match is an "extremely asymmetric" nearest neighbor problem.
Another paper with 'the Streaming Kings' (said to the sound of a flamenco strum) of Cormode, Chakrabarti, McGregor and Thaler got into CCC (and this is my first paper at Complexity). I'll have more to say about this work in a separate post: in brief, we looked at streaming interactive proofs (of the kind first developed by Cormode, Thaler and Yi) where you have a prover and a streaming verifier and wish to verify a computation with constant rounds of communication. There's a 3-round nearest neighbor protocol in this paper, as well as a number of subtle results about the nature of multi-round communication protocols in the streaming setting.

"Well Suresh, you have two new papers, what are you going to do next" ?

I'm going to ~~DisneyWorld !~~ DC !! I'll be attending the CCC sponsored Leadership in Science Policy Institute April 26-27 in DC. And there's homework to boot !

Wednesday, February 04, 2015

No one is in control...

The New Scientist has a cover this week on the way in which algorithms have taken over our lives. They interviewed some of the participants at the FATML workshop I participated in (thanks to Solon Barocas for sending the reporter our way), and Sorelle Friedler got some nice quotes in regarding the work she, Carlos Scheidegger and I have been doing.

The article is currently paywalled, but a friend of mine sent me the text. The relevant paragraph, lightly edited (the "she" here is Sorelle):

By understanding the biases inherent in the underlying data, she hopes to eliminate bias in the algorithm. Her system looks for correlations between arbitrary properties – like height or address – and demographic groupings like race or gender. If the correlation is expected to lead to unwanted bias, then it would make sense to normalise the data. It is essentially affirmative action for algorithms, she says.

I'm not sure I'd describe our work as affirmative action :), but that's a longer argument.

Monday, February 02, 2015

(the dangers of) Data Mining hits the Super Bowl (ads) and prime time comedy

I get strange looks from people in my work circles when I admit to watching American football, and stranger looks from people in my not-work circles when I admit to watching the Super Bowl for the game.

I didn't avoid the ads though. And two ads (both from Esurance) riffed off the idea of segmenting: that the business goal of data mining customer data is to segment people into little categories that can be targeted the same way. The ads are a little heavy-handed, but then I don't think most people actually know how data mining works to segment them. And getting Bryan Cranston to play a "sort-of" pharmacist was brilliant.

Esurance: "Sorta Pharmacy"

Esurance: "Sorta Mom"

And finally, an entire 30 minute show centered around the idea of a Facegoogle-like company invading our privacy by reading our texts/emails etc. If you don't watch Parks and Recreation, then it's too difficult to summarize six-odd seasons of shenanigans, but in brief: A facebook-google mashup-like company is setting up shop in a small town in Indiana, and is sending people free gifts via Amazon-like shopping drones based on analyzing their private communication, and then shenanigans ensure. If you can't stand the horror of watching network TV, then there's a short clip here:

and the full episode is on Hulu (for now) here.

Thursday, January 29, 2015

More FOCS 2014-blogging

In the spirit of better late than never, some more updates from Amirali Abdullah from his sojourn at FOCS 2014. Previously, he had blogged about the higher-order Fourier analysis workshop at FOCS.

I'll discuss now the first official day of FOCS, with a quick digression into the food first: the reception was lovely, with some nice quality beverages, and delectable appetizers which I munched on to perhaps some slight excess. As for the lunches given to participants, I will think twice in future about selecting a kosher option under dietary restrictions. One hopes for a little better than a microwave instant meal at a catered lunch, with the clear plastic covering still awaiting being peeled off. In fairness to the organizers, once I decided to revert to the regular menu on the remaining days, the chicken and fish were perfectly tasty.

I will pick out a couple of the talks I was most interested in to summarize briefly. This is of course not necessarily a reflection of comparative quality or scientific value; just which talk titles caught my eye.

The first talk is "Discrepancy minimization for convex sets" by Thomas Rothvoss. The basic setup of a discrepany problem is this: consider a universe of $n$ elements, $[n]$ and a set system of $m$ sets ($m$ may also be infinite), $S = \{S_1, S_2, \ldots, S_m \}$, where $S_i \subset [n]$. Then we want to find a $2$-coloring $\chi : [n] \to \{-1, +1 \}$ such that each set is as evenly colored as possible. The discrepany then measures how unevenly colored some set $S_i \in S$ must be under the best possible coloring.

One fundamental result is that of Spencer, which shows there always exists a coloring of discrepancy $O(\sqrt{n})$. This shaves a logarithmic factor off of a simple random coloring, and the proof is non-constructive. This paper by Rothvoss gives the first algorithm that serves as a constructive proof of the theorem.

The first (well-known) step is that Spencer's theorem can be recast as a problem in convex geometry. Each set $S_i$ can be converted to a geometric constraint in $R^n$, namely define a region $x \in R^n : \{ \sum_{j \in S_i} | x_j | \leq 100 \sqrt{n} \}$. Now the intersection of these set of constraints define a polytope $K$, and iff $K$ contains a point of the hypercube $\{-1 , +1 \}^n$ then this corresponds to the valid low discrepancy coloring.

One can also of course do a partial coloring iteratively - if a constant fraction of the elements can be colored with low discrepancy, it suffices to repeat.

The algorithm is surprisingly simple and follows from the traditional idea of trying to solve a discrete problem from the relaxation. Take a point $y$ which is generated from the sphercial $n$-dimensional Gaussian with variance 1. Now find the point $x$ closest to $y$ that lies in the intersection of the constraint set $K$ with the continuous hypercube $[-1, +1]^n$. (For example, by using the polynomial time ellipsoid method.) It turns out some constant fraction of the coordinates of $x$ are actually tight(i.e, integer valued in $\{-1, +1 \}$) and so $x$ turns out to be a good partial coloring.

To prove this, the paper shows that with high probability all subsets of $[-1 +1]^n$ with very few tight coordinates are far from the starting point $y$. Whereas with high probability, the intersection of $K$ with some set having many tight coordinates is close to $y$. This boils down to showing the latter has sufficiently large Gaussian measure, and can be shown by standard tools in convex analysis and probabilitiy theory. Or to rephrase, the proof works by arguing about the isoperimetry of the concerned sets.

The other talk I'm going to mention from the first day is by Karl Bringmann on the hardness of computing the Frechet distance between two curves. The Frechet distance is a measure of curve similarity, and is often popularly described as follows: "if a man and a dog each walk along two curves, each with a designated start and finish point, what is the shortest length leash required?"

The problem is solvable in $O(n^2)$ time by simple dynamic programming, and has since been improved to $O(n^2 / \log n)$ by Agarwal, Avraham, Kaplan and Sharir. It has long been conjectured that there is no strongly subquadratic algorithm for the Frechet distance. (A strongly subquadratic algorithm being defined as $O(n^{2 -\delta})$ complexity for some constant $\delta$, as opposed to say $O(n^2 / polylog(n))$.)

The work by Bringmann shows this conjecture to be true, assuming SETH (the Strongly Exponential Time Hypothesis), or more precisely that there is no $O*((2- \delta)^N)$ algorithm for CNF-SAT. The hardness result holds for both the discrete and continuous versions of the Frechet distance, as well as for any $1.001$ approximation.

The proof works on a high level by directly reducing an instance of CNF-SAT to two curves where the Frechet distance is smaller than $1$ iff the instance is satisfiable. Logically, one can imagine the set of variables are split into two halves, and assigned to each curve. Each curve consists of a collection of "clause and assignment" gadgets, which encode whether all clauses are satisfied by a particular partial assignment. A different such gadget is created for each possible partial assignment, so that there are $O*(2^{N/2})$ vertices in each curve. (This is why solving Frechet distance by a subquadratic algorithm would imply a violation of SETH.)

There are many technical and geometric details required in the gadgets which I won't go into here. I will note admiringly that the proof is surprisingly elementary. No involved machinery or complexity result is needed in the clever construction of the main result; mostly just explicit computations of the pairwise distances between the vertices of the gadgets.

I will have one more blog post in a few days about another couple of results I thought were interesting, and then comment on the Knuth Prize lecture by the distinguished Dick Lipton.

Tuesday, January 27, 2015

Streaming @ SODA: Part II

This is the second of two posts by Samira Daruki on the streaming sessions at SODA 2015. For the first post, see here.

In the third paper from the streaming graph family in SODA15: "Parameterized Streaming: Maximal Matching and Vertex Cover", Chitnis, Cormode, Hajiaghayi and Monemizadeh introduce a new approach to handling graph streams called parameterized streaming algorithms. Also, in addition to insertion-only model, they consider the dynamic model of streaming graphs in which the input is a sequence of insertion/deletion on the edges.

This dynamic model of streaming graph processing is popular when the graph size is changing, and has recently received much attention due to breakthroughs by Ahn, Guha and McGregor (one, and two). Over these two papers, they showed the first results for a number of graph problems over dynamic streams. This has provoked much interest into what can be computed over dynamic graph streams, although still there is not much work on solving graph optimization problems in this model. The challenge here is that when an edge is deleted, sometimes it requires a substantial work to repair the solution again, so we need to make sure that the algorithm has enough information to do so, while keeping only a bounded amount of working space. (ed: not surprisingly, some of these ideas are useful for non-streaming dynamic graph algorithms: witness the paper by Kapron, King and Mountjoy on dynamic connectivity in (randomized) worst-case polylog time from SODA a few years ago)

Returning to parametrized streaming, in this paper instead of solving exactly the optimization problem on the graph stream, the goal is to solve the “parametrized” version of the problem, where the parameter $k$ is given and we want to solve the following decision problem:

Is there a solution with size bounded by $k$?

The motivation behind parametrizing the problem comes from real world applications in which the solution of the graph problems is small comparing to the size of the input (i.e. sublinear in the size of input). In these cases, the interesting challenge is to solve the optimization graph problems in streaming fashion using space bounded by some function of the “solution size” instead of the “input size”.

To solve the parameterized problems, one of the techniques which is used is kernelization, which uses a polynomial time preprocessing to map the input to another equivalent input of smaller size $f(k)$ (called a kernel) with a new parameter value $k’ \le g(k)$, for a computable function $g$.

In this paper, by combining kernelization techniques with randomized sketch structures, the first streaming algorithms for the parameterized versions of the Vertex Cover problem is obtained. The main idea here is to maintain a maximal matching of underlying graph in a streaming fashion. Then run the well-known kernelization algorithm for Vertex Cover on the maintained maximal matching. The data structure to maintain the maximal matching use the $k$-sample recovery sketching algorithm, which is a generalization of linear sketching for $\ell_0$-sampling, as the main tool and apply it to the neighborhood of each vertex (incident edges) in the resulted matching. So as the edges are inserted or deleted, these sketches can be updated without needing knowledge of the full neighborhood of nodes. However, there are some challenges with deletion of edges: as the edges are deleted we need to have an intelligent mechanism to ensure the matching remains maximal using only limited stored information.

Another nice result here is showing a tight lower bound of $\Omega(k^2)$ (by reducing from the INDEX problem in communication complexity) for the space complexity of any (randomized) streaming algorithms for parameterized Vertex Cover, which holds even in the insertion-only model.

Besides the general models of insert-only and dynamic, another restricted model in dynamic framework is also discussed in which we know for sure that at time $i$, the size of the vertex cover of underlying graph induced by the edges till that point is at most $k$. With this promise, they develop a dynamic parameterized streaming algorithm whose space usage matches the proved lower bound.

It is interesting to think about other NP-hard problems in the framework of parameterized streaming and explore how kernelization can be helpful in this direction or see whether we can find other powerful hammers to overcome the challenges which arises in designing algorithms for hard problems in streaming setting.

Coda:
Along with the three papers discussed above, there was another paper on streaming presented at SODA (by Naumovitz and Saks) which provides a deterministic polylogarithmic-space streaming algorithm for approximating distance to monotonicity for a sequence of $n$ numbers, compared to the corresponding randomized result presented at SODA two years ago.

While I won't discuss this work here so as to keep these posts just about streaming graph algorithms, I encourage the interested readers to take a look at this paper as well, as the last one in the streaming family of SODA15.

Streaming @ SODA: Part I

This two part series is written by my student Samira Daruki.

Modern graph data sets are too large to fit in the memory. And so the streaming model is one of the more popular and attractive ones for analyzing massive graphs: in this model, for an input graph $G = (V, E)$ with $n$ vertices and $m$ edges, the edges arrive in an arbitrary order in a stream and the algorithm can only use $\tilde{O}(n)$ space. The algorithm is allowed to have several passes over the stream but usually the ideal case is to have just one pass.

For many graph problems such as matching, spanning tree, graph sparsification, approximate distance and counting subgraphs there now exist streaming algorithms with space complexity $O(n \text{poly} (\log n))$. In these algorithms, usually we assume that the nodes of the graphs can be stored in memory but edges cannot. Notice that for constructing matchings, spanners and sparsifiers, the output size is often $\Omega(n)$, so it forces the streaming algorithm to use $\Omega(n)$ space.

But if you don't need to report the output, then this can be avoided. For an example, see the work of Kapralov, Khanna and Sudan from SODA 2014 which approximates the matching size to a $\text{poly}(\log n)$ factor using $\text{poly}(\log n)$ space in a random stream (where edges appear in a random order rather than arbitrarily)

Thus, the question now is:

can we obtain a $\text{poly}\log$ space streaming algorithm for approximating the solution cost for other graph problems?

Consider for instance MAX-CUT. There are several results on approximating the maximum cut in graphs in a non-streaming model; the trivial approach is to take a random cut. This selects half of the edges in expectation, resulting in a factor $1/2$-approximation.

Thus implies that in a streaming model we can just count the number of edges $m$ and output $\frac{m}{2}$ which results in a $O(\log n)$-space algorithm. By keeping a sample of the edge set we can get a different tradeoff: a $(1+\epsilon)$-approximation algorithm which uses $O(\frac{n}{\epsilon^2})$ space.

Can we get a streaming algorithm with better than factor-$2$ approximation using just $poly(\log n)$ space?

A paper by Kapralov, Khanna and Sudan in the streaming session of SODA15 this year answers this question. This is the latest in a line of results on streaming graph problems by Kapralov and others from SODA 2012, 2013 and 2014 (mentioned above)

Here is their main result

For any constant $\epsilon > 0$ a single pass streaming algorithm for approximating the value of MAX-CUT to a factor $2 − \epsilon$ requires $\Omega(\sqrt{n})$ space, even in the random order model.

This result rules out the possibility of $\text{poly}(\log n)$ space, but suggests that $\tilde{O}(\sqrt{n})$ space cost might be possible in some specific settings.

Another major result of this paper is as follows:

For any constant $\epsilon > 0$ a single pass streaming algorithm for approximating MAX-CUT value to factor $1 + \epsilon$ requires $n^{1−O(\epsilon)}$ space in adversarial streams.

The main fact they use here is the connection between the MAX CUT value and (distance from) bipartiteness:

if a graph $G$ with $m$ edges is $\beta$-far from being bipartite, then maxcut value of $G$ is at most $(1-\beta)m$.

The other simple observation is that any algorithm that computes a $\gamma$-approximation to MAX CUT distinguishes between bipartite graphs and graphs that are $1-\frac{1}{\gamma}$-far from being bipartite. Thus to show that no streaming algorithm using space $c$ can achieve a $\gamma$- approximation with failure probability at most $\delta$, it's enough enough to show no streaming algorithm using space $c$ can distinguish between bipartite graphs and graphs which are $1- \frac{1}{\gamma}$-far from being bipartite with probability at least $1- \delta$.

Given these facts, now the core idea to prove the main result here is exhibiting a distribution over inputs where $(2-\epsilon)$ approximation to MAX-CUT requires $\Omega(\sqrt{n})$ space.

More precisely, the input graph instances are based on random Erdos-Renyi graphs, which are either bipartite in YES case or non-bipartite in the NO case. In order to achieve a $(2-\epsilon)$-factor gap for the MAX CUT in this structure, we choose the expected degree of vertices to be $\Theta(\frac{1}{\epsilon^2})$.

This way, the input streaming graph can be partitioned and given to the algorithm in $\Omega(\frac{1}{\epsilon^2})$ phases, which can be simulated as a $\frac{1}{\epsilon^2}$-party one-way communication game.

Then, by giving a reduction from a variation of Boolean Hidden Matching(BHM) called Distributional Boolean Hidden Partition(D-BHP) to the MAX-CUT on the input instance of the problem, and showing that $\Omega(\sqrt{n})$ space is necessary to differentiate between these two cases, the main streaming lower bound result for obtaining approximate MAX-CUT is straightforward.

There are many technical details in performing this reduction, but roughly speaking they show that any algorithm that solves MAX-CUT on the constructed instances must solve the two-party communication problem in at least one of the phases.

There are still some main open problems left in this paper:

One is whether breaking $2$-approximation barrier in $\sqrt{n}$ space is possible if we are allowed to have $poly(\log n)$ passes over the input stream?
Also it is interesting to think about designing streaming algorithms for other graph problems using $o(n)$ space.

This brings us to another paper presented in this session. In Streaming Algorithms for Estimating the Matching Size in Planar Graphs and Beyond (by Esfandiari, Hajiaghayi, Liaghat, Monemizadeh and Onak), this latter question is answered about finding the maximum matching for planar graphs using $o(n)$ space.

Here is the main result:

If the underlying graph is planar, then there is a streaming algorithm which provides a $O(1)$-approximation solution to maximum matching with high probability using $O(n^{\frac{2}{3}})$ space.

The main idea for proving the result here is to use a known structural graph property:

If we characterize the nodes of the input graph based on the degrees to two groups of light (deg < 9) and heavy (deg > 9) vertices and then define the shallow edges as the ones with two light endpoints, then we have the following nice property: (Assuming |maximum matching| = m, |shallow edges| = s and | heavy vertices| = h):

$$ \frac{\max (s, h)}{12} \leq m \leq h + s$$

Then using this structural fact as the main tool, constructing a small size matching (bounded by $c n^{\frac{2}{3}}$) as we read the edges in a greedy manner, and estimating the number of shallow edges and heavy vertices in the induced subgraph by a subset of sampled vertices with size $c n^{\frac{2}{3}}$, we can approximate the size of the maximum matching by a constant factor.

In addition to planar case, they show that similar results for approximating maximum matching in other graph structures such as $d$-degenerate graphs and forests are achievable.

Coming up: parametrized streaming and VERTEX COVER.

Thursday, January 15, 2015

Brief Review of the post-SODA workshop on algorithmic challenges in machine learning.

My student +John Moeller attended the workshop on algorithmic challenges in machine learning organized by +Shachar Lovett and +Kamalika Chaudhuri after SODA. He wrote up a brief report of some papers that he liked.

Monday, January 12, 2015

FOCS Workshop on higher-order Fourier analysis: A Review

This is a guest post by my student Amirali Abdullah. Amirali attended FOCS 2014 and has a number of interesting reflections on the conference.

2014 was my first experience of attending a FOCS conference, and finally seeing the faces* (attached to some of the cutting edge as well as classical results in theoretical computer science. Not only did I get to enjoy the gentle strolls between conference halls, and hearing about fields I'd never known existed, I had the pleasure of doing so while revisiting historic Philadelphia.

With the talk videos now available, I thought now would be a good time to revisit some of the talks and my thoughts on the event. The usual disclaimers apply - this will be by no means comprehensive, and I won't go into the technicalities in much depth.

I'll begin with the first day, where I chose to attend the workshop on Higher-Order Fourier analysis. The starting point is that for the study of a function $f$, it is standard to consider its correlations with the Fourier basis of exponential functions (i.e., of the form $e(x) = e^{2 \pi \iota x}$) (also called as linear phase functions). This basis is orthonormal and has many nice analytic properties.

A standard technique is to decompose a function $f$ into the heavy components of its Fourier basis, and then argue that the contribution of the lower weight components is negligible. This gives a decompositon $f= f_1+ f_2$, where $f_1$ has few non-zero Fourier coefficients, and $f_2$ has all Fourier coefficients close to 0. Another way to view this under certain perspectives is $f_1$ representing the structured part of $f$ and $f_2$ the pseudorandom part.

However for some applications, the analysis of correlation with quadratic (or even higher order) phase functions of the form $e(x) = e^{2 \pi \iota x^2}$ is more powerful and indeed required. (An example of such a problem is where given a function on the integers, one desires to study its behavior on arithmetic progressions of length four or more.)

A subtlety in the study of higher order Fourier analysis is that notions such as "tail" and "weight" now have to be redefined. For regular Fourier analysis (at least on the Hamming cube) the natural notion corresponds with the $\ell_2^2$ norm or the linear correlation of a function and a linear basis element. However, in higher order Fourier analysis one has to define norms known as the Gowers uniformity norms which capture higher order correlations with these higher degree functions. This yields a decomposition of $f = f_1 + f_2 + f_3$, where $f_1$ consists of few non-zero higher order phase functions, $f_2$ has small $\ell_2$ norm and $f_3$ has small \emph{Gower's norm} under the right notion.

There were several talks discussing the various subfields of higher order Fourier analysis. Madhur Tulsiani discussed some of the algorithmic questions, including computing such a decomposition of the function into higher order Fourier terms in an analog of the Goldreich-Levin algorithm. Yui Yoshida discussed applications to algebraic property testing.

Abhishek Bhowmick discussed his very interesting paper with Lovett showing that the list-decoding radius of the Reed-Muller code over finite prime fields equals (approximately) the minimum distance of the code. The application of Fourier order analysis here is essentially to decompose the input space of the code by high-degree polynomials so that any random distribution is well-spread over these partition atoms.

I thought the workshop was an interesting exposure to some deep mathematics I had not previously seen, and gave some good pointers to the literature/work I can consult if I ever want a richer understanding of the toolset in the field.

Note: Thanks to Terry Tao's book on the subject for some useful background and context. All mistakes in this post are entirely mine. For a much more comprehensive, mathematically correct and broad view on the subject, do check out his blog.

* One can of course claim that 'seeing faces' could also include the digital images on faculty and student websites snapped in haste or misleading poses, but I choose to ignore this subtlety.

Privacy is dead, but not because Scott McNealy said so.

I decided to experiment with using Medium for a long-form not-quite technical post on the evolution of research in data privacy.

The idea of privacy has driven much of our concerns over data for the last few years, and has been the driver of extremely successful research efforts, most notably in differential privacy.What we’re seeing now though is a pivot away from the undifferentiated and problematic notion of privacy in data towards a more subtle and nuanced notion of how data is actually used, and how it should be used.

Here's the post: Medium allows this nifty way of commenting inline, so comment away there, or here.

Thursday, December 11, 2014

Accountability in data mining

For a while now, I've been thinking about the endgame in data mining. Or more precisely, about a world where everything is being mined for all kinds of patterns, and the contours of our life are shaped by the patterns learnt about us.

We're not too far from that world right now - especially if you use intelligent mail filtering, or get your news through social media. And when you live in such a world, the question is not "how do I find patterns in data", but rather

How can I trust the patterns being discovered ?

When you unpack this question, all kinds of questions come to mind: How do you verify the results of a computation ? How do you explain the results of a learning algorithm ? How do you ensure that algorithms are doing not just what they claim to be doing, but what they should be doing ?

The problem with algorithms is that they can often be opaque: but opacity is not the same as fairness, and this is now a growing concern amongst people who haven't yet drunk the machine learning kool-aid.

All of this is a lead up to two things. Firstly, the NIPS 2014 workshop on Fairness, Accountability and Transparency in Machine Learning, organized by +Moritz Hardt and +Solon Barocas and written about by Moritz here.

But secondly, it's about work that +Sorelle Friedler, +Carlos Scheidegger and I are doing on detecting when algorithms run afoul of a legal notion of bias called disparate impact. What we're trying to do is pull out from a legal notion of bias something more computational that we can use to test and certify whether algorithms are indulging in legally proscribed discriminatory behavior.

This is preliminary work, and we're grateful for the opportunity to air it out in a forum perfectly designed for it.

And here's the paper:

What does it mean for an algorithm to be biased?
In U.S. law, the notion of bias is typically encoded through the idea of disparate impact: namely, that a process (hiring, selection, etc) that on the surface seems completely neutral might still have widely different impacts on different groups. This legal determination expects an explicit understanding of the selection process.

If the process is an algorithm though (as is common these days), the process of determining disparate impact (and hence bias) becomes trickier. First, it might not be possible to disclose the process. Second, even if the process is open, it might be too complex to ascertain how the algorithm is making its decisions. In effect, since we don't have access to the algorithm, we must make inferences based on the data it uses.

We make three contributions to this problem. First, we link the legal notion of disparate impact to a measure of classification accuracy that while known, has not received as much attention as more traditional notions of accuracy. Second, we propose a test for the possibility of disparate impact based on analyzing the information leakage of protected information from the data. Finally, we describe methods by which data might be made "unbiased" in order to test an algorithm. Interestingly, our approach bears some resemblance to actual practices that have recently received legal scrutiny.

In one form or another, most of my recent research touches upon questions of accountability in data mining. It's a good thing that it's now a "trend for 2015". I'll have more to say about this in the next few months, and I hope that the issue of accountability stays relevant for more than a year.

Friday, December 05, 2014

Experiments with conference processes

NIPS is a premier conference in machine learning (arguably the best, or co-best with ICML). NIPS has also been a source of interesting and ongoing experiments with the process of reviewing.

For example, in 2010 Rich Zemel, who was a PC chair of NIPS at the time, experimented with a new system he and Laurent Charlin were developing that would determine the "fit" between a potential reviewer and a submitted paper. This system, called the Toronto Paper Matching System, is now being used regularly in the ML/vision communities.

This year, NIPS is trying another experiment. In brief,

10% of the papers submitted to NIPS were duplicated and reviewed by two independent groups of reviewers and Area Chairs.

And the goal is to determine how inconsistent the reviews are, as part of a larger effort to measure the variability in reviewing. There's even a prediction market set up to guess what the degree of inconsistency will be. Also see Neil Lawrence's fascinating blog describing the mechanics of constructing this year's PC and handling the review process.

I quite like the idea of 'data driven' experiments with format changes. It's a pity that we didn't have a way of measuring the effect of having a two-tier committee for STOC a few years ago, and instead had to rely on anecdotes about its effectiveness, and didn't even run the experiment long enough to collect enough data. I feel that every time there are proposals to change anything about the theory conference process, the discussion gets drowned out in a din of protests, irrelevant anecdotes, and opinions based entirely on ... nothing..., and nothing ever changes.

Maybe there's something about worst-case analysis (and thinking) that makes change hard :).

Thursday, November 20, 2014

Open access, ACM and the Gates Foundation.

Matt Cutts, in an article on the new Gates Foundation open access policy (ht +Fernando Pereira) says that

while the ACM continues to drag its heels, the Gates Foundation has made a big move to encourage Open Access...

Which got me thinking. Why can't the ACM use this policy as a guidelines to encourage open access ? Specifically,

Announce that from now on, it will subsidize/support the open access fees paid by ACM members
(partially) eat the cost of publication in ACM publications (journals/conferences/etc)
Use the resulting clout to negotiate cheaper open access rates with various publishers in exchange for supporting open access fees paid to those journals.

Of course this would put the membership side of ACM at odds with its publication side, which maybe points to another problem with ACM having these dual roles.

Monday, October 06, 2014

Algorithms is the new Algorithms...

Hal Daumé wrote a rather provocative post titled ‘Machine Learning is the new algorithms’, and has begged someone to throw his gauntlet back at him. Consider it now thrown !

His thesis is the following quote:

Everything that algorithms was to computer science 15 years ago, machine learning is today

And among the conclusions he draws is the following:

we should yank algorithms out as an UG requirement and replace it with machine learning

Having spent many years having coffees and writing papers with Hal, I know that he truly does understand algorithms and isn’t just trying to be a troll (at least I hope not). So I’m trying to figure out exactly what he’s trying to say. It will help if you read his article first before returning…

First off, I don’t understand the conclusion. Why not (say) replace architecture with ML, or databases with ML. Or why replace anything at all ? the assumption is that ML is a core tool that students will need more than any other topic. Now I have no problem with adding ML to the list of “things a CS person ought to know”, but I don’t see why it’s not important for a CS person to understand how a database works, or how a compiler processes code, or even how we design an algorithm. This fake mutual exclusiveness appears to be without basis.

But maybe he’s saying that algorithms and ML are two flavors of the same object, and hence one can be replaced by the other. If so, what exactly is that object ? In his view, that object is:

given an input, what’s the best way to produce an (optimal) output ?

This seems to be an overly charitable view of ML. In ML, the goal is to, well, learn. Or to be more stodgy about it, ML provides tools for making systematic inferences and predictions from data.

Which suggests that the concerns are fundamentally orthogonal, not in opposition (and Sasho Nikolov makes this point very well in a comment on Hal’s post). As Hal correctly argues, the hard work in ML is front-loaded: modeling, feature extraction, and the like. The algorithm itself is mostly an afterthought.

But what’s ironic is that one of the most important trends in ML of late is the conversion of an ML problem to an optimization problem. The goal is to make good modeling choices that lead to an optimization problem that can be solved well. But wait: what do you need to know how to solve an optimization ? Wait for it…… ALGORITHMS !!

The argument about stability in algorithms is an odd one, especially coming from someone who’s just written a book on ML. Yes, some core algorithms techniques haven’t changed much in the last many years, but did you see that 2014 paper on improvements in recurrence analysis ? Or the new sorting algorithm by Mike Goodrich ? or even the flurry of new results for Hal’s beloved flow problems ?

As for “everything’s in a library”, I’ll take your boost graph library and give you back WEKA, or libSVM, or even scikit-learn. I don’t need to know anything from Hal’s book to do some basic futzing around in ML using libraries: much like I could invoke some standard hashing subroutine without knowing much about universal hash families.

In one sense though, Hal is right: ML is indeed where algorithms was 15 years ago. Because 15 years ago (approximately) is when the streaming revolution started, and with it the new interest in sub linear algorithms, communication complexity, matrix approximations, distributed algorithms with minimal communication, and the whole “theory of big data” effort. And ML is now trying to catch up to all of this, with some help with from algorithms folks :).

What is true is this though: it wouldn’t hurt us to revisit how we construct the core algorithms classes (undergrad and grad). I realize that CLRS is the canon, and it would cause serious heart palpitations to contemplate not stuffing every piece of paper of that book down students’ throats, but maybe we should be adapting algorithms courses to the new and exciting developments in algorithms itself. I bring in heavy doses of approximation and randomization in my grad algorithms class, and before we forked off a whole new class, I used to teach bits of streaming, bloom filters, min-wise hashing and the like as well. My geometry class used to be all about the core concepts from the 3 Marks book, but now I throw in lots of high dimensional geometry, approximate geometry, kernels and the like.

Ultimately, I think a claim of fake mutual exclusivity is lazy, and ignores the true synergy between ML and algorithms. I think algorithms research has a lot to learn from ML about robust solution design and the value of "noise-tolerance", and ML has plenty to learn from algorithms about how to organize problems and solutions, and how deep dives into the structure of a problem can yield insights that are resilient to changes in the problem specification.

Wednesday, October 01, 2014

On the master theorem vs Akra-Bazzi

Everyone knows the master theorem.

Or at least everyone reading this blog does.

And I'm almost certain that everyone reading this blog has heard of the generalization of the master theorem due to Akra and Bazzi. It's particularly useful when you have recurrences of the form

$$ T(n) = \sum_i a_i T(n/b_i) + g(n) $$

because like the master theorem it gives you a quick way to generate the desired answer (or at least a guess that you can plug in to the recurrence to check).

(And yes, I'm aware of the generalization of A/B due to Drmota and Szpankowski)

When I started teaching grad algorithms this fall, I was convinced that I wanted to teach the Akra-Bazzi method instead of the master theorem. But I didn't, and here's why.

Let's write down the standard formulation that the master theorem applies to

$$ T(n) = a T(n/b) + f(n) $$

This recurrence represents the "battle" between the two terms involved in a recursive algorithm: the effort involved in dividing (the $a T(n/b)$) and the effort involved in putting things back together (the $f(n)$).

And the solution mirrors this tension: we look at which term is "stronger" and therefore dominates the resulting running time, or what happens when they balance each other out. In fact this is essentially how the proof works as well.

I have found this to be a useful way to make the master theorem "come alive" as it were, and allow students to see what's likely to happen in a recurrence without actually trying to solve it. And this is very valuable, because it reinforces the point I'm constantly harping on: that the study of recurrences is a way to see how to design a recursive algorithm. That decimation as a strategy can be seen to work just by looking at the recurrence. And so on.

But the Akra-Bazzi method, even though it's tremendously powerful, admits no such easy intuition. The bound comes from solving the equation

$$ \sum a_i b_i^p = 1 $$ for $p$, and this is a much more cryptic expression to parse. And the proof doesn't help make it any less cryptic.

Which is not to say you can't see how it works with sufficient experience. But that's the point: with sufficient experience. From a purely pedagogical perspective, I'd much rather teach the master theorem so that students get a more intuitive feel for recurrences, and then tell them about A/B for cases (like in median finding) where the master theorem can only provide an intuitive answer and not a rigorous one.

Friday, September 26, 2014

STOC 2015 Deadline: Nov 4, 2014

Via Ronitt Rubinfeld comes word that the STOC 2015 CFP is out.

Submission deadline: Tuesday, November 4, 2014, 3:59pm EST

Conference: June 14-17 2015, Portland, Oregon (part of FCRC)

Saturday, August 23, 2014

LaTeX is code...

I'm giving a talk on LaTeX this Monday as part of our new grad student "boot camp" series. It's really more of an interactive presentation: I'll use writelatex (or sharelatex) to demo examples, give student simple assignments, and use real-time chat to see how things are going. It should be quite interesting.

Here's the talk announcement:

Did you know that every time you use $..$ to italicize text, or use \\ to force a newline, Leslie Lamport cries out in agony and Don Knuth starts another fascicle of Volume IV of TAoCP ?

Come and learn about how to use LaTeX, and use it well. Your collaborators will love you, your advisors will love me, and you'll realize that the most awfully written drivel looks awesome when typeset well.

This will be interactive!! I'll be using a shared space for editing and viewing latex documents, and there will be class activities, so please do bring your laptops/tablets/other editing device so you can follow along and participate.

For this talk I solicited comments from colleagues as to what they'd like their students to learn. Probably the most useful comment I got was from +Robert Ricci and +Eric Eide: to whit,

LaTeX is code.

This might seem obvious, but once you internalize it, all kinds of other things become very natural. For example

You should really use some kind of IDE to write and build your documents
Version control is your friend
*sections should be separate files.
Text should be readable.
Use macros where convenient
Don't reinvent: use the many many built-in packages at ctan.org
Use tex.stackexchange.com to learn how to hack whatever you need in LaTeX.

A corollary: to see a theoretician editing LaTeX close to a STOC/FOCS/SODA deadline is to realize that theory folk are AWESOME programmers.

Tuesday, August 19, 2014

Long Live the Fall Workshop (guest post by Don Sheehy)

An announcement for the Fall Workshop in Computational Geometry, by Don Sheehy.

In all the conversation about SoCG leaving the ACM, there were many discussions about ownership, paywalls, and money. This leads naturally to questions of ideals. What can and ought a research community be like? What should it cost to realize this? Isn't it enough to bring together researchers and in an unused lecture hall at some university somewhere, provide coffee (and wifi), and create a venue for sharing problems, solutions, and new research in an open and friendly atmosphere? There is a place for large conferences, with grand social events (Who will forget the boat cruise on the Seine at SoCG 2011?), but there is also a place for small meetings run on shoestring budgets that are the grassroots of a research community.

The Fall Workshop on Computational Geometry is such a meeting. It started in 1991, at SUNY Stony Brook and has been held annually every fall since. I first attended a Fall Workshop during my first year of graduate school, back in 2005. This year marks the 24th edition of the workshop, and this time, I will be hosting it at the University of Connecticut. It is organized as a labor of love, with no registration fees. There are no published proceedings and it is a great opportunity to discuss new work and fine-tune it in preparation for submission. It is perfectly timed to provide a forum for presenting and getting immediate feedback on your potential SoCG submissions. I cordially invite you to submit a short abstract to give a talk and I hope to see you there.

Important dates:

Submission deadline: Oct 3 midnight (anywhere on earth)

Conference: Oct 31-Nov 1, 2014.

Wednesday, August 13, 2014

Interdisciplinary research and the intellectual richness of data analysis

Slides on a brief introduction to themes in machine learning from an algorithms perspective, and some thoughts on the mathematical richness of the study of data.

A brief note on Fano's inequality

I've been bumping into Fano's inequality a lot lately, and have found the various explanations on the web somewhat lacking. Not because they aren't useful, but because their perspective is very different to the kind that I'd prefer as an algorithms person.

So after grumbling and mumbling and complaining, I decided the only solution was to write my own ! And here it is, as a raindrop.

Eh ? What's that you say ? And here we're just getting used to twitter ?

Raindrops are a web publishing form designed by the company run by our very own V. Vinay. When I was in India last I visited him in Bangalore, and he showed me the system. It's a nice way to make presentations or short lectures.

The raindrop I created is embedded directly in my blog, but can also be viewed directly at this link. I hope you like the medium, and the content !

Sunday, July 27, 2014

A response to Vint Cerf

A somewhat grumpy response to +Vint cerf 's request for feedback on how the ACM can do more for "professional programmers".

Dear Dr. Cerf
In your recent letter to the members of ACM, you write "I would like to ask readers how they satisfy their need to keep informed about computing practices and research results that may influence their own work". While I suspect your goal is to understand how ACM can serve the larger tech community and not the research community and I am a card-carrying member of the latter group, I thought I'd respond anyway.

First up, it's an ambitious (and brave!) idea to think that the ACM (or any single entity for that matter) can serve the needs of the vast technology enterprise. There was probably a time before the web when professional societies played an important role in collecting together people with shared interests and disseminating valuable information out to interested individuals. But now we have your current employer ! and online communities galore ! and Quora ! and the Stackexchange ecosystem ! and so many different ways for people to build communities, share information and learn about new ideas percolating through the world of tech.

It's a little funny though that you're worried about ACM's presence in the professional world. Many of us have long assumed that ACM spends most of its focus on that side of the computing community (the excellent revamp of the CACM under +Moshe Vardi being the exception that proved the rule). In fact, I'd go as far as to argue that the ACM would be much better served if it were instead to realize how it's driving itself into irrelevance in a research community that so desperately needs an institutional voice.

How do we satisfy our need to keep informed about results that might influence our work ? We (still) read papers and go to conferences. And how does the ACM help ? Well not very well.

Aggregating the deluge of information: anyone will tell you that the amount of research material to track and read has grown exponentially. But we still, to this day, have nothing like PUBMED/MEDLINE as a central clearinghouse for publications in CS-disciplines. The ACM DL is one step towards this, but it's a very poor imitation of what a 21st century repository of information should look like. It's not comprehensive, its bibliographic data is more erroneous than one expects, and the search mechanisms are just plain depressing (it's much easier to use Google)
Dealing with the changing nature of peer review and publication: Sadly, ACM, rather than acting like a society with its members' interests at heart, has been acting as a for-profit publisher with a some window dressing to make it look less execrable. Many people have documented this far more effectively than I ever could.
Conference services: One of the services a national organization supposedly provides are the conference services that help keep communities running. But what exactly does the ACM do ? It sits back and nitpicks conference budgets, but provides little in the way of real institutional support. There's no infrastructure to help with conference review processes, no support for at-conference-time services like social networking, fostering online discussion and communities, and even modern web support. I only bring this up because all of these services exist, but piecemeal, and outside the ACM umbrella.

Underneath all of this is a slow but clear change in the overall CS research experience. The CRA has been doing yeoman service here: taking the temperature of the community every year with the Taulbee surveys, putting out a best practices document for postdocs after extensive community discussion, and even forming action groups to help gain more support for CS research from the government. Does the ACM do any of this ?

In many ways, this is a golden era for computer science, as the fruits of decades of work in our field seep out into the larger world under the guise of computational thinking, big data and learning. It's a perfect time for an organization that has deep connections in both the academic side of CS and the industrial side to help with the translation and tech transfer needed to maximize the impact of the amazing new technologies we're all developing, as well as reach out to similar institutions in other areas to bring more CS into their communities (as you rightly pointed out)

But there is no sign that ACM has ideas about how to do this or even wants to. And while it continues to chase a professional tech community that doesn't seem to care about it at all, the academics who would have cared are finding their own way.

Monday, June 09, 2014

A declaration of independence (kind of), and technology-induced disintermediation

Musings on the SoCG move to declare independence of the ACM.

It's a little odd to be sitting in Denmark while high quality rød pølse (with remoulade!) is being made across the world in Kyoto. There's an abysmal lack of blogging/tweeting/facebooking/instagramming/snapchatting/can-I-stop-now-ing from the conference.

Of course following in the lines of the complexity conference, we're attempting to declare our own independence from the ~~EEVVIIL SKELETOR AND HIS MINIONS~~ ACM. Jeff Erickson, our ~~esteeemed grand poobah~~ steering committee chair has put out a series of posts outlining the issues at hand, the history of the matter, and the current status of discussions with SIGACT and ACM. Please visit http://makingsocg.wordpress.com to read, discuss and/or heckle as you see fit.

It might be obvious, but this new wave of devolution is being driven largely by technology - most importantly the existence of LIPIcs. Having a platform for open-access publication with minimal costs exposes as false the claims of institutional providers that they alone can provide the support needed for conferences. In fact, there are a number of related claims that appear to be not true in practice (or at least are not as obviously true as once thought).

* We need the imprimatur of an institutional provider as a signalling mechanism for quality. While it might be too soon to tell if dropping affiliation with established institutions (IEEE or ACM) will affect how publications in a venue will be perceived, there's a lot of confidence in the communities that their long-established reputation will outweigh any loss of prestige.

* Institutional providers provide quality editorial and archival services necessary for a serious venue. I think juxtaposing 'quality' and 'editorial' together with Sheridan Printing might cause my two remaining readers to die of hysterical laughter. But the archival issue is a good one. LIPIcs appears to be funded solidly by the German government for a while, and will take a fixed fee from a conference for publication (capped at 750 €). But the Arxiv was struggling for a while. Should we view the ACM and IEEE as "more likely to last" than any other entity ?

* Institutional providers provide the backing and clout needed for conference organization, hotel bookings and so on. This is another good example of a time-money tradeoff. Organizations like SIAM actually do take over the management of the conference: while the results are not always optimal, there is a clear reduction in hassle for the conference organizers. But organizations like the ACM don't take things over in the same way (and from what I can tell, neither does IEEE). I'm surprised though that there aren't yet lean-and-mean event planning providers that we can just pay money to and make our planning problems go away.

* Institutional providers have the financial wherewithal to manage cycles in revenue streams for a conference. This is another real issue. Conferences that have gone independent have eventually managed to maintain a steady income stream, but theory conferencs are smaller and poorer: it remains to be seen whether we can generate the kind of endowment needed to insulate the community against the natural variation in revenue from year to year.

What's disappointing is that none of this had to play out this way.

Take LIPICs for example: they clearly marked out their scope -- indexing, archiving functions and hosting -- while staying away from the more content-driven aspects of the process (editing, proofing etc). This makes a lot of sense, given that everyone who publishes knows how to use LaTeX and style files, but might still not be able to make a web page. Why couldn't the ACM have realized this and provided a slimmed-down publishing service ?
Why do we go to Microsoft (or Easychair, or hotCRP, or Shai Halevi's software) for our conference submission servers ? If the ACM had provided a service of this kind (or even provided hosting for hotCRP/Shai's software), we'd be happily using it right now, and it could have then tied nicely into Regonline, that ACM already partners with.
A lot of the current angst seems to have tone as a root cause: a certain feeling about the provider's attitude towards the conference. This is again something that could have been recognized and addressed before things got to this stage.

While it's exciting to be part of the "academic spring" (?), people tend to forget that in all revolutions someone gets hurt, and often things don't get better for a long time. I'm intrigued by our attempt to move towards independence though, and the people involved have thought this through very carefully.

Tuesday, May 20, 2014

On beauty and truth in science.

Philip Ball writes a thought-provoking article in Aeon with the thesis that the kind of beauty that scientists describe does not necessarily match the aesthetic notions of art, and is not even consistent among scientists.

It was hard for me to get beyond the casual conflating of beauty in mathematics (the four-color theorem, the proof of Fermat's theorem, and proofs in general) and beauty in scientific theories (relativity, evolution, and so on). But if one goes beyond the artificial duality constructed by the author, the idea of beauty as a driver in science (and mathematics) is a rich one to explore.

A particular example: for a long time (and even codified in books) it was taught that there were five natural classes of approximation hardness: PTAS, constant factor-hard, log-hard, label-cover (superlogarithmic) hard, and near-linear hard. There were even canonical members of each class.

Of course, this nice classification no longer exists. There are even problems that are $\log^* n$-hard to approximate, and can also be approximated to that factor. And to be fair, I'm not sure how strong the belief was to begin with.

But it was such a beautiful idea.

At least in mathematics, the search for the beautiful result can be quite fruitful. It spurs us on to find better, simpler proofs, or even new ideas that connect many different proofs together. That notion of connection doesn't appear to be captured in the article above: that beauty can arise from the way a concept ties disparate areas together.

MADALGO Summer School on Learning At Scale

I'm pleased to announce that this year's MADALGO summer school (continuing a long line of summer programs on various topics in TCS) will be on algorithms and learning. The formal announcement is below, and registration information will be posted shortly.

Save the date ! Aug 11-14, 2014.

MADALGO Summer School on
LEARNING AT SCALE

August 11- 14, 2014, Aarhus University, Denmark

madalgo.au.dk/learningatscale2014

OVERVIEW AND GOAL
The MADALGO Summer School 2014 will introduce attendees to the latest developments in learning at scale. The topics will include high dimensional inference, algorithmic perspectives on learning and optimization, and challenges in learning with huge data.

LECTURES
The school will be taught by experts in learning:

Amr Ahmed (Google)

Mikhail Belkin (Ohio State)

Stefanie Jegelka (Berkeley)

Ankur Moitra (MIT)

PARTICIPATION
The summer school will take place on August 11-14, 2014 at Center for Massive Data Algorithmics (MADALGO) at the Department of Computer Science, Aarhus University, Denmark. The school is targeted at graduate students, as well as researchers interested in an in-depth introduction to Learning. Registration will open soon at the school webpage. Registration is free on a first-come-first serve basis - handouts, coffee breaks, lunches and a dinner will be provided by MADALGO and Aarhus University.

ORGANIZING COMMITTEE

Suresh Venkatasubramanian (University of Utah)

Peyman Afshani (MADALGO, Aarhus University)

Lars Arge (MADALGO, Aarhus University)

Gerth S. Brodal (MADALGO, Aarhus University)

Kasper Green Larsen (MADALGO, Aarhus University)

LOCAL ARRANGEMENTS

Trine Ji Holmgaard (MADALGO, Aarhus University)

Katrine Østergaard Rasmussen (MADALGO, Aarhus University)

ABOUT MADALGO

Center for Massive Data Algorithmics is a major basic research center funded by the Danish National Research Foundation. The center is located at the Department of Computer Science, Aarhus University, Denmark, but also includes researchers at CSAIL, Massachusetts Institute of Technology in the US, and at the Max Planck Institute for Informatics and at Frankfurt University in Germany. The center covers all areas of the design, analysis and implementation of algorithms and data structures for processing massive data (interpreted broadly to cover computations where data is large compared to the computational resources), but with a main focus on I/O-efficient, cache-oblivious and data stream algorithms.

Thursday, May 01, 2014

The history of the vector space model

Gerald Salton is generally credited with the invention of the vector space model: the idea that we could represent a document as a vector of keywords and use things like cosine similarity and dimensionality reduction to compare documents and represent them.

But the path to this modern interpretation was a lot twistier than one might think. David Dubin wrote an article in 2004 titled 'The Most Influential Paper Gerard Salton Never Wrote'. In it, he points out that most citations that refer to the vector space model refer to a paper that doesn't actually exist (hence the title). Taking that as a starting point, he then traces the lineage of the ideas in Salton's work.

The discoveries he makes are quite interesting. Among them,

Salton's original conception of the vector space model was "operational" rather than mathematical. In other words, his earliest work really uses 'vector space' to describe a collection of tuples, each representing a document. In fact, the earliest terminology used was the 'vector processing model'.
In later papers, he did talk about things like orthogonality and independence, as well as taking cosines for similarity, but this was done in an intuitive, rather than formal manner.
It was only after a series of critiques in the mid 80s that researchers (Salton included) started being more direct in their use of the vector space model, with all its attendant algebraic properties.

Of course today the vector space model is one of the first things we learn when doing any kind of data analysis. But it's interesting to see that it didn't start as this obvious mathematical representation (that I've taken to calling the reverse Descartes trick).

Wednesday, April 30, 2014

SIAM Data Mining 2014: On differential privacy

After my trip to Haverford, I attended the SIAM Data Mining (SDM) conference in Philly. For those who aren't that familiar with the data mining universe, SDM is the SIAM entrant in the data mining conference sweepstakes, along with ACM (KDD) and IEEE (ICDM). SDM is probably also the smallest of the three venues, which makes it comparable in feel to SODA (also because of SIAM organization). The conference attracts the usual data mining suspects, but also more of the applied math folks.

I was the tutorials chair this year, and there were a number of very well-attended tutorials ranging from applications to core mining to theory. In particular, +Moritz Hardt and +Aleksandar Nikolov did a very nice tutorial on differential privacy entitled 'Safer Data Mining'.

SDM is a good venue for theory folks wanting to "test the waters" with data mining: the papers are consistently more mathematically oriented and less "business-heavy", and it's a friendly crowd :).

Shameless plug: I'm the PC co-chair next year along with Jieping Ye and I'd encourage more algorithms folks to submit, and visit Vancouver in April.

In a future post I'll talk more about a panel I also ran at the conference titled 'Ethics in Data Mining'

Tuesday, April 22, 2014

The Shape of Information

A brief synopsis of my general-audience talk at Haverford College.

I'm currently visiting Haverford College at the invitation of +Sorelle Friedler as part of Haverford's big data lecture series. Today's talk was a general audience talk about data mining, titled 'The Shape Of Information': (GDrive link)

The Shape Of Information

What makes data mining so powerful, and so ubiquitous? How can the same set of techniques identify patients at risk for a rare genetic disorder, consumers most likely to like Beyonce's latest album, or even a new star from an sky survey ?

The answer starts with an idea Descartes had nearly 500 years ago. He suggested expressing geometry in terms of numbers (coordinates). This turned out to be a powerful technique that led (among other things) to the development of the calculus. Data mining returns the favor. It starts with sets of numbers that describe a collection of objects. To find patterns in these objects, we create a geometry in which the numbers are coordinates. And just like that, objects become shapes, and the search for information becomes a quest for common structure in these shapes.

In this search, we are not limited by the geometry of our world: we can dream up ever more intricate geometries that capture the shape of the information that we seek to find in our data. In this sense, data mining is the best kind of science fiction come to life: we craft a world out of our imagination, and let the laws of this world lead us to fascinating discoveries about the data that inhabits it.

I had a great time visiting with Sorelle's students in their data mining class. Haverford College continues to impress me with the quality of their undergrads (and their faculty !)

Friday, April 18, 2014

danah boyd, Randall Munro, and netizens.

danah boyd, author of 'It's Complicated' just gave a tech talk at Google. Her book has been in the news a lot lately, so I'll skip the details (although Facebook ought to be at least slightly worried).

But what I enjoyed the most about her talk was the feeling that I was listening to a true netizen: someone who lives and breathes on the internet, understands (and has helped build) modern technology extremely well (she is a computer scientist as well as an ethnographer), and is able to deliver a subtle and nuanced perspective on the role and use of technology amidst all the technobabble (I'm looking at you, BIG data) that inundates us.

And she delivers a message that's original and "nontrivial". Both about how teens use and interact with social media, and about how we as a society process technological trends and put them in context of our lives. Her discussion of context collapse was enlightening: apart from explaining why weddings are such fraught experiences (better with alcohol!) it helped me understand incidences of cognitive frisson in my own interactions.

What she shares with Randall Munro in my mind is the ability to speak unselfconsciously and natively in a way that rings true for those of us who inhabit the world of tech, and yet articulate things that we might have felt, but are unable to put into words ourselves. Of course they're wildly different in so many other ways, but in this respect they are like ambassadors of the new world we live in.

Pages