Ruminations on computational geometry, algorithms, theoretical computer science and life
Saturday, December 15, 2012
NIPS II: Deep Learning and the evolution of data models
Over the years, as we've engaged with "big data" more and more, the way we construct mental models of data has changed. And as I've argued before, understanding how we think about data, and what shape we give it, is key to the whole enterprise of finding patterns in data.
The model that one always starts with is Euclidean space. Data = points, features = dimensions, and so on. And as a first approximation of a data model, it isn't terrible.
There are many ways to modify this space. You can replace the $\ell_2$ norm by $\ell_1$. You can normalize the points (again with $\ell_2$ or $\ell_1$, sending you to the sphere or the simplex). You can weight the dimensions, or even do a wholesale scale-rotation.
But that's not all. Kernels take this to another level. You can encode weak nonlinearity in the data by assuming that it's flat once you lift it. In a sense, this is still an $\ell_2$ space, but a larger class of spaces that you can work with. The entire SVM enterprise was founded on this principle.
But that's not all either. The curse of dimensionality means that it's difficult to find patterns in such high dimensional data. Arguably, "real data" is in fact NOT high dimensional, or is not generated by a process with many parameters, and so sparsity-focused methods like compressed sensing start playing a role.
But it gets even more interesting. Maybe the data is low-dimensional, but doesn't actually lie in a subspace. This gets you into manifold learning and variants: the data lies on a low-dimensional curved sheet of some kind, and you need to learn on that space.
While the challenge for geometry (and algorithms) is to keep up with the new data models, the challenge for data analysts is to design data models that are realistic and workable.
So what does this have to do with deep learning ?
Deep learning networks "work" in that they appear to be able to identify interesting semantic structures in data that can be quite noisy. But to me it's not entirely clear why that is. So I've been mulling the question of what kind of structure in data might be "visible" to such a network. In the absence of formal results ("if your data can be separated in some RKHS, then an SVM will find it"), what follows is some speculation, based on talks I attended and conversations I had with NIPs attendees.
Observation I: Both Stephané Mallat's talk and a nice paper by Coates, Karpathy and Ng talk about the idea of "first-level" features that identify (low-dimensional) invariants and eliminate noise. In the Coates et al paper, they start with a very fine $k$-means clustering ($k$ very large), and attempt to "glue" the centers together into low dimensional subspace pieces. These are then propagated up a network of higher-order feature processors.
Observation 2: Yoshua Bengio's survey of deep learning (a very readable account) points to work by Geoff Hinton as part of the reinvigoration of the field. A central idea of this work is that deep belief networks can be trained "layer by layer", where each layer uses features identified from the previous layer.
If you stare at these things long enough, you begin to see a picture not of sparse data, or low-rank data, or even manifold data. What you see is a certain hierarchical collection of subspaces, where low-dimensional spaces interact in a low dimensional way to form higher level spaces, and so on. So you might have a low-level "lip" feature described by a collection of 2-3 dimensional noisy subspaces in an image space. These "lip" features in turn combine with "eye" features and so on.
This might seem rather far fetched, and a strange way to think about data. But I can't claim originality even here. Back in June, Mauro Maggioni gave a talk at CGWeek in Chris Bishop's workshop on the connection between analysis and geometry. In this talk, he described a multi-resolution view on data that admits structure at different scales, and admits different structures at these scales.
The upshot of all of this: it is possible that deep learning is trying to capture hierarchies of low dimensional subspaces that interact in a low dimensional way. This would explain how one is able to avoid the curse of dimensionality, and might also explain why it sometimes can find structure that other methods (kernels, manifold methods, etc) might miss.
Problem is: I'm not sure how one tests this hypothesis. Almost any data set could be viewed this way if you allow enough features and enough "depth" in the hierarchy.
Monday, December 10, 2012
NIPS ruminations I
I just returned from NIPS in Lake Tahoe. For those not in the know, NIPS is one of the premier machine learning conferences, and has been growing rapidly over the last few years. This year's incarnation (the first of at least 7 in Lake Tahoe) has over 1600 attendees, a gazillion workshops, 4 tutorials, and more papers each DAY than all of STOC.
The format of the conference is very interesting. There are keynotes each morning and afternoon, a few selected 20 minute talks, and a few more more 5 minute "spotlight" talks, all in single session mode. All accepted papers get a poster slot one of of four days, and the poster session is a mammoth event from 7pm to midnight each night (yes, you read that correctly).
I'll say more about the format in a later post. For the next few posts, I'll highlight some of the papers and trends that were interesting to me. For more conference blogging, check out Anand Sarwate's sequence (I, II), and Hal Daumé's recap.
Perhaps the most obvious trend at the conference was on deep learning. While it's been in the news recently because of the Google untrained search for youtube cats, the methods of deep learning (basically neural nets without lots of back propagation) have been growing in popularity over a long while, and I was awash in autoencoders, pooling, and other jargon from the area. It was amusing to see speakers apologize for NOT talking about deep learning.
Stéphane Mallat's keynote on this topic was a thing of beauty. While I understood very little of the technical details, the central idea was that by using certain convolution-based encodings, and looking for "invariant substructures", you could essentially filter out irrelevant noise in the feature space, generating new features for the next level of the learning network. This simple idea is quite powerful: there was a paper on learning faces from videos from Coates, Karpathy and Ng that used a simple version of this idea by using k-means with lots of clusters to identify these low-D subspaces and factor them out.
Another trend that's been around for a while, but was striking to me, was the detailed study of optimization methods. Optimization is a major theme in machine learning. This is primarily because so many machine learning problems can be formulated as convex/nonconvex optimizations. Solving these problems takes a lot of ingenuity and engineering, and after you've been swimming in a pond of regularization, sparsity, mixed norms, stochastic gradient descent, and alternating optimization for a while, things can get crazy. There are at least two different workshops on optimization in machine learning (DISC and OPT), and numerous papers that very carefully examined the structure of optimizations to squeeze out empirical improvements.
If I had to mount a critique of this large body of work, it would only be that a lot of this appears to be black magic (a criticism that seems even more true for deep learning). There are number of tricks in the optimization toolbox, and you can always try any number of them, but it's not always clear which methods work best, and why.
Quick hits on papers that interested me:
Bregman divergences: A paper by Iyer and Bilmes show how to extend Bregman divergences to the case when the generating function $\phi$ is not convex, but sub modular. This is a little tricky, because in such cases there's no unique gradient, and so technically the Bregman divergence is a function of both the function and its subdifferential. One interesting variant is the Lovasz Bregman divergence, which comes from using the Lovasz extension of a submodular function as the convex generator for a Bregman divergence. An application of these constructions comes in ranking, and it's interesting that things like the Hamming distance can be constructed.
On a separate note, Satoru Fujishige gave a wonderful invited lecture at DISC on submodular functions. I particularly liked the geometric interpretation of submodularity via its polymatroid and how a matroid can be viewed as the object you get when the submodular function is monotone. Of course this has been known for over 40 years, but I "got it" for the first time. If his book is anything like his talk, I really want to get it.
Kernel distances: Ever since I discovered the kernel distance (as in, found it, not invented it) I've been fascinated by how it behaves more or less like the earth mover distance, but is so much easier to compute. Scott Aaronson (at his NIPS invited talk) made this joke about how nature loves $\ell_2$. The kernel distance is "essentially" the $\ell_2$ variant of EMD (which makes so many things easier).
There's been a series of papers by Sriperumbudur et al. on this topic, and in a series of works they have shown that (a) the kernel distance captures the notion of "distance covariance" that has become popular in statistics as a way of testing independence of distributions. (b) as an estimator of distance between distributions, the kernel distance has more efficient estimators than (say) the EMD because its estimator can be computed in closed form instead of needing an algorithm that solves a transportation problem and (c ) the kernel that optimizes the efficient of the two-sample estimator can also be determined (the NIPS paper).
Metrics for PSD: In my continuing quest to find strange geometries to design algorithms for, I had this paper some time ago on doing geometry on the Riemannian manifold of positive definite matrices. It's a tricky business, and life gets sad quickly when you can't describe the convex hull of three points finitely.
Suvrit Sra proposed a new metric for the space of $n \times n$ positive definite matrices. It's what you get if you apply the Jensen-Shannon construction to the Burg entropy on matrices. It has the property of being more easily computable than the standard Riemannian metric, and also shares with that metric a nice closed form for the midpoint of two matrices. What remains open is the kind of geometry this metric induces: a weird property is that the associated kernel exp(-\gamma d^2) is p.d iff $\gamma$ is half integral below $n-1$, and any real above it.
Distance metric learning and MKL.
Our presentation at OPT was about multiple kernel learning, which is closely related to the general problem of distance metric learning (especially when the distance metric is defined by a kernel). There were a good number of distance metric learning at NIPS - different approaches that either did local learning of "multi-metrics" or learned a metric based on k-NN points, or even Hamming distance metric learning.
Distributed learning.
Distributed learning (in a communication-restricted environment) seems to be a growing concern, which of course makes me happy. There were a few papers on different kinds of communication-limited learning, including one on doing this for probabilistic PCA (which didn't have formal bounds on the amount of communication though) and one on distributed "non-stochastic" experts.
Wednesday, November 07, 2012
Data, Dimensions and Geometry oh my !
tl;dr: High dimensional geometry (and non standard geometry) is the unifying language that binds data mining problems together.
We're trying to find patterns all over the place, with very different kinds of data. A search engine is trying to find patterns in web pages that might indicate that they have similar content. Brain researchers are throwing MRIs of patients with diseases into an algorithm that attempts to infer patterns in brain scans that might yield clues about pathology and diagnosis. Genome-wide analysis takes what are essentially long strings of letters and tries to explain why certain populations might be susceptible to certain diseases.
Pandora, Shazam and other music sites analyze fragments of music to find related artists, or even just match a tune. While an infinite gangnam-style video might be a frivolous use of data mining on video streams, video scans are being analyzed by robots trying to drive unmanned cars or even just play soccer. Social networks are all the rage: Facebook, Twitter and Google are desperate to understand your circle of friends in order to sell things to you more effectively.
How are we able to find patterns, clusters and trends in such different kinds of data ? The key step in all of this is the idea of features. The trick is to describe each object we are studying as a sequence (or set) of features. For example, a feature set for a document could be the number of times each particular word appears. The feature set for an image could be the count of different colors. For a tune, it might be a collection of features identified by hiring artists to list out which of more than 700 characteristics a piece of music has.
And so on, and so on. What this allows us to do is represent each object as a set of features, whether it's a web page, a brain scan, a video, or even a social graph. Once we do that, we no longer have to worry about the original data (well, kind of !), and different types of data are all on an equal footing.
But what now ? Here's where a cool trick that dates back to the 1600s comes in.
I doubt that René Descartes ever heard of the term "data mining". But in a strange way, he's the father of the field. One of Descartes' claim to fame was establishing a link between geometry and algebra. He said that if we wanted to represent points, lines and other shapes, we could do so not abstractly as Euclid and other classical geometers did, but using algebra. A "point" in the plane could be described by two coordinates (x,y), and a line in the plane could be described by the equation y = mx + c.
This is a very simple idea - children learn it in middle school. And yet like most simple ideas, it was revolutionary, and completely changed our perspective on geometry. The unification of algebra and geometry is so powerful and so natural, we use it almost unconsciously today.
But what does Descartes have to do with data mining ?
Remember those features I was telling you about ? That every object can be represented by a sequence of numbers, each number describing some aspect of the object.
Let's do the opposite of what Descartes proposed ! In other words, let's pretend that the objects are actually points in a space. Now this space is not the plane (unless we had only two features). It's a high dimensional space, with one feature for each dimension.
This process is entirely artificial. There is no real reason why we must think of objects as points in a high dimensional space. But here's the cool thing that happens. We can now express all basic mining primitives as geometric concepts, by translating the language of the primitive to this high dimensional space.
A clustering of data becomes a grouping of points so that "nearby" points are grouped together. A classification of reviews into positive and negative statements becomes a way to separate "positive" points and "negative" points by a line. Finding trends in data becomes the problem of fitting a straight line to a collection of points.
It is hard to emphasize how utterly bizarre this is. There is no underlying "geometry" in the problem we're solving. We're essentially creating a castle out of thin air in order to understand the problem. And yet it works, and is the bedrock of how we think about data mining.
But wait ! there's more. What exactly does it mean to say that points are "nearby" or they are "separated" ? To answer this question, it's not enough to view objects as points in a high dimensional space. You have to give this space a shape - a geometry (and also a topology, but I'll skip that for now).
For example, if I have two feature lists, how do I measure the distance between them ? If they were points on a map, I could do the usual straight line distance. But does that really capture how far apart they are ? After all, if you're in downtown Salt Lake with its grids, a "crow flies" distance doesn't help estimate how far things are. If you're on the surface of the earth, you can't really tunnel through the center to get from one point to another.
So we have to create the shape of the space by defining how far apart two points are. And this is one of the trickiest parts of data mining. Either we have to use some domain knowledge to estimate which features are significant and control more of the distance between objects, or we have to try and learn what seems like the right distance between objects based on user estimates or other information.
The good thing is that we have a huge collection of shapes to play with, and different shapes tend to work well for different classes of objects. Some are easy to interpret, others are easy to compute, and so on. So a good part of the "art" in data mining is understanding the data and estimating the right geometry for it, so that our tasks (expressed geometrically) give us meaningful answers.
Or as Dorothy famously said to her dog, "Toto, I don't think we're in the plane any more!"
Tuesday, November 06, 2012
On the elections, Nate Silver, and lessons for data mining
There are lessons here for the greater goal of using data mining for prediction and modelling, and this will get more and more important the more predictive analytics intrude on our lives.
People don't quite understand probability, even though they understand odds.
There's a lot of confusion about what it means for a one-time event to have a probability associated with it, even though people are quite comfortable with the idea of odds in (for example) sports. This to me reflects a deeper confusion between the frequentist and Bayesian view of probability: namely, is probability the long-run average of the frequency of an event, or an a priori expression of uncertainty in the likelihood of an event.
I'm not going to say anything here that hasn't been said for more than a century, but from the point of view of interpreting the results of mining, this distinction will be important.
Aggregation is a GOOD THING.
It is entirely likely that all the poll aggregators will have egg on their face tomorrow when the results come in. But it does seem that aggregation helps smooth out the natural irregularities and biases in individual polls. This is a theme that comes up time and again in data mining, and especially in clustering: rather than trusting a single algorithm, you should try to run algorithms that return diverse answers and aggregate them in some fashion.
It's not enough to predict: you must also explain.
Among the good reasons to feel uneasy about the aggregate predictions is that they don't give much insight into why things are going the way they are. To be fair, Nate Silver laid out some economic markers back in the spring, and tied them to possible outcomes via regression. While this provides some insight, it's still pattern matching as opposed to a mechanism.
More generally, it is very difficult to convince people that an answer is pertinent or "correct" unless there's some way to explain the answer without listing a sequence of coefficients or writing down a collection of centers. Much of the popularity of decisions trees comes from the ease of explanation it seems to provide.
Conclusion.
Most of the controversy around data mining in the public sphere has centered around privacy issues. Indeed, privacy issues become a concern precisely because people worry that the mining algorithms are too accurate. But in fact we don't really understand the behavior of many algorithms that we use, and that is dangerous regardless of privacy concerns. The angst over the methods used to predict this election are an illustration of what will happen when the predictions we make start to matter, and matter to many people.
Monday, October 08, 2012
On why I'm excited about "big data"
At the risk of embarrassing myself even more than I usually do, here's the video. Note that this was recorded at a time of great crisis across the globe, when all hair styling products had mysteriously disappeared for a few days.
Wednesday, October 03, 2012
We're hiring FIVE (count 'em, FIVE) faculty this year.
Our department is looking to hire five new faculty (at least four are at the assistant professor level). I'm particularly excited that we're hiring two people in the general area of big data (following up on our data mining and database hires from two years ago).
One slot is in what I'll call "big data meets big performance": I'll have more to say about this shortly, but the challenges of large data analysis are not just about managing the data, but about managing large numbers of machines to crunch this data (MapReduce is perhaps the most well known example of this). We're looking for people who can "speak extreme data and extreme processors" fluently - these could be on the data/systems management side, or on the analysis side, or the modelling side.
Utah has a strong presence in high performance computing (the Supercomputing confererence is happening in Salt Lake, and Mary Hall is the general chair), and we're one of the few places that has a good understanding of both sides of the large data story (i.e machines and bits).
The second slot is in machine learning, with ties to language. Text (and language) provide one of the best large data sources for scalable machine learning, and we're looking for people interested in the challenges of doing ML at scale, especially when dealing with NLP. If you're that person, you'll be coming into a department that has the entire range of data analysis folks from algorithms to analysis to systems to NLP (with many other faculty that are heavy users of ML technology).
Our plan, once we fill these slots, is to make Utah a one-stop shop for large-scale data analysis and visualization - in addition to the above slots, we're also looking to hire in information visualization to complement our strong viz presence.
In addition to the above slots, we are also hiring in computer security and HCI/user interfaces. While I'm equally excited about these positions, I know much less about the areas :). I will point out that we have a big systems group that covers many aspects of security (language security, verification, and network security) already. We've also had strong demand from students and industry for research in HCI, which will complement our info-viz efforts (and even our data mining work)
For more details on how to apply, see our ad. We'll start reviewing applications after Dec 1. Feel free to email me if you have questions about the slots (but don't send me your application material - send them in directly)
Disclaimer: the above views are my own personal views, and don't represent the views of the department or the hiring subcommittees.
Monday, October 01, 2012
Things a _____ should do at least once...
Firstly, you should check out the G+ post for even more scathing commentary on my original post.
Next, you should see the followups (in case you haven't already):
- Complexity blog: Things a complexity theorist....
- Turing's invisible hand: Things an AGTer....
- Peter Krautzberger (on Boole's rings): Things a set theorist..
- Computational Combinatorics: Things a graph theorist...
- Quantum Pontiff: Things a quantum information theorist ....
Wednesday, September 19, 2012
FOCS 2012 Student Support
They note that they can fund 24-25 people, so people in the US should apply EVEN IF they don't have a paper.
The NSF has been very generous of late in supporting student travel to conferences, and the best way to encourage them to continue is to show that we use the money they give us :). My students have taken advantage of these opportunities in the past, and it's been a great experience for them.
So if you're a student reading this (and I know there are many of you!) or if you're an advisor who may not have the funding to send your student(s) to FOCS, do take advantage of this chance.
Sunday, September 09, 2012
Things a TCSer should have done at least once
Rules of the game:
- You lose points if you do it as part of a class. If you decide to round an LP in a class on approximations, that's something you're being taught to do. But if you do it as part of solving a problem, then you've also done the difficult job of recognizing that an LP needs to be rounded. That latter skill demonstrates some mastery.
- The goal is not complete coverage of all of TCS; rather, it's coverage of techniques that will come up fairly often, no matter what kinds of problems you look at.
- This list is necessarily algorithms-biased. I doubt you'll need many of these if you're doing (say) structural complexity.
- A similar caveat applies to classical vs quantum computing. While it seems more and more important even for classical computations that one knows a little quantumness, I don't know enough about quantum algorithm design to add the right elements. Comments ?
- Show that a problem is NP-hard (for bonus points, from some flavor of SAT via gadgets)
- Show a hardness of approximation result (bonus points for going straight from a PCP)
- Prove a lower bound for a randomized algorithm
- Prove a lower bound via communication complexity or even information theory
- Round an LP (bonus points for not just doing the obvious rounding)
- Show an integrality gap for an LP
- Design a primal-dual algorithm
- Use projective duality to solve a problem (bonus points for using convex duality)
- Apply a Chernoff bound (bonus for using negative dependence, Janson's inequality, and an extra bonus for Talagrand's inequality)
- Design an FPT algorithm (maybe using treewidth, bonus for using bidimensionality or kernelization)
- Design a nontrivial exponential time algorithm (i.e an algorithm that doesn't just enumerate, but does something clever)
- Do an amortized analysis (for extra bonus get a log*n bound)
- use an advanced data structure - something beyond van emde Boas trees (extra bonus for exploiting word-size)
- invoke VC dimension to solve a problem
Tuesday, August 28, 2012
FOCS 2012 Call for Posters
After the poster events at STOC 2011, STOC 2012 and SoCG 2012, FOCS 2012 makes it four times. Tim Roughgarden notes that the deadline for submitting posters to FOCS 2012 is in two weeks (Sep 11). So if you have interesting work you'd like to air out in the theory comunity, and a deep longing to visit New Brunswick, New Jersey, then submit your entry.
By this time, there's a good chance that you've already experienced the posters event either as presenter or audience at one of these venues. If not, I'll reiterate what I've said before: presenting a poster is a great way to disseminate your work with more personalized interaction than you often get with a paper presentation.
Wednesday, July 25, 2012
Data Streaming in Dortmund: Day II
Day II could roughly be summarized as:
There were a number of talks on sparse recovery, which includes compressed sensing, and asks the question: How best can we reconstruct a signal from linear probes if we know the signal is sparse.
Eric Price led things off with a discussion of his recent breakthroughs on sparse FFTs. While the regular DFT takes $n \log n$ time, it's reasonable to ask if we can do better if we know the signal has only $k$ nonzero Fourier coefficients. He talked about a sequence of papers that do this "almost optimally", in that they improve the $n\log n$ running time as long as the sparsity parameter $k = o(n)$.
Anna Gilbert provided an interesting counterpoint to this line of work. She argued that for real analog signals the process of sensing and recording them, even if the original signal was extremely sparse, can lead to discrete signals that have $\Theta(n)$ nonzero Fourier coefficients, and this is in some way intrinsic to the sensing process. This was part of an attempt to explain why a long line of sublinear methods (including Eric's work) don't do very well on real signals. This line of attack is called 'off the grid" reconstruction because you're off the (discrete) grid of a digital signal.
In sparse recovery, there are two parameters of interest: the number of measurements you make of the signal, and the time for reconstruction. Obviously, you'd like to get both to be as small as possible, and information-theoretic arguments show that you have to spend at least $k \log(n/k)$ measurements. Martin Strauss focused on speeding up the reconstruction time while maintaining measurement-optimality, in a setting known as the "for-all" formulation, where the adversary is allowed to pick a signal after seeing the probe matrix (there's a related "foreach" model that is subtlely different, and I'm still a little confused about the two).
On a slightly different note (but not that different as it turns out), Atri Rudra talked about a fun problem: Given the "shadows" of a set of 3D points along three orthogonal planes, can you find the minimum number of points that could yield the specific shadows ? If all projections have complexity $n$, it's well known that the right bound is $n^{3/2}$. While this bound was known, it wasn't constructive, and part of Atri's work was providing an actual construction. There are all kinds of interesting connections between this problem, join queries, triangle counting in graphs, and even the scary-hairy 3SUM, but that's a topic for another time.
Other talks in the afternoon: Alex Andoni talked about finding the eigenvalues of a matrix efficiently on streams (specifically finding the "heavy" eigenvalues). I learnt about the Cauchy interlacing theorem from his talk - it's a neat result about how the eigenvalues of submatrices behave. Ely Porat talked about the problem of preventing evil entities
Joshua Brody returned to the basic streaming setup. Most stream algorithms that estimate some quantity introduce two-sided error (the true estimate could be above or below the reported value). He asked whether this was necessary to stay with sublinear bounds: it turns out for some problems, limiting yourself to 1-sided error can worsen the space complexity needed to solve the problem (note that for problems like the Count-min sketch, the estimate is one-sided by design, and so there's no deterioriation)
Coming up next: Andrew's summary of day three, and a report on our heroic tree-swinging adventures.
Tuesday, July 17, 2012
Permutation tests, graph non-isomorphism and kernels
1. I've known about permutation tests for a long time, but it only just occurred to me that the permutation test is exactly what you do in the AM protocol for Graph non-isomorphism. The principle is that if the two distributions are identical, your test statistic should not be able to tell them apart, and the way to test this is by randomly changing labels. Replace distributions by graphs and test by Merlin, and it looks the same. Is this a trivially trite observation, or are there other examples of protocols that use standard statistical tests in disguise ? (hmm I see a cstheory question here)
2. The kernel test he mentions, if you look closely, is merely computing the kernel distance between the two distributions. And the energy test he mentions later is doing something akin to earthmover. The kernel distance strikes again !!!
Wednesday, July 11, 2012
Guest Post: Women in Theory Workshop II
As an aside, many of the topics discussed would be quite useful to have at a general conference as well: it would be nice to have a similar panel discussion at a STOC/FOCS/SODA.
The opening talk of the workshop was by Joan Girgus (Princeton). She presented statistics about the percentage of women at the undergrad and graduate level in different fields of science and engineering in the last five decades. She mentioned that nearly fifty years ago, those who wanted to combine a career with raising a family were seen as anomalies. Today, combining family and the career is the norm with its own complexities and difficulties. However, even now women continue to be underrepresented in science and engineering beginning from undergraduate level till the faculty and research positions. Joan presented several possible reasons for this and also suggested approaches that could be taken by universities to improve the participation of women in academic and research careers.
The other interesting talk on the first day was by Maria Klawe (Harvey Mudd) who argued (and actually convinced us!) that it is the best time ever to be a woman in theory, and discussed opportunities for the future.
On the second day, there was a "work-life balance" panel led by Tal Rabin. All the speakers and organizers of the workshop were gathered to answer the questions by students. This panel was one of the most useful sessions in the workshop, because we could hear the real experiences and challenges that pioneering female researchers faced in their career and life.
The panel began by Tal asking speakers to just give one piece of advice to the audience. Some highlighted points were:
- " Be aware of young male talkers”
- “make conscious choices”
- “Make a right decision and go for it”,
- “Go for what really interests you”
- “The path is difficult, and so you must acquire the ability to talk about yourself and your work”,
- “do the work you like and be excited about that”,
- “Try to be more self promoting”
The floor was then opened for questions. There were different types of questions. Some of the questions were about family life for female researchers and the right time to have children.
Some speakers believed that having children is an option, rather than a default, and there should be no pressure to have children. While it might seem that everybody by default expects you to raise a family and have children, you don’t need to listen to people and do what they want. It was also mentioned that there is no "right time" to have children, and that it was a very personal decision. You should just decide for yourself when it's the right time for you.
Valerie King said that some of the conditions at one's job can affect decisions regarding one's family. She pointed out that in Canada there are child-friendly policies for women in academia. But she also mentioned that sometimes you have to sacrifice something because of your family life, and in those cases you should find some other alternative ways to minimize the impact, like choosing a less taxing problem to work on or...
There were different views on how family life and having children could affect the career life of young female researchers. Some believed that it wasn't a major issue - you could take off for a few years and wait till you reach a stable point for going back to your job - but some argued against this, pointing out that research in fields like CS develops very fast, and coming back after being away from research for a while can be very difficult and destructive for your research career without a careful and conscious plan. There were also discussions among speakers about how choosing a supportive partner can help young female researchers to deal with these difficulties more easily, (and that actually finding a proper partner is time-consuming and challenging by itself!).
Another question was about the challenges and pressures of graduate study. One of the highlighted issues was about challenges in working with colleagues in academia. It was mentioned that you should be careful with the people around you, and make sure to have some people around you that talk about your contribution and acknowledge you.
Panelists talked about different experiences they had with male colleagues. some of whom would make sure to acknowledge your contributions explicitly in their presentations, and some who would use your ideas without any acknowledgement. Clearly if you want to be more successful you should avoid being surrounded by this latter group of people. It was mentioned that one of the techniques in dealing with problems that you might face with male colleagues (if you find yourself unable to solve it by yourself) is to go to your manager or boss and push him to help you in that situation.
Another challenge that was highlighted was finding a research problem to work on during graduate study and also for one's research career after that. Many of the speakers agreed that was one of the biggest challenges in their research work). Other discussed challenges were about choosing the right advisor and changing research problems or advisors during one's PhD.
It was mentioned that usually the most common mistake new students make in doing research is that they decide on some topic, do a wide search on the current and previous work, and then come to the conclusion that all the problems had already been solved and that there was nothing new to do. But in fact in most research topics there are always ways to make the area broader and find new directions: this is the creative aspect of research. This is the main distinction between doing research and "re-search"
There were also some discussions about the different aspects of working as a researcher at research labs or at a university as faculty. Lisa Zhang from Lucent mentioned that research labs have good quality of life and encourage a lot of flexibility. However, there are issues relating to job security versus tenure and there is a trade-off between these two kinds of research positions.
There was discussion about collaboration between researchers. Valerie King mentioned that one should not be afraid to contact people and ask to work with them. In fact, people like it that others come and work with them on interesting research problems. She related experiences where she got stuck in solving some problem and found someone more expert in that area to collaborate with. One such collaboration was with two other female researchers resulting in what she called the only “Three Women SODA paper”.
At the end of the panel, Shubhangi Saraf (IAS, Rutgers) talked about her experiences during graduate study. She said that it was very important for one's research career to do multiple internships, travel and visit different schools and research labs, find good people to work with and build a good network and connections. Shubhangi was one of the participants that first attended the Women In Theory workshop as a student four years ago and is now, at the third workshop, one of the organizers. She mentioned that this workshop was one of the ways that she was able to meet new people and make connection to do internships.
At the end of the second day there was a banquet in Palmer House at which we were able to meet other professors from Princeton University and talk with them.
To conclude this post, I think this workshop was successful in its main goal of bringing together theory women from different departments and fostering a sense of kinship and camaraderie among them. It was really a good feeling to talk about challenges you faced or times when you got stuck during your research and realize that other students and researchers have had the same experience! You feel a lot more powerful, because now when you're stuck with a problem and don’t know what to do, you know there are some other people with a similar situations that you can just shoot an email to and say: “Hey! I'm stuck and need to talk with you! ”.
Tuesday, July 03, 2012
Guest Post: Women in Theory Workshop I
This past week, I was at the third workshop on “Women in Theory (WIT)” workshop held in Princeton University with about 46 female students and 12 speakers. Most of the students came from different schools in USA: University of Maryland, College Park, Brown, Oregon State, UMass, MIT, Stony Brook, Berkeley, Princeton, Columbia, Boston U, Washington, Stevens Institute of Technology, , UCSD, Northwestern, Harvard, UW-Madison, UCLA, CUNY, GMU, Harvey Mudd and Uta . There were also some international students from India (IIT), ETH Zurich, Germany, Canada (Toronto), City University Hong Kong, Moscow Engineering Institute and Israil (Technion, Weizmann, Hebrew).
Participants were from a wide range of research fields like Cryptography and Privacy, Graph theory and algorithms, Computational Geometry, Computational Biology, Complexity, Game theory and mechanism design and Social Networks. But the interesting thing was that you could find more students working on cryptography than other fields.
Tal Rabin, Shubhangi Saraf and Lisa Zhang were organizers of the workshop and there were many junior and senior women in the field as speakers: Gagan Aggarwal (Google), Glencora Borradaile (Oregon State), Joan Girgus(Princeton), Nicole Immorlica (Northwestern University), Valerie King (University of Victorial),Maria Klawe (Harvey Mudd), Vijaya Ramachandran (UT Austin), Jennifer Rexford (Princeton),Dana Ron (Tel Aviv University), Ronitt Rubinfeld (MIT and Tel Aviv University), Shubhangi Saraf(IAS), Rebecca Wright (Rutgers and DIMACS).
Beside the non-technical parts of the workshop, there were many interesting technical talks by different researchers on different topics.
Among them, one of the very wonderful and interesting talks was presented by Dana Ron (Tel-Aviv University, Columbia) on sublinear-time algorithms. When we refer to “efficient algorithms” we usually mean “polynomial-time algorithm” and at the best we can try to lower the degree of the polynomial as much as possible, leading to “linear time algorithms“. However when we are dealing with massive data sets, even linear time might be infeasible. In these cases we look to design even more efficient algorithms, namely “sub-linear time algorithms”. These algorithms do not even go through the whole input data, so we just expect to output approximately-good answers. They are also probabilistic, and so are allowed to have a failure probability (i.e are Monte Carlo).
One such type of algorithms are property testing algorithms, in which one has to decide with high success probability whether an input (like a graph), has a certain property (for example is bipartite), or is relatively far from having that property (for example in this case a relatively large fraction of its edges should be removed so that the graph become bipartite). In this talk several properties and algorithms were discussed. Other types of sublinear algorithms discussed in her talk were algorithms for estimating various graph parameters, like the number of connected components, the size of a minimum vertex cover, and so on. I think Dana wase able to give a flavor of analysis techniques for sublinear-time algorithms with great clarity, and it was definitely one of the best talks in this workshop.
Another great talk was given by Ronitt Rubinfeld (Tel Aviv University and MIT), on estimating properties of distributions. In her talk, she surveyed a body of work regarding the complexity of testing and estimating various parameters of distributions over large domains, when given access to only few samples from the distributions. Such properties include testing if two distributions have small statistical distance, testing if the marginal distributions induced by a joint distribution are independent, testing if a distribution is monotone, and approximating the entropy of the distribution. In this kind of problems, the classical techniques such as the Chi-squared test or the use of Chernoff bounds have sample complexities that are at least linear in the size of the support of the underlying discrete probability distributions. However, algorithms whose sample complexity is “sublinear” in the size of the support were shown for all of the above problems.
Nicole Immorlica (Northwestern University) also gave an interesting talk about cooperation in social networks, presented as a game among students. In this talk, she explored the impact of networks on behavior through a study of the emergence of cooperation in the dynamic, anonymous social networks that occur in online communities. In these communities, cooperation (for example in business deal) results in mutual benefits, whereas cheating results in a high short-term gain.
Some of the other interesting talks was presented by Gagan Aggarwal (Google) on mechanism design for online advertising, Glencora Borradaile (Oregon State) on designing algorithms for planar graphs (and it was the only talk on the blackboard without any slides!), Vijaya Ramachandran (U of Texas, Austin) on Parallel and Multicore Computing Theory and Valerie King (University of Victoria) on Dynamic Graph Algorithms for maintaining connectivity. In her talk, Valerie discussed this problem that had been studied for over 30 years, and she reviewed the area and described a new direction for achieving polylogarithmic worst case performance. At the end of her talk she also mentioned Mihai Patrascu as a pathbreaking researcher who was taken too soon from us.
Unfortunately there were no talks on topics like Computational Geometry, but we had a few students working on CG related topics and they presented their research problems in the rump session. My presentation at the rump session was on building core sets for uncertain data.
Monday, July 02, 2012
Guest Post: Report from the Snowbird Math Research Communities program
Recently, there was an MRC workshop for Discrete and Combinatorial Geometry held right in my hometown, at Snowbird in Utah. Suresh invited me to share my experience of the event.
Working with trained mathematicians illustrated to me just how many tools are out there
that I wasn't familiar with beyond a "oh I heard that word somewhere" head-nod. Ergodic theory, configuration spaces of cohomologies, measure theoretic ideas, algebraic geometry techniques and variational calculus tools and more. Now, there's always a strong need for self-study and picking up techniques independently in a Ph.D but I can't help but feel that most theoretical CS students would benefit from required courses and curricula more tailored to supplementing our math backgrounds.
But more than being exposed to the mathematical depth of knowledge out there, I loved the intoxicating energy of a room filled with curious mathematicians. One group would be eagerly folding origami cubes, another playing with colored bricks and crayons on a coloring problem,
a trifecta of mathematicians lying passed out by a whiteboard decoated with scribbled figures and gazing dreamily into the distance .I finally understand the cliche of mathematicians having the enthusiasm and boundless energy of kindergarteners,playing with ideas and constructs-- no disinterested 'suits' here!
More so, it was good for me to associate real faces and people with many of the papers and fields I had read of. One grad student speaks of how he had been burned out for several months after his last paper, another stares blankly at his beer at 10:30 pm after a twelve hour session discussing a tricky problem, another discuss the merits of wine vs scotch, one raves about an advisor who recommend his students go out hiking on a beautiful day instead of staying in their labs (Suresh, hint, hint!!), another of handling the two-body problem in an already restricted job market. And of course, the common theme of how wonderful and at times frustrating math is.
There were many light-hearted moments too, to brighten a few faces. For example after mathematicians A and B had spent all day on a problem only to realize their approach was all wrong-
Mathematician A: "I guess doing math is all about cursing sometimes."Or another light-hearted conversation between Organizer A and a student who had invited him to speak at a conference -
Mathematician B: "$%#@! F-bomb. #@%#@ F-bomb. Censored here for family audiences".
"So, am I the entertainment for the high school students
then?"
Student-"Yes, we have your talk on geometry the evening after the sword-swallower performs."
Let me give a shout out to the wonderful facilities provided us by the organizers, especially the amazing food.We were fed three square meals a day, plus tea twice a day and another informal snacks and beer session after nine pm. Most of the meals were supplemented by confectionaries including red velvet cake or pastries, the meals were generally 3-4 courses (including mushroom and cheese garlic pizza, salmon fillet, chicken teriyaki, beef broccoli and more) and there were several rounds of free wine and scotch during the week. I may or may not have been slightly tipsy on a few occasions, and most of us put on at least a couple of pounds in a gluttonous week of math and fine cuisine. Several of us also went on a hike up the nearby trails, or enjoyed the swimming pools. I'm from Utah, of course, so I've been spoiled to always have the outdoors easily available.
There was a lovely talk given by the organizers on the job hunt process and pointers on finding the best fit institution. We've all heard the horror tales of how tight the academic job market is, but it's
disconcerting nonetheless to hear firsthand of several hundred applicants for a single faculty position, or of how many of the workshop participants had sent in over a 100 applications to various universities for their first job. Despite this, the independence of a research university position is still THE holy grail for those of a more mathematical bent - most of those attending seemed uninterested in the compromises involved in a teaching intensive or industry position, and I can certainly understand that sentiment.
Finally a shout out for my favorite screening of the session- Diana Davis showed us her entry for "Dance your Ph.D thesis", which drew much approval from an audience worn out by the excessive number of dry beamer and powerpoint presentations we've seen. .
Thursday, June 28, 2012
Geometry in the Field: Talk Slides
Many people had asked me when the talk slides would be available, and I'm happy to announce that all the talk slides are now up at the workshop site.
I strongly recommend looking over the slides if you have the time: these were excellent talks by great speakers that gave a masterful overview of their specific areas, as well as challenges for the future.
Friday, June 22, 2012
CGWeek II
The Cheeger inequality is a well known inequality in spectral graph theory that connects the "combinatorial expansion" of a graph with "spectral expansion". Among other things, it's useful for clustering, because you can split a graph using Cheeger's inequality to find a "thin cut" and then repeat. There's been work recently on a "higher-order" Cheeger inequality, that allows you to cut a graph into $k$ pieces instead of two by connecting this to observations about the first $k$ eigenvectors of the Laplacian.
The above paper generalizes the notion of combinatorial expansion versus spectral expansion in a very different way. Think of a graph as the 1-skeleton of a simplicial complex (i.e the set of faces of dimension at most 1). Is there a way to define the Laplacian of the complex itself ? And is there then an equivalent of Cheeger's inequality ?
It turns out that this can indeed be done. Recent work by Gromov and others have shown how to define the notion of "edge expansion" for a simplicial complex. Roughly speaking, you compute the edge expansion of a cochain (a function of a chain) by endowing it with a norm and then looking at the norm of the edge operator with respect to the norm (sort of) of the cochain itself. What is interesting is that if you choose the underlying coefficient field as $\mathbb{R}$ and the norm as $\ell_2$, you get the spectral equivalent, and if you choose instead $\mathbb{Z}_2$ and the Hamming distance, you get the combinatorial equivalent.
It's known that for 1-skeletons, and even for things like persistence, the underlying field used to define homology doesn't really matter. However, for this problem, it matters a lot. The authors show that there is no equivalent Cheeger's inequality for simplicial complexes ! They also look at random complexes and analyze their properties (just like we can do for random graphs).
Suppose you have three Gaussians in the plane, and you look at the resulting normalized distribution. You expect to see three bumps (modes), and if the Gaussians merge together, you'd expect to see the modes come together in a supermode.
Can you ever get more than three modes ?
This is the question the above paper asks. It was conjectured that this cannot happen, and in fact in 2003 it was shown that it was possible to get 4 modes from three Gaussians (you can get a little bump in the middle as the three Gaussians pull apart). In this paper, they show that in fact you can get a super-linear number of "bumps" or critical points for $n$ Gaussians, and these modes are not transient - they "persist" in a certain sense.
This is quite surprising in and of itself. But it's also important. A plausible approach to clustering a mixture of Gaussians might look for density modes and assign cluster centers there. What this paper says is that you can't really do that, because of the "ghost" centers that can appear.
Other quick hits:
- Chris Bishop ran a very interesting workshop on Analysis and Geometry. While I couldn't attend all the talks, the first one was by Ranaan Schul gave an overview of the analytic TSP, in which you're given a continuous set of points in the plane, and want to know if there's a finite length curve that passes through all the points. The techniques used here relate to multiscale analysis of data and things like local PCA.
- One of the cool innovations in CGWeek was the fast-forward session: each afternoon before the workshops started, speakers were allowed to give a 1-slide overview of what would happen in their events. The slides were all placed in a single presentation ahead of time, and it was clocked so that people couldn't run on. It was great - I didn't have to do careful planning, and got a much better idea of what was coming up. We should do this for all talks at the conference !
Sunday, June 17, 2012
CGWeek Day I
There were a number of great talks today at the conference and the workshops -- too many to do justice to. Here are some random thoughts:
- This year, there's a best student presentation award. My student Amirali Abdullah is one of the competitors, so I don't want to say too much about it, but I'll say that everyone uppped their game - the presentations by students were great ! Maybe they should have best presentation awards in general to improve everyone's presentation skills !
- Amir's talk went pretty well (if I say so myself). It was his first conference talk, but I thought it was well above the average (maybe this says something about the average :)). The talk was about approximate Bregman near neighbors.
- Very intriguing paper by Adamaszek and Stacho on the connection between homology structure of flag complexes and matchings that induce maximal independent sets.
- The workshop on geometric learning was fantastic. Lots to think about on multiscale representations, and distances to measures.
- Jeff Erickson has a four part liveplussing (one, II, trois, fear) of the business meeting
- Most importantly: Rio in 2013 !! Kyoto in 2014 (boo) !! OSU no way (I kid, I kid).
- Even more importantly: Herbert proposes changing the name of the conference to "Symposium on Computational Geometry and Topology". Major concern - what about the sausage ? Audience votes confusedly after Herbert promises a Klein bottle of beer for everyone.
A Reflection On Big Data (Guest post)
There seem to be no shortage of Big Data events at the moment, reflecting some of the enthusiasm around this topic. Last week, there were meetings at NIST and at Duke and there are upcoming events at Stanford and SAMSI to name but a few. (Ed: also the new Simons Foundation program on big data)
I attended the Duke event, and was hoping to blog about the meeting, but suffered a big data problem of my own: with 25 speakers, plus panels and breakout sessions, how to pull out any meaningful summary? So instead, I'll just pick on one issue that was the subject of much discussion: what exactly is new about big data? In particular, there has been much interest in what we called 'massive' data over the last years (as captured by MADALGO, the center for massive data algorithmics and the accompanying MASSIVE workshop). Is Big Data just a rebranding of Massive Data? Which is bigger, Big Data, Massive Data, or Very Large Data ?
What came out from our discussions is that we believe that Big Data is qualitatively different from the challenges that have come before. The major change is the increasing heterogeneity of data that is being produced. Previously, we might have considered data that was represented in a simple form, as a very high-dimensional vector, over which we want to capture key properties. While Big Data does include such problems, it also includes many other types of data: database tables, text data, graph data, social network data, GPS trajectories, and more. Big Data problems can consist of a mixture of examples of such data, not all of which are individually "big", but making sense of which represents a big problem. Addressing these challenges requires not only algorithmics and systems, but machine learning, statistics and data mining insights.
This is by no means a new observation: the axes of Volume, Velocity and Variety have been widely discussed before. But it did remind me that we should not get too hung up on how big our big data is, since size is not the sole defining characteristic of handling big data (although it is a necessary characteristic).
Many other familiar topics came up in discussions, such as: how can big data sets be made more accessible to researchers? how can privacy concerns with big data be addressed? what are the right models for big data, for both the data and the computation? what are the key techniques for working with big data? do we even know enough to teach a course on big data?
What struck me about these discussions is that not only were we far from reaching a consensus on any point, but also no one seemed to think they had any strong candidate solutions. Consequently, I left assured that Big Data is much more than a rebranding of existing work. It represents a significant research agenda that will challenge us for years to come.
Friday, June 08, 2012
Geometry in the Field: Abstracts are up !
The abstracts for the talks are now up at the workshop page. We hope to see you there !
Wednesday, June 06, 2012
Mihai Patrascu, R.I.P
I can only hope that he's out there proving lower bounds in the sky, and arguing fiercely with the clouds while he does so.
Update: There is now a memorial page for Mihai at mipmemorial.blogspot.com. Please visit there to post your comments, and have a drink for Mihai !
Friday, June 01, 2012
Minimum language complexity needed to construct exponential time algorithm
I was at dinner with some faculty colleagues (assistant professor beer night - best invention known to mankind!) and we were in the traditional "all our students are terrible" phase of the evening *. The discussion turned to "what programming concepts do students understand after an intro-CS class", which led to the following claim:
"you can't write an exponential time algorithm without knowing recursion"While this seemed plausible, I was skeptical. Thankfully my PL colleague (and super blogger) Matt Might was at the table, and he pointed out that the lambda calculus formally has no recursive constructs and yet is Turing-complete, via the magic of the Y-combinator (which essentially computes fixed points directly).
Of course one might argue that a fixed point operator does give you recursion (indeed the Y-combinator is often referred to as anonymous recursion). So my question really is:
Is there a well defined notion of "programming languages without recursion or recursion-like substances" and if so, what is the most expensive algorithm that can be expressed in such a language ?
p.s BDDs seem like an obvious candidate...
1 (other regularly discussed phases include "I can't believe < latest cryptic thing > senior faculty member said", "How many frequent flyer miles do you have" and "chickens!". Beer night is awesome !↩
Extracurricular events at STOC
For a while now, people (including me) have been clamoring for an expansion of the traditional three-days-of-talks format at theory conferences, and now STOC and FOCS (and SoCG soon!) are doing workshops, tutorials and poster sessions on a regular basis. Chandra Chekuri, Sergei Vassilvitskii and I organized the poster session at STOC this year, and Avrim Blum and Muthu (Muthu!) were in charge of workshops.
I'm really glad that we had these events, and even happier that FOCS 2012 plans to continue the event. Speaking of which, Avrim is looking for workshop proposals for FOCS 2012, and specifically asked me to suggest that geometry folks suggest a workshop. The deadline is June 20, and all you need is a 2-page proposal and 2-3 people who'd organize it.
The only complaint was that the poster session was too short, and that we should have left the posters up for people to browse while taking breaks. I think this is an excellent idea that the next posters committee (at FOCS 2012) should take into account.
Wednesday, May 30, 2012
Workshop on coding theory, complexity and sparsity
Efficient and effective transmission, storage, and retrieval of information on a large-scale are among the core technical problems in the modern digital revolution. The massive volume of data necessitates the quest for mathematical and algorithmic methods for efficiently describing, summarizing, synthesizing, and, increasingly more critical, deciding when and how to discard data before storing or transmitting it. Such methods have been developed in two areas: coding theory, and sparse approximation (SA) (and its variants called compressive sensing (CS) and streaming algorithms).
Coding theory and computational complexity are both well established fields that enjoy fruitful interactions with one another. On the other hand, while significant progress on the SA/CS problem has been made, much of that progress is concentrated on the feasibility of the problems, including a number of algorithmic innovations that leverage coding theory techniques, but a systematic computational complexity treatment of these problems is sorely lacking.
The workshop organizers aim to develop a general computational theory of SA and CS (as well as related areas such as group testing) and its relationship to coding theory. This goal can be achieved only by bringing together researchers from a variety of areas. We will have several tutorial lectures that will be directed to graduate students and postdocs.
These will be hour-long lectures designed to give students an introduction to coding theory, complexity theory/pseudo-randomness, and compressive sensing/streaming algorithms.
We will have a poster session during the workshop and everyone is welcome to bring a poster but graduate students and postdocs are especially encouraged to give a poster presentation.
Confirmed speakers:
We have some funding for graduate students and postdocs. For registration and other details, please look at the workshop webpage: https://sites.google.com/site/umccsworkshop2012/
- Eric Allender Rutgers
- Mark Braverman Princeton
- Mahdi Cheraghchi Carnegie Mellon University
- Anna Gal The University of Texas at Austin
- Piotr Indyk MIT
- Swastik Kopparty Rutgers
- Dick Lipton Georgia Tech
- Andrew McGregor University of Massachusetts, Amherst
- Raghu Meka IAS
- Eric Price MIT
- Ronitt Rubinfeld MIT
- Shubhangi Saraf IAS
- Chris Umans Caltech
- David Woodruff IBM
Thursday, May 17, 2012
8F Computational Geometry
Well, this June it will be known for geometry, as SoCG will be held this year very near Durham, NC. Unfortunately, it will be at UNC and not Duke. (I kid, I know Jack will do a great job organizing.)
If those things are not enough, Suresh and I (Jeff) are organizing an 8F-Computational Geometry workshop at SoCG. In case thats not clear, 8F stands for AITF, and AITF stands for Algorithms in the Field.
The idea is to both showcase how computational geometry has had impact on "the field" and highlight some areas that we hope are ripe for future impact from the CG community. There are two days of talks Tuesday, June 19 on ongoing impact and Wednesday, June 20 on potential (more) impact. We already have a great line-up of speakers; on the first day we have talks by
On the second day we have talks by
We encourage you all to attend, and to see what the Durham area has to offer. There is actually a full 4 days of activities at SoCG this year. The early registration deadline is next week (May 23) and its only $100 for students!
We should mention that this 8F-Computational Geometry workshop is part of a larger push (through sponsorship) by NSF. There was a more general 8F Workshop last year, and there are more in the works.
Wednesday, May 16, 2012
"In the long run..."
It's an apt response also to occasional claims that asymptotics will eventually win out, especially with large data. Asymptotics will eventually win out, as long as everything else stays fixed.
But that's the precise problem. Everything else doesn't stay fixed. Well before your $C n \log n$ algorithm beats the $c n^2$ algorithm, we run out of memory, or local cache, or something else, and the computational model changes on us.
We come up with external memory models, and are informed that in fact even a factor of log N is too much. We design streaming models and Mapreduce and so on and so forth, and realize that all this communication is burning a hole in our atmosphere.
Lesson to be learned (and re-learned): asymptotic analysis is a powerful tool, but it's only as good as the model of computation you're REALLY working with.
Wednesday, May 09, 2012
Multiplicative Weight Updates as Zombie Binary Search
After pondering the use of MWU for a while now, it seems to me that this should be taught as a standard algorithms tool that naturally follows from divide and conquer and prune-and-search. In what follows, I'll sketch out how this might work.
A quick caveat: the MWU, like many powerful tools, has multiple interpretations, and the survey hints at many of them (for example, MWU = derandomization of Chernoff bound). I don't intend to imply that there's only one way to interpret the technique, but that the approach I'll describe is the most accessible one when learning the method for the first time.
The divide and conquer unit in an algorithms class might cover sorting and FFTs, driven by the standard D&C recurrence relation. Prune and search is introduced as a variation, with median finding and binary search being the prototypical examples.
Prune-and-search works like magic, if you're seeing it for the first time. Do some work, throw away the rest, and your $n \log n$ running time goes down to linear. (As an Indian of a certain age, this always reminds me of "thoda khao, thoda phenko" (eat a bit, throw a bit) from Jaane Bhi Do Yaaro).
But what makes it tick is determining the constant factor that needs to be thrown away. The most direct way to do this is deterministically: on a linear time budget, find a point that splits the input into two roughly balanced parts.
But that's expensive in practice. Indeed, algorithms that use median finding as a subroutine try to avoid using a deterministic procedure because it can be slow. When you get to more advanced methods like parametric search, it gets even worse.
The neat trick to apply here is to randomize ! We know that we don't have to find an exact split point - merely one that will approximately balance the two sides of the recursion. For median finding, we can pick three points at random, and take their median. With a sufficiently high probability, this median will create a 1/3-2/3 split, and away we go !
This idea surfaces again and again, especially in the many randomized algorithms in computational geometry. As a design strategy, it's quite effective - design an algorithm that works if you can do balanced splits, and then find the split by choosing randomly. Invoke the Chernoff God and some satellite deities, and you're done.
Which brings us to the MWU.
Randomized splitting says that we're willing to lie a little about the split process, and things still work out. But whether you do randomized splitting or deterministic, the end result is still that you throw away some reasonable fraction of the input, NEVER TO SEE IT AGAIN.
Suppose you can't even do that ?
Think of noisy binary search (or 20 questions with a liar). Now, even your decision on what to prune is error-prone. You might be eliminating things that you need to consider later. So it's not clear that you can make any kind of search progress. But let's be reasonable and limit the adversary (or the noise) in some manner. Let's say that you'll only misclassify a small fraction of the input in each iteration (where small is "strictly less than half"). Let's also assume that I can tell you how important certain points are, so that you have to take that into consideration when defining your "small fraction".
So how does MWU now work ? I tell you a distribution over the input, and you give me back a rule that's reasonably good at (say) classifying points. I increase the importance of things you made mistakes on (so you try not to do it again), and repeat.
Eventually, what happens is that the weight of things that you make mistakes on increases rapidly. But the overall weight of the input can't increase too much, because I only increase the weight of things you make mistakes on, which is not a lot. Intuitively, what happens is tha the first weight catches up with the second, at which point the algorithm is complete. You can show that if the updating schedule is chosen correctly, the process terminates in logarithmic steps.
Essentially, MWU functions as a zombie binary search. You want to kill elements, but they keep coming back. Thankfully, each time they come back they're slightly weaker, so you have a better chance of hitting them next time. Eventually, head is severed from neck, and your zombies are dead (and your points are found).
My only wish at this point is that whenever you think of MWU you think of zombies. Now that's impact :)
Wednesday, April 18, 2012
Distributed Learning: A new model
So suppose you want to learn in this setting ? Suppose you have data sitting on different nodes (you have a data center, or a heterogeneous sensor network, and so on) and you'd like to learn something on the union of the data sets. You can't afford to ship everything to a single server for processing: the data might be too large to store, and the time to ship might be prohibitive.
So can you learn over the (implicit) union of all the data, with as little discussion among nodes as possible ? This was the topic of my Shonan talk, as well as two papers that I've been working on with my student Avishek Saha, in collaboration with Jeff Phillips and Hal Daume. The first one will be presented at AISTATS this week, and the second was just posted to the arxiv.
We started out with the simplest of learning problems: classification. Supppose you have data sitting on two nodes (A and B), and you wish to learn a hypothesis over the union of A and B. What you'd like is a way for the nodes to communicate as little as possible with each other while still generating a hypothesis close to the optimal solution.
It's not hard to see that you could compute an $\epsilon$-sample on A, and ship it over to B. By the usual properties of an $\epsilon$-sample, you guarantee that any classifier on B's data combined with the sample will also classify A correctly to within some $\epsilon$-error. It's also not too hard to show a lower bound that matches this upper bound. The amount of communication is nearly linear in $1/\epsilon$.
But can you do better ? In fact yes, if you let the nodes talk to each other, rather than only allowing one-way communication. One way of gaining intuition for this is that $A$ can generate classifiers, and send them over to $B$, and $B$ can tell $A$ to turn the classifier left or right. Effectively, $B$ acts as an oracle for binary search. The hard part is showing that this is actually a decimation (in that a constant fraction of points are eliminated from consideration as support points in each step), and once we do that, we can show an exponential improvement over one-way communication. There's a trivial way to extend this to more than 2 players, with a $k^2$ blow up in communication for $k$ players.
This binary search intuition only works for points in 2D, because then the search space of classifiers is on the circle, which lends itself naturally to a binary search. In higher dimensions, we have to use what is essentially a generalization of binary search - the multiplicative weight update method. I'll have more to say about this in a later post, but you can think of the MWU as a "confused zombie" binary search, in that you only sort of know "which way to go" when doing the search, and even then points that you dismissed earlier might rise from the dead.
It takes a little more work to bring the overhead for k-players down to a factor k. This comes by selecting one node as a coordinator, and implementing one of the distributed continuous sampling techniques to pass data to the coordinator.
You can read the paper for more details on the method. One thing to note is that the MWU can be "imported" from other methods that use it, which means that we get distributed algorithms for many optimization problems for free. This is great because a number of ML problems essentially reduce to some kind of optimization.
A second design template is multipass streaming: it's fairly easy to see that any multipass sublinear streaming algorithm can be placed in the k-player distributed setting, and so if you want a distributed algorithm, design a multipass streaming algorithm first.
One weakness of our algorithms was that we didn't work in the "agnostic" case, where the optimal solution itself might not be a perfect classifier (or where the data isn't separable, to view it differently). This can be fixed: in an arxiv upload made simultaneously with ours, Blum, Balcan, Fine and Mansour solve this problem very neatly, in addition to proving a number of PAC-learning results in this model.
It's nice to see different groups exploring this view of distributed learning. It shows that the model itself has legs. There are a number of problems that remain to be explored, and I'm hoping we can crack some of them. In all of this, the key is to get from a 'near linear in error' bound to a 'logarithmic in error' bound via replacing sampling by active sampling (or binary search).
Monday, April 09, 2012
Approximate Bregman near neighbors.
One of the things that has haunted me over the years (yes, haunted!) is the geometry of Bregman divergences. A Bregman divergence is a very interesting creature. It's defined as the difference between a convex function and its linear approximation starting at another point:
$$ d_\phi(p,q) = \phi(p) - \phi(q) - \langle \nabla \phi(q), p - q\rangle $$
Because the Bregman divergences are parametrized by the function $\phi$, they yield a family of divergences. The most familiar one is $\ell_2^2$, but the most important one is the Kullback-Leibler divergence. There are a ton of reasons why one should study Bregman divergences, and in a shameless plug I'll refer to you my slides (one, two, three) from a geometry summer school last year. Suffice it to say that it's possible to unify a number of different algorithms in machine learning under a single framework by realizing that they're all particular instances of doing something with a Bregman divergence: two notable examples of this are Adaboost and information-theoretic clustering.
So what's the geometric perspective on Bregman divergences ? They generalize duality !
There's a standard way to think about traditional point-hyperplane duality via a lifting map: take a point, lift it to the paraboloid one dimension up, and find the tangent plane. But suppose we replace the paraboloid by a generic convex function ? what we get is a general convex duality (technically a Fenchel-Legendre duality) defined by
$$\phi^*(u) = \sup_v \langle u,v \rangle - \phi(v)$$
The reason this doesn't come up in Euclidean space is that the mapping is self-dual if we set $\phi(x) = (1/2) \|x\|^2$, which explains the symmetry in $\ell_2^2$.
It turns out that this observation is enough to import much of standard combinatorial geometry over to general Bregman spaces. We can compute convex hulls, Voronoi Diagrams, delaunay triangulations etc, as long as we're careful to keep primal and dual straight. for example, the locus of points equidistant from two points under a Bregman divergence is either a primal straight line or a dual straight line (depending on how you measure things), and so on. This was all elegantly worked out by Nielsen, Nock and Boissonnat a while ago.
Alas, this beautiful connection doesn't help us with approximation problems.
One of the most interesting questions regarding Bregman divergences is solving the near-neighbor problem. This is interesting because the Kullback-Leibler distance is often used (for example) to compare images, and so there's been a lot of empirical work on this problem.
But here's the problem. Let's consider the low dimensional ANN problem. In Euclidean space (or even in spaces of bounded doubling dimension), here's how you solve the problem. You build some kind of quad-tree-like data structure, using the triangle inequality to reason about which cells you need to explore, and using packing bounds to bound the number of cells explored. You also need a crude bound on the near neighbor to start with, and to do this, you use some variant of a ring-tree.
The key ingredients: triangle inequality (twice over) and packing bounds.
Bregman divergences don't even satisfy a directed triangle inequality in general. And to date, we didn't even know how to define packing bounds properly for these directed distances.
In an upcoming paper at SoCG 2012 with my students Amirali Abdullah and John Moeller, we finally figured out how to get some "approximation" of the ANN machinery to work with Bregman divergences, and get logarithmic query times with small space.
If I say so myself, there are some nice ideas in the paper. Firstly, in a somewhat surprising twist, a Bregman divergence satisfies a "reverse triangle inequality" on the line:
\[ d_\phi(a,b) + d_\phi(b,c) \le d_\phi(a, c), a, b, c, \in R\]
This is neat, because it gives packing bounds ! Intuitively, if the sum of lengths of subintervals along an interval is at most the length of the interval, then you can't have too many subintervals.
The next trick is an observation that the square root of a Bregman divergence satisfies a kind of relaxed triangle inequality that we call $\mu$-defectiveness:
\[ |d(x, y) - d(x,z)| \le \mu d(y, z)\]
This allows us to import some of the ring-tree machinery to get a coarse approximation.
And I should mention that the $\mu$ values involved here are quite small. If you're comparing distributions with the KL-divergence, then the value of $\mu$ is less than 2 even quite close to the boundary of the simplex.
Even with all of this in place, the quad-tree argument breaks. This is because of $\mu$-defectiveness: we can't assume that cell sizes are "reasonably large" at some number of levels below the root of the tree, and so our packing bounds look terrible. It takes a lot more work to fix this problem: essentially we exploit second-order structure of the Bregman divergences to bound the cell sizes and get the desired packing bound.
Afterthoughts:
- Most of the complexity of the algorithm is in the analysis: the actual algorithm looks mostly like a Euclidean ANN procedure. While we haven't implemented it, I'm hopeful that when we do, we'll be able enjoy the empirical behavior of our Euclidean cousins.
- We're looking at high dimensional Bregman near neighbors, as well as some other approximate Bregman geometry questions. While our low-D ANN result comes from the "I will break your limbs one by one till you submit" school of algorithms, the hope is that we can start to exploit more of the dual geometry as we learn more.