Saturday, December 15, 2012

NIPS II: Deep Learning and the evolution of data models

(tl;dr: some rambles and musings on deep learning and data, as I attempt to sort out in my head what this all means)

Over the years, as we've engaged with "big data" more and more, the way we construct mental models of data has changed. And as I've argued before, understanding how we think about data, and what shape we give it, is key to the whole enterprise of finding patterns in data.

The model that one always starts with is Euclidean space. Data = points, features = dimensions, and so on. And as a first approximation of a data model, it isn't terrible.

There are many ways to modify this space. You can replace the $\ell_2$ norm by $\ell_1$. You can normalize the points (again with $\ell_2$ or $\ell_1$, sending you to the sphere or the simplex). You can weight the dimensions, or even do a wholesale scale-rotation.

But that's not all. Kernels take this to another level. You can encode weak nonlinearity in the data by assuming that it's flat once you lift it. In a sense, this is still an $\ell_2$ space, but a larger class of spaces that you can work with. The entire SVM enterprise was founded on this principle.

But that's not all either. The curse of dimensionality means that it's difficult to find patterns in such high dimensional data. Arguably, "real data" is in fact NOT high dimensional, or is not generated by a process with many parameters, and so sparsity-focused methods like compressed sensing start playing a role.

But it gets even more interesting. Maybe the data is low-dimensional, but doesn't actually lie in a subspace. This gets you into manifold learning and variants: the data lies on a low-dimensional curved sheet of some kind, and you need to learn on that space.

While the challenge for geometry (and algorithms) is to keep up with the new data models, the challenge for data analysts is to design data models that are realistic and workable.

So what does this have to do with deep learning ?

Deep learning networks "work" in that they appear to be able to identify interesting semantic structures in data that can be quite noisy. But to me it's not entirely clear why that is. So I've been mulling the question of what kind of structure in data might be "visible" to such a network. In the absence of formal results ("if your data can be separated in some RKHS, then an SVM will find it"), what follows is some speculation, based on talks I attended and conversations I had with NIPs attendees.

Observation I: Both StephanÃ© Mallat's talk and a nice paper by Coates, Karpathy and Ng talk about the idea of "first-level" features that identify (low-dimensional) invariants and eliminate noise. In the Coates et al paper, they start with a very fine $k$-means clustering ($k$ very large), and attempt to "glue" the centers together into low dimensional subspace pieces. These are then propagated up a network of higher-order feature processors.

Observation 2: Yoshua Bengio's survey of deep learning (a very readable account) points to work by Geoff Hinton as part of the reinvigoration of the field. A central idea of this work is that deep belief networks can be trained "layer by layer", where each layer uses features identified from the previous layer.

If you stare at these things long enough, you begin to see a picture not of sparse data, or low-rank data, or even manifold data. What you see is a certain hierarchical collection of subspaces, where low-dimensional spaces interact in a low dimensional way to form higher level spaces, and so on. So you might have a low-level "lip" feature described by a collection of 2-3 dimensional noisy subspaces in an image space. These "lip" features in turn combine with "eye" features and so on.

This might seem rather far fetched, and a strange way to think about data. But I can't claim originality even here. Back in June, Mauro Maggioni gave a talk at CGWeek in Chris Bishop's workshop on the connection between analysis and geometry. In this talk, he described a multi-resolution view on data that admits structure at different scales, and admits different structures at these scales.

The upshot of all of this: it is possible that deep learning is trying to capture hierarchies of low dimensional subspaces that interact in a low dimensional way. This would explain how one is able to avoid the curse of dimensionality, and might also explain why it sometimes can find structure that other methods (kernels, manifold methods, etc) might miss.

Problem is: I'm not sure how one tests this hypothesis. Almost any data set could be viewed this way if you allow enough features and enough "depth" in the hierarchy.

1. Yoshua Bengio had an interesting paper at NIPS deep learning workshop "Knowledge Matters: Importance of Prior Information for Optimization"

The task is to tell if all pieces shown on the image are the same. Starting with pixel representation, all methods tried, including deep learning fail to perform better than random. However, if you pre-trained feature extraction layer to discriminate individual shapes first, same architecture network can perform perfectly. Now they are trying to reproduce the same result by using fully unsupervised pre-training

2. Interesting analysis, and I think you're right about the connection between hierarchical structure and deep learning. The key to preventing over fitting, as usual, is complexity control, where we let the depth of the hierarchy be another model parameter. Or, one could do a kind of compression of the hierarchy through averaging or something.

But I want to raise a concern (which I'm not the first to voice). Computer science is supposed to be a serious, fundamental discipline; but today it is overrun with hype-driven terminology with little descriptive value ('deep learning', 'big data'). In some cases ('cloud computing'), the terminology is ideologically colored as well, inviting an attitude of comfort toward the idea of sharing sensitive data.

It is not too surprising that these terms would develop, since our research takes place in close interaction with the computing industry. And when everyone else is using such hype-terms in their grant proposals, I imagine it's hard to avoid using them oneself. But I think that collectively we are hurting our own credibility.

I believe we should move toward adopting norms from a field like medicine. How many of us would go to a doctor with a sign out front offering 'Deep Healing'? Few, I hope. Legit physicians and pharmaceuticals have become circumspect in their claims, to our benefit. We may not be able to get computing companies to tone it down, but computer scientists should at least maintain a separate integrity.

In some cases it is possible to have terminology that is both exciting and properly descriptive, e.g. 'nanoscale computing'. In other cases we'll simply have to start with neutral-sounding terms, and then do enough good science around them that the work generates buzz on its own merit.

4. I'd be the first to agree that the Big data hype has become out of control. Having said that, the term 'deep' in deep learning doesn't refer to the "seriousness" of the method. It merely refers to the depth of the neural networks involved in the learning task: the point being that the perceived advantages come from using more than 1 or 2 layers of feature collapse.

And to be fair, these are usually referred to in the research literature as deep belief networks, or restricted boltzmann machines, and so on. It's only in the larger computing community that the terms get hyped out of control, much like the benefits of drugs get hyped by drug companies (since you use the analogy to medicine)

5. Thanks, Suresh. I understand that "deep" isn't meant to outright mean "profound", but it not-so-subtly suggests it, just as "cloud computing" not-so-subtly suggests "magical and friendly and trustworthy". I find this troubling and think it is likely to be detrimental to our thinking and credibility.
It is also question-begging: how deep is "deep"? (In other parts of CS, a structure of depth 3 or 10 may be considered shallow.) For big data, how big is "big"? What gives us the right to pretend to answer these questions? And isn't TCS united in its effort to find solutions that scale well in the limit of extremely large input size?

6. You look at DL from the perspective of big data which makes me wonder, is DL strictly a big data tool? Or is it useful for relatively small data sets (i.e. sub-terabyte, possibly even sub gigabyte)?

Also, from what I have seen so far, the sort of elemental building block of DL architectures are NNs and as you mention usually RBMs. Is there something about the method that requires this? Or is it like ensemble learning methods in that other types of learners can be used?

7. Both Anon and Dustin raise an interesting point: how big (or how small) must the "big" data need to be for these methods to be effective, and how many layers are needed.

I have absolutely no idea, having never implemented these techniques myself. I've heard anecdotal evidence that even 2 layers is quite decent, and the more the better.

Dustin, on the issue of what the elements look like. My understanding is that common elements are things like soft-max and thresholds. I assume that the key is nonlinearity. In the Coates paper I mentioned, the first-line elements are k-means clustering modules. I'd imagine that other learners could be used, and it depends on the problem. Again, I'll emphasize that I really have no clue because I have no experience: the DL folks know a lot more about these things.

8. I suspect that many deep learning researchers would be mildly insulted to have the term "deep learning" compared to "big data." At least among deep learning researchers, there is a lot of agreement about what is and isn't a deep architecture. The reason big data is such a successful industry buzz word is because no one can ever tell you something isn't big data. That being said, deep learning has partly started to become a brand, but there is an important distinction between a brand and a buzz word.

Deep learning has nothing to do with big data other than deep learning being an attitude towards machine learning and machine learning generally having to deal with so-called "big data."

Generally I prefer to talk about deeper and shallower models instead of deep or shallow models since it isn't always clear how we want to define different architectural depths in an absolute sense.

9. In principle there could be deep models based on decision trees or SVMs or something else. Increasing depth is about composing elementary modules and learning each piece, ideally optimizing all the pieces jointly although this isn't strictly necessary.

Boosting adds one "layer" of depth, for instance. So I would say that a boosted ensemble of some classifier type has a depth of k+1 where k is the whatever we define the depth of the base models being boosted to be.