Comments on The Geomblog: NIPS II: Deep Learning and the evolution of data models

In principle there could be deep models based on d...

2012-12-15T21:47:28.878-07:00

In principle there could be deep models based on decision trees or SVMs or something else. Increasing depth is about composing elementary modules and learning each piece, ideally optimizing all the pieces jointly although this isn't strictly necessary.

Boosting adds one "layer" of depth, for instance. So I would say that a boosted ensemble of some classifier type has a depth of k+1 where k is the whatever we define the depth of the base models being boosted to be.

I suspect that many deep learning researchers woul...

2012-12-15T21:35:48.171-07:00

I suspect that many deep learning researchers would be mildly insulted to have the term "deep learning" compared to "big data." At least among deep learning researchers, there is a lot of agreement about what is and isn't a deep architecture. The reason big data is such a successful industry buzz word is because no one can ever tell you something isn't big data. That being said, deep learning has partly started to become a brand, but there is an important distinction between a brand and a buzz word.

Deep learning has nothing to do with big data other than deep learning being an attitude towards machine learning and machine learning generally having to deal with so-called "big data."

Generally I prefer to talk about deeper and shallower models instead of deep or shallow models since it isn't always clear how we want to define different architectural depths in an absolute sense.

Both Anon and Dustin raise an interesting point: h...

2012-12-15T13:42:14.685-07:00

Both Anon and Dustin raise an interesting point: how big (or how small) must the "big" data need to be for these methods to be effective, and how many layers are needed.

I have absolutely no idea, having never implemented these techniques myself. I've heard anecdotal evidence that even 2 layers is quite decent, and the more the better.

Dustin, on the issue of what the elements look like. My understanding is that common elements are things like soft-max and thresholds. I assume that the key is nonlinearity. In the Coates paper I mentioned, the first-line elements are k-means clustering modules. I'd imagine that other learners could be used, and it depends on the problem. Again, I'll emphasize that I really have no clue because I have no experience: the DL folks know a lot more about these things.

You look at DL from the perspective of big data wh...

2012-12-15T11:18:30.128-07:00

You look at DL from the perspective of big data which makes me wonder, is DL strictly a big data tool? Or is it useful for relatively small data sets (i.e. sub-terabyte, possibly even sub gigabyte)?

Also, from what I have seen so far, the sort of elemental building block of DL architectures are NNs and as you mention usually RBMs. Is there something about the method that requires this? Or is it like ensemble learning methods in that other types of learners can be used?

Thanks, Suresh. I understand that "deep"...

2012-12-15T11:15:30.077-07:00

Thanks, Suresh. I understand that "deep" isn't meant to outright mean "profound", but it not-so-subtly suggests it, just as "cloud computing" not-so-subtly suggests "magical and friendly and trustworthy". I find this troubling and think it is likely to be detrimental to our thinking and credibility.
It is also question-begging: how deep is "deep"? (In other parts of CS, a structure of depth 3 or 10 may be considered shallow.) For big data, how big is "big"? What gives us the right to pretend to answer these questions? And isn't TCS united in its effort to find solutions that scale well in the limit of extremely large input size?

I'd be the first to agree that the Big data hy...

2012-12-15T09:59:30.997-07:00

I'd be the first to agree that the Big data hype has become out of control. Having said that, the term 'deep' in deep learning doesn't refer to the "seriousness" of the method. It merely refers to the depth of the neural networks involved in the learning task: the point being that the perceived advantages come from using more than 1 or 2 layers of feature collapse.

And to be fair, these are usually referred to in the research literature as deep belief networks, or restricted boltzmann machines, and so on. It's only in the larger computing community that the terms get hyped out of control, much like the benefits of drugs get hyped by drug companies (since you use the analogy to medicine)

Very helpful post! But I want to raise a concern ...

2012-12-15T09:53:00.822-07:00

Very helpful post!

But I want to raise a concern (which I'm not the first to voice). Computer science is supposed to be a serious, fundamental discipline; but today it is overrun with hype-driven terminology with little descriptive value ('deep learning', 'big data'). In some cases ('cloud computing'), the terminology is ideologically colored as well, inviting an attitude of comfort toward the idea of sharing sensitive data.

It is not too surprising that these terms would develop, since our research takes place in close interaction with the computing industry. And when everyone else is using such hype-terms in their grant proposals, I imagine it's hard to avoid using them oneself. But I think that collectively we are hurting our own credibility.

I believe we should move toward adopting norms from a field like medicine. How many of us would go to a doctor with a sign out front offering 'Deep Healing'? Few, I hope. Legit physicians and pharmaceuticals have become circumspect in their claims, to our benefit. We may not be able to get computing companies to tone it down, but computer scientists should at least maintain a separate integrity.

In some cases it is possible to have terminology that is both exciting and properly descriptive, e.g. 'nanoscale computing'. In other cases we'll simply have to start with neutral-sounding terms, and then do enough good science around them that the work generates buzz on its own merit.

Interesting analysis, and I think you're right...

2012-12-15T02:25:39.351-07:00

Interesting analysis, and I think you're right about the connection between hierarchical structure and deep learning. The key to preventing over fitting, as usual, is complexity control, where we let the depth of the hierarchy be another model parameter. Or, one could do a kind of compression of the hierarchy through averaging or something.

Yoshua Bengio had an interesting paper at NIPS dee...

2012-12-15T02:14:30.119-07:00

Yoshua Bengio had an interesting paper at NIPS deep learning workshop "Knowledge Matters: Importance of Prior Information for Optimization"

The task is to tell if all pieces shown on the image are the same. Starting with pixel representation, all methods tried, including deep learning fail to perform better than random. However, if you pre-trained feature extraction layer to discriminate individual shapes first, same architecture network can perform perfectly. Now they are trying to reproduce the same result by using fully unsupervised pre-training