Wednesday, August 13, 2014

Interdisciplinary research and the intellectual richness of data analysis

Slides on a brief introduction to themes in machine learning from an algorithms perspective, and some thoughts on the mathematical richness of the study of data.

The MADALGO Summer School on Learning at Scale is on. As I mentioned earlier, we have a great collection of speakers talking on many different aspects of algorithms and learning.

To kickoff the school, I gave a very short introduction to some of the main themes in learning. My talk was roughly an hour long, so I obviously couldn't be comprehensive in my coverage. But I tried to cover some of the main "eigenvectors" of the space. My (lack of) coverage of  the very large Bayesian side of things was a function of my own ignorance.

Which brings me to another point.

What topics do you need a more than passing familiarity with to do data analysis ? (when I say "do", I don't mean the ability to run and understand standard methods. I mean a level of facility where you understand the underlying principles behind different methods, and are capable of designing new methods yourself: so roughly at the level of a graduating Ph.D student)

At the very least, the list of topics includes:
  • Linear algebra (and a sound understanding of matrices, transformations, projections, eigenspace analysis, matrix decompositions, and the like)
  • Probability (including slightly more advanced notions like conditional probabilities and conditional expectation, mutual information, and entropy, sampling)
  • Basic statistics (estimators, distributions, hypothesis testing, Bayesian analysis)
  • Functional analysis (vector spaces, norms, Hilbert spaces and so on)
  • Optimization (linear to convex, and beyond. Also some understanding of gradient descent and methods beyond to do optimization)
  • Basic algorithm design (design primitives, randomization, approximation)
  • Scalable algorithm design (streaming, distributed, sketching)
  • Limits of computing (a passing familiarity with basic complexity theory, just to know what you can't expect to do)
  • High dimensional geometry (and topology)
  • Core ML (basic tools in learning, some smattering of learning theory)
That's a lot of knowledge !! And much of this is still evolving, and is not always in textbook form or in any way easy to assimilate. What's worse is that different areas might discuss the same topics with completely different vocabulary. 

Here are some implications of this:
  • Data analysis (or data science, or big data, or whatever you want to call it), is truly interdisciplinary. Not in the fake "neither here nor there" sense, but a genuinely rich nexus of different traditional disciplines where everyone brings something to the table, but everyone needs to appreciate different perspectives in order to make a useful contribution. 
  • It's awfully fun ! It's breathtaking to look back on my (not that many) years spent pondering problems of data analysis and realizing how many different things I've been able to learn about in the process. 
  • You can't learn everything ahead of time. I certainly didn't. The trick here (as with most of the things we learn for doing research) is to learn just enough about a topic to use it. And then, over time, you develop a series of shorthands and intuitions to recognize a concept when you see it ("kernels = infinite D l_2", "submodular is convex++", "communication is information theory in disguise", and so on....). Of course you often have to update your incorrect intuitions and make them richer. 
  • It's a lot more uncomfortable than sitting in a single area. So it's really a matter of taste as to whether you enjoy working in this area or not. 

Disqus for The Geomblog