Thursday, March 20, 2014

Data Mining, machine learning and statistics.

How does one tell data mining, machine learning and statistics apart ?

If you spend enough time wandering the increasingly crowded landscape of Big Data-istan, you'll come across the warring tribes of Datamine, MachLearn and Stat, whose constant bickering will make you think fondly of the People's front of Judea:

Cosma Shalizi has what I think is a useful delineation of the three tribes that isn't prejudicial to any of them ("Stats is just inefficient learning !", "MachLearn is just the reinvention of statistics!" "DataMine is a series of hacks!"). It goes something like this:

  • Data mining is the art of finding patterns in data.
  • Statistics is the mathematical science associated with drawing reliable inferences from noisy data
  • Machine learning is [the branch of computer science] that develops technology for automated inference (his original characterization was as a branch of engineering).
I like this characterization because it emphasizes the different focus: data mining is driven by applications, machine learning by algorithms, and statistics by mathematical foundations. 

This is not to say that the foci don't overlap: there's a lot of algorithm design in data mining and plenty of mathematics in ML. And of course applied stats work is an art as much as a science. 

But the primary driving force is captured well. 

Wednesday, March 19, 2014

Reproducibility in data mining/learning

Is reproducibility in data mining/learning different from reproducibility in systems research ?

+John Regehr and +Shriram Krishnamurthi have discussions (John, Shriram) arising from a new paper on  reproducibility in computer systems research from people at the U. of Arizona. You should read their posts/G+ discussions for the specifics.

But it got me thinking: is there anything intrinsically different about reproducibility in my neck of the woods ? Many concerns are shared:

  • The need for a standard computing environment to build and test code: MATLAB and scikit-learn make that a lot easier for us: we don't need to worry too much about details of the computing environment/hardware just to get the code to run. 
  • Discrepancies between what is stated in the paper and what the code actually does: that's a problem with any experimental work: John in fact mentions one such issue in one of his recent papers.
  • Code rot: otherwise known as 'graduation' :). I think if more people used public repos like github from the beginning, some of these problem might go away. I would be remiss if I didn't also plug +Robert Ricci's new system AptLab for helping distribute research.
But here's one that may not be shared: data preparation.

It's no secret that data preparation takes most of the time in a data mining pipeline. This includes
  • data cleaning (removing errors, reformatting data)
  • feature engineering (deciding how to truncate, modify or retain certain features)
  • training (cross-validation levels, what data is used for model building and so on)
As with any of the above, a proper specification can ensure true reproducibility, but there are lots of things we might do to data without necessarily even realizing it (if there are bugs in the transformation pipeline) or without thinking we need to mention it (truncating ranges, defining null values away). 

Feature engineering can also be a problem, especially with models that use many features (for example, deep learning systems have lots of "knobs" to tune, and they can be quite sensitive to the knobs).

So one thing I often look for when reviewing such papers is sensitivity: how well can the authors demonstrate robustness with respect to the parameter/algorithm choices. If they can, then I feel much more confident that the result is real and is not just an artifact of a random collection of knob settings combined with twirling around and around holding one's nose and scratching one's ear. 

Tuesday, March 11, 2014

A short not-review of the first episode of Cosmos

If you've been living on a rogue planet wandering through the far reaches of the Local group, you may not have heard of the Cosmos reboot on FOX with Neil deGrasse Tyson.

We finally got around to seeing the first episode yesterday. I was nervous because I loved the original so much and hoped that my kids would like it as well (as all parents know, you can't show fear or the sharks will smell it :)).

Things I liked:

  • The homage to Carl Sagan was done beautifully. If you're going to reboot a franchise, you don't have to pretend that nothing existed before, you know. 
  • The mix of CG + real person and animation was a good way to tell the different stories. I was a little worried about the denouement of the Giordano Bruno tale because my six-year old was watching, but it was done without being too graphic. 
  • The graphics were non-cheesy: but then I wasn't ever worried about the quality of graphics in the original. I'm pretty sure that 20 years from now these graphics will look cheesy as well. The year-long calendar of the universe was a fantastic demonstration of the immensity of time. 
Things I could live without:

Granted, it's the first episode, so there's no point in being too harsh. But 
  • If I had any heartstrings ever, they've been pulled out of shape into long strands of spaghetti. The constant sweeping orchestral background made me feel like I was watching NBC's coverage of the Winter Olympics. I almost expected to see a personal profile of how the universe overcame the adversity of the Big Bang to become the star performer it is today. 
  • As with all modern American shows purporting to be about science, I found the balance between sweeping rhetoric and actual facts to be disturbingly skewed towards the soaring fluffy. Watch some David Attenborough, people ! or even, I don't know, COSMOS itself. But see below...
I liked it. No question that I'm watching it again. And the eight-year old loved it ! Thought it was full of information. So maybe my assessment of graphics-to-information ratio is completely off. 

The funniest thing: when it was over, this happened:
Kids: "Let's see the next episode now!"
Me: We'll have to wait a week for it.
Kids: "What do you mean, we have to wait a week ?"
Netflix should be proud. Binge watching on demand is now the default mode. 

Tuesday, March 04, 2014

Two big data talks at Haverford College

I'm giving two talks at Haverford College just before SDM 2014 in Philadelphia. If you're in the area. stop by and say hello !

Disqus for The Geomblog