Monday, September 03, 2007

"Data Mining" = voodoo science ?

On the Statistical Modeling blog, Aleks Jakulin has a rant on the virtues of data mining:

I view data analysis as summarization: use the machine to work with large quantities of data that would otherwise be hard to deal with by hand. I am also curious about what would the data suggest, and open to suggestions. Automated model selection can be used to list a few hypotheses that stick out of the crowd: I was not using model selection to select anything, but merely to be able to quantify how much a hypothesis sticks out from the morass of the null.

The response from several social scientists has been rather unappreciative along the following lines: "Where is your hypothesis? What you're doing isn't science! You're doing DATA MINING !"
I had almost the same reaction a while back when I was visiting JPL: the climatologists there were horrified at the idea of trolling for patterns in climate data, and to the person, asked me the dreaded 'But what is the science question?" question. Of course, given the general hot-potato-ness of climatology right now, one might sympathize with their skittishness.

Data mining is a tricky area to work in, and I've discussed this problem earlier. It's a veritable treasure-chest of rich algorithmic problems, especially in high dimensional geometry, and especially over large data sets. However, it's often difficult to get a sense of forward progress, especially since the underlying analysis questions often seem like elaborate fishing expeditions.

In that context, the distinction Aleks makes between confirmatory data analysis (check if the data validates or invalidates a hypothesis) and exploratory data analysis (play with the data to create a non-uniform distribution on plausible hypotheses) is quite helpful. It also emphasizes the interactive and very visual nature of data mining; interactive tools and visualizations are as important as the underlying analysis tools as well.

Update: Chris Wiggins points me to some of the earlier references to 'data mining'. One of the most vituperative is a paper by Michael Lovell in 1983 in The Review of Economics And Statistics. This paper drips with scorn for 'data miners', but makes a point that is at the very least worthy of consideration: namely that because of the large dimensionality of the space of hypotheses that a data mining application typically explores (here couched in terms of explanatory variables for a regression), patterns with apparently high p-values might not actually be that significant (or stated another way, in high dimensional spaces, there are many seemingly rare patterns that aren't that rare).

6 comments:

  1. I read your blog and thought I could tell you about something else which would be useful for seeing maps and directions in India.

    I am writing to tell you about MapmyIndia.com, a free interactive maps and directions portal for all India.

    We have recently launched our next generation v2008, whose main focus is to provide meaningful and useful directions from anywhere to anywhere in India. Next time you need to go somewhere unfamiliar, we want you to know you have a reliable and easy option to get directions – MapmyIndia.com!

    As a company and individual enthusiasts, we dream only of solving the problem of reliable directions and navigation for India. For your blog specifically, you can map enable it by using our youtube-style embeddable maps, and links to specific searches (of maps, directions, local and eLocation) on MapmyIndia.

    Do give us feedback, suggestions, or get involved yourself by mailing me back at tarun@mapmyindia.com or marketing@mapmyindia.com.

    And if you find the different services useful, we would be grateful to you for writing and telling your readers about us.

    Warm Regards,
    Tarun Gupta
    The MapmyIndia Team
    For directions in India, just search print and go with MapmyIndia.com

    ReplyDelete
  2. Perhaps someone should come up with a variant of data mining and call it "automatic scientific hypothesis generation."

    ReplyDelete
  3. Zapraszam do zapoznania się z moją stroną http://serwisybukmacherskie.com poruszającą tematykę bukmacherską. Znajdziesz tam opisy bukmacherów, porady jak typować, darmowe typy, informacje o aktualnych promocjach i wiele innych ciekawych rzeczy

    ReplyDelete
  4. FYI, the last post is in Polish (and not particularly interesting, unless you are into gambling). And no, it is not an advertisement of my startup.

    I am impressed that your blogging contributions have been recognized even in that nascent Eastern European country. Therefore, hereby I grant you the title of an "Honorary Pole".

    Cheers;)

    Piotr

    ReplyDelete
  5. Hey, as long as I get to be the polish plumber ;)

    ReplyDelete
  6. Data mining is a tricky area to work in, and I've discussed this problem earlier. It's a veritable treasure-chest of rich algorithmic problems, especially in high dimensional geometry, and especially over large data sets. However, it's often difficult to get a sense of forward progress, especially since the underlying analysis questions often seem like elaborate fishing expeditions.

    I agree that this work is subtle, but it is not a "fishing expedition". Data mining work in practice tends to be focused on one or a small number of variables of interest, and mechanisms exist to prevent accidental findings. When practiced with rigor, data mining delivers results which are useful and durable. The predictive models I have built for banks, for instance, have frequently continued to perform accurately for years.

    ReplyDelete

Disqus for The Geomblog