The following is a summary of a talk I gave to undergraduates interested in going on to graduate school. It's targeted at the layperson, and tries to convey a sense of the interplay between data mining and geometry. I gave this talk partly because I realized that things that we take utterly for granted in the rarified world of high dimensional data mining are completely foreign to people who don't think about this for a living.
tl;dr: High dimensional geometry (and non standard geometry) is the unifying language that binds data mining problems together.
We're trying to find patterns all over the place, with very different kinds of data. A search engine is trying to find patterns in web pages that might indicate that they have similar content. Brain researchers are throwing MRIs of patients with diseases into an algorithm that attempts to infer patterns in brain scans that might yield clues about pathology and diagnosis. Genome-wide analysis takes what are essentially long strings of letters and tries to explain why certain populations might be susceptible to certain diseases.
Pandora, Shazam and other music sites analyze fragments of music to find related artists, or even just match a tune. While an infinite gangnam-style video might be a frivolous use of data mining on video streams, video scans are being analyzed by robots trying to drive unmanned cars or even just play soccer. Social networks are all the rage: Facebook, Twitter and Google are desperate to understand your circle of friends in order to sell things to you more effectively.
How are we able to find patterns, clusters and trends in such different kinds of data ? The key step in all of this is the idea of features. The trick is to describe each object we are studying as a sequence (or set) of features. For example, a feature set for a document could be the number of times each particular word appears. The feature set for an image could be the count of different colors. For a tune, it might be a collection of features identified by hiring artists to list out which of more than 700 characteristics a piece of music has.
And so on, and so on. What this allows us to do is represent each object as a set of features, whether it's a web page, a brain scan, a video, or even a social graph. Once we do that, we no longer have to worry about the original data (well, kind of !), and different types of data are all on an equal footing.
But what now ? Here's where a cool trick that dates back to the 1600s comes in.
I doubt that René Descartes ever heard of the term "data mining". But in a strange way, he's the father of the field. One of Descartes' claim to fame was establishing a link between geometry and algebra. He said that if we wanted to represent points, lines and other shapes, we could do so not abstractly as Euclid and other classical geometers did, but using algebra. A "point" in the plane could be described by two coordinates (x,y), and a line in the plane could be described by the equation y = mx + c.
This is a very simple idea - children learn it in middle school. And yet like most simple ideas, it was revolutionary, and completely changed our perspective on geometry. The unification of algebra and geometry is so powerful and so natural, we use it almost unconsciously today.
But what does Descartes have to do with data mining ?
Remember those features I was telling you about ? That every object can be represented by a sequence of numbers, each number describing some aspect of the object.
Let's do the opposite of what Descartes proposed ! In other words, let's pretend that the objects are actually points in a space. Now this space is not the plane (unless we had only two features). It's a high dimensional space, with one feature for each dimension.
This process is entirely artificial. There is no real reason why we must think of objects as points in a high dimensional space. But here's the cool thing that happens. We can now express all basic mining primitives as geometric concepts, by translating the language of the primitive to this high dimensional space.
A clustering of data becomes a grouping of points so that "nearby" points are grouped together. A classification of reviews into positive and negative statements becomes a way to separate "positive" points and "negative" points by a line. Finding trends in data becomes the problem of fitting a straight line to a collection of points.
It is hard to emphasize how utterly bizarre this is. There is no underlying "geometry" in the problem we're solving. We're essentially creating a castle out of thin air in order to understand the problem. And yet it works, and is the bedrock of how we think about data mining.
But wait ! there's more. What exactly does it mean to say that points are "nearby" or they are "separated" ? To answer this question, it's not enough to view objects as points in a high dimensional space. You have to give this space a shape - a geometry (and also a topology, but I'll skip that for now).
For example, if I have two feature lists, how do I measure the distance between them ? If they were points on a map, I could do the usual straight line distance. But does that really capture how far apart they are ? After all, if you're in downtown Salt Lake with its grids, a "crow flies" distance doesn't help estimate how far things are. If you're on the surface of the earth, you can't really tunnel through the center to get from one point to another.
So we have to create the shape of the space by defining how far apart two points are. And this is one of the trickiest parts of data mining. Either we have to use some domain knowledge to estimate which features are significant and control more of the distance between objects, or we have to try and learn what seems like the right distance between objects based on user estimates or other information.
The good thing is that we have a huge collection of shapes to play with, and different shapes tend to work well for different classes of objects. Some are easy to interpret, others are easy to compute, and so on. So a good part of the "art" in data mining is understanding the data and estimating the right geometry for it, so that our tasks (expressed geometrically) give us meaningful answers.
Or as Dorothy famously said to her dog, "Toto, I don't think we're in the plane any more!"
Thanks for the explanation. It's helpful to think of it this way.
ReplyDeleteI'm no theoretician, but I've applied data mining algorithms for years. One thing that has always been counterintuitive to me is that objects can't always be well represented by a vector of values. Or maybe I'm just thinking of it in the wrong way. For example, let's say your object is a house and you want to compare it against other house objects. Some features might be simple, like the height, width, square footage, exterior paint color, etc. But some features might be complex like the configuration of the power wiring in the fourth bedroom. I guess you could find a way to represent such a complex feature with numerical values, but the point I'm trying to make is that some features are nested and complex while others are straightforward. So I guess my question is whether there are data-mining approaches designed to handle complex objects rather than vectors? Or is the field more about finding ways to convert complex objects to vectors?
This talk reminds me of my high school teacher who always taught us to imagine geometrical objects as live objects in our minds with our eyes closed. This kind of analogy and imagination is a wonderful way of illustration to a layperson.
ReplyDeleteHi Steve
ReplyDeleteOne answer to your question is that we do try to find a higher dimensional vector space to represent the complex objects. Roughly speaking, if there is some way to measure distance between objects, and this distance has some nice properties, then there is a vector space in which Euclidean distance is exactly the object distance. This method is called "the kernel approach".
This is very clear, but to nitpick:
ReplyDeleteAnd yet like most simple ideas, it was revolutionary […]
Really? I find it easy to believe the converse, that most revolutionary ideas are simple, but believing that most simple ideas are revolutionary seems hard.
Point taken. I intended the converse as you indicate
ReplyDeleteJust a curiousity of a person from outside.
ReplyDeleteWhen it comes to data mining (in the broad sense, i.e. also pattern recognition, for example), the following point always seemed to me problematic. If we are to find out things more complicated than the hand written digit, or if we simply do not have the time or resources to research the small set of features which characterize a given problem, we should probably have to deal with high dimensional spaces, as you note. But shouldn't then the concentration of measure effect interfere somehow with our efforts to separate class A from class B? In case you may have some indications: what really happens in regards to this in real world problems (especially the ones off the beaten track)?
This is definitely a problem. There's no specific answer just yet: what people do is a mixture of careful validation, trying to find low dimensional structures that explain the phenomenon, and other ad hoc methods.
ReplyDeleteSteve,
ReplyDeleteAnother way to deal with complex objects related to kernel methods is dissimilarity. Instead of kernels, which represent inner products between objects and which can be turned into distances, you quantify how similar or dissimilar objects are. The dissimilarities don't even have to satisfy triangle inequality. You can then go to vector representation by embedding methods (such as multidimensional scaling) or try to do learning using dissimilarities themselves.
Minor typo: it's "René Descartes", not "Réne Descartes".
ReplyDeleteFixed. thanks
ReplyDelete