Showing posts with label data-mining. Show all posts
Showing posts with label data-mining. Show all posts

Monday, July 25, 2016

Carl Zimmer's series on exploring his genome

If you haven't yet read Carl Zimmer's series of articles (one, two, three), you should go out and read it now!

Because after all, it's Carl Zimmer, one of the best science writers around, especially when it comes to biology.

But even more so because when you read the story of his personal quest to understand his genetic story in all its multifaceted glory, you understand the terrifying opportunities and dangers in the use of genetic information for predictive and diagnostic medicine. You also realize the intricate way that computation is woven into this discovery, and how sequences of seemingly arbitrary choices lead to actual conclusions about your genome that you now have to evaluate for risk and likelihood.

In a sense, this is the tale of the use of all computational approaches right now, whether it be in science, engineering, the social sciences, or yes, even algorithmic fairness. Zimmer uses the analogy with telescopes to describe his attempts to look at his genome, and this explanation is right on the money:
Early telescopes weren’t terribly accurate, either, and yet they still allowed astronomers to discover new planets, galaxies, and even the expansion of the universe. But if your life depended on your telescope — if, for example, you wanted to spot every asteroid heading straight for Earth — that kind of fuzziness wouldn’t be acceptable.
And this quote from Robert Green, one of the geneticists who was helping Zimmer map out his genome:
Ultimately, the more you know, the more frustrating an exercise it is. What seemed to be so technologically clear and deterministic, you realize is going through a variety of filters — some of which are judgments, some of which are flawed databases, some of which are assumptions about frequencies, to get to a best guess.
 In this is a message for all of us doing any kind of data mining.

Monday, February 02, 2015

(the dangers of) Data Mining hits the Super Bowl (ads) and prime time comedy

I get strange looks from people in my work circles when I admit to watching American football, and stranger looks from people in my not-work circles when I admit to watching the Super Bowl for the game.

I didn't avoid the ads though. And two ads (both from Esurance) riffed off the idea of segmenting: that the business goal of data mining customer data is to segment people into little categories that can be targeted the same way. The ads are a little heavy-handed, but then I don't think most people actually know how data mining works to segment them. And getting Bryan Cranston to play a "sort-of" pharmacist was brilliant.

Esurance: "Sorta Pharmacy"

 

Esurance: "Sorta Mom"


And finally, an entire 30 minute show centered around the idea of a Facegoogle-like company invading our privacy by reading our texts/emails etc. If you don't watch Parks and Recreation, then it's too difficult to summarize six-odd seasons of shenanigans, but in brief: A facebook-google mashup-like company is setting up shop in a small town in Indiana, and is sending people free gifts via Amazon-like shopping drones based on analyzing their private communication, and then shenanigans ensure. If you can't stand the horror of watching network TV, then there's a short clip here:

   

 and the full episode is on Hulu (for now) here.

Thursday, December 11, 2014

Accountability in data mining

For a while now, I've been thinking about the endgame in data mining. Or more precisely, about a world where everything is being mined for all kinds of patterns, and the contours of our life are shaped by the patterns learnt about us.

We're not too far from that world right now - especially if you use intelligent mail filtering, or get your news through social media. And when you live in such a world, the question is not "how do I find patterns in data", but rather
How can I trust the patterns being discovered ? 
When you unpack this question, all kinds of questions come to mind: How do you verify the results of a computation ? How do you explain the results of a learning algorithm ? How do you ensure that algorithms are doing not just what they claim to be doing, but what they should be doing ?

The problem with algorithms is that they can often be opaque: but opacity is not the same as fairness, and this is now a growing concern amongst people who haven't yet drunk the machine learning kool-aid.

All of this is a lead up to two things. Firstly, the NIPS 2014 workshop on Fairness, Accountability and Transparency in Machine Learning, organized by +Moritz Hardt and +Solon Barocas and written about by Moritz here.

But secondly, it's about work that +Sorelle Friedler+Carlos Scheidegger and I are doing on detecting when algorithms run afoul of a legal notion of bias called disparate impact. What we're trying to do is pull out from a legal notion of bias something more computational that we can use to test and certify whether algorithms are indulging in legally proscribed discriminatory behavior.

This is preliminary work, and we're grateful for the opportunity to air it out in a forum perfectly designed for it.

And here's the paper:
What does it mean for an algorithm to be biased?
In U.S. law, the notion of bias is typically encoded through the idea of disparate impact: namely, that a process (hiring, selection, etc) that on the surface seems completely neutral might still have widely different impacts on different groups. This legal determination expects an explicit understanding of the selection process. 
If the process is an algorithm though (as is common these days), the process of determining disparate impact (and hence bias) becomes trickier. First, it might not be possible to disclose the process. Second, even if the process is open, it might be too complex to ascertain how the algorithm is making its decisions. In effect, since we don't have access to the algorithm, we must make inferences based on the data it uses. 
We make three contributions to this problem. First, we link the legal notion of disparate impact to a measure of classification accuracy that while known, has not received as much attention as more traditional notions of accuracy. Second, we propose a test for the possibility of disparate impact based on analyzing the information leakage of protected information from the data. Finally, we describe methods by which data might be made "unbiased" in order to test an algorithm. Interestingly, our approach bears some resemblance to actual practices that have recently received legal scrutiny.
In one form or another, most of my recent research touches upon questions of accountability in data mining. It's a good thing that it's now a "trend for 2015". I'll have more to say about this in the next few months, and I hope that the issue of accountability stays relevant for more than a year.

Wednesday, April 30, 2014

SIAM Data Mining 2014: On differential privacy

After my trip to Haverford, I attended the SIAM Data Mining (SDM) conference in Philly. For those who aren't that familiar with the data mining universe, SDM is the SIAM entrant in the data mining conference sweepstakes, along with ACM (KDD) and IEEE (ICDM). SDM is probably also the smallest of the three venues, which makes it comparable in feel to SODA (also because of SIAM organization). The conference attracts the usual data mining suspects, but also more of the applied math folks.

I was the tutorials chair this year, and there were a number of very well-attended tutorials ranging from applications to core mining to theory. In particular, +Moritz Hardt and +Aleksandar Nikolov did a very nice tutorial on differential privacy entitled 'Safer Data Mining'.


SDM is a good venue for theory folks wanting to "test the waters" with data mining: the papers are consistently more mathematically oriented and less "business-heavy", and it's a friendly crowd :).

Shameless plug: I'm the PC co-chair next year along with Jieping Ye and I'd encourage more algorithms folks to submit, and visit Vancouver in April.

In a future post I'll talk more about a panel I also ran at the conference titled 'Ethics in Data Mining'

Tuesday, April 22, 2014

The Shape of Information

A brief synopsis of my general-audience talk at Haverford College. 

I'm currently visiting Haverford College at the invitation of +Sorelle Friedler as part of Haverford's big data lecture series. Today's talk was a general audience talk about data mining, titled 'The Shape Of Information': (GDrive link)
The Shape Of Information 
What makes data mining so powerful, and so ubiquitous? How can the same set of techniques identify patients at risk for a rare genetic disorder, consumers most likely to like Beyonce's latest album, or even a new star from an sky survey ?  
The answer starts with an idea Descartes had nearly 500 years ago. He suggested expressing geometry in terms of numbers (coordinates). This turned out to be a powerful technique that led (among other things) to the development of the calculus. Data mining returns the favor. It starts with sets of numbers that describe a collection of objects. To find patterns in these objects, we create a geometry in which the numbers are coordinates. And just like that, objects become shapes, and the search for information becomes a quest for common structure in these shapes. 
In this search, we are not limited by the geometry of our world: we can dream up ever more intricate geometries that capture the shape of the information that we seek to find in our data. In this sense, data mining is the best kind of science fiction come to life: we craft a world out of our imagination, and let the laws of this world lead us to fascinating discoveries about the data that inhabits it.
I had a great time visiting with Sorelle's students in their data mining class. Haverford College continues to impress me with the quality of their undergrads (and their faculty !)

Thursday, March 20, 2014

Data Mining, machine learning and statistics.

How does one tell data mining, machine learning and statistics apart ?

If you spend enough time wandering the increasingly crowded landscape of Big Data-istan, you'll come across the warring tribes of Datamine, MachLearn and Stat, whose constant bickering will make you think fondly of the People's front of Judea:


Cosma Shalizi has what I think is a useful delineation of the three tribes that isn't prejudicial to any of them ("Stats is just inefficient learning !", "MachLearn is just the reinvention of statistics!" "DataMine is a series of hacks!"). It goes something like this:

  • Data mining is the art of finding patterns in data.
  • Statistics is the mathematical science associated with drawing reliable inferences from noisy data
  • Machine learning is [the branch of computer science] that develops technology for automated inference (his original characterization was as a branch of engineering).
I like this characterization because it emphasizes the different focus: data mining is driven by applications, machine learning by algorithms, and statistics by mathematical foundations. 

This is not to say that the foci don't overlap: there's a lot of algorithm design in data mining and plenty of mathematics in ML. And of course applied stats work is an art as much as a science. 

But the primary driving force is captured well. 

Wednesday, March 19, 2014

Reproducibility in data mining/learning

Is reproducibility in data mining/learning different from reproducibility in systems research ?

+John Regehr and +Shriram Krishnamurthi have discussions (John, Shriram) arising from a new paper on  reproducibility in computer systems research from people at the U. of Arizona. You should read their posts/G+ discussions for the specifics.

But it got me thinking: is there anything intrinsically different about reproducibility in my neck of the woods ? Many concerns are shared:

  • The need for a standard computing environment to build and test code: MATLAB and scikit-learn make that a lot easier for us: we don't need to worry too much about details of the computing environment/hardware just to get the code to run. 
  • Discrepancies between what is stated in the paper and what the code actually does: that's a problem with any experimental work: John in fact mentions one such issue in one of his recent papers.
  • Code rot: otherwise known as 'graduation' :). I think if more people used public repos like github from the beginning, some of these problem might go away. I would be remiss if I didn't also plug +Robert Ricci's new system AptLab for helping distribute research.
But here's one that may not be shared: data preparation.

It's no secret that data preparation takes most of the time in a data mining pipeline. This includes
  • data cleaning (removing errors, reformatting data)
  • feature engineering (deciding how to truncate, modify or retain certain features)
  • training (cross-validation levels, what data is used for model building and so on)
As with any of the above, a proper specification can ensure true reproducibility, but there are lots of things we might do to data without necessarily even realizing it (if there are bugs in the transformation pipeline) or without thinking we need to mention it (truncating ranges, defining null values away). 

Feature engineering can also be a problem, especially with models that use many features (for example, deep learning systems have lots of "knobs" to tune, and they can be quite sensitive to the knobs).

So one thing I often look for when reviewing such papers is sensitivity: how well can the authors demonstrate robustness with respect to the parameter/algorithm choices. If they can, then I feel much more confident that the result is real and is not just an artifact of a random collection of knob settings combined with twirling around and around holding one's nose and scratching one's ear. 




Wednesday, October 02, 2013

Call for tutorials at SIAM Data Mining

I'm the tutorials chair for the 2014  SIAM conference on data mining (SDM) (for readers not aware of the data mining landscape, SDM is a SODA-style data mining conference run by SIAM - i.e it's a CS-style venue with peer-reviewed papers, rather than a math-style conference like SIAM discrete math). SDM is one of the major venues for data mining research, especially the more statistically focused kind. It will be held in Philadelphia between Apr 24-26, 2014.

SDM runs 4-5 tutorials each year on various aspects of data mining (ranging from theory/algorithms to specific application areas). I'm personally very interested in encouraging submissions from people in the theory community working with data who might want to share their expertise about new methods/techniques from theoryCS land with the larger data analysis community.

The deadline is Oct 13, and all you need is a few pages describing the tutorial content, target audience, and some sample material (more details here). If you have an idea and are not sure whether it will fit, feel free to email me directly as well.

Monday, February 25, 2013

Data analysis, interpretation and explanations

There was a recent data-related kerfuffle between the New York Times and the makers of the Tesla electric car. If you haven't read the articles, (and the NYT public editor's post on this has good links), the crude summary is this:

  • NYT reporter takes Tesla on long test drive, and reports problems with running out of charge.
  • Tesla Motors CEO Elon Musk writes a long takedown of the reporter's review, complete with graphs generated from data the car recorded during the drive
  • NYT comes back with their own interpretation of data, rebutting Musk's claims.
  • Others attempt to reproduce the reporter's experience and fail, but arguably in different weather conditions that might or might not make a difference.
In an insightful meta-analysis of the dustup, Taylor Owen of the Tow Center for Digital Journalism discusses the implications for journalism in a data-driven world. He also references an article by David Brooks that makes the point:
People are really good at telling stories that weave together multiple causes and multiple contexts. Data analysis is pretty bad at narrative and emergent thinking...
I've felt for a while now that it's time to design mechanisms for providing context and narrative to data analysis*. Some of the research my student Parasaran does is on metaclustering: essentially the meta-analysis of different perspectives (clusterings) of data to draw out a larger meaning. We've also just submitted a paper on how to derive local markers of prediction validity in clustering (rather than just relying on global measures of quality). And my seminar this semester is about the larger problems of explanations and accounting in data mining.

I think as computer scientists, we have a lot to offer in the realm of data mining - not just in the design of tools for prediction, but in the design of tools to facilitate better understanding.


* None of this is surprising to experimental scientists, who will routinely attack a problem from multiple directions in order to acquire a larger understanding of a phenomenon rather than just the ability to predict. 


Monday, January 21, 2013

Accountability in Data Mining: A new seminar

Glencora Borradaile makes a number of interesting points about the implications of Aaron Swartz's life and work for us academic computer scientists. 
As computer science academics we are in a very powerful position. We are trusted with shaping the next generation that will make very important decisions that will have far-reaching social implications. Decisions like those over Facebook’s privacy defaults, motivating technology that enables autonomous private vehicles at the expense of the public interest, defining ownership of electronic media. We make those decisions ourselves in our research; what we research, how we allow our research to be used.
 Whether or not you agree with her particular policy preferences, the point remains that the technologies we develop can have lasting consequences for the "shape" of the world to come. And if we don't speak up (either through our work, or through our own advocacy), others will take the technologies we develop and find their own ways of using or misusing them.

All of this leads up to my current interests in data mining. I've been thinking about the problems of accountability in data mining for a while now: initially mostly in private, but now more and more in public (along with +Graham Cormode and +Andrew McGregor) as I see the importance of the issue.

What is accountability in data mining ? It means many things really, but for me, for now, it means this:

The fruits of data mining pervade every aspect of our life. We are recommended books and movies; given differential pricing for insurance; screened for potential terror threats; diagnosed with various diseases; and targeted for political advertising. The ability to sift through massive data sets with sophisticated algorithms has resulted in applications with impressive predictive power. 
Since the internals of a learning process can be complex and opaque, it is important to verify that the results satisfy the properties claimed by the algorithm. The importance of this goes beyond merely checking the algorithm itself. Validation mechanisms also provide a way for users to demand accountability from authorities who might use the results of mining to affect users in some way (by changing their insurance rates, or even putting them on a terrorism watchlist). As the results of data mining affect more and more of our lives, the more crucial it is that the user be able to validate decisions made on their behalf and that affect them. 
The above is an introduction to a seminar I'm running this semester on this topic. I'm a little nervous about it, because the topic is vast and unstructured, and almost anything I see nowadays on data mining appears to be "in scope". I encourage you to check out the outline, and comment on topics you think might be missing, or on other things worth covering. Given that it's a 1-credit seminar that meets once a week, I obviously can't cover everything I'd like, but I'd like to flesh out the readings with related work that people can peruse later.

It's entirely possible that we don't care about our privacy any more (as Facebook seems to think*). But we will care about the consequences of the use of our data. My perspective is that a better understanding of what is technically possible (and not possible) to demand and get accountability will be critical to informing larger policy discussions on this topic.

* In an earlier version of this post I wrongly attributed this sentiment to an individual. Apologies.

Saturday, December 15, 2012

NIPS II: Deep Learning and the evolution of data models

(tl;dr: some rambles and musings on deep learning and data, as I attempt to sort out in my head what this all means)

Over the years, as we've engaged with "big data" more and more, the way we construct mental models of data has changed. And as I've argued before, understanding how we think about data, and what shape we give it, is key to the whole enterprise of finding patterns in data.

The model that one always starts with is Euclidean space. Data = points, features = dimensions, and so on. And as a first approximation of a data model, it isn't terrible.

There are many ways to modify this space. You can replace the $\ell_2$ norm by $\ell_1$. You can normalize the points (again with $\ell_2$ or $\ell_1$, sending you to the sphere or the simplex). You can weight the dimensions, or even do a wholesale scale-rotation.

But that's not all. Kernels take this to another level. You can encode weak nonlinearity in the data by assuming that it's flat once you lift it. In a sense, this is still an $\ell_2$ space, but a larger class of spaces that you can work with. The entire SVM enterprise was founded on this principle.

But that's not all either. The curse of dimensionality means that it's difficult to find patterns in such high dimensional data. Arguably, "real data" is in fact NOT high dimensional, or is not generated by a process with many parameters, and so sparsity-focused methods like compressed sensing start playing a role.

But it gets even more interesting. Maybe the data is low-dimensional, but doesn't actually lie in a subspace. This gets you into manifold learning and variants: the data lies on a low-dimensional curved sheet of some kind, and you need to learn on that space.

While the challenge for geometry (and algorithms) is to keep up with the new data models, the challenge for data analysts is to design data models that are realistic and workable.

So what does this have to do with deep learning ?

Deep learning networks "work" in that they appear to be able to identify interesting semantic structures in data that can be quite noisy. But to me it's not entirely clear why that is. So I've been mulling the question of what kind of structure in data might be "visible" to such a network. In the absence of formal results ("if your data can be separated in some RKHS, then an SVM will find it"), what follows is some speculation, based on talks I attended and conversations I had with NIPs attendees.

Observation I: Both Stephané Mallat's talk and a nice paper by Coates, Karpathy and Ng talk about the idea of "first-level" features that identify (low-dimensional) invariants and eliminate noise. In the Coates et al paper, they start with a very fine $k$-means clustering ($k$ very large), and attempt to "glue" the centers together into low dimensional subspace pieces. These are then propagated up a network of higher-order feature processors.

Observation 2: Yoshua Bengio's survey of deep learning (a very readable account) points to work by Geoff Hinton as part of the reinvigoration of the field. A central idea of this work is that deep belief networks can be trained "layer by layer", where each layer uses features identified from the previous layer.

If you stare at these things long enough, you begin to see a picture not of sparse data, or low-rank data, or even manifold data. What you see is a certain hierarchical collection of subspaces, where low-dimensional spaces interact in a low dimensional way to form higher level spaces, and so on. So you might have a low-level "lip" feature described by a collection of 2-3 dimensional noisy subspaces in an image space. These "lip" features in turn combine with "eye" features and so on.

This might seem rather far fetched, and a strange way to think about data. But I can't claim originality even here. Back in June, Mauro Maggioni gave a talk at CGWeek in Chris Bishop's workshop on the connection between analysis and geometry. In this talk, he described a multi-resolution view on data that admits structure at different scales, and admits different structures at these scales.

The upshot of all of this: it is possible that deep learning is trying to capture hierarchies of low dimensional subspaces that interact in a low dimensional way. This would explain how one is able to avoid the curse of dimensionality, and might also explain why it sometimes can find structure that other methods (kernels, manifold methods, etc) might miss.

Problem is: I'm not sure how one tests this hypothesis. Almost any data set could be viewed this way if you allow enough features and enough "depth" in the hierarchy.

Wednesday, November 07, 2012

Data, Dimensions and Geometry oh my !

The following is a summary of a talk I gave to undergraduates interested in going on to graduate school. It's targeted at the layperson, and tries to convey a sense of the interplay between data mining and geometry. I gave this talk partly because I realized that things that we take utterly for granted in the rarified world of high dimensional data mining are completely foreign to people who don't think about this for a living. 

tl;dr: High dimensional geometry (and non standard geometry) is the unifying language that binds data mining problems together.

We're trying to find patterns all over the place, with very different kinds of data. A search engine is trying to find patterns in web pages that might indicate that they have similar content. Brain researchers are throwing MRIs of patients with diseases into an algorithm that attempts to infer patterns in brain scans that might yield clues about pathology and diagnosis. Genome-wide analysis takes what are essentially long strings of letters and tries to explain why certain populations might be susceptible to certain diseases.

Pandora, Shazam and other music sites analyze fragments of music to find related artists, or even just match a tune. While an infinite gangnam-style video might be a frivolous use of data mining on video streams, video scans are being analyzed by robots trying to drive unmanned cars or even just play soccer. Social networks are all the rage: Facebook, Twitter and Google are desperate to understand your circle of friends in order to sell things to you more effectively.

How are we able to find patterns, clusters and trends in such different kinds of data ? The key step in all of this is the idea of features. The trick is to describe each object we are studying as a sequence (or set) of features. For example, a feature set for a document could be the number of times each particular word appears. The feature set for an image could be the count of different colors. For a tune, it might be a collection of features identified by hiring artists to list out which of more than 700 characteristics a piece of music has.

And so on, and so on. What this allows us to do is represent each object as a set of features, whether it's a web page, a brain scan, a video, or even a social graph. Once we do that, we no longer have to worry about the original data (well, kind of !), and different types of data are all on an equal footing.


But what now ? Here's where a cool trick that dates back to the 1600s comes in.

I doubt that René Descartes ever heard of the term "data mining". But in a strange way, he's the father of the field. One of Descartes' claim to fame was establishing a link between geometry and algebra. He said that if we wanted to represent points, lines and other shapes, we could do so not abstractly as Euclid and other classical geometers did, but using algebra. A "point" in the plane could be described by two coordinates (x,y), and a line in the plane could be described by the equation y = mx + c.

This is a very simple idea - children learn it in middle school. And yet like most simple ideas, it was revolutionary, and completely changed our perspective on geometry. The unification of algebra and geometry is so powerful and so natural, we use it almost unconsciously today.

But what does Descartes have to do with data mining ?

Remember those features I was telling you about ? That every object can be represented by a sequence of numbers, each number describing some aspect of the object.

Let's do the opposite of what Descartes proposed ! In other words, let's pretend that the objects are actually points in a space. Now this space is not the plane (unless we had only two features). It's a high dimensional space, with one feature for each dimension.

This process is entirely artificial. There is no real reason why we must think of objects as points in a high dimensional space. But here's the cool thing that happens. We can now express all basic mining primitives as geometric concepts, by translating the language of the primitive to this high dimensional space.

A clustering of data becomes a grouping of points so that "nearby" points are grouped together. A classification of reviews into positive and negative statements becomes a way to separate "positive" points and "negative" points by a line. Finding trends in data becomes the problem of fitting a straight line to a collection of points.

It is hard to emphasize how utterly bizarre this is. There is no underlying "geometry" in the problem we're solving. We're essentially creating a castle out of thin air in order to understand the problem. And yet it works,  and is the bedrock of how we think about data mining.

 But wait ! there's more. What exactly does it mean to say that points are "nearby" or they are "separated" ? To answer this question, it's not enough to view objects as points in a high dimensional space. You have to give this space a shape - a geometry (and also a topology, but I'll skip that for now).

For example, if I have two feature lists, how do I measure the distance between them ? If they were points on a map, I could do the usual straight line distance. But does that really capture how far apart they are ? After all, if you're in downtown Salt Lake with its grids, a "crow flies" distance doesn't help estimate how far things are. If you're on the surface of the earth, you can't really tunnel through the center to get from one point to another.

So we have to create the shape of the space by defining how far apart two points are. And this is one of the trickiest parts of data mining. Either we have to use some domain knowledge to estimate which features are significant and control more of the distance between objects, or we have to try and learn what seems like the right distance between objects based on user estimates or other information.

The good thing is that we have a huge collection of shapes to play with, and different shapes tend to work well for different classes of objects. Some are easy to interpret, others are easy to compute, and so on. So a good part of the "art" in data mining is understanding the data and estimating the right geometry for it, so that our tasks (expressed geometrically) give us meaningful answers.

Or as Dorothy famously said to her dog, "Toto, I don't think we're in the plane any more!"

Tuesday, November 06, 2012

On the elections, Nate Silver, and lessons for data mining

One interesting side story from this election has been the intense focus on Nate Silver's election predictions, and the matter of aggregate polling statistics. While there's certainly a partisan element to much of the discussion, there's also a larger sense of unease about what the predictions are actually saying.

There are lessons here for the greater goal of using data mining for prediction and modelling, and this will get more and more important the more predictive analytics intrude on our lives.

People don't quite understand probability, even though they understand odds. 
There's a lot of confusion about what it means for a one-time event to have a probability associated with it, even though people are quite comfortable with the idea of odds in (for example) sports. This to me reflects a deeper confusion between the frequentist and Bayesian view of probability: namely, is probability the long-run average of the frequency of an event, or an a priori expression of uncertainty in the likelihood of an event.

I'm not going to say anything here that hasn't been said for more than a century, but from the point of view of interpreting the results of mining, this distinction will be important.

Aggregation is a GOOD THING. 
It is entirely likely that all the poll aggregators will have egg on their face tomorrow when the results come in. But it does seem that aggregation helps smooth out the natural irregularities and biases in individual polls. This is a theme that comes up time and again in data mining, and especially in clustering: rather than trusting a single algorithm, you should try to run algorithms that return diverse answers and aggregate them in some fashion.

It's not enough to predict: you must also explain. 
Among the good reasons to feel uneasy about the aggregate predictions is that they don't give much insight into why things are going the way they are. To be fair, Nate Silver laid out some economic markers back in the spring, and tied them to possible outcomes via regression. While this provides some  insight, it's still pattern matching as opposed to a mechanism.

More generally, it is very difficult to convince people that an answer is pertinent or "correct" unless there's some way to explain the answer without listing a sequence of coefficients or writing down a collection of centers. Much of the popularity of decisions trees comes from the ease of explanation it seems to provide.

Conclusion. 

Most of the controversy around data mining in the public sphere has centered around privacy issues. Indeed, privacy issues become a concern precisely because people worry that the mining algorithms are too accurate. But in fact we don't really understand the behavior of many algorithms that we use, and that is dangerous regardless of privacy concerns. The angst over the methods used to predict this election are an illustration of what will happen when the predictions we make start to matter, and matter to many people.

Friday, December 23, 2011

Thoughts on ICDM II: Social networks

The other trend that caught my eye at ICDM is the dominance of social networking research. There was a trend line at the business meeting that bore this out, showing how topics loosely classified as social networking had a sharp rise among accepted papers in ICDM over the past few years.

There were at least three distinct threads of research that I encountered at the conference, and in each of them, there's something to interest theoreticians.
  • The first strand is modelling: is there a way to describe social network graphs using abstract evolution models or random graph processes. I spent some time discussing this in a previous post, so I won't say more about it here. Suffice it to say that there's interesting work in random graph theory underpinning this strand, as well as a lot of what I'll call 'social network archaeology': scouring existing networks for interesting structures and patterns that could be the basis for a future model. 
  • The second strand is pattern discovery, and the key term here is 'community': is there a way to express natural communities in social networks in a graph-theoretic manner ? While modularity is one of the most popular ways of defining community, it's not the only one, and has deficiencies of its own. In particular, it's not clear how to handle "soft" or "overlapping" communities. More generally, there appears to be no easy way to capture the dynamic (or time-varying) nature of communities, something Tanya Berger-Wolf has spent a lot of energy thinking about. Again, while modelling is probably the biggest problem here, I think there's a lot of room for good theory, especially when trying to capture dynamic communities.
  • The final strand is influence flow. After all, the goal of all social networking research is to monetize it (I kid, I kid). A central question here is: can you identify the key players who can make something go viral for cheap ? is the network topology a rich enough object to identify these players, and even if you do, how can you maximize flow (on a budget, efficiently). 
There were many papers on all of these topics -- too many to summarize here. But the landscape is more or less as I laid it out. Social networking research is definitely in its bubble phase, which means it's possible to get lots of papers published without necessarily going deep into the problem space. This can be viewed as an invitation to jump in, or a warning to stay out, depending on your inclination. And of course, the definitive tome on this topic is the Kleinberg-Easley book

This concludes my ICDM wrap-up. Amazingly, it only took me a week after the conference concluded to write these up.

Thursday, December 22, 2011

Thoughts on ICDM I: Negative results (part C)

This is the third of three posts (one, two) on negative results in data mining, inspired by thoughts and papers from ICDM 2011.

If you come up with a better way of doing classification (for now let's just consider classification, but these remarks apply to clustering and other tasks as well), you have to compare it to prior methods to see which works better. (note: this is a tricky problem in clustering that my student Parasaran Raman has been working on: more on that later.).

The obvious way to compare two classification methods is how well they do compared to some ground truth (i.e labelled data), but this is a one-parameter system, because by changing the threshold of the classifier (or if you like, translating the hyperplane around),you can change the false positive and false negative rates.

Now the more smug folks reading these are waiting with 'ROC' and "AUC" at the tip of their tongues, and they'd be right ! You can plot a curve of the false positive vs false negative rate and take the area under the curve (AUC) as a measure of the effectiveness of the classifier.

For example, if the y axis measured increase false negatives, and the x-axis measured increasing false positives, you'd want a curve that looked like an L with the apex at the origin, and a random classifier would look like the line x+y = 1. The AUC score would be zero for the good classifier and 0.5 for the bad one (there are ways of scaling this to be between 0 and 1).

The AUC is a popular way of comparing methods in order to balance the different error rates. It's also attractive because it's parameter-free and is objective: seemingly providing a neutral method for comparing classifiers independent of data sets, cost measures and so on.

But is it ?

There's a line of work culminating (so far) in the ICDM 2011 paper 'An Analysis of Performance Measures For Binary Classifiers' by Charles Parker of BigML.com (sorry, no link yet). The story is quite complicated, and I doubt I can do it justice in a blog post, so I'll try to summarize the highlights, with the caveat that there's nuance that I'm missing out on, and you'll have to read the papers to dig deeper.

The story starts with "Measuring classifier performance: a coherent alternative to the area under the ROC curve" ($$ link) by David Hand (copy of paper here). His central result is a very surprising one:
The AUC measure implicitly uses different misclassification costs when evaluating different classifiers, and thus is incoherent as an "objective" way of comparing classifiers.
To unpack this result a little, what's going on is this. Suppose you have a scenario where correct classification costs you nothing, and misclassification costs you a certain amount (that could be different for the two different kinds of misclassification). You can now write down an overall misclassification cost for any threshold used for a classifier, and further you can compute the optimal threshold (that minimizes this cost). If you don't actually know the costs (as is typical) you can then ask for the expected misclassification cost assuming some distribution over the costs.

If you run this computation through, what you end up with is a linear transformation of the AUC, where the distribution over the costs depends on the distribution of scores assigned by the classifier ! In other words, as Hand puts it,
It is as if one measured person A’s height using a ruler calibrated in inches and person B’s using one calibrated in centimetres, and decided who was the taller by merely comparing the numbers, ignoring the fact that different units of measurement had been used
This is a rather devastating critique of the use of the AUC. While there's been pushback (case in point is an ICML 2011 paper by Flach, Hernandez-Orallo and Ferri which is a very interesting read in its own right), the basic premise and argument is not contested (what's contested is the importance of finding the optimal threshold). Hand recommends a few alternatives, and in fact suggests that the distribution of costs should instead be made explicit, rather than being implicit (and subject to dependence on the data and classifiers)

What Parker does in his ICML paper is take this further. In the first part of his paper, he extends the Hand analysis to other measures akin to the AUC, showing that such measures are incoherent as well. In the second part of his paper, he unleashes an experimental tour de force of classifier comparisons under different quality measures, showing that
  • nearly 50% of the time, measures disagree on which classifier in a pair is more effective. He breaks down the numbers in many different ways to show that if you come up with a new classification algorithm tomorrow, you'd probably be able to cherry pick a measure that showed you in a good light. 
  • It's the measures perceived to be more "objective" or parameter-less that had the most trouble reconciling comparisons between classifiers. 
  • It's also not the case that certain classifiers are more likely to cause disagreements: the problems are spread out fairly evenly.
  • His experiments also reinforce Hand's point that it's actually better to define measures that explicitly use domain knowledge, rather than trying to achieve some objective measure of quality. Measures that were either point-based (not integrating over the entire range) or domain specific tended to work better.  
I'm not even close to describing the level of detail in his experiments: it's a really well-executed empirical study that should be a case study for anyone doing experimental work in the field. It's especially impressive because from personal experience I've found it to be REALLY HARD to do quality methodological studies in this area (as opposed to the "define algorithm-find-toy-data-profit" model that most DM papers seem to follow").

At a deeper level, the pursuit of objective comparisons that can be reduced to a single number seems fundamentally misguided to me. First of all, we know that precise cost functions are often the wrong way to go when designing algorithms (because of modelling issues and uncertainty about the domain). Secondly, we know that individual methods have their own idiosyncracies - hence the need for 'meta' methods. And finally, we're seeing that even the meta-comparison measures have severe problems ! In some ways, we're pursuing 'the foolish hobgoblin of precision and objectivity' in an area where context is more important than we as mathematicians/engineers are used to.


Tuesday, December 20, 2011

Thoughts on ICDM I: Negative Results (part B)

Continuing where I left off on the idea of negative results in data mining, there was a beautiful paper at ICDM 2011 on the use of Stochastic Kronecker graphs to model social networks. And in this case, the key result of the paper came from theory, so stay tuned !

One of the problems that bedevils research in social networking is the lack of good graph models. Ideally, one would like a random graph model that evolves into structures that look like social networks. Having such a graph model is nice because
  • you can target your algorithms to graphs that look like this, hopefully making them more efficient
  • You can re-express an actual social network as a set of parameters to a graph model: it compacts the graph, and also gives you a better way of understanding different kinds of social networks: Twitter is a (0.8, 1, 2.5) and Facebook is a (1, 0.1, 0.5), and so on.
  • If you're lucky, the model describes not just reality, but how it forms. In other words, the model captures the actual social processes that lead to the formation of a social network. This last one is of great interest to sociologists.
But there aren't even good graph models that capture known properties of social networks. For example, the classic Erdos-Renyi (ER) model of a random graph doesn't have the heavy-tailed degree distribution that's common in social networks. It also doesn't have a property that's common to large social networks: densification, or the fact that even as the network grows, the diameter stays small (implying that the network seems to get denser over time).

One approach to fixing this models a social network as a Stochastic Kronecker graph. You can read more about these graphs here: a simple way of imagining them is that you add an edge in the graph by a random process that does a (kind of) quad tree like descent down a partitioning of the adjacency matrix and places a 1 at a leaf. SKGs were proposed by Leskovec, Chakrabarti, Kleinberg and Faloutsos, and include ER graphs as a special case. They appear to capture heavy tailed degree distributions as well as densification, and have become a popular model used when testing algorithms on social networks. They're also used as the method to generate benchmark graphs for the HPC benchmark Graph500.

But a thorough understanding of the formal properties of SKGs has been lacking. In "An In-Depth Analysis of Stochastic Kronecker Graphs", Seshadhri, Pinar and Kolda show some rather stunning results. Firstly, they provide a complete analysis of the degree distribution of an SKG, and prove a beautiful result showing that it oscillates between having a lognormal and exponential tail. Their theorems are quite tight: plots of the actual degree distribution match their theorems almost perfectly, and convincingly display the weird oscillations in the degree frequencies (see Figure 2 in the paper).

Secondly, they also formally explain why a noisy variant of SKGs appears to have much more well-behaved degree distribution, proving that a slightly different generative process will indeed generate the desired distribution observed in practice.

Finally, they also show that the graphs generated by an SKG have many more isolated nodes than one might expect, sometimes upto 75% of the total number of vertices ! This has direct implications for the use of SKGs as benchmarks. Indeed, they mention that the Graph500 committee is considering changing their benchmarks based on this paper - now that's impact :)

What I like about this paper is that it proves definitive theoretical results about a popular graph model, and very clearly points out that it has significant problems. So any methodology that involves using SKGs for analysis will now have to be much more careful about the claims it makes.

p.s There's also more supporting evidence on the lack of value of SKGs from another metric (the clustering coefficient, that measures how many configurations uv, uw also have the third edge vw). Real social networks have a high CC, and SKGs don't.This was first mentioned by Sala, Cao, Wilson, Zablit, Zheng and Zhao, and Seshadhri/Pinar/Kolda have more empirical evidence for it as well. (Disclaimer: I was pointed to these two references by Seshadhri: my opinions are my own though :))


Sunday, December 18, 2011

Thoughts on ICDM I: Negative Results (part A)

I just got back from ICDM (the IEEE conference on Data Mining). Data mining conferences are quite different from theory conferences (and much more similar to ML or DB conferences): there are numerous satellite events (workshops, tutorials and panels in this case), many more people (551 for ICDM, and that's on the smaller side), and a wide variety of papers that range from SODA-ish results to user studies and industrial case studies.

While your typical data mining paper is still a string of techniques cobbled together without rhyme or reason (anyone for spectral manifold-based correlation clustering with outliers using MapReduce?), there are some general themes that might be of interest to an outside viewer. What I'd like to highlight here is a trend (that I hope grows) in negative results.

It's not particularly hard to invent a new method for doing data mining. It's much harder to show why certain methods will fail, or why certain models don't make sense. But in my view, the latter is exactly what the field needs in order to give it a strong inferential foundation to build on (I'll note here that I'm talking specifically about data mining, NOT machine learning - the difference between the two is left for another post).

In the interest of brevity, I'll break this trend down in three posts. The first result I want to highlight isn't even quite a result, and isn't even from ICDM ! It's actually from KDD 2011 (back in August this year). The paper is Leakage in Data Mining: Formulation, Detection, and Avoidance, by Shachar Kaufman, Saharon Rosset, and Claudia Perlich, and got the Best Paper Award at KDD this year.

The problem they examine is "leakage", or the phenomenon that even in a 'train model and test model" framework, it is possible for valuable information to "leak" from the test data or even from other sources to the training system, making a learned model look surprisingly effective and even give it predictive powers beyond what it really can do. Obviously, the problem is that when such models are then applied to new data, their performance is worse than expected, compared to a model that wasn't "cheating" in this way.

They cite a number of examples, including many that come from the data challenges that have become all the rage. The examples highlight different kinds of leakage, including "seeing the future", cross contamination of data from other data sets, and even leakage by omission, where a well-intentioned process of anonymization actually leaks data about the trend being analyzed.

While there are no results of any kind (experimental or theoretical), the authors lay out a good taxonomy of common ways in which leakage can happen, and describe ways of avoiding leakage (when you have control over the data) and detecting it (when you don't). What makes their paper really strong is that they illustrate this with specific examples from recent data challenges, explaining how the leakage occurs, and how the winners took advantage of this leakage explicitly or implicitly.


There are no quick and dirty fixes for these problems: ultimately, leakage can even happen through bad modelling, and sometimes modellers fail to remove leaky data because they're trying to encourage competitors to build good predictive models. Ironically, it is this encouragement that can lead to less predictive models on truly novel data. But the paper makes a strong argument that the way we collect data and use it to analyze our algorithms is fundamentally flawed, and this is especially true for the more sophisticated (and mysterious) algorithms that might be learning models through complex exploitation of the trained data.

It remains to be seen whether the recommendations of this paper will be picked up, or if there will be more followup work along these lines. I hope it does.

Tuesday, November 01, 2011

Large-Data Clustering Part I: Clusters of clusters

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here

I don't know why, but I've been struggling with a series of posts on large-data clustering for a while now. Distilling out the key ideas in a way that is intuitive and yet rigorous has been trickier than I thought. But here goes ...

Let's consider a prototypical clustering algorithm: $k$-means. In a $k$-means iteration, you have to scan each point and determine its nearest cluster center. Then you have to organize all the points that share a cluster center and compute a new one by averaging. Each iteration takes $nk$ time, and this isn't even accounting for high dimensions.

This algorithm doesn't scale well to large data, needless to say. But clustering, like most data mining tasks, is needed most acutely when we're dealing with large data. So what do we do ?

In the next series of posts, I'll try to walk you through some of the key principles that get used when designing large data clustering algorithms. As always, the twin principles of sampling and streaming will be the keys to the kingdom.

Clusters can be formed by clustering clusters.

One of the ideas that makes large-data clustering tractable is that you can cluster clusters (say that three times quickly, will ya ?). $k$-means is a particularly good example of this. I can think of a cluster center as a "representative weighted point" and discard the data associated with it. This is critical, because now I can just worry about the cluster centers themselves, and repeat.

The advantage of this is locality. I can take small chunks of data and cluster them efficiently. Then I can combine the cluster reps from each chunk to get a smaller set of reps, and so on.  Many of the "standard" large-data clustering methods (BIRCH, CLARA, CLARANS and the like) are based on this principle. As a concept, it's sound. But does it guarantee anything ? 

The catch of course is error propagation. How do I make sure that in this repeated (hierarchical) clustering process, I didn't commit to a bad clustering  early on because of locality ? It seems like I could pay a lot of error at each level, limiting my ability to group data sets into small chunks (because small chunks mean many levels).

It turns out that there are clever ways around this problem by careful merging. The first example of this is a paper from 1997 by Charikar, Chekuri, Feder and Motwani. They look at the $k$-center problem in a general metric space (minimize the maximum radius) and their technique illustrates this principle well.

Their algorithm is quite elegant, and works in phases. In each phase, they first merge cluster centers together to ensure that all resulting cluster centers are far apart. Then they insert new points into clusters as long as they don't get too big, or create new clusters. This happens as long the number of clusters stays below $k$. When it crosses $k$, the phase ends, and then the process repeats.

Note that the algorithm reads in points and processes them on the fly in a single "pass". The reason it works inspite of the repeated mergings is because of a nifty property of the $k$-center algorithm (which lies at the heart of the Gonzalez 2-approximation for $k$-center):
If you can produce $k+1$ points that are all at distance at least $d$ from each other, then the optimal $k$-clustering has diameter at least $d$.
This is because of a 'quantitative pigeonhole principle': At least two of these points must be in a single cluster, which then has diameter at least $d$.

There are some tricks needed to figure out how much you can expand each cluster when inserting, and how far apart cluster centers should be in the phase after the current one. I'll spare you the details, but they lead to an 8-approximation, which can be improved using a neat randomization trick to a $2e$-approximation.



The above trick works because we can build a witness structure that lower bounds OPT, and so the merging can be done carefully so that error doesn't propagate badly. What happens if you change the cost function to the $k$-median though, where the witness isn't as straightforward ?

An excellent source here is the work by Guha, Meyerson, Mishra, Motwani and O'Callaghan. This paper, which combines prior work on both the theory and practice of $k$-median clustering, presents the following idea:
Break the data set into pieces. Cluster each set "loosely", using many more than $k$ clusters. Weight each cluster by the number of points assigned to it. Cluster these weighted cluster centers to find the desired $k$ clusters.
The key observations they make are:
  1. The total cost of clustering the pieces isn't too much larger than the overall clustering cost (so it gives a good lower bound). This follows from the triangle inequality.
  2. The new "weighted" instance admits a solution that again is not too much larger than the optimal cost (is actually within a factor of 6 for any partition). This again follows from the triangle inequality.
The last piece of this puzzle is: how to cluster each set "loosely" ? They propose using a bicriterion method, whose cost is compared to the optimal $k$-median cost, but is allowed to use more medians. Such bicriterion methods are often easier to design.

Note that there's nothing stopping them from using their own algorithm recursively to cluster the weighted cluster centers. What's neat about this is that it can be adapted to a streaming setting using another standard trick.

Specifically, if you have an algorithm that breaks things into pieces and then recombines them (a two-level scheme), you can first recursively apply it to get a multi-level hierarchical scheme, and then you can implement this in a streaming setting by updating (potentially) an entire side of the computation tree whenever a new item appears. This can also be viewed as a lazy implementation of the hierarchical strategy.

Wednesday, September 07, 2011

ECML-PKDD Conference Report: Part II

My student Parasaran Raman continues his conference report from ECML-PKDD in Athens. For his first post, see here.

Christopher Bishop gave an excellent plenary talk on the first day about the need to embrace uncertainty. He talked about the Kinect technology that uses infra-red camera and sensors to measure distance of aperson from the screen. Instead of modeling the problem as a motion tracking problem, the folks at MSR started to look at the challenge as a pattern recognition problem. Each frame is now to be cold-started and understood as a picture instead of tracking the motion in a video sense. The goal now is to classify sets of pixels enclosed inside an axis-aligned rectangle. The training data is collected as a series of motion capture videos by letting a test subject wear sensors on the body a la hollywood-animation style, and is combined with some synthetic data, corrupted artificially since these are “too pure”. The training set consists of a whopping 1 million examples of body positions .

They then use a decision tree to classify the pixels into one of 31 predefined body parts by maximizing an information gain function. Another sub-goal that Bishop talked about was to identify human joints. For this, they use a kind of mean-shift technique, applying a simple Gaussian smoothing function on joints.

The second half of the talk revolved around model-based ML; where Bishop recommends constructing model equivalent for algorithms. He says and I quote “Don't look at the algorithms; look at the models”. He talked about an application where the goal is to rank players who play each other in a common environment. They have a new rating system TrueSkill, that extends the popular Elo rating system [which typically only works in a two player game like chess] to a multi-player team games. TrueSkill performs a kind of Bayesian ranking inference on the graphs by local message passing .

He also introduced Infer.NET, a framework for running Bayesian inference in graphical models.
[http://research.microsoft.com/en-us/um/cambridge/projects/infernet/]. This looks like a cool tool to play with, especially since it provides a single platform for classification and clustering problems.

The plenary talk on day two was by Rakesh Agarwal. In a talk that focused on a very interesting application, Rakesh introduced the idea of using data mining towards enriching education. The key idea is to take concepts from textbooks and map it with a relevant Wikipedia article and then induce relationship between concepts.

The goal is to enrich sections in the textbooks by providing recommendations for illustrative pictures to go with the text. Dispersion [which is a measure of how unrelated the concepts are] and syntactic complexity play a big role in deciding “which picture should go where”. Since search engines fail when given long key words, one challenge is to find key concepts in a section that one can then match up to the most similar article in Wikipedia. The underlying assumption was that for most high school textbook sections will have a associated Wikipedia article.

Rakesh showed results on applying this to a huge word corpus from NCERT, who draft the high school textbook in India for most schools. He also talked about computing an immaturity score to see if a section is well-written and not surprisingly, subjects like physics and chemistry scored over commerce and economics!

To summarize, two things that go into solving the problem of augmenting textbooks with images are “Comity” [where they identify lots of short key queries by concatenate 2 or 3 concepts important concepts at a time] and “Affinity”. Comity ransk images by computing a relevance score of each image in context of the concepts picked out from a section and affinity identifies relevant webpages and computes an importance score. Another challenge is that these techniques cannot operate one section at a time since the textbooks would probably not like the images to repeat!

Tuesday, September 06, 2011

ECML-PKDD conference report: Part I

My student Parasaran Raman is in Athens attending ECML-PKDD and associated workshops, and will be updating this blog regularly with details from the conference. His posts are edited lightly for format and to add links.


I am in Athens [+1 to location-driven-research] to attend ECML PKDD 2011 and I am at the second MultiClust workshop today. This workshop is held in conjunction with ECML PKDD 2011.

The first talk of the morning was a invited talk by Michael Houle on "Combinatorial approaches to clustering". Michael Houle began the talk with how in high dimensions, all out intuitions about space go for a toss. One example that he gave illustrated how most of the points are concentrated along the boundaries in high dimensions. The talk revolved around how similarity measures between partitions cannot be fully trusted though there have been a range of combinatorial and spatial measures that have been introduced, since each measure suffers from some kind of drawback. He also talked about constructing secondary similarity measures [“Secondary similarity between two points v and w is defined in terms of data objects in the common intersection of neighborhoods based at v and w, where the neighborhoods themselves are determined according to a supplied primary similarity measure”] and how they can get around the curse of dimensionality.

One interesting question that was thrown out was "Can we do a fully automated clustering?", i.e., can we convert similarity functions into ranking on the neighbors, assume no knowledge of distribution of the data, and automatically pick 'k'. The talk moved on to approaches towards computing quality of a partition by computing a significance score for each point which depends on the correlation between a member of the set to the members in neighboring sets and other members of the same set, along with the size of cluster and the feature set. Once we know an object's significance, we can use this to change the shape of a cluster, by adding points and removing them.

Bart Goethals gave another invited talk titled “Cartification: from similarities to item set frequencies”. He talked about doing a k-nearest neighbors for all data points to find the right centers of clusters. The idea is that true 'centers' occurs in many knn baskets.

There were many interesting ideas out of the other talks at the workshop. I will highlight a few of them:

  1. One of the first talks about “Subjectively interesting alternate clusters”, uses an information theoretical framework to find interesting alternate clusters.
  2. Jilles Vreeken gave a very good talk about the current approaches in pattern mining and how we can use that knowledge in data mining. On this talk titled, “where pattern met subspace cluster”, he highlighted the similarities between subspace clustering and pattern mining where the goal is to find informative local structures.
  3. The talk on “Factorial clustering” was about using latent variables to generate multiple clusterings. One question that they wanted an answer for was a way to compute the right distance between partitions and I said, "Hey ! Use ours!”.
  4. Another talk on “Browsing clustering alternatives" outlined constructing a Merge-Split tree structure on the space of partitions and enable a user driven browsing of partitions.
My talk was the final one of the session. I talked about some recent work with Suresh Venkatasubramanian and Jeff Phillips on exploring the landscape of clusterings by sampling from the space of partitions proportional to a quality measure. It was generally good to see a lot of head-nods during the talk. A couple of people from the industry who were present agreed to the fact that there exists data that is multi-clusterable and the hard part was finding them. So presenting a landscape where you can pick from looked like a useful idea!

The day ended with a rooftop reception with a magnificent view of the Acropolis and the temple of Zeus! I had very useful dinner conversations with Michael Houle and Thomas Seidl about general problems in the area of meta-clustering and on how to compute distances metrics on partitions. We chatted about combinatorial distances and spatial distances, and secondary level distances, where one constructs a ranking on top of a existing distance between many members.

Disqus for The Geomblog