Tuesday, September 06, 2011

ECML-PKDD conference report: Part I

My student Parasaran Raman is in Athens attending ECML-PKDD and associated workshops, and will be updating this blog regularly with details from the conference. His posts are edited lightly for format and to add links.

I am in Athens [+1 to location-driven-research] to attend ECML PKDD 2011 and I am at the second MultiClust workshop today. This workshop is held in conjunction with ECML PKDD 2011.

The first talk of the morning was a invited talk by Michael Houle on "Combinatorial approaches to clustering". Michael Houle began the talk with how in high dimensions, all out intuitions about space go for a toss. One example that he gave illustrated how most of the points are concentrated along the boundaries in high dimensions. The talk revolved around how similarity measures between partitions cannot be fully trusted though there have been a range of combinatorial and spatial measures that have been introduced, since each measure suffers from some kind of drawback. He also talked about constructing secondary similarity measures [“Secondary similarity between two points v and w is defined in terms of data objects in the common intersection of neighborhoods based at v and w, where the neighborhoods themselves are determined according to a supplied primary similarity measure”] and how they can get around the curse of dimensionality.

One interesting question that was thrown out was "Can we do a fully automated clustering?", i.e., can we convert similarity functions into ranking on the neighbors, assume no knowledge of distribution of the data, and automatically pick 'k'. The talk moved on to approaches towards computing quality of a partition by computing a significance score for each point which depends on the correlation between a member of the set to the members in neighboring sets and other members of the same set, along with the size of cluster and the feature set. Once we know an object's significance, we can use this to change the shape of a cluster, by adding points and removing them.

Bart Goethals gave another invited talk titled “Cartification: from similarities to item set frequencies”. He talked about doing a k-nearest neighbors for all data points to find the right centers of clusters. The idea is that true 'centers' occurs in many knn baskets.

There were many interesting ideas out of the other talks at the workshop. I will highlight a few of them:

  1. One of the first talks about “Subjectively interesting alternate clusters”, uses an information theoretical framework to find interesting alternate clusters.
  2. Jilles Vreeken gave a very good talk about the current approaches in pattern mining and how we can use that knowledge in data mining. On this talk titled, “where pattern met subspace cluster”, he highlighted the similarities between subspace clustering and pattern mining where the goal is to find informative local structures.
  3. The talk on “Factorial clustering” was about using latent variables to generate multiple clusterings. One question that they wanted an answer for was a way to compute the right distance between partitions and I said, "Hey ! Use ours!”.
  4. Another talk on “Browsing clustering alternatives" outlined constructing a Merge-Split tree structure on the space of partitions and enable a user driven browsing of partitions.
My talk was the final one of the session. I talked about some recent work with Suresh Venkatasubramanian and Jeff Phillips on exploring the landscape of clusterings by sampling from the space of partitions proportional to a quality measure. It was generally good to see a lot of head-nods during the talk. A couple of people from the industry who were present agreed to the fact that there exists data that is multi-clusterable and the hard part was finding them. So presenting a landscape where you can pick from looked like a useful idea!

The day ended with a rooftop reception with a magnificent view of the Acropolis and the temple of Zeus! I had very useful dinner conversations with Michael Houle and Thomas Seidl about general problems in the area of meta-clustering and on how to compute distances metrics on partitions. We chatted about combinatorial distances and spatial distances, and secondary level distances, where one constructs a ranking on top of a existing distance between many members.


  1. The size of the PC seems to exceed the number of papers!

  2. In non-theory conferences, the "PC" usually includes people you'd normally think of as external reviewers for a theory conference. You get the tag of being on the PC, which is why it's easy to pile up the PC memberships, but the numbers are similar if you look at the external reviewer count for STOC/FOCS.

    PC members can submit papers as well, and usually only review 8-10 papers. It's the area chairs/vice chairs who act like a theory PC (sort of)

  3. Very interesting post about clustering, I liked a lot the idea of fractals or subspace clusters.

    I think that the main problem with clustering methods is that in order to make the best possible grouping with a certain method, you need to, understand first the data you are working with, because in that way you have much more chances than when this data set you are grouping is not reaching satisfactory results (or they don't even make any logical seance) of having correct intuitions of what is possible to improve, or which areas/concepts might be causing this problems. This is specially important and difficult while using a certain amount of metrics that where created by you and also other researchers, which might be very difficult to have a decent idea of what kind of information can that summary of metrics derive into

    Also the method internals must be clearly understood, along with implications of parameters variations in the way the data is processed.


Disqus for The Geomblog