The Geomblog: 04/01/2011

Saturday, April 30, 2011

SDM 2011: Day 2

This is part II of Parasaran Raman's conference report from SDM 2011. His earlier report is here.

I was anxious about my first talk at a conference. It did help that I was scheduled to be the second speaker in the morning; that is when you are fresh with all the caffeine running through your system. I thought my talk went well, thanks to the many practice sessions my advisor and other people in my lab coerced me into doing! I talked to a few meta-clustering researchers here and they seemed to buy/appreciate our idea of a new spatially-aware metric and consensus procedure that we proposed in our paper.

There were a couple of other interesting talks in the clustering session today. Geng Li introduced a new method called ABACUS to mine arbitrarily shaped clusters in the data. They essentially collapse points till a spine of each cluster emerges. After this since they only have to deal with a smaller set of points, their technique is scalable and can work with large datasets. And yeah, as expected the Swiss roll dataset popped out!

Claudia Plant talked about a new approach to robust correlation clustering called SONAR. Her technique works in 3 steps: collect signals by probing the data with different primitive clusters, extract response patterns, and identify true clusters with techniques like Minimum Description Length. It was a nice talk and I made a mental note to read this paper more carefully.

Next up was the best student paper talk by Pu Wang on Nonparametric Bayesian Co-clustering Ensembles. This is the most recent work in a series of clustering ensembles papers of Wang, Domeniconi and Laskey. This paper tries to address the issues of clustering bias, parameter setting and curse of dimensionality through clustering ensembles and co-clustering. This line of research looks interesting, at least with the types of issues that they are trying to solve.

Later at lunch, I met a few data mining people interested in general meta aspects of clustering like subspace clustering, generating alternate clusterings and computing consensus solutions. Thomas Seidl, Stephan Günnemann and Ines Färber from RWTH Aachen University and Emmanuel Müller, who is currently at Karlsruhe Institute of Technology are talking at a tutorial session tomorrow, on "Discovering Multiple Clustering Solutions". We had a very interesting conversation about the challenges that researchers who ask these meta questions about clustering including my favorite "How do we obtain/generate data that will admit multiple good partitions?". These folks are also organizing the second workshop on Multiclust in conjunction with ECML-PKDD at Athens. The first MultiClust workshop happened at KDD last year. So, let's get down to "Clustering the clusters in the clusterings!".

Friday, April 29, 2011

SDM 2011: Day 1

My student Parasaran Raman and I are at the SIAM Conference on Data Mining (SDM) 2011, where he's presenting this paper on "spatially aware" consensus. In between preparation for his very first conference talk (yay!) he had enough time to jot down his thoughts on day 1. As usual, I've done light editing for formatting, and take responsibility for all errors in transmission. Note that SIAM hasn't yet placed papers online, and so when available I'll add links.

Collecting real data has always been a real problem for me. All the experiments with synthetically generated data and shouting "bingo!" after getting good results on them for the meta-clustering problems that I work on, was getting too boring. SDM11 had a tutorial on "Data Mining for Healthcare management" and I was interested to see what kind of healthcare-related data I can get to play with. Especially because almost all of the papers that uses consensus clustering techniques on real data have been motivated by analysis of either microarray gene expression or protien sequences. Prasanna Desikan of Allina Hospitals and Clinics talked about how terabytes of data generated by multiple sources drives the need for efficient knowledge based systems. It was heartening to see many publicly available datasets that require mining. The key challenges from a strictly CS perspective are the usual (1) Dealing with missing data/attributes and (2) Outliers and noisy data which are even more pronounced in the healthcare sector.

I also sat through a few talks in the classification session. Byron Wallace talked about letting multiple experts involve in active learning and "who should label what?". The talk focussed on how a single infallible oracle is impractical and stressed the need for multiple experts. The key idea was to save the hard questions that are usually not many for the experienced labelers who are expensive and let the inexperienced labelers do the load of work when the questions are rather easy. An interesting question that came up was about the feasibility of partitioning the question set into easy and hard.

Sanjay Chawla talked about supervised learning for skewed data where the classes of different sizes causes the hyperplane separating the classes to be pushed towards the class with lesser points (ed: the imbalanced classification problem). The solution provided was to use a quadratic mean loss instead of just the arithmetic mean. The loss function is convex and can be solved by usual convex solvers.

Liang Sun spoke about a boosting framework for to learn the optimal combination of individual base kernel classifiers. The algorithm goes through multiple boosting trials. During each trial, one classifier for each kernel is learned. The 'multi-kernel' classifier for each stage in the boosting process is a function of the input classifiers. Their choice is to pick the best one. The boosing process itself is otherwise similar to AdaBoost.

There was also a lightning session and true to its name, there were 25 talks in 75 minutes! I could relate to a meta-clustering work by Yi Zhang and Tao Li where the goal was to find a k-consensus solution from a bunch of input partitions of the same data. The method works in four steps: (1) generate multiple input partitions, (2) use Mallows distance (Zhou et. al.) to compute distance between partitions, (3) build a hierarchy over the partitions and make a cut to get a 'clustering' of partitions and (4) generate a consensus partition on each 'cluster' from the cut. I have found it difficult to find data that admits multiple partitions that are good but different from each other at the same time, and that make me ask the question "How do you generate the different partitions?". The answer sadly was "We run kmeans 38 times!". I am still looking for a motivation for multiple alternate partitions.

The reception had just cheese and a few veggies, but I have been having some good food here in Phoenix and I am not complaining.

Tuesday, April 19, 2011

POTD: Witnessed k-distance

Witnessed k-Distance (SoCG 2011)

Leonidas J. Guibas, Quentin Mérigot (LJK), Dmitriy Morozov (LBNL)

Distance function to a compact set plays a central role in several areas of computational geometry. Methods that rely on it are robust to the perturbations of the data by the Hausdorff noise, but fail in the presence of outliers. The recently introduced distance to a measure offers a solution by extending the distance function framework to reasoning about the geometry of probability measures, while maintaining theoretical guarantees about the quality of the inferred information. A combinatorial explosion hinders working with distance to a measure as an ordinary (power) distance function. In this paper, we analyze an approximation scheme that keeps the representation linear in the size of the input, while maintaining the guarantees on the inference quality close to those for the exact (but costly) representation.

Notes:
The idea of defining distances between (or to) shapes robustly by replacing the shape by a distribution is near to my heart (see this paper for more). The authors provide an algorithmic perspective on a formulation first proposed by Chazal, Cohen-Steiner and Mérigot. The idea is that instead of constructing a shape from a point cloud by growing balls of fixed radius and measuring the distance to this union of balls, one constructs a measure by growing balls out to a fixed measure, and then defining a distance. The distance measure has the nice property of being Lipschitz-related to the EMD, making it stable. In a manner reminiscient of the Aurenhammer et al work, they relate this distance to a power diagram construction, and then design efficient approximations for it (because the exact version involves terms exponential in the thresholding parameter).

It seems to me that they're going to a lot of work to recover from what I think is a tactical problem: fixing a sharp threshold for the ball measure. It might be interesting to explore alternate ways of defining the measure that are "smoother" - maybe using a kernel instead of a hard measure boundary. It might yield an alternate measure that serves the same purpose but is easier to compute.

Sunday, April 10, 2011

POTD: A unified approach to Approximate Proximity searching

A Unified approach to Approximate Proximity Searching

Sunil Arya, Guilherme D. da Fonseca, and David M. Mount

European Symposium on Algorithms, 2010.

Abstract:

The inability to answer proximity queries efficiently for spaces of dimension d > 2 has led to the study of approximation to proximity problems. Several techniques have been proposed to address different approximate proximity problems. In this paper, we present a new and unified approach to proximity searching, which provides efficient solutions for several problems: spherical range queries, idempotent spherical range queries, spherical emptiness queries, and nearest neighbor queries. In contrast to previous data structures, our approach is simple and easy to analyze, providing a clear picture of how to exploit the particular characteristics of each of these problems. As applications of our approach, we provide simple and practical data structures that match the best previous results up to logarithmic factors, as well as advanced data structures that improve over the best previous results for all aforementioned proximity problems.

Notes:
When a problem becomes interesting, papers get written quickly, and techniques start pouring out of the firehose. Not all tricks are needed, and not all machinery is effective, but cleanup only comes later, once the frenzy has settled. Approximate nearest neighbor research is like this: there are many tricks, and lots of machinery, but there are also some core ideas that keep showing up, and are sufficient for many variants of the problem.

Near-neighbor searching in low dimension is "easy" if you're given data that's uniformly sampled. Simple divide-and-conquer will give balanced search trees, and thus low query times. The problem comes when you have regions of sparsity. Informally, they prevent you from making progress as you divide the space up, and so the root-to-leaf length increases, increasing query time.

While the ANN problem in low dimensions is "solved" in that there's a fairly good theoretical understanding of the bounds and tradeoffs needed for query time vs storage, the methods themselves are quite complex. I learnt this first-hand when reading the Har-Peled paper that uses approximate Voronoi diagrams for near-neighbor search in an attempt to reverse-engineer the method for a different setting.

The beauty of the POTD is that it starts with a very simple data structure - the compressed quad tree. It shows that this structure can be used to isolate "sparse" and "dense" regions of space, and uses a hybrid strategy for processing these regions, with core sets to reduce size in dense regions, and optimized data structures for sparse regions (that necessarily only have few points). While the paper itself has no experimental results, I'd imagine that this approach would lend itself far more easily to experimentation.

Thursday, April 07, 2011

MAD ALGOrithms

It's time for some large data sets ! Once again, MADALGO is organizing MASSIVE, the workshop on massive data algorithmics. As usual, it will be colocated with SoCG in Paris (actually the day after, on the 16th of June). In case you haven't noticed, big-data is all the rage now !

Deadline is Apr 27, so get those 10 page submissions in, and see you in Paris !

The Geomblog

Pages