The Geomblog: sdm2011

Showing posts with label sdm2011. Show all posts

Sunday, May 01, 2011

SDM 2011: Other thoughts

My student Parasaran Raman did a two part review of SDM 2011 (here, and here), and I thought I'd add my own reflections.

There was an excellent talk on the second day by Haesun Park from Georgia Tech on non-negative matrix factorization methods (NMF). This problem has become all the rage in ML circles, and is defined as follows.

Given a matrix $A$ and parameter $k$, find rank $k$ matrices $U$ and $V$ such that $\|A - UV\|$ is minimized, and $U$ and $V$ contain only nonnegative entries.

The main difference between this problem and the SVD is the non-negativity requirement, and not surprisingly, it changes the complexity quite a bit - you can no longer get an optimal solution, for example.

There appears to be relatively little theory-algorithms work on this problem (there's some mathematical work, including the famous Perron-Frobenius theorem), and her talk presented a good overview of the various heuristic strategies people have used to try and solve the problem.

One of the reasons this formulation is interesting is because for many problems, you'd like an "explanation" of the data that doesn't have negative coefficients (which is the bane of PCA-like methods). She also says that for reasons unknown, the matrices produced by this formulation tend to be quite useful "in practice" at factoring out the different interactions present in data. In fact one of her big open questions is whether there's a more rigorous way of explaining this.

The talk was very enjoyable, and I sat there thinking that it would be perfect as an ALENEX (or even SODA) invited talk.

There was also a panel discussion on the future of data mining. The panelists comprised two industry folks, two academics, and one researcher from a national lab, with questions being thrown at them by Chandrika Kamath from LLNL. My twitter stream gave a blow-by-blow, so I won't rehash it here. I was intrigued (and a little disappointed) when I realized that almost all the answers centered around the needs of business.

In one respect this is not surprising: the primary reason why data mining is so hot right now is because of the vast opportunities for data mining to sell ads, products, or even just model consumer behavior. But it makes the field a little more shallow as a consequence: inevitably, industry needs solutions, not deep problems, and a relentless focus on solutions doesn't help facilitate deeper analysis of the problems. Maybe that's the nature of data mining, and I'm expecting too much to ask it to be more profound. But I wouldn't mind a more ML-like sense of rigor in the formulation of computational questions.

Overall, I quite enjoyed the conference. I've been to KDD, and have been on PCs for both KDD and ICDM (the third major data mining conference). To me, SDM has the feel of a SODA/SoCG like conference - a smaller, more academic crowd, more mathematics/statistics in the talks, and less of a big industrial presence. I can definitely see myself publishing there and going back again.

Saturday, April 30, 2011

SDM 2011: Day 2

This is part II of Parasaran Raman's conference report from SDM 2011. His earlier report is here.

I was anxious about my first talk at a conference. It did help that I was scheduled to be the second speaker in the morning; that is when you are fresh with all the caffeine running through your system. I thought my talk went well, thanks to the many practice sessions my advisor and other people in my lab coerced me into doing! I talked to a few meta-clustering researchers here and they seemed to buy/appreciate our idea of a new spatially-aware metric and consensus procedure that we proposed in our paper.

There were a couple of other interesting talks in the clustering session today. Geng Li introduced a new method called ABACUS to mine arbitrarily shaped clusters in the data. They essentially collapse points till a spine of each cluster emerges. After this since they only have to deal with a smaller set of points, their technique is scalable and can work with large datasets. And yeah, as expected the Swiss roll dataset popped out!

Claudia Plant talked about a new approach to robust correlation clustering called SONAR. Her technique works in 3 steps: collect signals by probing the data with different primitive clusters, extract response patterns, and identify true clusters with techniques like Minimum Description Length. It was a nice talk and I made a mental note to read this paper more carefully.

Next up was the best student paper talk by Pu Wang on Nonparametric Bayesian Co-clustering Ensembles. This is the most recent work in a series of clustering ensembles papers of Wang, Domeniconi and Laskey. This paper tries to address the issues of clustering bias, parameter setting and curse of dimensionality through clustering ensembles and co-clustering. This line of research looks interesting, at least with the types of issues that they are trying to solve.

Later at lunch, I met a few data mining people interested in general meta aspects of clustering like subspace clustering, generating alternate clusterings and computing consensus solutions. Thomas Seidl, Stephan Günnemann and Ines Färber from RWTH Aachen University and Emmanuel Müller, who is currently at Karlsruhe Institute of Technology are talking at a tutorial session tomorrow, on "Discovering Multiple Clustering Solutions". We had a very interesting conversation about the challenges that researchers who ask these meta questions about clustering including my favorite "How do we obtain/generate data that will admit multiple good partitions?". These folks are also organizing the second workshop on Multiclust in conjunction with ECML-PKDD at Athens. The first MultiClust workshop happened at KDD last year. So, let's get down to "Clustering the clusters in the clusterings!".

Friday, April 29, 2011

SDM 2011: Day 1

My student Parasaran Raman and I are at the SIAM Conference on Data Mining (SDM) 2011, where he's presenting this paper on "spatially aware" consensus. In between preparation for his very first conference talk (yay!) he had enough time to jot down his thoughts on day 1. As usual, I've done light editing for formatting, and take responsibility for all errors in transmission. Note that SIAM hasn't yet placed papers online, and so when available I'll add links.

Collecting real data has always been a real problem for me. All the experiments with synthetically generated data and shouting "bingo!" after getting good results on them for the meta-clustering problems that I work on, was getting too boring. SDM11 had a tutorial on "Data Mining for Healthcare management" and I was interested to see what kind of healthcare-related data I can get to play with. Especially because almost all of the papers that uses consensus clustering techniques on real data have been motivated by analysis of either microarray gene expression or protien sequences. Prasanna Desikan of Allina Hospitals and Clinics talked about how terabytes of data generated by multiple sources drives the need for efficient knowledge based systems. It was heartening to see many publicly available datasets that require mining. The key challenges from a strictly CS perspective are the usual (1) Dealing with missing data/attributes and (2) Outliers and noisy data which are even more pronounced in the healthcare sector.

I also sat through a few talks in the classification session. Byron Wallace talked about letting multiple experts involve in active learning and "who should label what?". The talk focussed on how a single infallible oracle is impractical and stressed the need for multiple experts. The key idea was to save the hard questions that are usually not many for the experienced labelers who are expensive and let the inexperienced labelers do the load of work when the questions are rather easy. An interesting question that came up was about the feasibility of partitioning the question set into easy and hard.

Sanjay Chawla talked about supervised learning for skewed data where the classes of different sizes causes the hyperplane separating the classes to be pushed towards the class with lesser points (ed: the imbalanced classification problem). The solution provided was to use a quadratic mean loss instead of just the arithmetic mean. The loss function is convex and can be solved by usual convex solvers.

Liang Sun spoke about a boosting framework for to learn the optimal combination of individual base kernel classifiers. The algorithm goes through multiple boosting trials. During each trial, one classifier for each kernel is learned. The 'multi-kernel' classifier for each stage in the boosting process is a function of the input classifiers. Their choice is to pick the best one. The boosing process itself is otherwise similar to AdaBoost.

There was also a lightning session and true to its name, there were 25 talks in 75 minutes! I could relate to a meta-clustering work by Yi Zhang and Tao Li where the goal was to find a k-consensus solution from a bunch of input partitions of the same data. The method works in four steps: (1) generate multiple input partitions, (2) use Mallows distance (Zhou et. al.) to compute distance between partitions, (3) build a hierarchy over the partitions and make a cut to get a 'clustering' of partitions and (4) generate a consensus partition on each 'cluster' from the cut. I have found it difficult to find data that admits multiple partitions that are good but different from each other at the same time, and that make me ask the question "How do you generate the different partitions?". The answer sadly was "We run kmeans 38 times!". I am still looking for a motivation for multiple alternate partitions.

The reception had just cheese and a few veggies, but I have been having some good food here in Phoenix and I am not complaining.

The Geomblog

Pages