Friday, September 23, 2011

Finding the right reference for finding the majority element

Probably everyone knows the standard trick for determining in one pass whether a stream admits an element occuring more than half the time. But what not everyone might now is the origin of this method.

Stuart Shieber writes about the algorithm (discovered by Boyer and Moore) and the mishaps it had on the way to publication. The article also describes how the followup paper by Misra and Gries came into being. Interestingly, I almost always see the Misra-Gries generalization being cited, and not the original Boyer-Moore method.

(HT Fernando Pereira)

Wednesday, September 07, 2011

ECML-PKDD Conference Report: Part II

My student Parasaran Raman continues his conference report from ECML-PKDD in Athens. For his first post, see here.

Christopher Bishop gave an excellent plenary talk on the first day about the need to embrace uncertainty. He talked about the Kinect technology that uses infra-red camera and sensors to measure distance of aperson from the screen. Instead of modeling the problem as a motion tracking problem, the folks at MSR started to look at the challenge as a pattern recognition problem. Each frame is now to be cold-started and understood as a picture instead of tracking the motion in a video sense. The goal now is to classify sets of pixels enclosed inside an axis-aligned rectangle. The training data is collected as a series of motion capture videos by letting a test subject wear sensors on the body a la hollywood-animation style, and is combined with some synthetic data, corrupted artificially since these are “too pure”. The training set consists of a whopping 1 million examples of body positions .

They then use a decision tree to classify the pixels into one of 31 predefined body parts by maximizing an information gain function. Another sub-goal that Bishop talked about was to identify human joints. For this, they use a kind of mean-shift technique, applying a simple Gaussian smoothing function on joints.

The second half of the talk revolved around model-based ML; where Bishop recommends constructing model equivalent for algorithms. He says and I quote “Don't look at the algorithms; look at the models”. He talked about an application where the goal is to rank players who play each other in a common environment. They have a new rating system TrueSkill, that extends the popular Elo rating system [which typically only works in a two player game like chess] to a multi-player team games. TrueSkill performs a kind of Bayesian ranking inference on the graphs by local message passing .

He also introduced Infer.NET, a framework for running Bayesian inference in graphical models.
[http://research.microsoft.com/en-us/um/cambridge/projects/infernet/]. This looks like a cool tool to play with, especially since it provides a single platform for classification and clustering problems.

The plenary talk on day two was by Rakesh Agarwal. In a talk that focused on a very interesting application, Rakesh introduced the idea of using data mining towards enriching education. The key idea is to take concepts from textbooks and map it with a relevant Wikipedia article and then induce relationship between concepts.

The goal is to enrich sections in the textbooks by providing recommendations for illustrative pictures to go with the text. Dispersion [which is a measure of how unrelated the concepts are] and syntactic complexity play a big role in deciding “which picture should go where”. Since search engines fail when given long key words, one challenge is to find key concepts in a section that one can then match up to the most similar article in Wikipedia. The underlying assumption was that for most high school textbook sections will have a associated Wikipedia article.

Rakesh showed results on applying this to a huge word corpus from NCERT, who draft the high school textbook in India for most schools. He also talked about computing an immaturity score to see if a section is well-written and not surprisingly, subjects like physics and chemistry scored over commerce and economics!

To summarize, two things that go into solving the problem of augmenting textbooks with images are “Comity” [where they identify lots of short key queries by concatenate 2 or 3 concepts important concepts at a time] and “Affinity”. Comity ransk images by computing a relevance score of each image in context of the concepts picked out from a section and affinity identifies relevant webpages and computes an importance score. Another challenge is that these techniques cannot operate one section at a time since the textbooks would probably not like the images to repeat!

Tuesday, September 06, 2011

ECML-PKDD conference report: Part I

My student Parasaran Raman is in Athens attending ECML-PKDD and associated workshops, and will be updating this blog regularly with details from the conference. His posts are edited lightly for format and to add links.


I am in Athens [+1 to location-driven-research] to attend ECML PKDD 2011 and I am at the second MultiClust workshop today. This workshop is held in conjunction with ECML PKDD 2011.

The first talk of the morning was a invited talk by Michael Houle on "Combinatorial approaches to clustering". Michael Houle began the talk with how in high dimensions, all out intuitions about space go for a toss. One example that he gave illustrated how most of the points are concentrated along the boundaries in high dimensions. The talk revolved around how similarity measures between partitions cannot be fully trusted though there have been a range of combinatorial and spatial measures that have been introduced, since each measure suffers from some kind of drawback. He also talked about constructing secondary similarity measures [“Secondary similarity between two points v and w is defined in terms of data objects in the common intersection of neighborhoods based at v and w, where the neighborhoods themselves are determined according to a supplied primary similarity measure”] and how they can get around the curse of dimensionality.

One interesting question that was thrown out was "Can we do a fully automated clustering?", i.e., can we convert similarity functions into ranking on the neighbors, assume no knowledge of distribution of the data, and automatically pick 'k'. The talk moved on to approaches towards computing quality of a partition by computing a significance score for each point which depends on the correlation between a member of the set to the members in neighboring sets and other members of the same set, along with the size of cluster and the feature set. Once we know an object's significance, we can use this to change the shape of a cluster, by adding points and removing them.

Bart Goethals gave another invited talk titled “Cartification: from similarities to item set frequencies”. He talked about doing a k-nearest neighbors for all data points to find the right centers of clusters. The idea is that true 'centers' occurs in many knn baskets.

There were many interesting ideas out of the other talks at the workshop. I will highlight a few of them:

  1. One of the first talks about “Subjectively interesting alternate clusters”, uses an information theoretical framework to find interesting alternate clusters.
  2. Jilles Vreeken gave a very good talk about the current approaches in pattern mining and how we can use that knowledge in data mining. On this talk titled, “where pattern met subspace cluster”, he highlighted the similarities between subspace clustering and pattern mining where the goal is to find informative local structures.
  3. The talk on “Factorial clustering” was about using latent variables to generate multiple clusterings. One question that they wanted an answer for was a way to compute the right distance between partitions and I said, "Hey ! Use ours!”.
  4. Another talk on “Browsing clustering alternatives" outlined constructing a Merge-Split tree structure on the space of partitions and enable a user driven browsing of partitions.
My talk was the final one of the session. I talked about some recent work with Suresh Venkatasubramanian and Jeff Phillips on exploring the landscape of clusterings by sampling from the space of partitions proportional to a quality measure. It was generally good to see a lot of head-nods during the talk. A couple of people from the industry who were present agreed to the fact that there exists data that is multi-clusterable and the hard part was finding them. So presenting a landscape where you can pick from looked like a useful idea!

The day ended with a rooftop reception with a magnificent view of the Acropolis and the temple of Zeus! I had very useful dinner conversations with Michael Houle and Thomas Seidl about general problems in the area of meta-clustering and on how to compute distances metrics on partitions. We chatted about combinatorial distances and spatial distances, and secondary level distances, where one constructs a ranking on top of a existing distance between many members.

Monday, September 05, 2011

FOCS 2011 Registration open

Marek Chrobak informs me that FOCS 2011 registration is open. The conference is being held from Oct 22-25 in Palm Springs CA, home of the Palm Springs Aerial Tramway that does a record 8500 ft (2600 m) elevation gain to give you views of the valley.

When you're not being made dizzy by the views, attend a tutorial or two or three ! Cynthia Dwork, Kirk Pruhs and Vinod Vaikuntanathan are the headliners for a day of tutorials on Saturday Oct 22. Shafi Goldwasser will be awarded the 2011 IEEE Emmanuel R. Piore Award, given for "outstanding contributions in the field of information processing, in relation to computer science".

I'm told there are also some talks, including some that smash through barriers for the k-server conjecture, metric TSP (in graphs) (twice over), and exact 3SAT.

Register now so that local organizers don't go crazy (I can assure you that they do :)). The deadline is Sep 29 for cheap rates on hotels and registration. And if you'd like to be a FOCS 2011 reporter and write a set of blog posts summarizing the conference for cstheory.blogoverflow.com, please add a note to this post.

Thursday, September 01, 2011

MADALGO Summer School: Wrapup

Speakers from the MADALGO summer school

I had mentioned the MADALGO summer school back in June. I'm happy to be informed by Piotr Indyk that lecture notes from the summer school are now up, along with additional reading material. Good stuff on similarity search, high dimensional combinatorics, the Hirsch non-conjecture, embedding/sketching and metrics/nets/dimension and measures.

Disqus for The Geomblog