Saturday, January 21, 2006

SODA report: ALENEX...

Today is the pre-SODA workshop today; ALENEX (Experimental Algorithms) and ANALCO (Analytic Algorithms and Combinatorics). David Mount was the plenary speaker at ALENEX, and he gave what I thought was an excellent overview of the area of nearest neighbour methods.

As befits an "experimental" talk, it wasn't your usual survey of techniques. He structured the talk around seven points (the "seven habits of highly effective nearest neighbour searching" as he put it), and skilfully weaved the known universe of NN methods around this structure.

The points: to summarize
  • Grids are good: quite good for many problems
  • Quad trees are better, if you need higher dimensions
  • kd-trees are even better, if you want the best bounds
  • In high dimensions, only locality sensitive hashing can save us.
The central theme was that there are a few basic ideas that work extremely well in near-neighbour searching, and depending on the kind of data you have (uniform, highly skewed, intrinsically high dimensional, ever-changing), some techniques are better suited than others.

Listening to his talk, I realized that we need something like this for clustering methods as well. Like the nearest-neighbour problem, clustering is a fundamental problem in a variety of domains (maybe it's the 3rd-most important problem in all of science.... more on this later). Like NN, there are a huge number of papers out there on different ways of solving the problem, and like NN, what a practitioner really needs is a way of organizing and classifying the different clustering techniques.

  • The hotel is in the middle of downtown, which apparently means 'always down'. The closest restaurant to the hotel is a Burger King, and the next closest is a Checkers. Much anger will be vented at the business meeting on Monday.
  • People who haven't arrived yet will be happy to know that the wireless in the conference area works fine, and the wireless ghetto is up and running.
  • The SODA proceedings is once again twice as heavy (and probably 5 times as thick) as my laptop, once again raising the question:
    Why can't we get PDF copies of the papers before the conference, or even via password-protection at the conference itself.

    There is a bigger issue here about electronic proceedings that I don't really care to get into; all I really want is the ability to read the papers without getting a sprained wrist (or neck).
  • Last year all the proceedings went for a tour of Minnesota, and we ended up downloading papers from a protected SIAM site. People didn't seem to mind too much.
  • Did I mention that there are few good eating places near the hotel ? There is only one place that looks like it might have decent coffee, and it's a bit of a hoof away.
  • In a talk early this morning, John Hershberger started his talk by asserting that determining the shape of points was "the most important problem in science". This apparently stole Dave Mount's thunder, who was then reduced to asserting that the nearest neighbour problem was "the second most important problem in science".
  • I like the fact that ALENEX and ANALCO have joint registration, so that we can walk into talks in either workshop. There are a number of interesting talks at ANALCO that I will attempt to attend.



  1. I'm sorry I missed David's talk.

    -kd-trees are even better, if you want the best bounds

    You mean "performance" here, right? I'd pretty much agree in that case, for low enough dimension.

    -In high dimensions, only locality sensitive hashing can save us.

    Well, there are some other techniques that are sometimes helpful. Surely there's nothing that's *always* helpful, is there?

    Posted by Ken Clarkson

  2. I'd love to learn more about locality-sensitive hashing! My exposition to hashing was from cryptography, which is the anti-thesis of locality sensitivity ;-) Did David mention anything in particular that would be good introductory reading? What is state-of-the-art there? 

    Posted by Ingo

  3. Ken:
    well I paraphrased, but that is basically what he said :). Essentially he argued that it's the most successful technique, but he did point out that LSH doesn't exist for all metrics, and even important metrics like the edit distance.

    Ingo: a good starting point is the web page at 

    Posted by Suresh

  4. Hi,

    A few more points:

    - from my experience, kd-trees and related techniques are doing surprisingly well even in high dimensions, if you use those algorithms "beyond" the provable bounds. E.g., you can stop the search after a fixed number of steps, or you can set the approximation factor to a very high value. Amazingly, the actual points reported are often of pretty good quality (e.g., you get the actual nearest neighbor 50% the time). Of course, no guarantees, but in many applications you can live without them.

    - there are techniques that one can use if your data is nominally high-dimensional, but "really" the points live on a "low-dimensional manifold". See the following survey  for more info. 

    Posted by Piotr


Disqus for The Geomblog