Monday, July 18, 2011

Axioms of Clustering

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here

``I shall not today attempt further to define the kinds of material I understand to be embraced within that shorthand description; and perhaps I could never succeed in intelligibly doing so. But I know it when I see it..."
-- Justice Potter Stewart in Jacobellis v. Ohio.
 What makes one clustering better than another? To answer that question we have previously assumed that a well motivated objective function was given to us. Then the situation is easy, we compute the value of the objective and decide accordingly. In practice, however, clustering is often an intermediate step in a longer chain of computations, and there is not specific motivation for one objective function over another. Debating the pros and cons of the multitude of possible objective functions can easily take hours. Instead, we can further the ``I know it when I see it'' intuition by writing down the properties that any good clustering objective should satisfy and see if this axiomatic approach guides us towards the right solution. 

This approach was undertaken by Kleinberg in his work on clustering axioms. A clustering function is one that takes a set of points (and their pairwise distances) and returns a partition of the data. The function should abide by the simple axioms:

  • Scale-Invariance: If all distances are scaled by a constant factor, the clustering should not change. Put another way, measuring the distance in miles or kilometers (or nanometers) should not change the final clustering partition.
  • Richness: Depending on the pairwise distances, any partition should be possible. For example, the clustering should not always put two specific points $x$ and $y$ together, regardless of the distance metric.
  • Consistency: If the pairwise distance between points in a cluster decreases, while the distances between a pair of points in different clusters increases, the clustering should not change. Indeed, such a transformation makes clusters `tighter' and better separated from each other, so why should the clustering change? 
Each of the axioms seems reasonable, and it is surprising that there is no clustering function that is consistent with all three axioms ! The trouble lies with the last axiom: by changing the distances we require that a clustering stay the same, but an even better clustering may emerge. For example, consider points lying in several ``obvious'' clusters, and proceed to move one of these clusters out to infinity. As we move the points further and further out, the resulting dataset appears to be two-clusterable (see the image below).


This leads to a slightly different tack at this problem. Instead of thinking about a specific clustering, or a specific partition of the data, instead we try to define an objective function, so that the optimum clustering under that objective behaves in a reasonable manner. This is the approach adapted by Ackerman and Ben-David. Let $m$ be a function measuring the quality of the clustering. As before, Scale-Invariance requires the measure $m$ to be independent of the units on the distances and Richness requires that every clustering can be the optimum (under $m$) clustering.

The Consistency axiom is the one undergoing a subtle change. Instead of requiring that the clustering stay the same if such a perturbation to the distances occurs, we only require that the score, or measure of the clustering {\em improve} under such a transformation. This breaks the counterexample above -- while the score of a 1-cluster decreases as we move the distances, the score of a 2-clustering decreases even more and surpasses the score of the 1-clustering. Indeed, the authors demonstrate a set of different clustering metrics that are consistent under this set of axioms.

-- Sergei Vassilvitskii

Friday, July 01, 2011

SODA 2012 submission question

This is what the SODA 2012 submission guidelines say (emphasis mine):
The submission, excluding title page and bibliography, must not exceed 10 pages (authors should feel free to send submissions that are significantly shorter than 10 pages.) If 10 pages are insufficient to include a full proof of the results, then a complete full version of the paper (reiterating the material in the 10-page abstract) must be appended to the submission after the bibliography. The length of the appended paper is not limited, and it will be read at the discretion of the committee. A detailed proof in the appended complete version is not a substitute for establishing the main ideas of the validity of the result within the 10-page abstract.
If I'm reading this right, you can't just shunt certain proofs and exposition to an appendix, as was previously done. You have to make an entire full version of the paper and append it to the shortened version. Is that correct ?

Update (7/5): I have an update from Yuval Rabani, the PC chair, which resolves the question (in the affirmative):
This is an experiment I introduced in response to conflicting attitudes in the PC. The idea is that some PC members would like to read the entire paper, and it's hard to follow the paper when you have to jump back and forth between the 10 page abstract and the appendix. So the idea is that most committee members will just print the 10 page abstract, and those
interested in the entire paper will print the attached full version. Since this is an experiment, I don't intend to enforce the guidelines very strictly, but if it's not too much of an effort on your part, we do prefer a submission according to the rules.

Thursday, June 30, 2011

Bob Morris and stream algorithms

Bob Morris, one of the early contributors to UNIX and an NSA cryptographer, died on Sunday. The New York Times has a remembrance that talks about his contribution to password security and cryptography in general.

What's not mentioned in the NYT profile is that he's the Morris of the 'Morris counter'. This was arguably the very first streaming algorithm, nearly 20 years before the AMS paper and the HRR tech report first introduced streaming algorithms, five years before the Flajolet-Martin paper on counting distinct elements in a stream (also not framed as as streaming problem), and two years before the Munro-Paterson algorithm for streaming median finding.

The Morris paper is titled "Counting large numbers of events in small registers":
It is possible to use a small counter to keep approximate counts of large numbers. The resulting expected error can be rather precisely controlled. An example is given in which 8-bit counters (bytes) are used to keep track of as many as 130,000 events with a relative error which is substantially independent of the number n of events. This relative error can be expected to be 24 percent or less 95 percent of the time (i.e. σ = n/8). The techniques could be used to advantage in multichannel counting hardware or software used for the monitoring of experiments or processes.
In modern language, this can be re-rendered as:
It is possible to approximately count the number of elements in a stream using $O(\log \log n)$ bits, where $n$ is the length of the stream.
The idea is very simple. The trick is to count $C = \log n$ instead of $n$. Clearly, doing so requires only $\log\log n$ bits. Then any additive approximation to $C$ yields a multiplicative approximation to $n$.

So how do we count $\log n$ ? Let's assume for now that we're using base 2 logarithms. If the current counter value is $i$, then we "know" that the next update should come only $i$ counts later, if indeed we've seen $i$ elements. However, we don't know that the counter value reflects the true count, so we instead do it probabilistically. Specifically, toss a coin whose probability of coming up heads is $2^{-C}$, and only increment the counter if the coin comes up heads.

Phillipe Flajolet did a masterful analysis  of this approach, and showed that the expected value of $2^C$ was $n+2$, yielding an unbiased estimator for the true count. He also showed that the variance of $2^C$ is $n(n+1)/2$.

This approach yields a fixed error bound for the resulting count. The next trick is then to change the base from 2 to some parameter $a$, and then reanalyze the method. Much work later, it can be shown that with roughly $\log \log n + \log(1/\epsilon)$ bits, you can get a multiplicative approximation of $1+\epsilon$.

Postscript

Stream algorithms are a good icebreaker in an algorithms class: they're unusual, seem to solve an impossible problem, and do it really well. I also like to point out to students that three of the more important streaming algorithms were designed well before people cared about "big data" or streaming. It's a valuable lesson in the importance of understanding "old" tools.

In case you're wondering if anyone still uses Morris counters, here's a paper from IJCAI 2009 that plugs them into a larger system to maintain counts of $n$-grams when building a statistical language model.

Friday, June 24, 2011

Workshop on Coding, Complexity and Sparsity

Anna Gilbert, S Muthukrishnan, Hung Ngo, Ely Porat, Atri Rudra,  and Martin Strauss are organizing a workshop on coding, complexity and sparsity at Ann Arbor in August, and are soliciting participation for a Dagstuhl-style event. Here's the blurb:


Workshop on Coding, Complexity and Sparsity
Ann Arbor, Aug1--4, 2011. 
Organized by Anna Gilbert, S Muthukrishnan, Hung Ngo,Ely Porat, Atri Rudra, Martin Strauss.

There has been a considerable amount of fruitful interaction between coding and complexity theories, and a fledgling interaction between sparse representation and coding theories. We believe that academic research will be far richer exploring more of the connections between sparse representation and coding theories, as well as, between sparse representation and complexity theories. Could there be a general structural complexity theory of sparse  representation problems and could techniques from algorithmic coding  theory help sparse representation problems? 
The plan is to get researchers from different areas together in a  Dagstuhl-like setting and explore the question. There will be tutorials,  talks and time to research. Registration is free and the deadline is July 15th.  We have limited funding to (partially) support the participation of some  graduate students.  
For more details on the workshop (including the tutorials, list of invited  speakers and how to apply for participation support), please go to the  workshop webpage.

Thursday, June 23, 2011

MADALGO Summer School

There are lots of good things happening at MADALGO. The institute was recently renewed for another 5 years, they recently organized the now-regular MASSIVE workshop in conjunction with SoCG (maybe it should also be organized in conjunction with SIGMOD or GIS ?), and now they're organizing a summer school on high dimensional geometric computing in conjunction with the new Center for the Theory of Interactive Computation (CTIC) run jointly by Aarhus and Tsinghua.

There's a stellar list of lecturers:
  • Alexandr Andoni (Microsoft Research Silicon Valley)

  • Ken Clarkson (IBM Research)

  • Thomas Dueholm Hansen (Aarhus University)

  • Piotr Indyk (MIT)

  • Nati Linial (Hebrew University)
and they're soliciting participation. Here's the blurb:
The summer school will take place on August 8-11, 2011 at Center for Massive Data Algorithmics (MADALGO) and Center for the Theory of Interactive Computation (CTIC) in the Department of Computer Science, Aarhus University, Denmark.
The school is targeted at graduate students, as well as researchers interested in an in-depth introduction to high-dimensional geometric computing.
The capacity of the summer school is limited. Prospective participants should register using the online registration form available here as soon as possible. Registering graduate students must also have their supervisor send a letter confirming their graduate student status directly to Gerth Brodal: gerth@madalgo.au.dk; the subject line of the email should be 'student_last_name/SS_2011/confirming'. Registration is on a first-come-first-serve basis and will close on Monday July 4, 2011.
Registration is free; handouts, coffee breaks, lunches and a dinner will be provided by MADALGO, CTIC and Aarhus University.
 I'd be remiss if I didn't point out that any MADALGO-sponsored event will also have rivers of beer, if you needed more incentive to attend. 

Wednesday, June 22, 2011

Applying for Jobs: the Job Talk

This is the fifth post in a series on applying for faculty positions. It is written by Jeff Phillips, frequent guest blogger, who will be starting as an Assistant Professor at the University of Utah starting Fall 2011.



The final component of applying for jobs is the job talk. It will usually be about an hour, but may vary so ask. Usually this means you can talk for about 58 minutes, before they want to stop you for questions. But again, its best to confirm this, try to ask your host specifically about this. Also ask about how frequently you should be expected to be interrupted with questions during the talk. Of course, this may vary greatly from talk to even at the same place (as I have seen in my experience). But, from my perspective, many instances where the speaker was delayed for more than 5 minutes due to questions in the talk was the fault of the speaker. This usually occurred by leaving ambiguities in the talk, by failing to answer questions clearly and concisely (this may lead to more questions), and by not taking charge and insisting on taking longer questions off-line.
The most important failure (ambiguities) will hopefully be avoided by taking the steps below.



First, what should you talk about?
  • Talk about your best work!
  • Talk about your work that appeals to a broadest audience and that you can convey the main ideas of most clearly.
  • Talk about the project you are most well-known for, this is probably because its your best received, and likely your actual best work even if you have some "personal'' connection to some other topic.
  • Talk about the work that is most likely to get you hired. This last one can be tricky, but try to ask your host or anyone else you know at the institution what specific type of person they are looking to hire. No one will fit a mold perfectly, but make sure the faculty there don't have to squint too hard to see you in fit in that mold.
If it is the same topic that fits all of these criteria, then you have a much easier task. Otherwise, you may need to compromise on some these. Try not to sacrifice strength of work and clarity of key ideas. If you are not sure which topic is your strongest, ask around.



What should go into talk? Not too much.
Present at most 2 or 3 main ideas. Most papers have one main idea, even if there are several smaller ideas in the paper. The lasting contribution is usually one key insight, (even if it took a lot of hard work and many other technical details to get everything to work). Try to abstract away those technical details, and convey the main new ideas your work added. You can mention that there were difficult technical details, but do not describe them!

So 2-3 main ideas equates to 2-3 papers. Perhaps you have a series of papers that layer on interesting ideas towards a single goal. In this case, you can possibly convey more, but do not try to cram more at the expense of not making clear what the main conceptual contribution is.

It is also helpful to have the talk tell a story, to have a single narrative. But, if you have two pieces of work that are both your strongest works and easiest to express the key contributions, then give the talk in two parts and choose to convey better work than to give a cleaner, but weaker story.

If you are a theoretician, give at most 1 proof. (Unless, with possible exception, you are being hired into a strong theory group and they alone make the hiring decision, then, maybe, give 2 proofs). Most people will get lost in the details. I gave one proof sketch, but "intuitively explained" why several things were true, usually each in one to two sentences. I am guessing for non-theory people, there is an equivalence between number of "proofs" and number of some other technical descriptions. The point is, it might be good to spend 3 minutes at some point showing off your technical prowess, just to try to convince the audience you are an expert in an area - but there can be better ways of doing this than diving so far into the technical details that you somewhat intentionally loose most of the audience.

You should aim to teach the audience something. That is, go in a bit more depth for some general technique in your field that you feel is not well-understood, and would serve the general audience to know. This does not need to be explicitly your work (and often will not be), but you may have extended this work. If you did extend it, the audience will appreciate it much more if they understand the importance of the general version. In this "teaching segment" the goal should be to allow the audience to understand how to operate some heavy machinery (at least in principle). Spend time developing simple examples that clearly demonstrate how it works and its power.
This segment is also important in demonstrating your teaching abilities. If someone leaves your talk comes away feeling they learned something, even if it was not your specific contribution, then they will have a positive impression. I know when I spend an hour in a talk, and don't come away with this impression, I am very disappointed in the speaker.



How should you structure your talk?
Spend the first 10 minutes motivating the entire area you are working in, and then the set of key problems in the area. Start at a very high level, and use well-planned examples to zero in on what are the major hard problems in the area, and why they are the critical remaining component in advancing the sub-area. This should allow you to outline the rest of the talk by just saying which of these problems you solved (and then spending 40 minutes explaining how).
This first 10 minutes of motivation can be modified if you give job talks at several place without changing much of the forth-coming meat of the talk. This may be necessary to paint yourself in a few slightly different roles depending on what each university/lab is looking to hire. Minor changes in the set up, can give the talk a very different flavor, perhaps focusing on different aspects that potential collaborators at each institutions could appreciate as fodder for possible proposals they could write with you.

Then spend about 15-20 minutes each on covering your contribution in 2-3 core problems. Probably don't spend more than this on any one topic, since if you lose some audience members, you can re-captivate them for the next section. And any less than this, you probably will not be emphasizing the contribution enough, and it you don't feel it deserves that much, then your talk might be cleaner if you did not waste the time to mention it.

Within each of these sections, explain the core problem in more detail, explain what other approaches are available and why they do not work. This should lead naturally into what you did. Do not just show results saying that your approach was better, make sure to explain the key idea of why your approach was better. What made you approach succeed when no one had before? If an audience member cannot immediately answer that question, then they may likely come away with the impression that you did not do anything interesting, other than tightening the existing bolts. You can conclude each section by mentioning extensions to this work, either your or others. These extensions are especially useful if they demonstrate how your key idea was integral in advancing the field.

Finally, there is likely other work that is not one of they 2-3 key contributions that you have done. You probably want to talk about it more than you actually should :). I dealt with this in two ways. Given than I had described a set of an important problems in the first 10 minutes with several sub problems, after a couple sub-problems I described my contribution for, I listed on a single page all of the other work I had done in this related sub-area. This allowed me to mention it and also have it in context of the field I was overviewing.
The other way was just to have 1 or 2 slides are the very end of the talk listing other work I had done. Its good to show you are broad, and that you have completed a lot of work. But its also easy to see this from glancing at your CV. So in these 1-2 slides attempt to cast each other project you did not talk about in relative comparison to the work you did talk about. If you mentioned 2 projects, and you had 4 other ones of similar proportion, try to convey this. This is more informative than CV paper-counting.

Conclude your talk with your vision for the next 5-10 years worth of research. I felt it best not to focus on specific problems, but on how the work I had done would have an impact down the road. Anyone can do work in a certain area or solve specific problems, but if you convince the audience that your work will be important for a lot of important future work, then not only did you have the foresight to produce that work, but you are well-positioned to continue to contribute in the field.



Finally, practice, practice, practice!!! I think I practiced my talk at least 50 times. Thats 50 hours of just speaking. When I mentioned 58 minutes to speak earlier, I meant more-or-less exactly 58 minutes (plus or minus 15 seconds). If you practice enough, you will know exactly how long the talk takes as a whole, and how long each sub-section takes. If you get a lot of questions in some early part and lose 3 minutes, you will know where you can make it up in a later part.

Also, practicing it will help you realize what is explained well and what is not. If you repeatedly find yourself explaining a slide and wishing you had a figure to help explain that, then add the figure. If there is another slide that has a figure that you don't really need, then remove it.

I only had the opportunity to practice in front of an audience (other than at actual interviews) twice. So, I spent a lot of time sitting in a room by myself talking to myself :).

So, the key take-aways are: (1) motivate the field to convince the audience your problems are important, (2) make sure you convey 2 or 3 main conceptual contributions, focusing on key new ideas, (3) teach the audience something, (4) demonstrate vision for the future, and (5) if you practice enough and keep these in mind, you will definitely give an excellent talk.

Wednesday, June 15, 2011

Applying for Jobs: On-Site Interviews

This is the fourth post in a series on applying for faculty positions. It is written by Jeff Phillips, frequent guest blogger, who will be starting as an Assistant Professor at the University of Utah starting Fall 2011.



Today's post is about what do do once you actually get a job interview. I'll try to cover everything except the job talk which will come in the final post.

You will usually be assigned a host. Talk to this person and try to get a feel for what the department is looking for. Perhaps the AI person is retiring and the department is happy to hire an Algorthims person as long as you can also teach AI. Or you would be the only theory person in the department and some of the other faculty are nervous of this, so the more you can talk about real world applications the better. These are very important things to find out.

And if you know other people in the department, by all means, ask them as well. If you get hired, you will need an advocate to make your case. Contact who you think might be an advocate for you. Don't be shy!

Try to figure out who the big shots are in the department. These people may be more senior or outspoken, bring in more money, and most importantly can potentially have way more influence than other people. If you can connect with these people, all the better. Sometimes hiring is done entirely within an area, so then the big shot is relative to the area.

Starting a week to a few days before the interview, try to get a schedule of your visit, even if it is only a draft. Research everyone on your schedule, know about their research areas and what classes they teach. All of this can usually be found on their webpage. Also try to read at least one of their recent papers that has the best probability of have some intersection with your research. And prepare several questions or topics for potential discussion.

Several people may be on your schedule who do research that is not really similar to yours. These people may be on the hiring committee, and should not be ignored. Try to make connection, even if it's not entirely about research. Like if you went to a common school , or know common people.

There is a good chance for last minute changes to your schedule, so research others in the department as well even if they are not on your schedule.

Also prepare answers to he same sort of questions as for phone interviews. If you suggest that you would teach a class, make sure you have many of those details sketched, because you may get asked specific questions about what you will cover.



On the actual visit I always carried a print-out of all of my notes, but never got a chance to look at it. So you'll have to memorize most of he information on there.

It is important to try to make positive connections with as many faculty as you can. Be friendly, outgoing and smile. Make eye contact. Be very excited about what your research. And obviously, don't say anything sexist or racist. Seriously, save the dirty jokes for later.




I've heard you should follow up with a short email with everyone you met, but I did not do this. I only sent something if it was meaningful. Like specific follow-up to a problem we discussed, or replying to a specific question. Usually about 3-5 per visit.


Monday, June 13, 2011

Clustering with outliers

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here

(ed. note: I'm happy to welcome Sergei Vassilvitskii (of k-means++ and mapreduce-algos fame, among other things) to this series. Sergei has great insights on clustering, and we decided to join forces as we continue this series. Here's his first post, on clustering with outliers)

When abstracting the clustering problem, we often assume that the data is perfectly clusterable and so we only need to find the right clusters. But what if your data is not so perfect? Maybe there's background noise, or a few random points were added to your data by an adversary. Some clustering formulations, in particular k-center or k-means, are not stable -- the addition of a single point can dramatically change the optimal clustering. For the case of k-center, if a single point $x$ is added far away from all of the original data points, it will become its own center in the optimum solution, necessitating that the other points are only clustered with $k-1$ centers.

It is easy to appeal to intuition for the definition of outliers, but formally specifying when a point becomes an outlier is a non-trivial task. Consider for example a set of points generated by a single Gaussian with small variance. Now we pick one data point and slowly start moving it away from this set. We agree that originally that point is not an outlier; as we move it out to infinity, it certainly becomes one. But where exactly did it cross this threshold? 

There are two ways to define clustering with outliers. First we can simply chose to ignore some fraction of the points. In the case when there are no outliers, ignoring a small (say 1%) fraction of the points should not materially affect the final clustering results. In the case that outliers are present, however, the algorithm has the choice of ignoring them in computing the final metric. Consider again the example above; although we don't know when the point we are moving becomes an outlier, in some sense we don't care because we are always looking at the optimum clustering without it. 

A generalization of the above approach assigns a cost for not classifying a particular point and gives a budget for unclassified points. Thus, if we know ahead of time that some points should be classified and some can be ignored, one can guide the algorithm to focus on the points that matter. Of course assigning each point an identical weight leads exactly to the first formulation. 

Algorithms

The ability to ignore a fraction of the points seems like it would make the problem easier, since the algorithm now has explicit permission to 'sweep things under the rug.' It actually makes the problem harder, since the optimum solution can also do this. Put another way, the algorithm needs to not only find the right clustering of points, but also decide which points to cluster in the first place. 

There are two extreme approaches to the latter problem: one can first decide which points are outliers and then run the baseline clustering algorithm, or one can chose not to ignore any of the points and decide which once are outliers given a clustering. Unfortunately neither one of these works well -- the first scales poorly, as the number of choices for outlier points grows exponentially; the second leads to poor solutions -- as we saw before ignoring the outliers can materially change the final outcome. 

The key to finding clustering algorithms that handle outliers is to determine which points will be ignored during the algorithm execution, and not before or after. For an example, consider the $k$-center problem. Recall we are trying to find a set of centers $\mathcal{C}$ and a radius $r$ so that every point is within $r$ of some point in $\mathcal{C}$. There is a lower bound that says unless $\mathsf{P}=\mathsf{NP}$ no algorithm can find better than a 2-approximation on the radius. There are many ways to achieve a 2-approximation: we previously saw the furthest point algorithm by Gonzales: repeatedly choose the point that is furthest away from the current set $\mathcal{C}$ and add it as s center. The algorithm breaks down if there are outliers --- we will surely choose an outlier as a cluster center. 

Here is an alternative algorithm for $k$-center: 
  1. Guess the optimal radius $r$. 
  2. While there are uncovered points, chose one point $p$ as a center, and mark all points within distance $2r$ as covered.
It is easy to see that in every iteration we mark as covered all points in the optimal cluster that $p$ belongs to. If after $k$ iterations there are still uncovered points remaining, it must be that the original guess of $r$ was incorrect.

To adapt this algorithm to the setting of outliers, we make two changes. First, we increase the slack on the radius from a factor of 2 to a factor of 3 (this turns out to be necessary; unless $\mathsf{P}=\mathsf{NP}$ no algorithm can find a better than a 3-approximation to the radius). Second, instead of selecting a point arbitrarily, we select the point that has the maximum number of uncovered points within radius $r$ from it.


Thus the final algorithm is as follows: 
  1. Guess the optimal radius $r$. 
  2. Repeat $k$ times: select the point $p$ that has the largest number of uncovered points within distance $r$. Mark as covered all points within distance $3r$ of $p$. 
The points that are left over in the end are precisely the outliers: they are both far away from any individual cluster center, and are not well clustered themselves (otherwise one of them would have been chosen as a potential cluster). Charikar et al show that this algorithm attains a 3-approximation: if the optimum can cover all but $\ell$ points with radius $r$, then the algorithm above will cover all but $\ell$ points with radius $3r$.

Saturday, June 11, 2011

Clustering as compression

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)
Numquam ponenda est pluralitas sine necessitate.
    -- Plurality must never be posited without necessity -- William of Ockham.

Clustering as compression generalizes mixture model clustering, as well as distance based methods. The central idea in this view of clustering is Occam's razor. Can you express the data as concisely as possible in a lossy manner ?

The devil in the details here is the definition of "concisely". There are two different notions of bounded space that are common, but they are actually not that different.

The information-theoretic approach.

The information theoretic approach dates back to Shannon. Consider sending information across a pipe.  How big on average does the pipe need to be in order to transmit the information ? One of Shannon's celebrated results is that this can be quantified by the mutual information of the channel, or the amount of shared information contained between the input and output. 

Thinking of mutual information as a measure of space is quite easy once you get used to it. It's helpful to consider the boundary cases. If the channel merely outputs a random flip of  the input, then the output has no information about the input and the mutual information is zero, corresponding to the idea you don't need to transmit much information through the channel.

On the other hand, if the output is a deterministic function of the input, then the mutual information of the channel equals the entropy of the source, corresponding to the idea that the channel needs to send over a complete description of the input to compute the output correctly.

What does this have to do with clustering ?  Think of the channel as the process by which the input points are mapped to clusters. In particular, the channel can be encoded by the conditional probability $p(t | x)$ of assigning a point x to cluster t. Once we have that, the mutual information $I(T;X)$ between the set of clusters T and the set of point X measures how concisely T represents X.

There's an intuitive explanation for this as well. Suppose that points were assigned to exactly one cluster each. Then I(T;X) equals the entropy of T, which is just the average number of bits needed to write down a description of T.

But what about the quality of the embedding ? Given some distance function on the input and output domains, you can compute the expected error as the sum of p(t | x) p(x) d(x,t) over all x,t. The study of how this varies with I(T;X) yields what's known in information theory as rate-distortion theory, and for an appropriate choice of the distance measure $d$ leads to the well known Information Bottleneck Method

Kolmogorov complexity.

The second approach takes the Kolmogorov view: that compactness of description can be expressed as the size of the smallest program expressing the data.  In many ways, the Kolmogorov complexity of a string serves as a point-wise proxy for entropy, and the equivalent of mutual information can be defined as well (for more on this, see the note by Grunwald and Vitanyi)

While the rate-distortion framework can be extended to the Kolmogorov setting, it's a lot more technical, and is hard to explain in a short note such as this. An easier application of the Kolmogorov method is due to Cilibrasi and Vitanyi, who showed how to do clustering by defining a distance metric based on compression (roughly, as the complement of a normalized mutual information), and then doing hierarchical clustering.

In summary, the main value of the compression-based viewpoint is the idea of a space-quality tradeoff, as opposed to a more artificial k vs quality tradeoff.  What's nice about the information-theoretic approach is that the units for space and quality are the same, and so you don't need ugly balancing constants to compare them.

p.s There are those who might argue that the information-theoretic view of clustering is merely a kind of mixture density estimation. In a sense, this is true: specifically, the information bottleneck method can be viewed as estimating a mixture of multinomials if we reinterpret the Kullback-Leibler divergence as the associated Bregman divergence for the multinomial family . This in turn connects to methods like LDA and probabilistic PCA. However, I'd argue that the viewpoint of compression is quite different, even though the end-result looks similar, and it's the viewpoint that's valuable here.

Tuesday, June 07, 2011

Applying for Jobs : Phone Interviews

This is the third post in a series on applying for faculty positions. It is written by Jeff Phillips, frequent guest blogger, who will be starting as an Assistant Professor at the University of Utah (yay!) starting Fall 2011.



A part of faculty interviews I had not heard or thought as much about was the phone interview. Not sure this was my strength in the interview process, since more places I had phone interviews did not follow-up with on-site interviews, than did. So if you have any different experiences, please post in the comments.

Some places do not have phone interviews, as some of my on-site interviews did not first have a phone interview. From my perspective, the places that did have phone interviews were places that may have wanted to confirm I was actually interested in coming. There were always several questions about why I was interested in that particular location/university.

Anyways, I'll try and provide some advice for performing well on phone interviews, even if I did not necessarily follow all of my advice.

First, prepare for these as if they were an on-site interview. Practice answering questions with someone over the phone (or in a room without eye-contact). You need to keep answers relatively short and to the point. Its harder to listen to a long answer over the phone. And its harder to take hints if you are giving too long or too short of an answer. Its probably more important here than in an on-site interview that you are well-prepared for questions, since you can't, say, move to a white board to explain something in more detail with a picture.

Try to figure out who will be on the other side of the interview a head of time. I've had one-on-one interviews as well as group interviews. The entire hiring committee was on the other side. Although usually in this case, one person asked most of the questions, but others would add follow-up questions. It could be a bit disorienting. When it was just one person, I usually tried to have their webpage up on my computer so I had a picture of them.

And, research who your interviews are, what their research areas are, and what classes they teach. If they bring up an topic, make sure you don't disrespect their research area, and you can try to positively relate to it if its relevant. If they realize you put in this effort, it will definitely help show you are serious about that university, which, I believe is a key aspect of why they are calling you.

Most importantly, make a list of potential questions and prepare and practice answers for them. I tried writing down answers, but this was not necessarily good on the interview since I really felt like I was just reading answers at some point. I'd rather suggest writing them down just to organize your ideas, but don't necessarily have them in front of you during the interviews. Bullet points might be ok.

Here are a list of questions similar to those I had:
- why do you want to go to UofX?
- why the location of UofX is attractive?
- who will you collaborate with? (try and answer with actual names, in the department)
- how will you fit in the department?
- what will your first proposal be about? (here they want to get a feeling that you have thought about proposals, name specific funding agencies and calls if you can. Don't be shy to mention any experience you have writing proposals even if you are not asked.)
- what you will teach? (try not to duplicate existing classes, especially not special topics ones)
- what is your research areas?/describe you research. (You may choose a topic so that is easy to describe over the phone.)
- Do you have any questions? (This one always gets asked. Sometimes you get the option to ask this at the beginning or wait til the end. I recommend waiting, since it allows the interview to get underway at first and you can get a sense of your interviewers before you ask.)

For faculty interviews, you probably won't get any technical questions (although you might - I did for research lab interviews).

And finally, you will probably have no idea how well it went after it is over. Don't worry this is normal, but quite frustrating. And you may not hear back for a couple weeks or more, so hang in there.

Applying for Jobs : Sending out Applications

This is the second post in a series on applying for faculty positions. It is written by Jeff Phillips, frequent guest blogger, who will be starting as an Assistant Professor at the University of Utah (yay!) starting Fall 2011.



I found actually applying for faculty jobs takes a lot of time. Maybe I spent too much time and others were successful with a less time-consuming approach; if so please add your perspective in the comments.

The first step is to find which jobs to apply for. I basically completely relied on the CRA jobs website to find out about job openings. Often I found that job listings appeared on their within a day or so of the listing becoming available on the school's website.
Comically, CRA also posts the same ads in a written newsletter which I get every couple months. The print edition appears several months later, often after the deadline has passed.

Some jobs will not appear on the CRA website. Some top schools don't need to advertise, so for those you need to either look for some ambiguous statement of faculty recruiting on the school's website, or better ask someone you know at that school (or have your advisor ask).

In fact, it is very important to have an advocate within any school you hope to get hired. So, as you are seeing friends or colleagues at conferences ask them "are you hiring next/this year?" For one, it is often a great conversations starter, and two, you can start implanting the thought in their head that you would be an excellent person to hire. I've heard that the head of the hiring committee can craft the description of the job posting to aim towards a particular type of candidate, or even a single particular candidate they are aiming for. Now is the time to approach you colleagues at universities you might want to apply and subtly try to get them to build an opening designed for you!



OK, now that you have found (or even crafted) the opening you are applying for, what is left to do? You already have your research statement, your teaching statement, and letters lined up that you have been working on-and-off all summer on.

The most customizable item is the cover letter. I've been given advice that you don't even need to write a cover letter unless it is required (presumably from people who when on the committee never even looked at them). While this may often be true, I felt it was worth trying to make a point of why this university it right for you.

Some middle tier university get many applicants with somewhat comparable resumes. How can you associate yourself with that university so they think if they made you an offer you would actually come? Have you lived nearby and liked the area? Do you have family nearby? Is there a particular program that the university is strong in that would be a great fit for you.

Also what I spent the most time on, was describing who I might collaborate with, within the department. Don't spend too much time on people outside the department since they usually have no bearing in the hiring decision. But if you have a lot of similar interests with someone on the hiring committee, definitely point those out. This usually took me 45-60 minutes per application, because not only did I want to find someone, I did not want to leave someone out. Perhaps, I could have not spent this time, but I felt it made a difference.

One thing I did not do much of, but could have done, is personalize the type of classes I would teach as described in my teaching statement. If someone, especially someone with similar background to you, already teaches a class very similar to a "new" one you propose, then many people outside that area would interpret that as you duplicating that person, and they would rather hire someone who could provide more breadth to the department. Rather, what I should have done is to try to identify which types of classes were missing from the university and adapt my teaching statement accordingly.

I've had advice saying I should prepare a few classes I would develop in my statement, and then choose which one to submit based on what the need was in the department.


Finally, the most important advice is to not rush the application process. I tried to do 1-2 applications every night for about 3 weeks (I submitted a lot of applications). This had two reasons, first it took me about 1 hour per application (some customization, but sometimes because of stupid forms). Second, if you do too many in a row, you get burnt out and start cutting corners.
Treat every application as if it is your top choice. Every time I applied, I convinced myself that this was a great place for me, and got in that mind set. For a few places where I just could not convince myself of that, I ended up not applying. I figured it wasn't worth anyone's time for me to submit a half-hearted application.




One final note. Although I intended for this series to be mainly about applying for faculty jobs, most of the advice carries over for research labs. I knew I wanted to apply for faculty positions, but I also applied to some research lab positions, and as it turned out, I almost went to one instead of accepting the Utah position. It seems each lab is somewhat different, and you might be pleasantly surprised if you get a chance to visit.

Most labs have some sort of hiring every year. But you generally need to know someone (or your advisor/mentor does) to get you foot in the door. Internships are a great way to meet these people and get the proper perspective. So, again ask colleagues at labs about hiring, and go through them to apply.

Saturday, June 04, 2011

Applying for Jobs: Application Material

This post is written by Jeff Phillips, frequent guest blogger, who will be starting as an Assistant Professor at the University of Utah (yay!) starting Fall 2011. Content edited lightly for formatting only.



Several people have recently asked me about applying for jobs, so I thought I would write a post (or a series of posts) about it. Other posts will be on:
  • Sending out Applications
  • Phone Interviews
  • On-Site Interviews
  • Job Talks
You may be saying, isn't this the wrong time to give advice about applying for jobs. My first piece of advice is: Now is the time to start planning your job search.

If you are planning to apply next year, then put together an application now. You will need a research statement, a teaching statement, a CV, and a list of 4-5 people to write letters.

The Research Statement should accomplish three things:
  1. Show you have breadth to your research.
  2. Show you have depth to your research.
  3. Show your research has potential to get funded in the immediate and long term future.
I tried to structure mine as several subareas I worked on describing one project in each area in a bit more depth (2-3 sentences worth). Then I had a section at the end outlining some future directions.

I felt it was also important to try to shape this in a single research story. This meant I left out some elements of my work. Don't try to fit everything in if it does not fit the best cohesive story. No one will have time to read through all the details of what you did, and if they did they can find it on your CV or webpage.

Also, this needs to be written at a level that is accessible to a general person in computer science - not an expert in your area. Too much technical detail will not help, the committee does not have time to try to understand technical details. Describe the motivation and the key technical insights you added to the field.

The Teaching Statement is secondary. What I have heard is it will not affect how your overall application is judged at many research institutions. And that their best indication of how good a teacher is will depend on how well you execute your job talk.

The best advice I've been given on the teaching statement is that it should support your research statement. So, discuss how you will teach about your own contributions to the field. Describe a new class or two you will teach that builds on what you talk about in your research statement. Think of this as a committee member might spend an extra 2-3 minutes reading this document. So don't waste that time on what classes you have taught, rather, use this as another opportunity to demonstrate the importance and excitement about your research direction.

Your CV should list everything you can think of. There is rarely a space limitation on this document, so go wild. List all papers, talks, awards of any kind, services. But make sure the first page highlights the most important things you want the committee to notice. And if you want them to notice it, then by all means make it bold. And of course, nice use lists, structures, and formatting so they can easily find what they are looking for. They may spend 1-2 minutes (or less) on this document, so make sure its very easy to navigate.

I also numbered my papers so I could easily refer to them in my research statement without repeating that information. Think of this all as a single document, and don't repeat information.

You will need 4-5 Letters of Reference, and these can (ed: and usually are) be the most important part of your application. Most places ask for 3 letters, and some 4. But if they allow it, as long as each letter is strong and tells a different story about you, extra references don't hurt. Also, I found depending on the job description I sometimes added a 5th letter writer who might have more connections at that particular place. You don't need to ask these people now, but you should start thinking about who would best represent you.

One person will almost always be your advisor, unless you have a unique situation. The best letter writers, I've been told, are well-respected people in your field, who you have not written papers with, and who are not at your university. But only ask for letters from people who can say meaningful things about you.



OK, so why should you be doing all of this now? Wouldn't it be more up-to-date if you prepared this material right before you applied? Yes! But getting these documents right will take many iterations. If they are done ahead of time, you can ask your very busy advisor for feedback and be ok if it takes her/him 1-2 months. If you have other friends who have recently been on the market, ask them too; and ask them for their research documents for examples. Ask professors in your department for advice; they are usually happy to give advice and it will always be different.

You don't need to ask for letter writers now, but it might be good to sit down with your advisor with an extended possible list, and plan together who would be good to ask. If there is some communication or project with someone well-known in your area who might be a good letter writer, you still have half a year to make a good impression. If you have them in mind as a possible letter writer, then make an effort to respond promptly to them when you have interactions, and try to say hi and talk to them when you see them at conferences.

When you have written your research statement now, look for holes in it. What is one more project you can complete or make significant progress on that would most enhance the statement ? I would argue that one project that fills out your research story and introduces new ideas or connects to a nice hot area is much better than two papers submitted to a journal, or an extension to a project that makes your thesis more complete but does not introduce a new idea to the field. Save this more incremental work for when you are stressed out waiting to hear back on interviews (or for future students to help them get into your area :) ).

At this stage last year I basically planned out several projects I would work on with aim of completing them in time so I could talk about this in my application. I had not necessarily published them when I applied, but I had the results. Between this time last year and when I submitted my applications, any project that was not directly influencing the main story was made secondary.


One final thing to keep in mind is that application deadlines vary greatly. Some will be due in September or October. Some not until February. A few slots will not open until March or even April. So, if you are prepared early, you will make some early deadlines that others will not.

Sunday, May 15, 2011

Weekend cstheory review.

I haven't done a cstheory review in a while. If you're not there surfing around (and possibly getting addicted!), then you're missing out on being part of a group of more than 4000 people who've asked and answered more than 1800 questions thus far.

Some high-profile interesting questions in the recent past:

NSF Workshop: 8F (Algorithms in the field)

I'm off to DIMACS for the NSF workshop 8F ("Algorithms in the field", AITF, get it?):
Computer Science (CS) is rapidly evolving. We need to understand the evolving CS systems and their algorithmic foundations. This needs close collaboration between the researchers in algorithmic foundations and expert researchers in various areas. The premise is that researchers from different communities should work jointly “in the field”, constantly informing each other, inventing in their respective areas, and forging systems that are simultaneously validated by both algorithmic and systems communities (and without needing to make independent research threads meet ex post).

The purpose of this workshop is to provide a working vision for examples of Algorithms in the Field, near term goals and directions for research in the future. The outcome of this workshop will be a report contributed by the participants that will inform the academic community and future funding programs. Scientifically, we hope bringing people with widely varied research interest together with algorithms researchers will lead to happy collisions, and unexpected directions of research.
There's a great mix of researchers from within theory and from more applied areas, and of course the topic is near and dear. Along with a long list of talks, the idea is to have a number of breakout sessions on different topics in this area. One that I'm involved with is 'Core Algorithms', whose mandate (roughly speaking) is to figure out how basic algorithms research might evolve over the next many years, in relation to events in the larger arena of computing research and technology.

Of course it's a mug's game to make predictions about the future of technology, or worse still, about the future of research. Unlike many other subfields of CS, theoryCS doesn't really lend itself to long-term prognostications or discussions of the 'this is what we should work on' variety.

But from a funding perspective, if we can identify interesting directions to explore even if we look completely stupid in retrospect, this could be a fun exercise. One framing device that we might try to use is:
View core algorithmic developments in two directions. In one direction, there are many concepts from the world of theoryCS that slowly make their way into the more applied realms. In this context, one might ask about which paradigms are either ripe for "tech tranfer", are likely to do so within a few years, or need more 'baking' before they're ready, but have potential.

    The reverse direction is the idea that core algorithmic questions and models are motivated by applications, whether they be models of computation like external memory/streaming/multicore/mapreduce, or conceptual developments like smoothed analysis, complexity classes for iterative algorithms, numerical algorithms and precise computations, and so on.

    In this context, interesting points for discussion would be: what kinds of paradigms do we see approaching from "applied land" and how might they influence the kinds of core algorithmic questions we ponder.
 If you're at the workshop, you can make your opinions known. But even if you're not, you can voice them right here ! I'll incorporate comments posted here into the discussions (and sign your name if you'd like credit). 

Sunday, May 01, 2011

SDM 2011: Other thoughts

My student Parasaran Raman did a two part review of SDM 2011 (here, and here), and I thought I'd add my own reflections.

There was an excellent talk on the second day by Haesun Park from Georgia Tech on non-negative matrix factorization methods (NMF). This problem has become all the rage in ML circles, and is defined as follows.
Given a matrix $A$ and parameter $k$, find rank $k$ matrices $U$ and $V$ such that $\|A - UV\|$ is minimized, and $U$ and $V$ contain only nonnegative entries.
The main difference between this problem and the SVD is the non-negativity requirement, and not surprisingly, it changes the complexity quite a bit - you can no longer get an optimal solution, for example.

There appears to be relatively little theory-algorithms work on this problem (there's some mathematical work, including the famous Perron-Frobenius theorem), and her talk presented a good overview of the various heuristic strategies people have used to try and solve the problem.

One of the reasons this formulation is interesting is because for many problems, you'd like an "explanation" of the data that doesn't have negative coefficients (which is the bane of PCA-like methods). She also says that for reasons unknown, the matrices produced by this formulation tend to be quite useful "in practice" at factoring out the different interactions present in data. In fact one of her big open questions is whether there's a more rigorous way of explaining this.

The talk was very enjoyable, and I sat there thinking that it would be perfect as an ALENEX (or even SODA) invited talk.

There was also a panel discussion on the future of data mining. The panelists comprised two industry folks, two academics, and one researcher from a national lab, with questions being thrown at them by Chandrika Kamath from LLNL. My twitter stream gave a blow-by-blow, so I won't rehash it here. I was intrigued (and a little disappointed) when I realized that almost all the answers centered around the needs of business.

In one respect this is not surprising: the primary reason why data mining is so hot right now is because of the vast opportunities for data mining to sell ads, products, or even just model consumer behavior. But it makes the field a little more shallow as a consequence: inevitably, industry needs solutions, not deep problems, and a relentless focus on solutions doesn't help facilitate deeper analysis of the problems. Maybe that's the nature of data mining, and I'm expecting too much to ask it to be more profound. But I wouldn't mind a more ML-like sense of rigor in the formulation of computational questions.

Overall, I quite enjoyed the conference. I've  been to KDD, and have been on PCs for both KDD and ICDM (the third major data mining conference). To me, SDM has the feel of a SODA/SoCG like conference - a smaller, more academic crowd, more mathematics/statistics in the talks, and less of a big industrial presence. I can definitely see myself publishing there and going back again.

Saturday, April 30, 2011

SDM 2011: Day 2

This is part II of Parasaran Raman's conference report from SDM 2011. His earlier report is here.

I was anxious about my first talk at a conference. It did help that I was scheduled to be the second speaker in the morning; that is when you are fresh with all the caffeine running through your system. I thought my talk went well, thanks to the many practice sessions my advisor and other people in my lab coerced me into doing! I talked to a few meta-clustering researchers here and they seemed to buy/appreciate our idea of a new spatially-aware metric and consensus procedure that we proposed in our paper.

There were a couple of other interesting talks in the clustering session today. Geng Li introduced a new method called ABACUS to mine arbitrarily shaped clusters in the data. They essentially collapse points till a spine of each cluster emerges. After this since they only have to deal with a smaller set of points, their technique is scalable and can work with large datasets. And yeah, as expected the Swiss roll dataset popped out!

Claudia Plant talked about a new approach to robust correlation clustering called SONAR. Her technique works in 3 steps: collect signals by probing the data with different primitive clusters, extract response patterns, and identify true clusters with techniques like Minimum Description Length. It was a nice talk and I made a mental note to read this paper more carefully.

Next up was the best student paper talk by Pu Wang on Nonparametric Bayesian Co-clustering Ensembles. This is the most recent work in a series of clustering ensembles papers of Wang, Domeniconi and Laskey. This paper tries to address the issues of clustering bias, parameter setting and curse of dimensionality through clustering ensembles and co-clustering. This line of research looks interesting, at least with the types of issues that they are trying to solve.

Later at lunch, I met a few data mining people interested in general meta aspects of clustering like subspace clustering, generating alternate clusterings and computing consensus solutions. Thomas Seidl, Stephan Günnemann and Ines Färber from RWTH Aachen University and Emmanuel Müller, who is currently at Karlsruhe Institute of Technology are talking at a tutorial session tomorrow, on "Discovering Multiple Clustering Solutions". We had a very interesting conversation about the challenges that researchers who ask these meta questions about clustering including my favorite "How do we obtain/generate data that will admit multiple good partitions?". These folks are also organizing the second workshop on Multiclust in conjunction with ECML-PKDD at Athens. The first MultiClust workshop happened at KDD last year. So, let's get down to "Clustering the clusters in the clusterings!".

Friday, April 29, 2011

SDM 2011: Day 1

My student Parasaran Raman and I are at the SIAM Conference on Data Mining (SDM) 2011, where he's presenting this paper on "spatially aware" consensus. In between preparation for his very first conference talk (yay!) he had enough time to jot down his thoughts on day 1. As usual, I've done light editing for formatting, and take responsibility for all errors in transmission. Note that SIAM hasn't yet placed papers online, and so when available I'll add links.


Collecting real data has always been a real problem for me. All the experiments with synthetically generated data and shouting "bingo!" after getting good results on them for the meta-clustering problems that I work on, was getting too boring. SDM11 had a tutorial on "Data Mining for Healthcare management" and I was interested to see what kind of healthcare-related data I can get to play with. Especially because almost all of the papers that uses consensus clustering techniques on real data have been motivated by analysis of either microarray gene expression or protien sequences. Prasanna Desikan of Allina Hospitals and Clinics talked about how terabytes of data generated by multiple sources drives the need for efficient knowledge based systems. It was heartening to see many publicly available datasets that require mining. The key challenges from a strictly CS perspective are the usual (1) Dealing with missing data/attributes and (2) Outliers and noisy data which are even more pronounced in the healthcare sector.

I also sat through a few talks in the classification session. Byron Wallace talked about letting multiple experts involve in active learning and "who should label what?". The talk focussed on how a single infallible oracle is impractical and stressed the need for multiple experts. The key idea was to save the hard  questions that are usually not many for the experienced labelers who are expensive and let the inexperienced labelers do the load of work when the questions are rather easy. An interesting question that came up was about the feasibility of partitioning the question set into easy and hard.

Sanjay Chawla talked about supervised learning for skewed data where the classes of different sizes causes the hyperplane separating the classes to be pushed towards the class with lesser points (ed: the imbalanced classification problem). The solution provided was to use a quadratic mean loss instead of just the arithmetic mean. The loss function is convex and can be solved by usual convex solvers.

Liang Sun spoke about a boosting framework for to learn the optimal combination of individual base kernel classifiers. The algorithm goes through multiple boosting trials. During each trial, one classifier for each kernel is learned. The 'multi-kernel' classifier for each stage in the boosting process is a function of the input classifiers. Their choice is to pick the best one. The boosing process itself is otherwise similar to AdaBoost.

There was also a lightning session and true to its name, there were 25 talks in 75 minutes! I could relate to a meta-clustering work by Yi Zhang and Tao Li where the goal was to find a k-consensus solution from a bunch of input partitions of the same data. The method works in four steps: (1) generate multiple input partitions, (2) use Mallows distance (Zhou et. al.) to compute distance between partitions, (3) build a hierarchy over the partitions and make a cut to get a 'clustering' of partitions and (4) generate a consensus partition on each 'cluster' from the cut. I have found it difficult to find data that admits multiple partitions that are good but different from each other at the same time, and that make me ask the question "How do you generate the different partitions?". The answer sadly was "We run kmeans 38 times!". I am still looking for a motivation for multiple alternate partitions.

The reception had just cheese and a few veggies, but I have been having some good food here in Phoenix and I am not complaining.

Tuesday, April 19, 2011

POTD: Witnessed k-distance

Distance function to a compact set plays a central role in several areas of computational geometry. Methods that rely on it are robust to the perturbations of the data by the Hausdorff noise, but fail in the presence of outliers. The recently introduced distance to a measure offers a solution by extending the distance function framework to reasoning about the geometry of probability measures, while maintaining theoretical guarantees about the quality of the inferred information. A combinatorial explosion hinders working with distance to a measure as an ordinary (power) distance function. In this paper, we analyze an approximation scheme that keeps the representation linear in the size of the input, while maintaining the guarantees on the inference quality close to those for the exact (but costly) representation.

Notes:
The idea of defining distances between (or to) shapes robustly by replacing the shape by a distribution is near to my heart (see this paper for more). The authors provide an algorithmic perspective on a formulation first proposed by Chazal, Cohen-Steiner and Mérigot. The idea is that instead of constructing a shape from a point cloud by growing balls of fixed radius and measuring the distance to this union of balls, one constructs a measure by growing balls out to a fixed measure, and then defining a distance. The distance measure has the nice property of being Lipschitz-related to the EMD, making it stable. In a manner reminiscient of the Aurenhammer et al work, they relate this distance to a power diagram construction, and then design efficient approximations for it (because the exact version involves terms exponential in the thresholding parameter).

It seems to me that they're going to a lot of work to recover from what I think is a tactical problem: fixing a sharp threshold for the ball measure. It might be interesting to explore alternate ways of defining the measure that are "smoother" - maybe using a kernel instead of a hard measure boundary. It might yield an alternate measure that serves the same purpose but is easier to compute.

Sunday, April 10, 2011

POTD: A unified approach to Approximate Proximity searching

A Unified approach to Approximate Proximity Searching
Sunil Arya, Guilherme D. da Fonseca, and David M. Mount

Abstract:
The inability to answer proximity queries efficiently for spaces of dimension d > 2 has led to the study of approximation to proximity problems. Several techniques have been proposed to address different approximate proximity problems. In this paper, we present a new and unified approach to proximity searching, which provides efficient solutions for several problems: spherical range queries, idempotent spherical range queries, spherical emptiness queries, and nearest neighbor queries. In contrast to previous data structures, our approach is simple and easy to analyze, providing a clear picture of how to exploit the particular characteristics of each of these problems. As applications of our approach, we provide simple and practical data structures that match the best previous results up to logarithmic factors, as well as advanced data structures that improve over the best previous results for all aforementioned proximity problems.
Notes:
When a problem becomes interesting, papers get written quickly, and techniques start pouring out of the firehose. Not all tricks are needed, and not all machinery is effective, but cleanup only comes later, once the frenzy has settled. Approximate nearest neighbor research is like this: there are many tricks, and lots of machinery, but there are also some core ideas that keep showing up, and are sufficient for many variants of the problem.

Near-neighbor searching in low dimension is "easy" if you're given data that's uniformly sampled. Simple divide-and-conquer will give balanced search trees, and thus low query times. The problem comes when you have regions of sparsity. Informally, they prevent you from making progress as you divide the space up, and so the root-to-leaf length increases, increasing query time.

While the ANN problem in low dimensions is "solved" in that there's a fairly good theoretical understanding of the bounds and tradeoffs needed for query time vs storage, the methods themselves are quite complex. I learnt this first-hand when reading the Har-Peled paper that uses approximate Voronoi diagrams for near-neighbor search in an attempt to reverse-engineer the method for a different setting.

The beauty of the POTD is that it starts with a very simple data structure - the compressed quad tree. It shows that this structure can be used to isolate "sparse" and "dense" regions of space, and uses a hybrid strategy for processing these regions, with core sets to reduce size in dense regions, and optimized data structures for sparse regions (that necessarily only have few points). While the paper itself has no experimental results, I'd imagine that this approach would lend itself far more easily to experimentation.

Thursday, April 07, 2011

MAD ALGOrithms

It's time for some large data sets ! Once again, MADALGO is organizing MASSIVE, the workshop on massive data algorithmics. As usual, it will be colocated with SoCG in Paris (actually the day after, on the 16th of June). In case you haven't noticed, big-data is all the rage now !

Deadline is Apr 27, so get those 10 page submissions in, and see you in Paris !

Disqus for The Geomblog