The Geomblog

Saturday, April 30, 2011

SDM 2011: Day 2

This is part II of Parasaran Raman's conference report from SDM 2011. His earlier report is here.

I was anxious about my first talk at a conference. It did help that I was scheduled to be the second speaker in the morning; that is when you are fresh with all the caffeine running through your system. I thought my talk went well, thanks to the many practice sessions my advisor and other people in my lab coerced me into doing! I talked to a few meta-clustering researchers here and they seemed to buy/appreciate our idea of a new spatially-aware metric and consensus procedure that we proposed in our paper.

There were a couple of other interesting talks in the clustering session today. Geng Li introduced a new method called ABACUS to mine arbitrarily shaped clusters in the data. They essentially collapse points till a spine of each cluster emerges. After this since they only have to deal with a smaller set of points, their technique is scalable and can work with large datasets. And yeah, as expected the Swiss roll dataset popped out!

Claudia Plant talked about a new approach to robust correlation clustering called SONAR. Her technique works in 3 steps: collect signals by probing the data with different primitive clusters, extract response patterns, and identify true clusters with techniques like Minimum Description Length. It was a nice talk and I made a mental note to read this paper more carefully.

Next up was the best student paper talk by Pu Wang on Nonparametric Bayesian Co-clustering Ensembles. This is the most recent work in a series of clustering ensembles papers of Wang, Domeniconi and Laskey. This paper tries to address the issues of clustering bias, parameter setting and curse of dimensionality through clustering ensembles and co-clustering. This line of research looks interesting, at least with the types of issues that they are trying to solve.

Later at lunch, I met a few data mining people interested in general meta aspects of clustering like subspace clustering, generating alternate clusterings and computing consensus solutions. Thomas Seidl, Stephan Günnemann and Ines Färber from RWTH Aachen University and Emmanuel Müller, who is currently at Karlsruhe Institute of Technology are talking at a tutorial session tomorrow, on "Discovering Multiple Clustering Solutions". We had a very interesting conversation about the challenges that researchers who ask these meta questions about clustering including my favorite "How do we obtain/generate data that will admit multiple good partitions?". These folks are also organizing the second workshop on Multiclust in conjunction with ECML-PKDD at Athens. The first MultiClust workshop happened at KDD last year. So, let's get down to "Clustering the clusters in the clusterings!".

Friday, April 29, 2011

SDM 2011: Day 1

My student Parasaran Raman and I are at the SIAM Conference on Data Mining (SDM) 2011, where he's presenting this paper on "spatially aware" consensus. In between preparation for his very first conference talk (yay!) he had enough time to jot down his thoughts on day 1. As usual, I've done light editing for formatting, and take responsibility for all errors in transmission. Note that SIAM hasn't yet placed papers online, and so when available I'll add links.

Collecting real data has always been a real problem for me. All the experiments with synthetically generated data and shouting "bingo!" after getting good results on them for the meta-clustering problems that I work on, was getting too boring. SDM11 had a tutorial on "Data Mining for Healthcare management" and I was interested to see what kind of healthcare-related data I can get to play with. Especially because almost all of the papers that uses consensus clustering techniques on real data have been motivated by analysis of either microarray gene expression or protien sequences. Prasanna Desikan of Allina Hospitals and Clinics talked about how terabytes of data generated by multiple sources drives the need for efficient knowledge based systems. It was heartening to see many publicly available datasets that require mining. The key challenges from a strictly CS perspective are the usual (1) Dealing with missing data/attributes and (2) Outliers and noisy data which are even more pronounced in the healthcare sector.

I also sat through a few talks in the classification session. Byron Wallace talked about letting multiple experts involve in active learning and "who should label what?". The talk focussed on how a single infallible oracle is impractical and stressed the need for multiple experts. The key idea was to save the hard questions that are usually not many for the experienced labelers who are expensive and let the inexperienced labelers do the load of work when the questions are rather easy. An interesting question that came up was about the feasibility of partitioning the question set into easy and hard.

Sanjay Chawla talked about supervised learning for skewed data where the classes of different sizes causes the hyperplane separating the classes to be pushed towards the class with lesser points (ed: the imbalanced classification problem). The solution provided was to use a quadratic mean loss instead of just the arithmetic mean. The loss function is convex and can be solved by usual convex solvers.

Liang Sun spoke about a boosting framework for to learn the optimal combination of individual base kernel classifiers. The algorithm goes through multiple boosting trials. During each trial, one classifier for each kernel is learned. The 'multi-kernel' classifier for each stage in the boosting process is a function of the input classifiers. Their choice is to pick the best one. The boosing process itself is otherwise similar to AdaBoost.

There was also a lightning session and true to its name, there were 25 talks in 75 minutes! I could relate to a meta-clustering work by Yi Zhang and Tao Li where the goal was to find a k-consensus solution from a bunch of input partitions of the same data. The method works in four steps: (1) generate multiple input partitions, (2) use Mallows distance (Zhou et. al.) to compute distance between partitions, (3) build a hierarchy over the partitions and make a cut to get a 'clustering' of partitions and (4) generate a consensus partition on each 'cluster' from the cut. I have found it difficult to find data that admits multiple partitions that are good but different from each other at the same time, and that make me ask the question "How do you generate the different partitions?". The answer sadly was "We run kmeans 38 times!". I am still looking for a motivation for multiple alternate partitions.

The reception had just cheese and a few veggies, but I have been having some good food here in Phoenix and I am not complaining.

Tuesday, April 19, 2011

POTD: Witnessed k-distance

Witnessed k-Distance (SoCG 2011)

Leonidas J. Guibas, Quentin Mérigot (LJK), Dmitriy Morozov (LBNL)

Distance function to a compact set plays a central role in several areas of computational geometry. Methods that rely on it are robust to the perturbations of the data by the Hausdorff noise, but fail in the presence of outliers. The recently introduced distance to a measure offers a solution by extending the distance function framework to reasoning about the geometry of probability measures, while maintaining theoretical guarantees about the quality of the inferred information. A combinatorial explosion hinders working with distance to a measure as an ordinary (power) distance function. In this paper, we analyze an approximation scheme that keeps the representation linear in the size of the input, while maintaining the guarantees on the inference quality close to those for the exact (but costly) representation.

Notes:
The idea of defining distances between (or to) shapes robustly by replacing the shape by a distribution is near to my heart (see this paper for more). The authors provide an algorithmic perspective on a formulation first proposed by Chazal, Cohen-Steiner and Mérigot. The idea is that instead of constructing a shape from a point cloud by growing balls of fixed radius and measuring the distance to this union of balls, one constructs a measure by growing balls out to a fixed measure, and then defining a distance. The distance measure has the nice property of being Lipschitz-related to the EMD, making it stable. In a manner reminiscient of the Aurenhammer et al work, they relate this distance to a power diagram construction, and then design efficient approximations for it (because the exact version involves terms exponential in the thresholding parameter).

It seems to me that they're going to a lot of work to recover from what I think is a tactical problem: fixing a sharp threshold for the ball measure. It might be interesting to explore alternate ways of defining the measure that are "smoother" - maybe using a kernel instead of a hard measure boundary. It might yield an alternate measure that serves the same purpose but is easier to compute.

Sunday, April 10, 2011

POTD: A unified approach to Approximate Proximity searching

A Unified approach to Approximate Proximity Searching

Sunil Arya, Guilherme D. da Fonseca, and David M. Mount

European Symposium on Algorithms, 2010.

Abstract:

The inability to answer proximity queries efficiently for spaces of dimension d > 2 has led to the study of approximation to proximity problems. Several techniques have been proposed to address different approximate proximity problems. In this paper, we present a new and unified approach to proximity searching, which provides efficient solutions for several problems: spherical range queries, idempotent spherical range queries, spherical emptiness queries, and nearest neighbor queries. In contrast to previous data structures, our approach is simple and easy to analyze, providing a clear picture of how to exploit the particular characteristics of each of these problems. As applications of our approach, we provide simple and practical data structures that match the best previous results up to logarithmic factors, as well as advanced data structures that improve over the best previous results for all aforementioned proximity problems.

Notes:
When a problem becomes interesting, papers get written quickly, and techniques start pouring out of the firehose. Not all tricks are needed, and not all machinery is effective, but cleanup only comes later, once the frenzy has settled. Approximate nearest neighbor research is like this: there are many tricks, and lots of machinery, but there are also some core ideas that keep showing up, and are sufficient for many variants of the problem.

Near-neighbor searching in low dimension is "easy" if you're given data that's uniformly sampled. Simple divide-and-conquer will give balanced search trees, and thus low query times. The problem comes when you have regions of sparsity. Informally, they prevent you from making progress as you divide the space up, and so the root-to-leaf length increases, increasing query time.

While the ANN problem in low dimensions is "solved" in that there's a fairly good theoretical understanding of the bounds and tradeoffs needed for query time vs storage, the methods themselves are quite complex. I learnt this first-hand when reading the Har-Peled paper that uses approximate Voronoi diagrams for near-neighbor search in an attempt to reverse-engineer the method for a different setting.

The beauty of the POTD is that it starts with a very simple data structure - the compressed quad tree. It shows that this structure can be used to isolate "sparse" and "dense" regions of space, and uses a hybrid strategy for processing these regions, with core sets to reduce size in dense regions, and optimized data structures for sparse regions (that necessarily only have few points). While the paper itself has no experimental results, I'd imagine that this approach would lend itself far more easily to experimentation.

Thursday, April 07, 2011

MAD ALGOrithms

It's time for some large data sets ! Once again, MADALGO is organizing MASSIVE, the workshop on massive data algorithmics. As usual, it will be colocated with SoCG in Paris (actually the day after, on the 16th of June). In case you haven't noticed, big-data is all the rage now !

Deadline is Apr 27, so get those 10 page submissions in, and see you in Paris !

Tuesday, March 29, 2011

STOC 2011 Poster Session

STOC 2011 is experimenting with a poster session. This is excellent news - kudos to the organizing committee for taking the initiative to do this.

What I'm a little puzzled about is the format though: rather than the usual "some papers become posters" or "all papers get a slot in the poster session", the format appears to be "submit posters about other, possibly published, work". This is a nice idea, and should help with dissemination of results from other venues, and drawing more folks into the conference. However, priority in poster acceptance will be given to registered STOC attendees, which doesn't do much for the participation numbers.

But I complain too much. This is a nice step forward, and I encourage people to submit their posters. Unfortunately, (fortunately?) I'll be in Paris instead of San Jose.

Friday, March 25, 2011

Permanent record of work

In our hurry to tar and feather the ACM, the IEEE and other LargePubs, I'm not sure we are quite ready to face the universe that will result, and the amount of work we'll need to do on our own.

Consider:

I was recently asked if I had presentation slides for my paper on universal MDS. I managed to ferret them out from my collection of talks, and sent it over. What I should have also done was add them to the paper page as well, but I've been busy and haven't got around to it (I have other talks that I need to upload as well)
This CS professor asks on reddit: "Where should I host code for the paper I just wrote ?". Good answers are provided, with github.com being the most popular choice.

Researchers are peripatetic: join company, leave company, join university, leave university for other university, leave university for company, rinse and repeat. The obvious way to keep a single fixed repository of your work is to maintain your own website with your own domain, and many researchers now do that.

But it is a pain in the neck. Granted, it's a necessary pain in the neck, but there was something to be said for being able to write a paper, ship it off somewhere, and have it be maintained by someone else.

The arxiv is doing a pretty good job of this for papers, as long as it can continue to receive funding, but what about talks, code fragments, and other related material that goes into research ?

Saturday, March 19, 2011

Lego lectures

I've been pondering the matter of lecture notes of late, and Luca's announcement of his new set of notes prompted me to write this.

Here's what often happens to me. I'm looking to either hand out or prepare an outline for a lecture (or set of lectures) on topic X. Under the principle of (code) reuse, I go hunting for lecture notes that I can link to. I'll often find three or four examples of almost what I need, but either there'll be background information that I have to provide links for as well, or maybe the treatment isn't quite what I wanted.

It seems to me that what one needs are Lego lectures (this is the term coined by my colleague Matt Might when I described my solution to him). My inspiration for this idea comes from reading Dexter Kozen's book on complexity theory.

So what are Lego lectures ?

One set of notes, a few pages or less, on a SINGLE topic. Very focused, and usually one main theorem. In Kozen's complexity notes, each lecture is (almost) one main result in complexity theory.
As little referencing of prior notes as possible, and notation declared when needed.

The idea is that if you have a collection of lego lectures on all topics, you can cobble together a class on a topic relatively easily, and give it your own angle.

I've probably written only two lego lectures in my life: one on tail bounds, and one on the FFT. But they have turned out to be immensely useful, and get reused all the time. I think that from now one I'll model my lectures notes on the lego principle.

Sunday, March 13, 2011

SoCG 2011: New Venue and hotel recommendations

I had mentioned a few days ago that the SoCG 2011 venue was set to change. The new venue has been announced: it's at UICP, right near the Eiffel tower (which I might add, looks like a snapshot of a negatively curved mesh). There are hotel recommendations as well, but it sounds as if even if you made reservations at a hotel near the previous location, it won't be too much trouble to get around. Aah, the joys of efficient public transport.

In any case, do those registrations now, and get those hotel rooms booked. As a former organizer, I know how nerve-wracking it can be while you wait for the registration count to climb towards respectability :).

Monday, February 28, 2011

NSF and government shutdown

Amidst all the other chaos that will ensue if the government shuts down, there will also be NSF problems. Heard on the grapevine:

If the Federal government does shut down on March 4, you will not have access to FastLane for the duration of the shutdown. On the NSF side, no one will have access to NSF email, and we will not be permitted to conduct government business using alternate email.

For example, this could be rather annoying for anyone trying to submit a CPS proposal.

Saturday, February 26, 2011

Arch-blogging

The Utah blogging juggernaut (random fact: juggernaut is a corruption of Jagannath, from the festival of Lord Jagannath in Orissa in which people would throw themselves in front of giant chariot carrying an image of the god) chugs on. Our newest member is Rajeev Balasubramonian, who might very well be the first computer architecture blogger. The blog itself is intended as a group blog of the Utah architecture group, whose cleverly-chosen logo is the Delicate Arch (get it ? Arch?). Read the blog manifesto here.

Idle thought: which academic department has the highest ratio of bloggers/faculty ? Or blogs/faculty ?

Phone interviews ?

Phone interviews are common in the industrial hiring scene. As a candidate, you go through a phone interview (or even two) and if you make the cut, you fly out for an actual interview.

I'm seeing phone interviews become more common in the academic world. You get a phone interview with some member of the recruiting committee, and the actual interview follows if you make the cut.

Has this always been like this (and am I just misremembering?) or is it a new trend ? Phone interviews can make a lot of sense if you want to weed out candidates who might look promising on paper, and it's a clear cost-saver (and time saver), but I haven't seen them be so common in academic settings. Maybe it's because there are more people chasing each slot and so these filters are more necessary now ?

Tuesday, February 22, 2011

If you're making plans for SoCG 2011...

Don't. Yet.

Important notice: in order to solve some logistical issues and to reduce the costs, the exact location of the conference is likely to change within the next two weeks. The final location (still downtown Paris) will be specified here by March 8th at the latest. Please accept our apologies for the inconvenience, and feel free to contact us if you have any questions or concerns.

Although if you don't need to find a hotel right next to the conference venue, this will not affect you.

Saturday, February 19, 2011

Agnostic/Non-agnostic Learning and Nets/Samples of Range Spaces

There are deep connections between geometry and learning that go back to the origins of VC dimension and include concepts like support vector machines, lifting maps, kernel functions, and iterative reweighting. What is often difficult is the translation problem of going between machine learning concepts and the corresponding (often identical) geometric concepts. One example of this that I've talked about is the two different viewpoints on characterizing isometric embeddability in an $\ell_2$ space.

I strongly believe that with a better "dictionary", we might find a lot more cross-fertilization between geometry and learning, and this would be beneficial for both communities. The 2009 summer school on geometry and learning was a good example of this: alas, it almost completely coincided with SoCG 2009. This summer, there is again a summer school in geometry and learning just before SoCG 2011, and it's in conjunction with it (shameless plug: I'll be lecturing there), and I look forward to seeing more such events.

In this post, my postdoc Jeff Phillips and student Avishek Saha explore a fascinating connection between nets and samples in range spaces on the one hand, and agnostic/non-agnostic learning on the other. Avishek is a machine learning specialist, and Jeff is of course an expert in approximate geometry, sampling and uncertainty. So it is fitting that their joint conversation led to this post.

There is a deep connection between some concepts in combinatorial geometry and learning. It turns out that $\varepsilon$-approximations and $\varepsilon$-nets of range spaces correspond to sample complexity in agnostic learning and non-agnostic learning, respectively. We noticed this last fall when both reading a cute paper by Gandhi, Suri, and Welzl and a book by Anthony and Bartlett.

The geometry:
Given a data set $P$ and a family of subsets $\mathcal{A} \subset 2^P$, the pair $(P,\mathcal{A})$ is a range space. An $\varepsilon$-approximation of $(P,\mathcal{A})$ is subset $Q \subset P$ such that

$\max_{R \in \mathcal{A}} \left| \frac{|R|}{|P|} - \frac{|R \cap Q|}{|Q|} \right| \leq \varepsilon.$

An $\varepsilon$-net of $(P,\mathcal{A})$ is a subset $N \subset P$ such that for all $R \in \mathcal{A}$ with $|R| \geq \varepsilon |P|$ then the intersection $|R \cap Q| \geq 1$. Informally, the $\varepsilon$-approximation approximately preserves the density, and the $\varepsilon$-net hits every large enough subset.

The VC-dimension $v$ of a range space $(P,\mathcal{A})$ informally describes the description complexity of a range $R \in \mathcal{A}$. For instance, halfspaces in $\mathbb{R}^d$ have VC-dimension $d+1$. A random sample $Q \subset P$ of size $O(v/\varepsilon^2)$ is an $\varepsilon$-approximation of $(P,\mathcal{A})$ with constant probability, and a random sample $N \subset P$ of size $O((v/\varepsilon) \log (1/\varepsilon))$ is an $\varepsilon$-net of $(P,\mathcal{A})$ with constant probability.

Learning theory:
Agnostic and non-agnostic learning can also be described with respect to a range space $(P,\mathcal{A})$. We are given a data set $P$ where each element $p \in P$ is either labeled $+$ or $-$. We want to find a classifier $R \in \mathcal{A}$ such that all elements $p \in P$ labeled $+$ are in $R$ and all elements $p \in P$ labeled $-$ are not in $R$. Non-agnostic learning, also referred to as realizable learning, is the setting where we assume that there exists such a classifier $R^*$ in "concept class" $\mathcal{A}$ that has zero-error, it perfectly separates all $+$ points from $-$ points. Agnostic learning is the setting without this assumption of the existence of a zero-error classifier in $\mathcal{A}$, so the aim is to find a classifier $\hat R \in \mathcal{A}$ which misclassifies the fewest number of points. This weakened target function assumption can be alternatively viewed as traditional PAC learning that accommodates arbitrary noise.

The connection:
Now here is where the concepts start looking similar: In non-agnostic learning, if we are given a random sample $N \subset P$ of size $O((v/\varepsilon) \log (1/\varepsilon))$, and run an algorithm to perfectly classify $+$ and $-$ points from $N$, then the same classifier will misclassify at most $\varepsilon |P|$ points on $P$, with constant probability.
In agnostic learning, if we are given a random sample $Q \subset P$ of size $O(v/\varepsilon^2)$, and run an algorithm to find a classifier on $Q$ that misclassifies the fewest number of points, then the same classifier will misclassify at most $\varepsilon$ larger fraction of points on $P$, with constant probability.

Why are those bounds the same?

Lets first look at $\varepsilon$-nets and non-agnostic learning. The key is to construct a range space $(P,\mathcal{S})$ where each $T \in \mathcal{S}$ is the symmetric difference of two ranges $R^*,R^\prime \in \mathcal{A})$, that is it contains all points contained in exactly one of the two ranges. If $(P,\mathcal{A})$ has VC-dimension $v$, then $(P,\mathcal{S})$ has VC-dimension $O(v)$.
Now consider the optimal classifier $R^* \in \mathcal{A}$ of $P$, and any other classifier $R^\prime \in \mathcal{A}$ that misclassifies at most $\varepsilon |P|$ points of $P$. Then the symmetric difference of $R^*$ and $R^\prime$ is at most $\varepsilon |P|$ points, and corresponds with some range $T \in \mathcal{S}$. If $N \subset P$ is an $\varepsilon$-net of $(P,\mathcal{S})$, then any classifier $R^\prime \in \mathcal{A}$ learned perfectly on $(N,\mathcal{A})$, cannot misclassify more than $\varepsilon |P|$ points since it would then necessarily have too many points in the symmetric difference with the perfect classifier $R^*$ on $P$, and $N$ would contain at least one of them.

Now to understand $\varepsilon$-approximations and agnostic learning, we need to see why this same approach won't work. Namely, because there is, in general, no perfect classifier $R^\prime \in \mathcal{A}$ defined on a subset of $Q \subset P$. Thus when we look at the symmetric difference between $\hat R$ and $R^\prime$, we cannot argue that the set of misclassified points should be empty in $Q$ using $R^\prime$ if the difference is less than $\varepsilon |P|$, and can't apply the $\varepsilon$-net result.

But we can apply the stronger $\varepsilon$-approximation result in this agnostic setting. The difference between any two ranges $R, R^\prime \subset \mathcal{A}$ on $P$, that contain the same points in $Q$, is at most $\varepsilon |P|$. Thus the difference in the fraction of misclassified points in $P$ from the optimal range $\hat R$ found on $Q$ is no more than $\varepsilon$, since $\hat R$ only differs by an $\varepsilon$-fraction in the number of points it contains in $Q$ versus $P$.

An intuitive way to see this connection is to think of agnostic and non-agnostic learning in the context of presence and absence of noise. Non-agnostic learning is the ideal setting where we have no noise and standard VC-dimenion based learning bounds suggest that we need $O((v/\varepsilon)\log(1/\varepsilon))$ samples (an $\varepsilon$-net) to learn a target function with $\varepsilon$ classification error. On the other hand, agnostic learning accommodates noise by allowing points from either class to spill over the decision boundary. This implies that some of the randomly sampled points might be noisy and hence misleading for the purposes of learning the target classifier. To nullify this effect of noisy sampled points we need to oversample which increases the sample complexity by about an $(1/\varepsilon)$ factor (yielding an $\varepsilon$-approximation).

Wednesday, February 09, 2011

My talks at ITA and at the college of engineering in Montana State

This is the abstract for the talk I'm giving in brief at the ITA Workshop and more expanded at a College of Engineering colloquium at Montana State. Thanks to the ITA program commitee and Brendan Mumey (at Montana State) for the invitations.

Dimensionality reduction for distributions: the good and the bad

In many application areas, the natural representation of data is in the form of an empirical probability distribution. Documents are represented as normalized frequency counts of words, images are represented by color (or other feature) histograms, and speech signals can be represented by spectral histograms.

There are many natural and meaningful ways of measuring similarity (or distance) between such distributions, and these define different geometries in which we might wish to analyze collections of distributions for interesting patterns. However, a practical bottleneck is the high dimensionality of these representations: for example, an 256 x 256 image might be represented by a vector of over 1000 features, and a document might be represented as a sparse vector with hundreds of attributes.

Thus, a key problem is: can we reduce the dimensionality of collections of distributions to make data analysis tractable while preserving the distances in the collection ?

In this talk, I'll discuss a collection of recent results centered around this theme, that provide both good news (and bad) for dimensionality reduction on distributions in theory and in practice.

The above draws on information mostly from this paper with Arvind Agarwal and Jeff Phillips, and this work-in-progress with Rasmus Kyng and Jeff Phillips.

Tuesday, February 08, 2011

SoCG Videos Call

You drew all those snazzy diagrams using your fancy Mac drawing program. Now it's time to make some videos:

Video and multimedia presentations are sought for the 20th Annual Video and Multimedia Review of Computational Geometry, to accompany the 27th Annual Symposium on Computational Geometry in Paris, France. This review showcases the use of visualization in computational geometry for exposition and education, for visual exploration of geometry in research, and as an interface and a debugging tool in software development. Algorithm animations, visual explanations of structural theorems, descriptions of applications of computational geometry, and demonstrations of software systems are all appropriate.

Three to five minutes is ideal for most presentations; eight minutes is the upper limit. Accepted video and multimedia presentations will have an abstract in the published conference proceedings; video/multimedia authors will have an opportunity to present their work at the Symposium during a dedicated video session. Accepted presentations will be available online in various formats in a web proceedings. See http://www.computational-geometry.org/ for examples of previous years' proceedings.

Send email to the Video/Multimedia committee chair, Jonathan Shewchuk, jrs@cs.berkeley.edu by 23:59 Pacific Standard Time, Friday, February 18, 2011, with the following information: the names and institutions of the authors, the email address of the corresponding author, and instructions for downloading the submission.

Monday, February 07, 2011

DIMACS Worskhop on Parallelism: Call for Talks

The DIMACS Workshop on Parallelism is soliciting contributed talks (deadline Feb 9!). Sergei Vassilvitskii offers the following synopsis with information on submission.

Decades after its origins and its study in academia, parallel computing is finally becoming pervasive. Today's PCs have multiple cores, and some predict 1000-core machines in the near future. Google, Yahoo and others run MapReduce or Hadoop on thousands of machines to answer search queries, among other things. D. E. Shaw Research is building a massively parallel machine to simulate molecular dynamics. Climate scientists predict the evolution of the earth's climate on parallel machines. Amazon's EC² enables users to run jobs on a "cloud" of PCs.

The evolution of parallel computing from primarily an academic subject in the '80s to its realization today is an exciting development. This DIMACS workshop will bring together some of the leading researchers and practitioners involved in parallel computing to describe their work. Attendees will discuss, for example:

how parallel computing in its various forms is used today;

what new uses and programming abstractions will arise by 2020;

what parallel computers will look like in 2020; and

how to model parallelism theoretically.

Those wishing to give a contributed talk must submit a one-page description of their work, with additional material (such as a paper) optional, to howard@research.att.com by February 9, 2011. Should there be an excess of submissions, the organizers will select the contributed talks according to the summaries submitted.

Friday, February 04, 2011

POTD: Reproducing Kernel Banach Spaces with the ℓ1 Norm

Reproducing Kernel Banach Spaces with the ℓ1 Norm

Guohui Song, Haizhang Zhang, and Fred J. Hickernell

http://arxiv.org/abs/1101.4388

Abstract:

Targeting at sparse learning, we construct Banach spaces B of functions on an input space X with the properties that (1) B possesses an l1 norm in the sense that it is isometrically isomorphic to the Banach space of integrable functions on X with respect to the counting measure; (2) point evaluations are continuous linear functionals on B and are representable through a bilinear form with a kernel function; (3) regularized learning schemes on B satisfy the linear representer theorem. Examples of kernel functions admissible for the construction of such spaces are given.

Notes:
This one probably requires some explanation, for the non-ML folks. Reproducing Kernel Hilbert spaces are the coin of the realm in machine learning and for good reason. They allow much of ML to be "ported" from linear classifiers to non-linear classifiers: the kernel mapping essentially linearizes (via lifting) the nonlinear classifiers so you can get the benefit of the nonlinearity while operating algorithmically in a linear world. Even though the induced Hilbert space is typically a function space and is therefore infinite-dimensional, the representer theorem allows us in most cases to operate in a finite dimensional space (where the dimension is bounded by the number of samples). From a metric embedding perspective, kernels completely characterize the class of metrics isometrically embeddable in Euclidean space.

So RKHSs are great. So what's the deal with this paper ? What it tries to do is combine the power of RKHSs with the regularity and sparsity properties guaranteed by $\ell_1$ norms. Even though your typical Banach space doesn't admit an inner product (what you need for the kernel mapping), they show that you can define special Banach spaces in which kernels can be defined as before, and the representer theorem holds, but that you can get sparse bases for solutions because of the nice $\ell_1$ properties.

I promised SHORT summaries, so I'm not going to go further. But the takeaway message here is the ability to extend the nice properties of RKHSs to Banach spaces. For completeness I should mention that there are other approaches that have tried to do this, but using different mathematical constructs that are in some way less well suited.

Thursday, February 03, 2011

On the future of conferences

There's a superb article out today in the CACM by Jonathan Grudin (non-paywall version here, thanks to Alexandre Passos) on the future (and past) of our CS conference-driven model. I'm biased in favor because it takes a generally negative view of the current model, but that's not why I like it.

I like it because he systematically tracks back to the origins of the conference culture, discusses why it's so prevalent, and avoids the glib "let's do everything in journals" argument that even I've been guilty of in the past.

Some of the tl;dr points (but do read the whole thing):

Technology and a Professional Organization Drove the Shift to Conference Publication: not speed of development of the field, as is commonly stated. It was also a more American-centered phenomenon.
Formal archiving of conference proceedings made creating journal versions more difficult (because of the 30% rule and so on)
"When conferences became archival, it was natural to focus on quality and selectivity." so the focus of conferences became more gatekeeping and less community.
This in turn has an impact on community: when your papers are rejected, you don't go to the conference. For more on the impact of rejection, see Claire Mathieu's post.
A further consequence is that computer scientists do not develop the skills needed to navigate large, community-building conferences.This is so true ! As someone who frequents SODA, SoCG and occasionally FOCS/STOC, I often find myself gasping for breath at VLDB (600+ participants) or KDD (800+). It's overwhelming to keep track of everything. And it makes it harder for me to attend such conferences regularly, even though it's important for me to go.

His analysis of where we should be heading is also sound. Much like the Blue-ray-HD-DVD wars of a few years ago, the whole journal vs conference argument seems like an argument between two dinosaurs as the meteor arrives. We have many different ways of disseminating, commenting on, and reviewing works of research now, and it's time to think beyond the current models. Some of his ideas:

Accept many more papers at conferences, but designate some to be the 'best in show'
Center attention on the work, rather than the conference, by keeping wikipedia-like entries for pieces of work as they evolve. This is similar to Dan Wallach's idea.

p.s for those of you who want to complain about ACM's closed policy on the CACM, and how you'll never read an article in the CACM because of the paywall, consider your opinion expressed.

Wednesday, February 02, 2011

Sample Complexity for eps-approximations of Range Spaces

This is a guest post by Jeff Phillips, who among other things ponders the computation of $\epsilon$-samples for range spaces under uncertainty. Jeff is looking for permanent research positions right now, so uncertainty is a major part of his life !

At some point I had plans to write a blog post detailing a simpler proof of the number of random samples needed for an $\varepsilon$-approximation of a range space. This is a very powerful result that is simple to state (I'll define below) and simple to use. I view it as very geometric, but it originated in learning theory. I had used it in several papers, but until recently had not actually read the full proof, and even slightly less recently was unaware of the best bound. The original paper by Vapnik and Chervonenkis is hard to find, but I eventually found a proof in a learning-theory book by Anthony and Bartlett. The proof uses a series of tricks I was not aware of, and the notation is different from what I was used to.

After reading the 20+ page proof, I felt there was a simpler geometric way to think about the problem, and even sketched-out what I thought was a simpler proof. I thought that would be a great thing to post on this blog! But slowly things began to break down. I prepared a pdf writeup, that started at under one page. Then realizing gaps in the reasoning, it grew to 2-3 pages, and then 4 as I attempted to write the tighter result. In presenting this approach to the informal theory seminar here at Utah and realized one of the main geometric tricks I was hoping to use was insufficient for proving what was needed. Looking back at the Anthony and Bartlett proof, I finally realized why all of tricks and machinery they use are really needed!

Anyways, I'd like sketch a proof of the $\varepsilon$-approximation theorem and try to include intuition of why techniques are being used.

Let $P$ be a set of objects (i.e. points) and $\mathcal{R}$ be a set of subset of $P$ (i.e. ranges, like disks). Then $(P,\mathcal{R})$ is called a range space. We will concern ourselves with range spaces with bounded VC-dimension d, which implies that the total number of distinct ranges is at most $c |P|^d$ for some fixed constant $c$. Now a subset $Q$ of $P$ is an $\varepsilon$-approximation of $(P,\mathcal{R})$ (for $0 \le \varepsilon \le 1$) for all ranges $A \in \mathcal{R}$

$\left| \frac{|P \cap A|}{|P|} - \frac{|Q \cap A|}{|Q|} \right| \leq \varepsilon. \hspace{.2in} (1)$

The main result is that:

Theorem. If $Q$ is a random sample from $P$ of size $h = O((1/\varepsilon^2)(d + \log(1/\delta))$ then it is an $\varepsilon$-approximation of $(P,\mathcal{R})$ with probability at least $1-\delta$.

This was first proven by Talagrand in 94. The original result of Vapnik and Chervonenkis had a slightly worse bound of $h = O((1/\varepsilon^2)(d \log(d/\varepsilon) + \log(1/\delta))$. The main reason this is a powerful tool is that $h$ is independent of $|P|$. As I found out, even this weaker result is difficult to prove.

I'll also note, that I first became aware of this stronger result as a reduction from a more general result by Li, Long, and Srinivasan 01, as cited by Har-Peled and Sharir 07.

There is a nice progression to the full result through a series of weaker results.

Step 1.
We can show that a random sample of $r = (1/2 \varepsilon^2) \ln(2/\delta)$ points is sufficient to prove (1) for any one fixed range, by a Chernoff-Hoeffding (C-H) bound. Then since there are only $c |P|^d$ distinct ranges, by the union bound, we only need $h_1 = (1/2 \varepsilon^2)(d \ln |P| + \ln(2c/\delta))$ samples to hold for all ranges.

Note that $h_1$ is a function of $|P|$, we want to get rid of this dependence.

This C-H + union bound will be used again. Here the C-H bound has $h_1$ events, one for each random sample, we will reduce this in Step 3. The dependence on $|P|$ is through the union bound, we improve this in Step 2.

Step 2.
To get rid of the dependence on |P| we need to apply the union bound on a smaller number of ranges. We use a technique called symmetrization, where we consider two random samples $S$ and $Q$ of $P$, both of size $h_2$. Then we compare them against each other with the goal of showing for all $A \in \mathcal{R}$

$\left| \frac{|Q \cap A|}{|Q|} - \frac{|S \cap A|}{|S|} \right| \leq \varepsilon/2. \hspace{.2in} (2)$

We can intuitively see that showing (2) holds with probability $\geq 1-\delta$ is sufficient to show (1): If $h_2$ is too small, it is likely both $Q$ and $S$ have at least some range $A$ in which they are far from $P$, but since there are many ranges, it is likely that they are far on different ranges. Thus if $h_2$ is too small $Q$ and $S$ are likely different from each other on some range $A$ also. Likewise, if $h_2$ is large enough so $Q$ and $S$ are close on all ranges, then it is likely they are close to $P$ on each of those ranges too.

Another technical trick is required to show (2) holds. We correspond each element in $q_i \in Q$ with an element of $s_i \in S$. Then it is sufficient (once we have sampled $Q$ and $S$) to show (2) occurs with probability $\geq 1-\delta$ w.r.t. possible swaps of ($s_i \in Q$ and $q_i \in S$) vs. ($s_i \in S$ and $q_i \in Q$), for all corresponding pairs.

Since the events in the permutation have bounded effect on each range, and there are only $c |Q \cup S|^d$ ranges we need to consider, we can apply the C-H + Union bounds as in Step 1, but now we only have $O((d/\varepsilon)^d)$ events (independent of $|P|$) and we get the original V-C bound of $h_2 = O((1/\varepsilon^2)(d \log(d/\varepsilon) + \log(1/\delta))$.

Another neat effect of symmetrization is that it allows us to consider continuous sets $P$. The analysis in Step 1 required us to compare $Q$ to all ranges defined by $P$, which would have been infinite if $P$ were continuous. But now since we compare two random samples from $P$, even if it is continuous, the random samples are finite, and hence induce a finite number of ranges.

Step 3.
We now show how to get rid of the extra $\log(d/\varepsilon)$ term. We do this through a technique called chaining which allows us to decompose ranges and change the type of events we measure with the C-H bound.

Using symmetrization consider two subsets $Q$ and $S$ of $P$ and let $P^\prime = Q \cup S$. We define a series of range spaces $(P^\prime, \mathcal{T}_0)$, $(P^\prime, \mathcal{T}_1)$, $\ldots$ so that

(a) each range $A \in \mathcal{R}$ is a disjoint union $A \bigcup_j B_j$ where each $B_j \in \mathcal{T}_j$.
(b) each $B_j \in \mathcal{T}_j$ is of size $|B_j| \leq |P^\prime|/2^j$.
(c) $Q$ is an $\varepsilon_j$-approximation of $(P^\prime, \mathcal{T}_j)$ with probability $\geq \delta_j$ where $\varepsilon_j = \sqrt{j/2^j}/12$ and $\delta_j = \delta/2^{j+2}$.

Fact (a) allows us to analyze $(P^\prime,\mathcal{R})$ (with $\varepsilon, \delta$) by considering each $(P^\prime,\mathcal{T}_j)$ separately (with $\varepsilon_j, \delta_j$) since by (c) $\varepsilon \leq \sum_j \varepsilon_j$ and $\delta \leq \sum_j \delta_j$.

From (c) the extra $\sqrt{j}$ term in $\varepsilon_j$ cancels out the $\log(1/\varepsilon)$ term in the technical analysis leading to $h = O((1/\varepsilon^2)(d + \log(1/\delta))$ as desired.

To prove (c) we need to use (b). The key insight is that for large $j$, each range $B_j$ is small. We can then apply the C-H bound using a different type of event which there are fewer of. For a range $B_j$, instead of measuring the event if $q \in Q$ is inside $B$ (as in Step 1 and 2) we measure if some $b \in B_j$ is in $Q$ -- thus there are only $|B_j| \leq |P'|/2^j$ events. Since $|B_j|$ decreases geometrically as $j$ increases, we can also decrease $\varepsilon_j$ and $\delta_j$ geometrically as $j$ increases, proving (c).

It only remains to show the existence of each $(P^\prime, \mathcal{T}_j)$.

We first consider a series of range spaces $(P^\prime, \mathcal{R}_j)$ where $\mathcal{R}_j$ correspond to all ranges in $(Q_j, \mathcal{R})$ where $Q_j$ is an $(1/2^j)$-approximation of $(P^\prime,\mathcal{R})$. We know $|Q_j| = O((1/\varepsilon^2) d \log(d/\varepsilon))$ by Step 2.
Then $\mathcal{T}_j$ is defined inductively as follows. $\mathcal{T}_0 = \mathcal{R}_0$. Each $B \in \mathcal{T}_j$ corresponds to a range $A \in \mathcal{R}_j$. We know there exists another range $A^\prime \in \mathcal{R}_{j-1}$ such that $A$ and $A^\prime$ differ by at most $|P^\prime|/2^j$ points; let $B$ be those points.

Note: Most proofs construct $\mathcal{T}_j$ using a combinatorial result of Haussler 95, but this approach does not affect the asymptotics of the final solution and may actually improve the constants.

Pages