If the Federal government does shut down on March 4, you will not have access to FastLane for the duration of the shutdown. On the NSF side, no one will have access to NSF email, and we will not be permitted to conduct government business using alternate email.For example, this could be rather annoying for anyone trying to submit a CPS proposal.
Ruminations on computational geometry, algorithms, theoretical computer science and life
Monday, February 28, 2011
NSF and government shutdown
Amidst all the other chaos that will ensue if the government shuts down, there will also be NSF problems. Heard on the grapevine:
Saturday, February 26, 2011
Arch-blogging
The Utah blogging juggernaut (random fact: juggernaut is a corruption of Jagannath, from the festival of Lord Jagannath in Orissa in which people would throw themselves in front of giant chariot carrying an image of the god) chugs on. Our newest member is Rajeev Balasubramonian, who might very well be the first computer architecture blogger. The blog itself is intended as a group blog of the Utah architecture group, whose cleverly-chosen logo is the Delicate Arch (get it ? Arch?). Read the blog manifesto here.
Idle thought: which academic department has the highest ratio of bloggers/faculty ? Or blogs/faculty ?
Idle thought: which academic department has the highest ratio of bloggers/faculty ? Or blogs/faculty ?
Phone interviews ?
Phone interviews are common in the industrial hiring scene. As a candidate, you go through a phone interview (or even two) and if you make the cut, you fly out for an actual interview.
I'm seeing phone interviews become more common in the academic world. You get a phone interview with some member of the recruiting committee, and the actual interview follows if you make the cut.
Has this always been like this (and am I just misremembering?) or is it a new trend ? Phone interviews can make a lot of sense if you want to weed out candidates who might look promising on paper, and it's a clear cost-saver (and time saver), but I haven't seen them be so common in academic settings. Maybe it's because there are more people chasing each slot and so these filters are more necessary now ?
I'm seeing phone interviews become more common in the academic world. You get a phone interview with some member of the recruiting committee, and the actual interview follows if you make the cut.
Has this always been like this (and am I just misremembering?) or is it a new trend ? Phone interviews can make a lot of sense if you want to weed out candidates who might look promising on paper, and it's a clear cost-saver (and time saver), but I haven't seen them be so common in academic settings. Maybe it's because there are more people chasing each slot and so these filters are more necessary now ?
Tuesday, February 22, 2011
If you're making plans for SoCG 2011...
Don't. Yet.
Important notice: in order to solve some logistical issues and to reduce the costs, the exact location of the conference is likely to change within the next two weeks. The final location (still downtown Paris) will be specified here by March 8th at the latest. Please accept our apologies for the inconvenience, and feel free to contact us if you have any questions or concerns.Although if you don't need to find a hotel right next to the conference venue, this will not affect you.
Saturday, February 19, 2011
Agnostic/Non-agnostic Learning and Nets/Samples of Range Spaces
There are deep connections between geometry and learning that go back to the origins of VC dimension and include concepts like support vector machines, lifting maps, kernel functions, and iterative reweighting. What is often difficult is the translation problem of going between machine learning concepts and the corresponding (often identical) geometric concepts. One example of this that I've talked about is the two different viewpoints on characterizing isometric embeddability in an $\ell_2$ space.
I strongly believe that with a better "dictionary", we might find a lot more cross-fertilization between geometry and learning, and this would be beneficial for both communities. The 2009 summer school on geometry and learning was a good example of this: alas, it almost completely coincided with SoCG 2009. This summer, there is again a summer school in geometry and learning just before SoCG 2011, and it's in conjunction with it (shameless plug: I'll be lecturing there), and I look forward to seeing more such events.
In this post, my postdoc Jeff Phillips and student Avishek Saha explore a fascinating connection between nets and samples in range spaces on the one hand, and agnostic/non-agnostic learning on the other. Avishek is a machine learning specialist, and Jeff is of course an expert in approximate geometry, sampling and uncertainty. So it is fitting that their joint conversation led to this post.
There is a deep connection between some concepts in combinatorial geometry and learning. It turns out that $\varepsilon$-approximations and $\varepsilon$-nets of range spaces correspond to sample complexity in agnostic learning and non-agnostic learning, respectively. We noticed this last fall when both reading a cute paper by Gandhi, Suri, and Welzl and a book by Anthony and Bartlett.
The geometry:
Given a data set $P$ and a family of subsets $\mathcal{A} \subset 2^P$, the pair $(P,\mathcal{A})$ is a range space. An $\varepsilon$-approximation of $(P,\mathcal{A})$ is subset $Q \subset P$ such that
The VC-dimension $v$ of a range space $(P,\mathcal{A})$ informally describes the description complexity of a range $R \in \mathcal{A}$. For instance, halfspaces in $\mathbb{R}^d$ have VC-dimension $d+1$. A random sample $Q \subset P$ of size $O(v/\varepsilon^2)$ is an $\varepsilon$-approximation of $(P,\mathcal{A})$ with constant probability, and a random sample $N \subset P$ of size $O((v/\varepsilon) \log (1/\varepsilon))$ is an $\varepsilon$-net of $(P,\mathcal{A})$ with constant probability.
Learning theory:
Agnostic and non-agnostic learning can also be described with respect to a range space $(P,\mathcal{A})$. We are given a data set $P$ where each element $p \in P$ is either labeled $+$ or $-$. We want to find a classifier $R \in \mathcal{A}$ such that all elements $p \in P$ labeled $+$ are in $R$ and all elements $p \in P$ labeled $-$ are not in $R$. Non-agnostic learning, also referred to as realizable learning, is the setting where we assume that there exists such a classifier $R^*$ in "concept class" $\mathcal{A}$ that has zero-error, it perfectly separates all $+$ points from $-$ points. Agnostic learning is the setting without this assumption of the existence of a zero-error classifier in $\mathcal{A}$, so the aim is to find a classifier $\hat R \in \mathcal{A}$ which misclassifies the fewest number of points. This weakened target function assumption can be alternatively viewed as traditional PAC learning that accommodates arbitrary noise.
The connection:
Now here is where the concepts start looking similar: In non-agnostic learning, if we are given a random sample $N \subset P$ of size $O((v/\varepsilon) \log (1/\varepsilon))$, and run an algorithm to perfectly classify $+$ and $-$ points from $N$, then the same classifier will misclassify at most $\varepsilon |P|$ points on $P$, with constant probability.
In agnostic learning, if we are given a random sample $Q \subset P$ of size $O(v/\varepsilon^2)$, and run an algorithm to find a classifier on $Q$ that misclassifies the fewest number of points, then the same classifier will misclassify at most $\varepsilon$ larger fraction of points on $P$, with constant probability.
Lets first look at $\varepsilon$-nets and non-agnostic learning. The key is to construct a range space $(P,\mathcal{S})$ where each $T \in \mathcal{S}$ is the symmetric difference of two ranges $R^*,R^\prime \in \mathcal{A})$, that is it contains all points contained in exactly one of the two ranges. If $(P,\mathcal{A})$ has VC-dimension $v$, then $(P,\mathcal{S})$ has VC-dimension $O(v)$.
Now consider the optimal classifier $R^* \in \mathcal{A}$ of $P$, and any other classifier $R^\prime \in \mathcal{A}$ that misclassifies at most $\varepsilon |P|$ points of $P$. Then the symmetric difference of $R^*$ and $R^\prime$ is at most $\varepsilon |P|$ points, and corresponds with some range $T \in \mathcal{S}$. If $N \subset P$ is an $\varepsilon$-net of $(P,\mathcal{S})$, then any classifier $R^\prime \in \mathcal{A}$ learned perfectly on $(N,\mathcal{A})$, cannot misclassify more than $\varepsilon |P|$ points since it would then necessarily have too many points in the symmetric difference with the perfect classifier $R^*$ on $P$, and $N$ would contain at least one of them.
Now to understand $\varepsilon$-approximations and agnostic learning, we need to see why this same approach won't work. Namely, because there is, in general, no perfect classifier $R^\prime \in \mathcal{A}$ defined on a subset of $Q \subset P$. Thus when we look at the symmetric difference between $\hat R$ and $R^\prime$, we cannot argue that the set of misclassified points should be empty in $Q$ using $R^\prime$ if the difference is less than $\varepsilon |P|$, and can't apply the $\varepsilon$-net result.
But we can apply the stronger $\varepsilon$-approximation result in this agnostic setting. The difference between any two ranges $R, R^\prime \subset \mathcal{A}$ on $P$, that contain the same points in $Q$, is at most $\varepsilon |P|$. Thus the difference in the fraction of misclassified points in $P$ from the optimal range $\hat R$ found on $Q$ is no more than $\varepsilon$, since $\hat R$ only differs by an $\varepsilon$-fraction in the number of points it contains in $Q$ versus $P$.
An intuitive way to see this connection is to think of agnostic and non-agnostic learning in the context of presence and absence of noise. Non-agnostic learning is the ideal setting where we have no noise and standard VC-dimenion based learning bounds suggest that we need $O((v/\varepsilon)\log(1/\varepsilon))$ samples (an $\varepsilon$-net) to learn a target function with $\varepsilon$ classification error. On the other hand, agnostic learning accommodates noise by allowing points from either class to spill over the decision boundary. This implies that some of the randomly sampled points might be noisy and hence misleading for the purposes of learning the target classifier. To nullify this effect of noisy sampled points we need to oversample which increases the sample complexity by about an $(1/\varepsilon)$ factor (yielding an $\varepsilon$-approximation).
I strongly believe that with a better "dictionary", we might find a lot more cross-fertilization between geometry and learning, and this would be beneficial for both communities. The 2009 summer school on geometry and learning was a good example of this: alas, it almost completely coincided with SoCG 2009. This summer, there is again a summer school in geometry and learning just before SoCG 2011, and it's in conjunction with it (shameless plug: I'll be lecturing there), and I look forward to seeing more such events.
In this post, my postdoc Jeff Phillips and student Avishek Saha explore a fascinating connection between nets and samples in range spaces on the one hand, and agnostic/non-agnostic learning on the other. Avishek is a machine learning specialist, and Jeff is of course an expert in approximate geometry, sampling and uncertainty. So it is fitting that their joint conversation led to this post.
There is a deep connection between some concepts in combinatorial geometry and learning. It turns out that $\varepsilon$-approximations and $\varepsilon$-nets of range spaces correspond to sample complexity in agnostic learning and non-agnostic learning, respectively. We noticed this last fall when both reading a cute paper by Gandhi, Suri, and Welzl and a book by Anthony and Bartlett.
The geometry:
Given a data set $P$ and a family of subsets $\mathcal{A} \subset 2^P$, the pair $(P,\mathcal{A})$ is a range space. An $\varepsilon$-approximation of $(P,\mathcal{A})$ is subset $Q \subset P$ such that
$\max_{R \in \mathcal{A}} \left| \frac{|R|}{|P|} - \frac{|R \cap Q|}{|Q|} \right| \leq \varepsilon.$
An $\varepsilon$-net of $(P,\mathcal{A})$ is a subset $N \subset P$ such that for all $R \in \mathcal{A}$ with $|R| \geq \varepsilon |P|$ then the intersection $|R \cap Q| \geq 1$. Informally, the $\varepsilon$-approximation approximately preserves the density, and the $\varepsilon$-net hits every large enough subset.The VC-dimension $v$ of a range space $(P,\mathcal{A})$ informally describes the description complexity of a range $R \in \mathcal{A}$. For instance, halfspaces in $\mathbb{R}^d$ have VC-dimension $d+1$. A random sample $Q \subset P$ of size $O(v/\varepsilon^2)$ is an $\varepsilon$-approximation of $(P,\mathcal{A})$ with constant probability, and a random sample $N \subset P$ of size $O((v/\varepsilon) \log (1/\varepsilon))$ is an $\varepsilon$-net of $(P,\mathcal{A})$ with constant probability.
Learning theory:
Agnostic and non-agnostic learning can also be described with respect to a range space $(P,\mathcal{A})$. We are given a data set $P$ where each element $p \in P$ is either labeled $+$ or $-$. We want to find a classifier $R \in \mathcal{A}$ such that all elements $p \in P$ labeled $+$ are in $R$ and all elements $p \in P$ labeled $-$ are not in $R$. Non-agnostic learning, also referred to as realizable learning, is the setting where we assume that there exists such a classifier $R^*$ in "concept class" $\mathcal{A}$ that has zero-error, it perfectly separates all $+$ points from $-$ points. Agnostic learning is the setting without this assumption of the existence of a zero-error classifier in $\mathcal{A}$, so the aim is to find a classifier $\hat R \in \mathcal{A}$ which misclassifies the fewest number of points. This weakened target function assumption can be alternatively viewed as traditional PAC learning that accommodates arbitrary noise.
The connection:
Now here is where the concepts start looking similar: In non-agnostic learning, if we are given a random sample $N \subset P$ of size $O((v/\varepsilon) \log (1/\varepsilon))$, and run an algorithm to perfectly classify $+$ and $-$ points from $N$, then the same classifier will misclassify at most $\varepsilon |P|$ points on $P$, with constant probability.
In agnostic learning, if we are given a random sample $Q \subset P$ of size $O(v/\varepsilon^2)$, and run an algorithm to find a classifier on $Q$ that misclassifies the fewest number of points, then the same classifier will misclassify at most $\varepsilon$ larger fraction of points on $P$, with constant probability.
Why are those bounds the same?
Lets first look at $\varepsilon$-nets and non-agnostic learning. The key is to construct a range space $(P,\mathcal{S})$ where each $T \in \mathcal{S}$ is the symmetric difference of two ranges $R^*,R^\prime \in \mathcal{A})$, that is it contains all points contained in exactly one of the two ranges. If $(P,\mathcal{A})$ has VC-dimension $v$, then $(P,\mathcal{S})$ has VC-dimension $O(v)$.
Now consider the optimal classifier $R^* \in \mathcal{A}$ of $P$, and any other classifier $R^\prime \in \mathcal{A}$ that misclassifies at most $\varepsilon |P|$ points of $P$. Then the symmetric difference of $R^*$ and $R^\prime$ is at most $\varepsilon |P|$ points, and corresponds with some range $T \in \mathcal{S}$. If $N \subset P$ is an $\varepsilon$-net of $(P,\mathcal{S})$, then any classifier $R^\prime \in \mathcal{A}$ learned perfectly on $(N,\mathcal{A})$, cannot misclassify more than $\varepsilon |P|$ points since it would then necessarily have too many points in the symmetric difference with the perfect classifier $R^*$ on $P$, and $N$ would contain at least one of them.
Now to understand $\varepsilon$-approximations and agnostic learning, we need to see why this same approach won't work. Namely, because there is, in general, no perfect classifier $R^\prime \in \mathcal{A}$ defined on a subset of $Q \subset P$. Thus when we look at the symmetric difference between $\hat R$ and $R^\prime$, we cannot argue that the set of misclassified points should be empty in $Q$ using $R^\prime$ if the difference is less than $\varepsilon |P|$, and can't apply the $\varepsilon$-net result.
But we can apply the stronger $\varepsilon$-approximation result in this agnostic setting. The difference between any two ranges $R, R^\prime \subset \mathcal{A}$ on $P$, that contain the same points in $Q$, is at most $\varepsilon |P|$. Thus the difference in the fraction of misclassified points in $P$ from the optimal range $\hat R$ found on $Q$ is no more than $\varepsilon$, since $\hat R$ only differs by an $\varepsilon$-fraction in the number of points it contains in $Q$ versus $P$.
An intuitive way to see this connection is to think of agnostic and non-agnostic learning in the context of presence and absence of noise. Non-agnostic learning is the ideal setting where we have no noise and standard VC-dimenion based learning bounds suggest that we need $O((v/\varepsilon)\log(1/\varepsilon))$ samples (an $\varepsilon$-net) to learn a target function with $\varepsilon$ classification error. On the other hand, agnostic learning accommodates noise by allowing points from either class to spill over the decision boundary. This implies that some of the randomly sampled points might be noisy and hence misleading for the purposes of learning the target classifier. To nullify this effect of noisy sampled points we need to oversample which increases the sample complexity by about an $(1/\varepsilon)$ factor (yielding an $\varepsilon$-approximation).
Wednesday, February 09, 2011
My talks at ITA and at the college of engineering in Montana State
This is the abstract for the talk I'm giving in brief at the ITA Workshop and more expanded at a College of Engineering colloquium at Montana State. Thanks to the ITA program commitee and Brendan Mumey (at Montana State) for the invitations.
Dimensionality reduction for distributions: the good and the bad
In many application areas, the natural representation of data is in the form of an empirical probability distribution. Documents are represented as normalized frequency counts of words, images are represented by color (or other feature) histograms, and speech signals can be represented by spectral histograms.
There are many natural and meaningful ways of measuring similarity (or distance) between such distributions, and these define different geometries in which we might wish to analyze collections of distributions for interesting patterns. However, a practical bottleneck is the high dimensionality of these representations: for example, an 256 x 256 image might be represented by a vector of over 1000 features, and a document might be represented as a sparse vector with hundreds of attributes.
Thus, a key problem is: can we reduce the dimensionality of collections of distributions to make data analysis tractable while preserving the distances in the collection ?
In this talk, I'll discuss a collection of recent results centered around this theme, that provide both good news (and bad) for dimensionality reduction on distributions in theory and in practice.The above draws on information mostly from this paper with Arvind Agarwal and Jeff Phillips, and this work-in-progress with Rasmus Kyng and Jeff Phillips.
Labels:
cs.CG,
cs.LG,
distributions,
research
Tuesday, February 08, 2011
SoCG Videos Call
You drew all those snazzy diagrams using your fancy Mac drawing program. Now it's time to make some videos:
Video and multimedia presentations are sought for the 20th Annual Video and Multimedia Review of Computational Geometry, to accompany the 27th Annual Symposium on Computational Geometry in Paris, France. This review showcases the use of visualization in computational geometry for exposition and education, for visual exploration of geometry in research, and as an interface and a debugging tool in software development. Algorithm animations, visual explanations of structural theorems, descriptions of applications of computational geometry, and demonstrations of software systems are all appropriate.
Three to five minutes is ideal for most presentations; eight minutes is the upper limit. Accepted video and multimedia presentations will have an abstract in the published conference proceedings; video/multimedia authors will have an opportunity to present their work at the Symposium during a dedicated video session. Accepted presentations will be available online in various formats in a web proceedings. See http://www.computational-geometry.org/ for examples of previous years' proceedings.
Send email to the Video/Multimedia committee chair, Jonathan Shewchuk, jrs@cs.berkeley.edu by 23:59 Pacific Standard Time, Friday, February 18, 2011, with the following information: the names and institutions of the authors, the email address of the corresponding author, and instructions for downloading the submission.
Monday, February 07, 2011
DIMACS Worskhop on Parallelism: Call for Talks
The DIMACS Workshop on Parallelism is soliciting contributed talks (deadline Feb 9!). Sergei Vassilvitskii offers the following synopsis with information on submission.
Decades after its origins and its study in academia, parallel computing is finally becoming pervasive. Today's PCs have multiple cores, and some predict 1000-core machines in the near future. Google, Yahoo and others run MapReduce or Hadoop on thousands of machines to answer search queries, among other things. D. E. Shaw Research is building a massively parallel machine to simulate molecular dynamics. Climate scientists predict the evolution of the earth's climate on parallel machines. Amazon's EC^{2} enables users to run jobs on a "cloud" of PCs.
The evolution of parallel computing from primarily an academic subject in the '80s to its realization today is an exciting development. This DIMACS workshop will bring together some of the leading researchers and practitioners involved in parallel computing to describe their work. Attendees will discuss, for example:
- how parallel computing in its various forms is used today;
- what new uses and programming abstractions will arise by 2020;
- what parallel computers will look like in 2020; and
- how to model parallelism theoretically.
Those wishing to give a contributed talk must submit a one-page description of their work, with additional material (such as a paper) optional, to howard@research.att.com by February 9, 2011. Should there be an excess of submissions, the organizers will select the contributed talks according to the summaries submitted.
Labels:
cs.DC,
dimacs,
parallelism,
research
Friday, February 04, 2011
POTD: Reproducing Kernel Banach Spaces with the ℓ1 Norm
Reproducing Kernel Banach Spaces with the ℓ1 Norm
Guohui Song, Haizhang Zhang, and Fred J. Hickernell
Abstract:
Targeting at sparse learning, we construct Banach spaces B of functions on an input space X with the properties that (1) B possesses an l1 norm in the sense that it is isometrically isomorphic to the Banach space of integrable functions on X with respect to the counting measure; (2) point evaluations are continuous linear functionals on B and are representable through a bilinear form with a kernel function; (3) regularized learning schemes on B satisfy the linear representer theorem. Examples of kernel functions admissible for the construction of such spaces are given.
Notes:
This one probably requires some explanation, for the non-ML folks. Reproducing Kernel Hilbert spaces are the coin of the realm in machine learning and for good reason. They allow much of ML to be "ported" from linear classifiers to non-linear classifiers: the kernel mapping essentially linearizes (via lifting) the nonlinear classifiers so you can get the benefit of the nonlinearity while operating algorithmically in a linear world. Even though the induced Hilbert space is typically a function space and is therefore infinite-dimensional, the representer theorem allows us in most cases to operate in a finite dimensional space (where the dimension is bounded by the number of samples). From a metric embedding perspective, kernels completely characterize the class of metrics isometrically embeddable in Euclidean space.
So RKHSs are great. So what's the deal with this paper ? What it tries to do is combine the power of RKHSs with the regularity and sparsity properties guaranteed by $\ell_1$ norms. Even though your typical Banach space doesn't admit an inner product (what you need for the kernel mapping), they show that you can define special Banach spaces in which kernels can be defined as before, and the representer theorem holds, but that you can get sparse bases for solutions because of the nice $\ell_1$ properties.
I promised SHORT summaries, so I'm not going to go further. But the takeaway message here is the ability to extend the nice properties of RKHSs to Banach spaces. For completeness I should mention that there are other approaches that have tried to do this, but using different mathematical constructs that are in some way less well suited.
Thursday, February 03, 2011
On the future of conferences
There's a superb article out today in the CACM by Jonathan Grudin (non-paywall version here, thanks to Alexandre Passos) on the future (and past) of our CS conference-driven model. I'm biased in favor because it takes a generally negative view of the current model, but that's not why I like it.
I like it because he systematically tracks back to the origins of the conference culture, discusses why it's so prevalent, and avoids the glib "let's do everything in journals" argument that even I've been guilty of in the past.
Some of the tl;dr points (but do read the whole thing):
I like it because he systematically tracks back to the origins of the conference culture, discusses why it's so prevalent, and avoids the glib "let's do everything in journals" argument that even I've been guilty of in the past.
Some of the tl;dr points (but do read the whole thing):
- Technology and a Professional Organization Drove the Shift to Conference Publication: not speed of development of the field, as is commonly stated. It was also a more American-centered phenomenon.
- Formal archiving of conference proceedings made creating journal versions more difficult (because of the 30% rule and so on)
- "When conferences became archival, it was natural to focus on quality and selectivity." so the focus of conferences became more gatekeeping and less community.
- This in turn has an impact on community: when your papers are rejected, you don't go to the conference. For more on the impact of rejection, see Claire Mathieu's post.
- A further consequence is that computer scientists do not develop the skills needed to navigate large, community-building conferences.This is so true ! As someone who frequents SODA, SoCG and occasionally FOCS/STOC, I often find myself gasping for breath at VLDB (600+ participants) or KDD (800+). It's overwhelming to keep track of everything. And it makes it harder for me to attend such conferences regularly, even though it's important for me to go.
- Accept many more papers at conferences, but designate some to be the 'best in show'
- Center attention on the work, rather than the conference, by keeping wikipedia-like entries for pieces of work as they evolve. This is similar to Dan Wallach's idea.
Labels:
community,
publishing
Wednesday, February 02, 2011
Sample Complexity for eps-approximations of Range Spaces
This is a guest post by Jeff Phillips, who among other things ponders the computation of $\epsilon$-samples for range spaces under uncertainty. Jeff is looking for permanent research positions right now, so uncertainty is a major part of his life !
At some point I had plans to write a blog post detailing a simpler proof of the number of random samples needed for an $\varepsilon$-approximation of a range space. This is a very powerful result that is simple to state (I'll define below) and simple to use. I view it as very geometric, but it originated in learning theory. I had used it in several papers, but until recently had not actually read the full proof, and even slightly less recently was unaware of the best bound. The original paper by Vapnik and Chervonenkis is hard to find, but I eventually found a proof in a learning-theory book by Anthony and Bartlett. The proof uses a series of tricks I was not aware of, and the notation is different from what I was used to.
After reading the 20+ page proof, I felt there was a simpler geometric way to think about the problem, and even sketched-out what I thought was a simpler proof. I thought that would be a great thing to post on this blog! But slowly things began to break down. I prepared a pdf writeup, that started at under one page. Then realizing gaps in the reasoning, it grew to 2-3 pages, and then 4 as I attempted to write the tighter result. In presenting this approach to the informal theory seminar here at Utah and realized one of the main geometric tricks I was hoping to use was insufficient for proving what was needed. Looking back at the Anthony and Bartlett proof, I finally realized why all of tricks and machinery they use are really needed!
Anyways, I'd like sketch a proof of the $\varepsilon$-approximation theorem and try to include intuition of why techniques are being used.
Let $P$ be a set of objects (i.e. points) and $\mathcal{R}$ be a set of subset of $P$ (i.e. ranges, like disks). Then $(P,\mathcal{R})$ is called a range space. We will concern ourselves with range spaces with bounded VC-dimension d, which implies that the total number of distinct ranges is at most $c |P|^d$ for some fixed constant $c$. Now a subset $Q$ of $P$ is an $\varepsilon$-approximation of $(P,\mathcal{R})$ (for $0 \le \varepsilon \le 1$) for all ranges $A \in \mathcal{R}$
Another technical trick is required to show (2) holds. We correspond each element in $q_i \in Q$ with an element of $s_i \in S$. Then it is sufficient (once we have sampled $Q$ and $S$) to show (2) occurs with probability $\geq 1-\delta$ w.r.t. possible swaps of ($s_i \in Q$ and $q_i \in S$) vs. ($s_i \in S$ and $q_i \in Q$), for all corresponding pairs.
Since the events in the permutation have bounded effect on each range, and there are only $c |Q \cup S|^d$ ranges we need to consider, we can apply the C-H + Union bounds as in Step 1, but now we only have $O((d/\varepsilon)^d)$ events (independent of $|P|$) and we get the original V-C bound of $h_2 = O((1/\varepsilon^2)(d \log(d/\varepsilon) + \log(1/\delta))$.
Another neat effect of symmetrization is that it allows us to consider continuous sets $P$. The analysis in Step 1 required us to compare $Q$ to all ranges defined by $P$, which would have been infinite if $P$ were continuous. But now since we compare two random samples from $P$, even if it is continuous, the random samples are finite, and hence induce a finite number of ranges.
Step 3.
We now show how to get rid of the extra $\log(d/\varepsilon)$ term. We do this through a technique called chaining which allows us to decompose ranges and change the type of events we measure with the C-H bound.
Using symmetrization consider two subsets $Q$ and $S$ of $P$ and let $P^\prime = Q \cup S$. We define a series of range spaces $(P^\prime, \mathcal{T}_0)$, $(P^\prime, \mathcal{T}_1)$, $\ldots$ so that
At some point I had plans to write a blog post detailing a simpler proof of the number of random samples needed for an $\varepsilon$-approximation of a range space. This is a very powerful result that is simple to state (I'll define below) and simple to use. I view it as very geometric, but it originated in learning theory. I had used it in several papers, but until recently had not actually read the full proof, and even slightly less recently was unaware of the best bound. The original paper by Vapnik and Chervonenkis is hard to find, but I eventually found a proof in a learning-theory book by Anthony and Bartlett. The proof uses a series of tricks I was not aware of, and the notation is different from what I was used to.
After reading the 20+ page proof, I felt there was a simpler geometric way to think about the problem, and even sketched-out what I thought was a simpler proof. I thought that would be a great thing to post on this blog! But slowly things began to break down. I prepared a pdf writeup, that started at under one page. Then realizing gaps in the reasoning, it grew to 2-3 pages, and then 4 as I attempted to write the tighter result. In presenting this approach to the informal theory seminar here at Utah and realized one of the main geometric tricks I was hoping to use was insufficient for proving what was needed. Looking back at the Anthony and Bartlett proof, I finally realized why all of tricks and machinery they use are really needed!
Anyways, I'd like sketch a proof of the $\varepsilon$-approximation theorem and try to include intuition of why techniques are being used.
Let $P$ be a set of objects (i.e. points) and $\mathcal{R}$ be a set of subset of $P$ (i.e. ranges, like disks). Then $(P,\mathcal{R})$ is called a range space. We will concern ourselves with range spaces with bounded VC-dimension d, which implies that the total number of distinct ranges is at most $c |P|^d$ for some fixed constant $c$. Now a subset $Q$ of $P$ is an $\varepsilon$-approximation of $(P,\mathcal{R})$ (for $0 \le \varepsilon \le 1$) for all ranges $A \in \mathcal{R}$
$\left| \frac{|P \cap A|}{|P|} - \frac{|Q \cap A|}{|Q|} \right| \leq \varepsilon. \hspace{.2in} (1)$
The main result is that:
Theorem. If $Q$ is a random sample from $P$ of size $h = O((1/\varepsilon^2)(d + \log(1/\delta))$ then it is an $\varepsilon$-approximation of $(P,\mathcal{R})$ with probability at least $1-\delta$.
This was first proven by Talagrand in 94. The original result of Vapnik and Chervonenkis had a slightly worse bound of $h = O((1/\varepsilon^2)(d \log(d/\varepsilon) + \log(1/\delta))$. The main reason this is a powerful tool is that $h$ is independent of $|P|$. As I found out, even this weaker result is difficult to prove.
I'll also note, that I first became aware of this stronger result as a reduction from a more general result by Li, Long, and Srinivasan 01, as cited by Har-Peled and Sharir 07.
There is a nice progression to the full result through a series of weaker results.
Step 1.
We can show that a random sample of $r = (1/2 \varepsilon^2) \ln(2/\delta)$ points is sufficient to prove (1) for any one fixed range, by a Chernoff-Hoeffding (C-H) bound. Then since there are only $c |P|^d$ distinct ranges, by the union bound, we only need $h_1 = (1/2 \varepsilon^2)(d \ln |P| + \ln(2c/\delta))$ samples to hold for all ranges.
Note that $h_1$ is a function of $|P|$, we want to get rid of this dependence.
This C-H + union bound will be used again. Here the C-H bound has $h_1$ events, one for each random sample, we will reduce this in Step 3. The dependence on $|P|$ is through the union bound, we improve this in Step 2.
Step 2.
To get rid of the dependence on |P| we need to apply the union bound on a smaller number of ranges. We use a technique called symmetrization, where we consider two random samples $S$ and $Q$ of $P$, both of size $h_2$. Then we compare them against each other with the goal of showing for all $A \in \mathcal{R}$
There is a nice progression to the full result through a series of weaker results.
Step 1.
We can show that a random sample of $r = (1/2 \varepsilon^2) \ln(2/\delta)$ points is sufficient to prove (1) for any one fixed range, by a Chernoff-Hoeffding (C-H) bound. Then since there are only $c |P|^d$ distinct ranges, by the union bound, we only need $h_1 = (1/2 \varepsilon^2)(d \ln |P| + \ln(2c/\delta))$ samples to hold for all ranges.
Note that $h_1$ is a function of $|P|$, we want to get rid of this dependence.
This C-H + union bound will be used again. Here the C-H bound has $h_1$ events, one for each random sample, we will reduce this in Step 3. The dependence on $|P|$ is through the union bound, we improve this in Step 2.
Step 2.
To get rid of the dependence on |P| we need to apply the union bound on a smaller number of ranges. We use a technique called symmetrization, where we consider two random samples $S$ and $Q$ of $P$, both of size $h_2$. Then we compare them against each other with the goal of showing for all $A \in \mathcal{R}$
$\left| \frac{|Q \cap A|}{|Q|} - \frac{|S \cap A|}{|S|} \right| \leq \varepsilon/2. \hspace{.2in} (2)$
We can intuitively see that showing (2) holds with probability $\geq 1-\delta$ is sufficient to show (1): If $h_2$ is too small, it is likely both $Q$ and $S$ have at least some range $A$ in which they are far from $P$, but since there are many ranges, it is likely that they are far on different ranges. Thus if $h_2$ is too small $Q$ and $S$ are likely different from each other on some range $A$ also. Likewise, if $h_2$ is large enough so $Q$ and $S$ are close on all ranges, then it is likely they are close to $P$ on each of those ranges too.
Another technical trick is required to show (2) holds. We correspond each element in $q_i \in Q$ with an element of $s_i \in S$. Then it is sufficient (once we have sampled $Q$ and $S$) to show (2) occurs with probability $\geq 1-\delta$ w.r.t. possible swaps of ($s_i \in Q$ and $q_i \in S$) vs. ($s_i \in S$ and $q_i \in Q$), for all corresponding pairs.
Since the events in the permutation have bounded effect on each range, and there are only $c |Q \cup S|^d$ ranges we need to consider, we can apply the C-H + Union bounds as in Step 1, but now we only have $O((d/\varepsilon)^d)$ events (independent of $|P|$) and we get the original V-C bound of $h_2 = O((1/\varepsilon^2)(d \log(d/\varepsilon) + \log(1/\delta))$.
Another neat effect of symmetrization is that it allows us to consider continuous sets $P$. The analysis in Step 1 required us to compare $Q$ to all ranges defined by $P$, which would have been infinite if $P$ were continuous. But now since we compare two random samples from $P$, even if it is continuous, the random samples are finite, and hence induce a finite number of ranges.
Step 3.
We now show how to get rid of the extra $\log(d/\varepsilon)$ term. We do this through a technique called chaining which allows us to decompose ranges and change the type of events we measure with the C-H bound.
Using symmetrization consider two subsets $Q$ and $S$ of $P$ and let $P^\prime = Q \cup S$. We define a series of range spaces $(P^\prime, \mathcal{T}_0)$, $(P^\prime, \mathcal{T}_1)$, $\ldots$ so that
- (a) each range $A \in \mathcal{R}$ is a disjoint union $A \bigcup_j B_j$ where each $B_j \in \mathcal{T}_j$.
- (b) each $B_j \in \mathcal{T}_j$ is of size $|B_j| \leq |P^\prime|/2^j$.
- (c) $Q$ is an $\varepsilon_j$-approximation of $(P^\prime, \mathcal{T}_j)$ with probability $\geq \delta_j$ where $\varepsilon_j = \sqrt{j/2^j}/12$ and $\delta_j = \delta/2^{j+2}$.
From (c) the extra $\sqrt{j}$ term in $\varepsilon_j$ cancels out the $\log(1/\varepsilon)$ term in the technical analysis leading to $h = O((1/\varepsilon^2)(d + \log(1/\delta))$ as desired.
To prove (c) we need to use (b). The key insight is that for large $j$, each range $B_j$ is small. We can then apply the C-H bound using a different type of event which there are fewer of. For a range $B_j$, instead of measuring the event if $q \in Q$ is inside $B$ (as in Step 1 and 2) we measure if some $b \in B_j$ is in $Q$ -- thus there are only $|B_j| \leq |P'|/2^j$ events. Since $|B_j|$ decreases geometrically as $j$ increases, we can also decrease $\varepsilon_j$ and $\delta_j$ geometrically as $j$ increases, proving (c).
It only remains to show the existence of each $(P^\prime, \mathcal{T}_j)$.
- We first consider a series of range spaces $(P^\prime, \mathcal{R}_j)$ where $\mathcal{R}_j$ correspond to all ranges in $(Q_j, \mathcal{R})$ where $Q_j$ is an $(1/2^j)$-approximation of $(P^\prime,\mathcal{R})$. We know $|Q_j| = O((1/\varepsilon^2) d \log(d/\varepsilon))$ by Step 2.
- Then $\mathcal{T}_j$ is defined inductively as follows. $\mathcal{T}_0 = \mathcal{R}_0$. Each $B \in \mathcal{T}_j$ corresponds to a range $A \in \mathcal{R}_j$. We know there exists another range $A^\prime \in \mathcal{R}_{j-1}$ such that $A$ and $A^\prime$ differ by at most $|P^\prime|/2^j$ points; let $B$ be those points.
Note: Most proofs construct $\mathcal{T}_j$ using a combinatorial result of Haussler 95, but this approach does not affect the asymptotics of the final solution and may actually improve the constants.
IMA Special Year on the mathematics of information
Dana Randall mentioned at SODA that the IMA (the Institute for Mathematics and its Applications) is looking for increased participation from folks in computer science.
Anna Gilbert at the University of Michigan (and former AT&T colleague!) is one of the organizers of the current 'special year' at the IMA, and it's on a topic that I think many theoryCS folks will be interested in: the mathematics of information.
They're taking applications for postdocs (speaking of which) up until February 4 (two days from now!). So get those applications in.
And what's the special year about ? In Anna's words (edited for formatting):
During the academic year 2011-2012, the annual program at the Institute for Mathematics and its Applications (IMA) at the University of Minnesota will be in the Mathematics of Information. Information, broadly defined, plays an ever increasingly large role in our scientific and technological endeavors, as well as our daily lives. Collectively, we search the web over billions of times per day, we keep our medical records in digital repositories, we share videos and photos over the web, and we listen to music on MP3 players. Our ability to collect, store, and generate vast quantities of data far outstrips our ability to analyze, process, and understand these data. Many scientific, engineering, and medical applications depend on our ability to handle enormous amounts of complex information. Network engineering involves massive datasets at high speeds, medical imaging generates large data with intricate geometric relationships, and astronomical observations include data at different wavelengths or spectral bands. Our current technology and scientific tools for information lag far behind our ability to collect enormous amounts of data, our need to analyze that data as efficiently as possible, and, finally, our ability to use that analysis to make sound, timely decisions.
The special year will include seven workshops with a thematic focus on the mathematics of information and will include topics in mathematics, algorithms, theoretical computer science, machine learning, network analysis, and applications in biology, include medical informatics and group testing for high throughput screening. Each workshop is designed to be truly interdisciplinary, involving researchers from mathematical sciences, computer sciences, engineering, and other fields. Interspersed will be Hot Topics workshops, IMA public lectures, and tutorial and/or short courses. Faculty, graduate students and postdoctoral fellows are invited to participate as long-term visitors, New Direction Visiting Professors, or postdoctoral fellows (either at the IMA or jointly with industry).
You can find more information about the individual workshops at the annual program webpage.
9/26-30/11 Workshop: High Dimensional Phenomena
10/24-28/11 Workshop: Large Graphs: Modeling, Algorithms and Applications
11/14-18/11 Workshop: Large Data Sets in Medical Informatics
2/13-17/12 Workshop: Group Testing Designs, Algorithms, and Applications to Biology
2/27-3/2/12 Workshop: Network Links: Connecting Social, Communication and Biological Network Analysis
3/26-30/12 Workshop: Machine Learning: Theory and Computation
5/7-11/12 Workshop: User-Centered Modeling
Postdoctoral Fellows
Postdoctoral Fellows applications deadline was January 7, 2011 but applications will be considered until February 4, 2011.
These IMA Postdoctoral Fellowships run one to two years, at the option of the holder, starting August 31, 2011. In the second year of the appointment there are a variety of options to enhance career development, including continued work on the fellow's established research program, participation in the 2012-2013 academic year program, teaching, and working on an industrial project. Postdoctoral fellows receive a salary of $55,000 annually, and a travel allowance.
There is also an option for Industrial Postdoctoral Fellows. In the second year of the appointment, Fellows can work on a project at an industrial partner.
Visiting the IMA
You can contact the IMA if you would like to visit for a long period of time during the year or come for a short visit, especially associated to a workshop. There's more information on general visits and short term visits.
There are sources of funding available for both types of visits and the IMA staff are excellent at making such visits possible.
Anna Gilbert at the University of Michigan (and former AT&T colleague!) is one of the organizers of the current 'special year' at the IMA, and it's on a topic that I think many theoryCS folks will be interested in: the mathematics of information.
They're taking applications for postdocs (speaking of which) up until February 4 (two days from now!). So get those applications in.
And what's the special year about ? In Anna's words (edited for formatting):
The Mathematics of Information
During the academic year 2011-2012, the annual program at the Institute for Mathematics and its Applications (IMA) at the University of Minnesota will be in the Mathematics of Information. Information, broadly defined, plays an ever increasingly large role in our scientific and technological endeavors, as well as our daily lives. Collectively, we search the web over billions of times per day, we keep our medical records in digital repositories, we share videos and photos over the web, and we listen to music on MP3 players. Our ability to collect, store, and generate vast quantities of data far outstrips our ability to analyze, process, and understand these data. Many scientific, engineering, and medical applications depend on our ability to handle enormous amounts of complex information. Network engineering involves massive datasets at high speeds, medical imaging generates large data with intricate geometric relationships, and astronomical observations include data at different wavelengths or spectral bands. Our current technology and scientific tools for information lag far behind our ability to collect enormous amounts of data, our need to analyze that data as efficiently as possible, and, finally, our ability to use that analysis to make sound, timely decisions.
The special year will include seven workshops with a thematic focus on the mathematics of information and will include topics in mathematics, algorithms, theoretical computer science, machine learning, network analysis, and applications in biology, include medical informatics and group testing for high throughput screening. Each workshop is designed to be truly interdisciplinary, involving researchers from mathematical sciences, computer sciences, engineering, and other fields. Interspersed will be Hot Topics workshops, IMA public lectures, and tutorial and/or short courses. Faculty, graduate students and postdoctoral fellows are invited to participate as long-term visitors, New Direction Visiting Professors, or postdoctoral fellows (either at the IMA or jointly with industry).
You can find more information about the individual workshops at the annual program webpage.
9/26-30/11 Workshop: High Dimensional Phenomena
10/24-28/11 Workshop: Large Graphs: Modeling, Algorithms and Applications
11/14-18/11 Workshop: Large Data Sets in Medical Informatics
2/13-17/12 Workshop: Group Testing Designs, Algorithms, and Applications to Biology
2/27-3/2/12 Workshop: Network Links: Connecting Social, Communication and Biological Network Analysis
3/26-30/12 Workshop: Machine Learning: Theory and Computation
5/7-11/12 Workshop: User-Centered Modeling
Postdoctoral Fellows
Postdoctoral Fellows applications deadline was January 7, 2011 but applications will be considered until February 4, 2011.
These IMA Postdoctoral Fellowships run one to two years, at the option of the holder, starting August 31, 2011. In the second year of the appointment there are a variety of options to enhance career development, including continued work on the fellow's established research program, participation in the 2012-2013 academic year program, teaching, and working on an industrial project. Postdoctoral fellows receive a salary of $55,000 annually, and a travel allowance.
There is also an option for Industrial Postdoctoral Fellows. In the second year of the appointment, Fellows can work on a project at an industrial partner.
Visiting the IMA
You can contact the IMA if you would like to visit for a long period of time during the year or come for a short visit, especially associated to a workshop. There's more information on general visits and short term visits.
There are sources of funding available for both types of visits and the IMA staff are excellent at making such visits possible.
CRA discussion on postdocs
The CRA commissioned a study on the growing prevalence of postdocs in the CS career path. Specifically,
Personally, I think that this is a dangerous trend, for the following reasons:
The question of whether it is healthy for the field, and for theThey want feedback:
individuals, to move to a new steady state, in which prior to being
hired in a tenure-track position post doctoral positions are expected,
is an important one for the entire community. The question of whether
support for such a program should come from funding agencies like NSF, perhaps at the expense of other funding, is also important. To encourage a discussion about the need for, and role of, postdocs within the computing research community, the CRA commissioned a committee to prepare a short white paper that reports the statistics associated with academic and industry hiring, articulates the relevant issues about postdocs in the context of the many stakeholders, and specifically solicits input from the community.
We would like to encourage departments and laboratories to provideSo chime in if you have thoughts on this matter.
input -- to review the white paper, to discuss the issue, and to post
your views (collectively or individually) on the companion website
(http://cra.org/postdocs). The CRA hopes to obtain a sense of the
community by March 15. Following review of the comments that are
received, the committee will prepare a revised version of the white
paper articulating the community's broad view (and consensus, if any)
on this issue.
Personally, I think that this is a dangerous trend, for the following reasons:
- Quickly, doing a postdoc will become the norm, rather than an option, when looking for academic jobs. I think this is unnecessary from a training perspective for everyone (though it might be appropriate for some).
- One of the things that has kept CS viable academically is that people can leave after a Ph.D, go to industry, and still make it back into academia. This no longer seems to be true in places like the natural sciences, with long postdocs. I wouldn't want their career path.
- Postdocs are glorified free labor for PIs. Salaries are miniscule, and competition is fierce. And again, it's not entirely clear that fresh Ph.Ds are so incompetent that they need 5 year postdocs to be ready for a faculty job.
- Ph.D training suffers, because "you can fix it in the postdoc". I don't think that's healthy either.
POTD: Continuous Local Search
Inspired by Oded Goldreich's "my choices" page, and my constant and futile attempts to read more, I'm going to attempt to write short (SHORT!) posts summarizing papers that I've been reading lately, with some thoughts. Each such post will be tagged 'potd' for 'paper of the day', in the spirit of John Baez which is to say, not a paper each day, but a paper on the day the post is made :).
Continuous Local Search
C. Daskalakis and C. Papadimitriou
SODA 2011.
Abstract:
There are many iterative schemes that get used in practice for which time to convergence is unknown. The Weiszfeld algorithm for computing 1-medians is one such method. There are also alternating optimization schemes like k-means, EM and ICP that have resisted convergence analysis for a while - although we now have a pretty good understanding of the behavior of k-means. Classes like CLS appear to capture some numerical iterative schemes like gradient descent, and it might be interesting to establish connections (via reductions) between such iterative methods and other problems of a more game-theoretic nature that appear to crop up in this class.
One catch though is that while the reductions are poly time, the only requirement is that a solution to one map back to a soliution for another. So it's not clear that reductions in CLS or convex CLS preserve "time to convergence" - in fact it seems unlikely.
Continuous Local Search
C. Daskalakis and C. Papadimitriou
SODA 2011.
Abstract:
We introduce CLS, for continuous local search, a class of polynomial-time checkable total functions that lies at the intersection of PPAD and PLS, and captures a particularly benign kind of local optimization in which the domain is continuous, as opposed to combinatorial, and the functions involved are continuous. We show that this class contains several well known intriguing problems which were heretofore known to lie in the intersection of PLS and PPAD but were otherwise unclassifiable: Finding fixpoints of contraction maps, the linear complementarity problem for P matrices, finding a stationary point of a low-degree polynomial objective, the simple stochastic games of Shapley and Condon, and finding a mixed Nash equilibrium in congestion, implicit congestion, and network coordination games. The last four problems belong to CCLS, for convex CLS, another subclass of PPAD $\cap$ PLS seeking the componentwise local minimum of a componentwise convex function. It is open whether any or all of these problems are complete for the corresponding classes.Notes:
There are many iterative schemes that get used in practice for which time to convergence is unknown. The Weiszfeld algorithm for computing 1-medians is one such method. There are also alternating optimization schemes like k-means, EM and ICP that have resisted convergence analysis for a while - although we now have a pretty good understanding of the behavior of k-means. Classes like CLS appear to capture some numerical iterative schemes like gradient descent, and it might be interesting to establish connections (via reductions) between such iterative methods and other problems of a more game-theoretic nature that appear to crop up in this class.
One catch though is that while the reductions are poly time, the only requirement is that a solution to one map back to a soliution for another. So it's not clear that reductions in CLS or convex CLS preserve "time to convergence" - in fact it seems unlikely.
Subscribe to:
Posts (Atom)