## Saturday, February 19, 2011

### Agnostic/Non-agnostic Learning and Nets/Samples of Range Spaces

There are deep connections between geometry and learning that go back to the origins of VC dimension and include concepts like support vector machines, lifting maps, kernel functions, and iterative reweighting. What is often difficult is the translation problem of going between machine learning concepts and the corresponding (often identical) geometric concepts. One example of this that I've talked about is the two different viewpoints on characterizing isometric embeddability in an $\ell_2$ space.

I strongly believe that with a better "dictionary", we might find a lot more cross-fertilization between geometry and learning, and this would be beneficial for both communities. The 2009 summer school on geometry and learning was a good example of this: alas, it almost completely coincided with SoCG 2009. This summer, there is again a summer school in geometry and learning just before SoCG 2011, and it's in conjunction with it (shameless plug: I'll be lecturing there), and I look forward to seeing more such events.

In this post, my postdoc Jeff Phillips and student Avishek Saha explore a fascinating connection between nets and samples in range spaces on the one hand, and agnostic/non-agnostic learning on the other. Avishek is a machine learning specialist, and Jeff is of course an expert in approximate geometry, sampling and uncertainty. So it is fitting that their joint conversation led to this post.

There is a deep connection between some concepts in combinatorial geometry and learning. It turns out that $\varepsilon$-approximations and $\varepsilon$-nets of range spaces correspond to sample complexity in agnostic learning and non-agnostic learning, respectively. We noticed this last fall when both reading a cute paper by Gandhi, Suri, and Welzl and a book by Anthony and Bartlett.

The geometry:
Given a data set $P$ and a family of subsets $\mathcal{A} \subset 2^P$, the pair $(P,\mathcal{A})$ is a range space. An $\varepsilon$-approximation of $(P,\mathcal{A})$ is subset $Q \subset P$ such that
$\max_{R \in \mathcal{A}} \left| \frac{|R|}{|P|} - \frac{|R \cap Q|}{|Q|} \right| \leq \varepsilon.$
An $\varepsilon$-net of $(P,\mathcal{A})$ is a subset $N \subset P$ such that for all $R \in \mathcal{A}$ with $|R| \geq \varepsilon |P|$ then the intersection $|R \cap Q| \geq 1$. Informally, the $\varepsilon$-approximation approximately preserves the density, and the $\varepsilon$-net hits every large enough subset.

The VC-dimension $v$ of a range space $(P,\mathcal{A})$ informally describes the description complexity of a range $R \in \mathcal{A}$. For instance, halfspaces in $\mathbb{R}^d$ have VC-dimension $d+1$. A random sample $Q \subset P$ of size $O(v/\varepsilon^2)$ is an $\varepsilon$-approximation of $(P,\mathcal{A})$ with constant probability, and a random sample $N \subset P$ of size $O((v/\varepsilon) \log (1/\varepsilon))$ is an $\varepsilon$-net of $(P,\mathcal{A})$ with constant probability.

Learning theory:
Agnostic and non-agnostic learning can also be described with respect to a range space $(P,\mathcal{A})$. We are given a data set $P$ where each element $p \in P$ is either labeled $+$ or $-$. We want to find a classifier $R \in \mathcal{A}$ such that all elements $p \in P$ labeled $+$ are in $R$ and all elements $p \in P$ labeled $-$ are not in $R$. Non-agnostic learning, also referred to as realizable learning, is the setting where we assume that there exists such a classifier $R^*$ in "concept class" $\mathcal{A}$ that has zero-error, it perfectly separates all $+$ points from $-$ points. Agnostic learning is the setting without this assumption of the existence of a zero-error classifier in $\mathcal{A}$, so the aim is to find a classifier $\hat R \in \mathcal{A}$ which misclassifies the fewest number of points. This weakened target function assumption can be alternatively viewed as traditional PAC learning that accommodates arbitrary noise.

The connection:
Now here is where the concepts start looking similar: In non-agnostic learning, if we are given a random sample $N \subset P$ of size $O((v/\varepsilon) \log (1/\varepsilon))$, and run an algorithm to perfectly classify $+$ and $-$ points from $N$, then the same classifier will misclassify at most $\varepsilon |P|$ points on $P$, with constant probability.
In agnostic learning, if we are given a random sample $Q \subset P$ of size $O(v/\varepsilon^2)$, and run an algorithm to find a classifier on $Q$ that misclassifies the fewest number of points, then the same classifier will misclassify at most $\varepsilon$ larger fraction of points on $P$, with constant probability.

Why are those bounds the same?

Lets first look at $\varepsilon$-nets and non-agnostic learning. The key is to construct a range space $(P,\mathcal{S})$ where each $T \in \mathcal{S}$ is the symmetric difference of two ranges $R^*,R^\prime \in \mathcal{A})$, that is it contains all points contained in exactly one of the two ranges. If $(P,\mathcal{A})$ has VC-dimension $v$, then $(P,\mathcal{S})$ has VC-dimension $O(v)$.
Now consider the optimal classifier $R^* \in \mathcal{A}$ of $P$, and any other classifier $R^\prime \in \mathcal{A}$ that misclassifies at most $\varepsilon |P|$ points of $P$. Then the symmetric difference of $R^*$ and $R^\prime$ is at most $\varepsilon |P|$ points, and corresponds with some range $T \in \mathcal{S}$. If $N \subset P$ is an $\varepsilon$-net of $(P,\mathcal{S})$, then any classifier $R^\prime \in \mathcal{A}$ learned perfectly on $(N,\mathcal{A})$, cannot misclassify more than $\varepsilon |P|$ points since it would then necessarily have too many points in the symmetric difference with the perfect classifier $R^*$ on $P$, and $N$ would contain at least one of them.

Now to understand $\varepsilon$-approximations and agnostic learning, we need to see why this same approach won't work. Namely, because there is, in general, no perfect classifier $R^\prime \in \mathcal{A}$ defined on a subset of $Q \subset P$. Thus when we look at the symmetric difference between $\hat R$ and $R^\prime$, we cannot argue that the set of misclassified points should be empty in $Q$ using $R^\prime$ if the difference is less than $\varepsilon |P|$, and can't apply the $\varepsilon$-net result.

But we can apply the stronger $\varepsilon$-approximation result in this agnostic setting. The difference between any two ranges $R, R^\prime \subset \mathcal{A}$ on $P$, that contain the same points in $Q$, is at most $\varepsilon |P|$. Thus the difference in the fraction of misclassified points in $P$ from the optimal range $\hat R$ found on $Q$ is no more than $\varepsilon$, since $\hat R$ only differs by an $\varepsilon$-fraction in the number of points it contains in $Q$ versus $P$.

An intuitive way to see this connection is to think of agnostic and non-agnostic learning in the context of presence and absence of noise. Non-agnostic learning is the ideal setting where we have no noise and standard VC-dimenion based learning bounds suggest that we need $O((v/\varepsilon)\log(1/\varepsilon))$ samples (an $\varepsilon$-net) to learn a target function with $\varepsilon$ classification error. On the other hand, agnostic learning accommodates noise by allowing points from either class to spill over the decision boundary. This implies that some of the randomly sampled points might be noisy and hence misleading for the purposes of learning the target classifier. To nullify this effect of noisy sampled points we need to oversample which increases the sample complexity by about an $(1/\varepsilon)$ factor (yielding an $\varepsilon$-approximation).