## Wednesday, February 02, 2011

### Sample Complexity for eps-approximations of Range Spaces

This is a guest post by Jeff Phillips, who among other things ponders the computation of $\epsilon$-samples for range spaces under uncertainty. Jeff is looking for permanent research positions right now, so uncertainty is a major part of his life !

At some point I had plans to write a blog post detailing a simpler proof of the number of random samples needed for an $\varepsilon$-approximation of a range space. This is a very powerful result that is simple to state (I'll define below) and simple to use. I view it as very geometric, but it originated in learning theory. I had used it in several papers, but until recently had not actually read the full proof, and even slightly less recently was unaware of the best bound. The original paper by Vapnik and Chervonenkis is hard to find, but I eventually found a proof in a learning-theory book by Anthony and Bartlett. The proof uses a series of tricks I was not aware of, and the notation is different from what I was used to.

After reading the 20+ page proof, I felt there was a simpler geometric way to think about the problem, and even sketched-out what I thought was a simpler proof. I thought that would be a great thing to post on this blog! But slowly things began to break down. I prepared a pdf writeup, that started at under one page. Then realizing gaps in the reasoning, it grew to 2-3 pages, and then 4 as I attempted to write the tighter result. In presenting this approach to the informal theory seminar here at Utah and realized one of the main geometric tricks I was hoping to use was insufficient for proving what was needed. Looking back at the Anthony and Bartlett proof, I finally realized why all of tricks and machinery they use are really needed!

Anyways, I'd like sketch a proof of the $\varepsilon$-approximation theorem and try to include intuition of why techniques are being used.

Let $P$ be a set of objects (i.e. points) and $\mathcal{R}$ be a set of subset of $P$ (i.e. ranges, like disks). Then $(P,\mathcal{R})$ is called a range space. We will concern ourselves with range spaces with bounded VC-dimension d, which implies that the total number of distinct ranges is at most $c |P|^d$ for some fixed constant $c$. Now a subset $Q$ of $P$ is an $\varepsilon$-approximation of $(P,\mathcal{R})$ (for $0 \le \varepsilon \le 1$) for all ranges $A \in \mathcal{R}$
$\left| \frac{|P \cap A|}{|P|} - \frac{|Q \cap A|}{|Q|} \right| \leq \varepsilon. \hspace{.2in} (1)$

The main result is that:
Theorem. If $Q$ is a random sample from $P$ of size $h = O((1/\varepsilon^2)(d + \log(1/\delta))$ then it is an $\varepsilon$-approximation of $(P,\mathcal{R})$ with probability at least $1-\delta$.

This was first proven by Talagrand in 94. The original result of Vapnik and Chervonenkis had a slightly worse bound of $h = O((1/\varepsilon^2)(d \log(d/\varepsilon) + \log(1/\delta))$. The main reason this is a powerful tool is that $h$ is independent of $|P|$. As I found out, even this weaker result is difficult to prove.
I'll also note, that I first became aware of this stronger result as a reduction from a more general result by Li, Long, and Srinivasan 01, as cited by Har-Peled and Sharir 07.

There is a nice progression to the full result through a series of weaker results.

Step 1.
We can show that a random sample of $r = (1/2 \varepsilon^2) \ln(2/\delta)$ points is sufficient to prove (1) for any one fixed range, by a Chernoff-Hoeffding (C-H) bound. Then since there are only $c |P|^d$ distinct ranges, by the union bound, we only need $h_1 = (1/2 \varepsilon^2)(d \ln |P| + \ln(2c/\delta))$ samples to hold for all ranges.

Note that $h_1$ is a function of $|P|$, we want to get rid of this dependence.

This C-H + union bound will be used again. Here the C-H bound has $h_1$ events, one for each random sample, we will reduce this in Step 3. The dependence on $|P|$ is through the union bound, we improve this in Step 2.

Step 2.
To get rid of the dependence on |P| we need to apply the union bound on a smaller number of ranges. We use a technique called symmetrization, where we consider two random samples $S$ and $Q$ of $P$, both of size $h_2$. Then we compare them against each other with the goal of showing for all $A \in \mathcal{R}$
$\left| \frac{|Q \cap A|}{|Q|} - \frac{|S \cap A|}{|S|} \right| \leq \varepsilon/2. \hspace{.2in} (2)$
We can intuitively see that showing (2) holds with probability $\geq 1-\delta$ is sufficient to show (1): If $h_2$ is too small, it is likely both $Q$ and $S$ have at least some range $A$ in which they are far from $P$, but since there are many ranges, it is likely that they are far on different ranges. Thus if $h_2$ is too small $Q$ and $S$ are likely different from each other on some range $A$ also. Likewise, if $h_2$ is large enough so $Q$ and $S$ are close on all ranges, then it is likely they are close to $P$ on each of those ranges too.

Another technical trick is required to show (2) holds. We correspond each element in $q_i \in Q$ with an element of $s_i \in S$. Then it is sufficient (once we have sampled $Q$ and $S$) to show (2) occurs with probability $\geq 1-\delta$ w.r.t. possible swaps of ($s_i \in Q$ and $q_i \in S$) vs. ($s_i \in S$ and $q_i \in Q$), for all corresponding pairs.

Since the events in the permutation have bounded effect on each range, and there are only $c |Q \cup S|^d$ ranges we need to consider, we can apply the C-H + Union bounds as in Step 1, but now we only have $O((d/\varepsilon)^d)$ events (independent of $|P|$) and we get the original V-C bound of $h_2 = O((1/\varepsilon^2)(d \log(d/\varepsilon) + \log(1/\delta))$.

Another neat effect of symmetrization is that it allows us to consider continuous sets $P$. The analysis in Step 1 required us to compare $Q$ to all ranges defined by $P$, which would have been infinite if $P$ were continuous. But now since we compare two random samples from $P$, even if it is continuous, the random samples are finite, and hence induce a finite number of ranges.

Step 3.
We now show how to get rid of the extra $\log(d/\varepsilon)$ term. We do this through a technique called chaining which allows us to decompose ranges and change the type of events we measure with the C-H bound.

Using symmetrization consider two subsets $Q$ and $S$ of $P$ and let $P^\prime = Q \cup S$. We define a series of range spaces $(P^\prime, \mathcal{T}_0)$, $(P^\prime, \mathcal{T}_1)$, $\ldots$ so that
• (a) each range $A \in \mathcal{R}$ is a disjoint union $A \bigcup_j B_j$ where each $B_j \in \mathcal{T}_j$.
• (b) each $B_j \in \mathcal{T}_j$ is of size $|B_j| \leq |P^\prime|/2^j$.
• (c) $Q$ is an $\varepsilon_j$-approximation of $(P^\prime, \mathcal{T}_j)$ with probability $\geq \delta_j$ where $\varepsilon_j = \sqrt{j/2^j}/12$ and $\delta_j = \delta/2^{j+2}$.
Fact (a) allows us to analyze $(P^\prime,\mathcal{R})$ (with $\varepsilon, \delta$) by considering each $(P^\prime,\mathcal{T}_j)$ separately (with $\varepsilon_j, \delta_j$) since by (c) $\varepsilon \leq \sum_j \varepsilon_j$ and $\delta \leq \sum_j \delta_j$.

From (c) the extra $\sqrt{j}$ term in $\varepsilon_j$ cancels out the $\log(1/\varepsilon)$ term in the technical analysis leading to $h = O((1/\varepsilon^2)(d + \log(1/\delta))$ as desired.

To prove (c) we need to use (b). The key insight is that for large $j$, each range $B_j$ is small. We can then apply the C-H bound using a different type of event which there are fewer of. For a range $B_j$, instead of measuring the event if $q \in Q$ is inside $B$ (as in Step 1 and 2) we measure if some $b \in B_j$ is in $Q$ -- thus there are only $|B_j| \leq |P'|/2^j$ events. Since $|B_j|$ decreases geometrically as $j$ increases, we can also decrease $\varepsilon_j$ and $\delta_j$ geometrically as $j$ increases, proving (c).

It only remains to show the existence of each $(P^\prime, \mathcal{T}_j)$.
• We first consider a series of range spaces $(P^\prime, \mathcal{R}_j)$ where $\mathcal{R}_j$ correspond to all ranges in $(Q_j, \mathcal{R})$ where $Q_j$ is an $(1/2^j)$-approximation of $(P^\prime,\mathcal{R})$. We know $|Q_j| = O((1/\varepsilon^2) d \log(d/\varepsilon))$ by Step 2.
• Then $\mathcal{T}_j$ is defined inductively as follows. $\mathcal{T}_0 = \mathcal{R}_0$. Each $B \in \mathcal{T}_j$ corresponds to a range $A \in \mathcal{R}_j$. We know there exists another range $A^\prime \in \mathcal{R}_{j-1}$ such that $A$ and $A^\prime$ differ by at most $|P^\prime|/2^j$ points; let $B$ be those points.
Note: Most proofs construct $\mathcal{T}_j$ using a combinatorial result of Haussler 95, but this approach does not affect the asymptotics of the final solution and may actually improve the constants.