This is part 2 of a three part series on the fundamentals of Machine Learning as presented by this book. It builds heavily onto part I.

Uniform Convergence

Uniform convergence is a property which a hypothesis class may satisfy that will end up being equivalent to PAC learning, and the ability to use either definition whenever needed will be quite useful. “Uniform convergence” means convergence with respect to the difference between the empirical error and the real error, which is uniform with respect to all hypotheses in the class – i.e. we can bound the difference between real and empirical error across all hypotheses simultaneously. We have the following formal definition:

A hypothesis class $H$ has the uniform convergence property iff there is a function $u^{*} : (0, 1)^{2} \to N$ such that, for every probability distribution $D$ over $X \times Y$ , every gap $ϵ \in (0, 1)$ and every confidence level $δ \in (0, 1)$ , if $| S | > u^{*} (ϵ, δ)$ , then with probability at least $1 - δ$ over the choice of $S$ , the inequality $| ℓ_{S} (h) - ℓ (h) | \leq ϵ$ holds for all $h \in H$ .

Recall that PAC learnability demands that the true error of the hypothesis output by $A_{ERM}$ is low. On the other hand, uniform convergence demands that the gap between the empirical and the true error is low. Unsurprisingly, both properties are equivalent. We will prove one direction now, namely that uniform convergence implies PAC learnability.

Suppose the uniform convergence property holds for values $ϵ, δ$ . We apply $A_{ERM}$ to our task, feed it our training sequence $S$ which fulfills the property that $| S | > u^{*} (ϵ, δ)$ , and now want to argue that the true error of $A_{ERM} (S)$ is not much higher than the true error of some $h^{*} \in {argmin}_{h \in H} [ℓ (h)]$ . The uniform convergence property does not allow us to compare both errors directly, rather, we can argue that $ℓ (A_{ERM} (S)) - ℓ_{S} (A_{ERM} (S)) \leq ϵ$ because of uniform convergence, that $ℓ_{S} (A_{ERM} (S)) - ℓ_{S} (h^{*}) \leq 0$ because of what $A_{ERM}$ does, and finally that $ℓ_{S} (h^{*}) - ℓ (h^{*}) \leq ϵ$ because of uniform convergence. Putting these three inequalities together, we do indeed obtain a bound on $ℓ (A_{ERM} (S)) - ℓ (h^{*})$ , but it is $2 ϵ$ not $ϵ$ . The confidence parameter $δ$ stays the same. Of course, the doubling of the gap does not matter for the definition, since we can choose both $ϵ$ and $δ$ to be arbitrarily small anyway (to reach the gap $ϵ$ for PAC learning, we simply choose $\frac{ϵ}{2}$ for $u^{*}$ ), so this simple argument indeed proves that every hypothesis class which has the uniform convergence property is PAC learnable.

Before we get to more exciting things, consider the following problem in isolation. Let $X$ be some finite set and let $H$ be the set of all functions from $X$ to ${0, 1}$ . Think of some “true” function $f^{*} : X \to {0, 1}$ and of $H$ as hypothesis class containing every possible candidate. To begin, the composition of $H$ reflects that we’re wholly ignorant as to the nature of $f^{*}$ (every possible function is in $H$ ). Now, suppose we are given some partial information about $f^{*}$ in the form of some subset $U ⊊ X$ and a function $g : U \to {0, 1}$ such that $g = f_{| U}^{*}$ . (This notation means, “g restricted to the set $U$ ”.) We now only care about those functions in $H$ that agree with $g$ on the set $U$ .

How much does this help? For every element $x \in U$ , we now know the correct label, namely $g (x)$ . But consider an element $y \in X - U$ . Does our partial knowledge of $f^{*}$ help us to figure out the label of $f^{*} (y)$ ? The answer to this is a resounding no. Even if $U$ contains every element in $X$ except $y$ , there are still 2 possible candidates in $H$ , namely the function $f_{0}$ which agrees with $g$ on $U$ and sets $y$ to $0$ , and $f_{1}$ which agrees with $g$ on $U$ and sets $y$ to 1. This means we’re still at ⁵⁰⁄₅₀ odds with regard to the label of $y$ , the same as we were from the beginning. Thus, if every labeling combination is possible (and equally likely), then learning the label of some points is zero information about the label of other points. This principle will be key to the next two sections.

The No-Free-Lunch Theorem

The overfitting question is about how to weigh evidence vs priors, but is there a need for priors at all? What happens if one only relies on evidence? The papayas example shows that $A_{ERM}$ will not reliably do the job, but perhaps there is some other, more clever learner that will always succeed given enough samples?

It will come to no surprise that the answer is negative (for any reasonable definition of “enough samples”), and this is formally proven in the No-Free-Lunch Theorem. Given any learning algorithm $A$ , we shall construct a domain set $X$ , a probability distribution $D$ , and a hypothesis class $H$ such that $A (S)$ will have a high expected error. (“Expected” as in, we take the set of all possible training sequences as our probability space and consider $A (S)$ to be a random variable so that we can compute its expected value.) The label set will still be just ${0, 1}$ , so the No-Free-Lunch theorem holds for binary classification. One noticeable feature of this proof is that we will require the instance domain to be much larger than the training sequence.

Let $m$ be the length of the training sequence $S$ , and let $C \subseteq X$ be a set of size $2 m$ . We will only consider probability distributions which put all of their probability mass into elements of $C$ , so the instances outside of $C$ will not matter. Let $H^{'}$ be the set of all functions from $C$ to $Y = {0, 1}$ , and let $H$ be the set obtained by taking every function from $H^{'}$ and extending it to a function from all of $X$ to $Y$ by letting it label every element outside of $C$ by 0. (Again, these elements do not matter.) In symbols,

$H = {h : X \to Y | \exists h^{'} \in H^{'} such that [\forall x \in C : h (x) = h^{'} (x)] \land [\forall x \in X - C : h (x) = 0]}$

Note that we have $| H | = | H^{'} |$ .

Now, we construct a set $D$ of many different probability distributions. Each $D \in D$ will sample each domain point $x \in C$ with equal probability, but they will label the element differently. (We shall not formally switch to the setting of a true function in the environment, but the probability distributions which we consider are all consistent with such a function.) To be more precise, we construct one probability distribution for each function $f \in H^{'}$ that labels each element according to $f$ . In symbols, for each $f \in H^{'}$ we set

$D_{f} ((x, y)) := {\begin{matrix} 1 / | C | & ; x \in C \land y = f (x) 0 & ; otherwise \end{matrix}$

and we let $D = {D_{f} | f \in H^{'}}$ . Notice that the learner is now in a situation analogous to what we’ve discussed earlier: it can learn about at most half of the points in the environment (because $| S | = m = \frac{1}{2} | C |$ ), and since every in-principle-possible labeling function is equally viable, that means it doesn’t know anything about the labels it doesn’t see, and those are at least half. Thus, it will have a sizeable total error – in fact, at least $\frac{1}{4}$ in expectation, since $A$ has to blindly guess about half of all elements.

The proof consists of formalizing the above argument. This gets more technical, but I’ll provide the details for the part that I think is at least an interesting technical problem. For each probability distribution $D_{f}$ , one can enumerate all possible training sequences $S_{1}^{D_{f}}, . . ., S_{k}^{D_{f}}$ that are obtained by querying $D_{f}$ repeatedly, $m$ times in total. (We have $k = (2 m)^{m}$ but this is not important.) The expected loss for this probability distribution is now $\frac{1}{k} \sum_{i = 1}^{k} ℓ (A (S_{i}^{D_{f}}))$ . The problem here is that $A$ may just guess all unseen labels correctly – it may even ignore $S$ entirely and guess the labels of all elements correctly. Of course, such a learner would then perform exceptionally poorly on other probability distributions; the point is that it cannot do better than guess. Thus, averaging over all training sequences does not do the trick; we actually want to average over all probability distributions. To achieve this, recall that we need to show that there exists some probability distribution with high error, or in other words, the maximum across all probability distributions (of the above sum) will have high error. We can then lower bound the maximum by replacing it with the average, hence summing over probability distributions after all. To be precise, let $H^{'} = {f_{1}, . . ., f_{T}}$ (we have $T = 2^{2 m}$ but this is not important) so that $D = {D_{f_{1}}, . . ., D_{f_{T}}}$ , then

${max}_{D \in D} (\frac{1}{k} \sum_{i = 1}^{k} ℓ (A (S_{i}^{D}))) \geq \frac{1}{T} \sum_{j = 1}^{T} \frac{1}{k} \sum_{i = 1}^{k} (ℓ (A (S_{i}^{D_{f_{j}}})))$

We can now exchange the order of these to sums so that we finally get to sum over all possible probability distributions. Then, we replace the right sum (across all training sequences) with the minimum across all training sequences, because in fact the training sequence does not matter. For each training sequence, we have that for each of the $m$ points outside the sequence, exactly half of all probability distributions will label that point 0 and the other half 1. Now the learner cannot differentiate between them, no matter how it is defined, because they have lead to the identical training sequence, and thus it gets it right in only half of all cases. Alas, it only gets half of all elements outside the training sequence right in expectation, and thus at most three quarters of all elements total. (For the remaining details, see pages ⁶²⁄₆₃.)

Now, this leads to an expected error of at least $\frac{1}{4}$ . However, instead of choosing a set $C$ of size $2 m$ , we could more generally choose a set $C$ of size $b \cdot m$ . With an analogous construction, we can show that the learner might still get the $m$ elements in the training sequence right, but will merely guess on the other $(b - 1) \cdot m$ , and its expected error ends up as $\frac{1}{2} \frac{b - 1}{b}$ . Thus, by increasing $b$ we can push the error arbitrarily close toward $\frac{1}{2}$ . In the setting of binary classification, this is the error achieved by blindly guessing, and so it is fair to say that no meaningful lower bound of quality can be achieved without making assumptions about the nature of the learning task. In particular, such assumptions are the basis of generalization: the idea that a papaya which is surrounded by similar papayas, all of which tasty, is probably tasty itself – that idea is an assumption about how we can generalize; it would not be true if we took all labeling functions as equally likely candidates for the true function. In fact, having such a dense hypothesis class makes generalization impossible, as we’ve just demonstrated. This fact is also key for the upcoming chapter.

The VC-Dimension

The VC-dimension is a concept that falls right out of the above line of thinking. You might recall that every finite hypothesis class is learnable – but many infinite hypothesis classes are also learnable. For example, the class of all intervals (or more precisely, the set of all predictors $h$ such that there exists an interval $I$ with the property that $h (x) = 1 ⟺ x \in I$ ) over the domain $X = R$ is learnable. (I won’t give a proof because we’re about to get a much more general result, but it’s probably not surprising – if you want, stop reading and think about it.) This suggests that the decisive property for learnability is not the size of the hypothesis class, but rather some kind of measure for how easy it would be to construct a no-free-lunch style example.

The key piece in the preceding proof is the same idea I’ve emphasized before: if all possible functions on a finite set are possible, then learning about the value of the true function on some points is zero information about its value on the other points. How does that work in the case of intervals? Can we construct a set of points such that all labeling functions are possible? Note that the important question is not whether the environment distribution $D$ can label them in every possible combination – the answer to this is likely affirmative since the $D$ need not be deterministic. Rather, the question is whether there are predictors in our class – i.e., interval predictors – which allow every combination of labels on any particular set. This is so because PAC learnability only implies that we can get close to the best classifier in our hypothesis class, so in order for a no free lunch type argument to go through, it has to be impossible to obtain information about this predictor (which is in our class).

So is this possible with intervals? Let’s start with a 1-point set:

Success! We have found interval predictors that give every possible labeling combination (i.e. (1) and (0)) on our one domain point $x \in X$ . (To be clear, the black and red x in the image are meant to denote the same element; we drew two possible scenarios of where the interval might be with regard to that element). We thus say that our one-point set is shattered by the class of interval predictors – a bad thing as it means it’s harder to learn.

What about a two-point set?

Indeed, we see that our hypothesis class is dense enough to shatter 2-point sets, since all four sequences of labels (namely (11), (10), (01), (00)) are possible. Now, the training sequence is of course a sequence and could therefore contain the same point twice. In that case, this two-point sequence could not be shattered. However, we actually care about the existential quantifier rather than the universal quantifier here. The existence of a single sequence of size two that is shattered will be important for our upcoming definition. (Also, the cases where the same point is included more than once are uninteresting anyway because those sequences can trivially never be shattered.) This is why we go about defining this new concept in terms of sets rather than sequences.

What about three-point sets? Take a moment to think about it: there are eight possible labeling combinations. Are there eight different interval classifiers that will produce all eight different labels?

The answer is no – if we label the three points $x$ and $y$ and $z$ such that $x < y < z$ , any interval containing both $x$ and $z$ must also contain $y$ , which makes the labeling $(1, 0, 1)$ impossible (although all 7 other combinations are possible). No set of size three can be shattered by our class.

This proves that the VC-dimension of our class is 2, because the VC-dimension is defined simply as $VC (H) := max {n \in N | \exists C \subseteq X : | C | = n \land H shatters C}$ . Since we found some set of size 2 that is shattered and proved that no set of size 3 is shattered, the largest integer such that a set of that size can be shattered is 2.

(If no maximum exists – i.e. if for any $n \in N$ , there exists a set of size $n$ that is shattered – the VC-dimension is $\infty$ . )

How useful is this concept of VC-dimension? For one, any class with VC-dimension $\infty$ is not PAC learnable. Given such a hypothesis class and any learner $A$ , for each $m \in N$ , one can find a set $C$ of size $2 m$ and a probability distribution $D$ such that the argument used in the No-Free-Theorem goes through, thus establishing a minimum gap between the performance of the optimal classifier and the classifier output by $A$ , thus contradicting the definition of PAC learning. (To be more precise, we first prove that, if the expected error is at least $\frac{1}{4}$ , then there must be at least a $\frac{1}{8}$ chance to have an error of at least $\frac{1}{7}$ , which then gives us a direct contradiction with the definition of PAC learning. This proof of this is easy – prove the contrapositive.)

The fact that this contradicts PAC learnability depends on the precise definition of PAC learnability. In particular, if the sample complexity were allowed to depend on the probability distribution, the No-Free Lunch argument would no longer work. The third post will discuss two alternative notions of learnability, where the sample complexity is allowed to depend on further elements.

What about classes with finite VC-dimension? Here, the definition of VC-dimension may seem uninformative on first glance – only sets of size five are shattered, but perhaps for any $m$ there exist sets of size $m$ that are almost shattered? However, it turns out this is not possible. One can prove that, for any $m \in N$ and any subset $C \subset X$ with $| C | = m$ , the number of functions in $H_{C}$ (where $H_{C} := {h_{| C} | h \in H}$ ) is equal to the number of subsets of $C$ that are shattered by $H$ . Since, if the VC-dimension of $H$ is $d$ for some $d \in N$ , only subsets of size at most $d$ can be shattered, and since there are only polynomially many subsets of size at most $d$ , this implies that the size of $H_{C}$ can be bounded by a function polynomial in $| C |$ . This is a pretty strong result, given that there are $2^{| C |}$ possible classifiers for $C$ .

No proof for this result, as I didn’t find reading it very instructive. The same will be true for most of the upcoming result.

The Fundamental Theorem of Statistical Learning

The fundamental theorem of statistical learning (in slightly simplified form) states that, if $H$ is some hypothesis class for any domain set $X$ and target set $Y = {0, 1}$ , the following statements are equivalent:

1. $H$ has the uniform convergence property

2. $A_{ERM}$ is a successful PAC learner for $H$

3. $H$ is PAC learnable

4. $H$ has finite VC-dimension

We’ve proved $1 ⟹ 2$ in the first section of this post. $2 ⟹ 3$ is trivial. We’ve sketched the proof for $3 ⟹ 4$ in the previous section, by arguing that $\neg 4 ⟹ \neg 3$ . The difficult-and-very-technical-proof-I-will-not-include is the one for $4 ⟹ 1$ . Those four taken together make it into a circle, thus establishing equivalence.

There is also a quantitative version which upper- and lower bounds the sample complexity for PAC learnability and uniform convergence.

As nice as this result is, it is also important to understand that it says nothing whatsoever about runtime. As we’ve seen, the class of all predictors such that there exists a C program with at most 10000 symbols implementing the predictor is PAC learnable, but actually implementing $A_{ERM}$ is completely infeasible.

Understanding Machine Learning (II)

Uniform Convergence

The No-Free-Lunch Theorem

The VC-Dimension

The Fundamental Theorem of Statistical Learning