Understanding Machine Learning (II)
Uniform convergence is a property which a hypothesis class may satisfy that will end up being equivalent to PAC learning, and the ability to use either definition whenever needed will be quite useful. “Uniform convergence” means convergence with respect to the difference between the empirical error and the real error, which is uniform with respect to all hypotheses in the class – i.e. we can bound the difference between real and empirical error across all hypotheses simultaneously. We have the following formal definition:
A hypothesis class has the uniform convergence property iff there is a function such that, for every probability distribution over , every gap and every confidence level , if , then with probability at least over the choice of , the inequality holds for all .
Recall that PAC learnability demands that the true error of the hypothesis output by is low. On the other hand, uniform convergence demands that the gap between the empirical and the true error is low. Unsurprisingly, both properties are equivalent. We will prove one direction now, namely that uniform convergence implies PAC learnability.
Suppose the uniform convergence property holds for values . We apply to our task, feed it our training sequence which fulfills the property that , and now want to argue that the true error of is not much higher than the true error of some . The uniform convergence property does not allow us to compare both errors directly, rather, we can argue that because of uniform convergence, that because of what does, and finally that because of uniform convergence. Putting these three inequalities together, we do indeed obtain a bound on , but it is not . The confidence parameter stays the same. Of course, the doubling of the gap does not matter for the definition, since we can choose both and to be arbitrarily small anyway (to reach the gap for PAC learning, we simply choose for ), so this simple argument indeed proves that every hypothesis class which has the uniform convergence property is PAC learnable.
Before we get to more exciting things, consider the following problem in isolation. Let be some finite set and let be the set of all functions from to . Think of some “true” function and of as hypothesis class containing every possible candidate. To begin, the composition of reflects that we’re wholly ignorant as to the nature of (every possible function is in ). Now, suppose we are given some partial information about , namely we are given some subset and a function such that . (This notation means, “g restricted to the set .) We now only care about those functions in that agree with on the set .
How much does this help? Clearly, for every element , we now know the correct label, namely . But consider an element . Does our partial knowledge of help us to figure out the label of ? As a matter of fact, the answer is no. Even if contains every element in except , there are still 2 possible candidates in , namely the function which agrees with on and sets to , and which agrees with on and sets to 1. We’re still at 50⁄50 odds with regard to the label of , the same as we were from the beginning. Thus, if every labeling combination is possible (and equally likely), then learning the label of some points is zero information about the label of other points. This principle will be key to the next two sections.
The No-Free-Lunch Theorem
The overfitting question is about how to weigh evidence vs priors, but is there a need for priors at all? What happens if one only relies on evidence? The papayas example shows that will not reliably do the job, but perhaps there is some other, more clever learner that will always succeed given enough samples?
It will come to no surprise that the answer is negative (for any reasonable definition of “enough samples”), and this is formally proven in the No-Free-Lunch Theorem. Given any learning algorithm , we shall construct a domain set , a probability distribution , and a hypothesis class such that will have a high expected error. (“Expected” as in, we take the set of all possible training sequences as our probability space and consider to be a random variable so that we can compute its expected value.) The label set will still be just , so the No-Free-Lunch theorem holds for binary classification. One noticeable feature of this proof is that we will require the instance domain to be much larger than the training sequence.
So let be the length of the training sequence , and let be a set of size . We will only consider probability distributions which put all of their probability mass into elements of , so the instances outside of will not matter. Let be the set of all functions from to , and let be the set obtained by taking every function from and extending it to a function from all of to by letting it label every element outside of by 0. (Again, these elements do not matter.) In symbols,
Note that we have .
Now we construct a set of many different probability distributions. Each will sample each domain point with equal probability, but they will label the element differently. (We shall not formally switch to the setting of a true function in the environment, but the probability distributions which we consider are all consistent with such a function.) To be more precise, we construct one probability distribution for each function that labels each element according to . In symbols, for each we set
and we let . Now notice that the learner is in a situation analogous to what we’ve discussed earlier: it can learn about at most half of the points in the environment (because ), and since every in-principle-possible labeling function is equally viable, that means it doesn’t know anything about the labels it doesn’t see, and those are at least half. Thus, it will have a sizeable total error – in fact, at least in expectation, since has to blindly guess about half of all elements.
The proof now consists of formalizing the above argument. This gets more technical, but I’ll provide the details for the part that I think is at least an interesting technical problem. (You may still choose to skip them, though.) So, for each probability distribution , one can enumerate all possible training sequences that are obtained by querying repeatedly, times in total. (We have but this is not important.) The expected loss for this probability distribution is now . The problem here is that may just guess all unseen labels correctly – in fact it may also ignore entirely and simply guess the labels of all elements correctly. Of course, such a learner would then perform exceptionally poorly on other probability distributions; the point is that it cannot do better than guess. Thus, averaging over all training sequences does not do the trick; we actually want to average over all probability distributions. To achieve this, recall that we need to show that there exists some probability distribution with high error, or in other words, the maximum across all probability distributions (of the above sum) will have high error. We can then lower bound the maximum by replacing it with the average, hence summing over probability distributions after all. To be precise, let (we have but this is not important) so that , then
Then we can exchange the order of these to sums so that we finally get to sum over all possible probability distributions. We can then replace the right sum (across all training sequences) with the minimum across all training sequences, because in fact the training sequence does not matter. For each training sequence, we have that for each of the points outside the sequence, exactly half of all probability distributions will label that point 0 and the other half 1. Now the learner cannot differentiate between them, no matter how it is defined, because they have lead to the identical training sequence, and thus it gets it right in only half of all cases. Alas, it only gets half of all elements outside the training sequence right in expectation, and thus at most three quarters of all elements total. (For the remaining details, see pages 62⁄63.)
Now, this leads to an expected error of at least . However, instead of choosing a set of size , we could more generally choose a set of size . Then with an analogous construction, we show that the learner might still get the elements in the training sequence right, but will merely guess on the other , and its expected error ends up as . Thus, by increasing we can push the error arbitrarily close toward . In the setting of binary classification, this is the error achieved by blindly guessing, and so it is fair to say that no meaningful lower bound of quality can be achieved without making assumptions about the nature of the learning task. In particular, such assumptions are the basis of generalization: the idea that a papaya which is surrounded by similar papayas, all of which tasty, is probably tasty itself – that idea is an assumption about how we can generalize; it would not be true if we took all labeling functions as equally likely candidates for the true function. In fact, having such a dense hypothesis class makes generalization impossible, as we’ve just demonstrated. This fact is also key for the upcoming chapter.
The VC-dimension is a concept that falls right out of the above line of thinking. You might recall that every finite hypothesis class is learnable, but it is in fact the case that many infinite hypothesis classes are also learnable – for example, the class of all intervals (or more precisely, the set of all predictors such that there exists an interval with the property that ) over the domain is learnable. (I won’t give a proof because we’re about to get a much more general result, but it’s probably not surprising – if you want, stop reading and think about it.) This suggests that the decisive property for learnability is not the size of the hypothesis class, but rather some kind of measure for how easy it would be to construct a no-free-lunch style example.
The key piece in the preceding proof is the same idea I’ve emphasized before: if all possible functions on a finite set are possible, then learning about the value of the true function on some points is zero information about its value on the other points. So how does that work in the case of intervals? Can we construct a set of points such that all labeling functions are possible? Note that the important question is not whether the environment distribution can label them in every possible combination – the answer to this is likely affirmative since the need not be deterministic, i.e. even for the same point, any label might be possible. Rather, the question is whether there are predictors in our class – i.e. interval predictors – which allow every combination of labels on any particular set. This is so because PAC learnability only implies that we can get close to the best classifier in our hypothesis class, so in order for a no free lunch type argument to go through, it has to be impossible to obtain information about this predictor (which is in our class).
So is this possible with intervals? Let’s start with a 1-point set:
Success! We have found interval predictors that give every possible labeling combination (i.e. (1) and (0)) on our one domain point . (To be clear, the black and red x in the image are meant to denote the same element; we drew two possible scenarios of where the interval might be with regard to that element). We thus say that our one-point set is shattered by the class of interval predictors – which is actually a bad thing because it means it’s harder to learn.
What about a two-point set?
Indeed, we see that our hypothesis class is dense enough to shatter 2-point sets, since all four sequences of labels (namely (11), (10), (01), (00)) are possible. Now, the training sequence is of course a sequence and could therefore contain the same point twice. In that case, this two-point sequence could not be shattered. However, we actually care about the existential quantifier rather than the universal quantifier here. The existence of a single sequence of size two that is shattered will be important for our upcoming definition. (Also, the cases where the same point is included more than once are uninteresting anyway because those sequences can trivially never be shattered.) This is why we go about defining this new concept in terms of sets rather than sequences.
What about three-point sets? Take a moment to think about it: there are eight possible labeling combinations. Are there eight different interval classifiers that will produce all eight different labels?
The answer is no – we might label the three points and and such that , and then any interval containing both and must also contain . Therefore, the labeling is not possible (although all 7 other combinations are). No set of size three can be shattered by our class.
This proves that the VC-dimension of our class is 2, because the VC-dimension is defined simply as . Since we found some set of size 2 that is shattered and proved that no set of size 3 is shattered, the largest integer such that a set of that size can be shattered is 2.
To be more precise with this formal definition above, we also need to clarify what happens if no maximum exists – i.e. if for any , there exists a set of size that is shattered. In this case, we shall simply set the VC-dimension to .
How useful is this concept of VC-dimension? For one, any class with VC-dimension is not PAC learnable. Given such a hypothesis class and any learner , for each , one can find a set of size and a probability distribution such that the argument used in the No-Free-Theorem goes through, thus establishing a minimum gap between the performance of the optimal classifier and the classifier output by , and thus contradicting the definition of PAC learning. (To be more precise, we first prove that, if the expected error is at least , then there must be at least a chance to have an error of at least , which then gives us a direct contradiction with the definition of PAC learning. This proof of this is easy – prove the contrapositive statement.)
The fact that this contradicts PAC learnability depends on the precise definition of PAC learnability. In particular, if the sample complexity were allowed to depend on the probability distribution, then the No-Free Lunch argument would no longer work. The third post will discuss two alternative notions of learnability, where the sample complexity is allowed to depend on further elements.
What about classes with finite VC-dimension? Here, the definition of VC-dimension may seem a bit silly – sure, maybe only sets of size five are shattered, but perhaps for any there exist sets of size that are almost shattered. In this case, the VC-dimension would appear to be highly misleading, and certainly the class would not be PAC learnable. However, it turns out this is not possible. One can prove that, for any and any subset with , the number of functions in (where ) is equal to the number of subsets of that are shattered by . Since, if the VC-dimension of is for some , only subsets of size at most can be shattered, and since there are only polynomially many subsets of size at most , this implies that the size of can be bounded by a function polynomial in . This is a pretty strong result, given that the number of possible functions on equals .
No proof for this result, as I didn’t find reading it very instructive. Unfortunately, the same will be true for most of the, upcoming result – its proof will be largely technical, hard to remember, and not very useful to build intuition. The upside is that it now fits into a single section.
The Fundamental Theorem of Statistical Learning
The fundamental theorem of statistical learning (in slightly simplified form) states that, if is some hypothesis class for any domain set and target set , the following statements are equivalent:
1. has the uniform convergence property
2. is a successful PAC learner for
3. is PAC learnable
4. has finite VC-dimension
We’ve proved in the first section of this post. is trivial. We’ve sketched the proof for in the previous section, by arguing that . The aforementioned difficult-and-very-technical-proof-I-will-not-include is the one for . Those four taken together make it into a circle, thus establishing equivalence.
There is also a quantitative version which upper- and lower bounds the sample complexity for PAC learnability and uniform convergence.
As nice as this result is, it is also important to understand that it says nothing whatsoever about runtime. As we’ve seen, the class of all predictors such that there exists a C program with at most 10000 symbols implementing the predictor is PAC learnable, but actually implementing is completely infeasible.