Understanding Machine Learning (II)

This is part 2 of a three part se­ries on the fun­da­men­tals of Ma­chine Learn­ing as pre­sented by this book. It builds heav­ily onto part I.

Uniform Convergence

Uniform con­ver­gence is a prop­erty which a hy­poth­e­sis class may satisfy that will end up be­ing equiv­a­lent to PAC learn­ing, and the abil­ity to use ei­ther defi­ni­tion when­ever needed will be quite use­ful. “Uniform con­ver­gence” means con­ver­gence with re­spect to the differ­ence be­tween the em­piri­cal er­ror and the real er­ror, which is uniform with re­spect to all hy­pothe­ses in the class – i.e. we can bound the differ­ence be­tween real and em­piri­cal er­ror across all hy­pothe­ses si­mul­ta­neously. We have the fol­low­ing for­mal defi­ni­tion:

A hy­poth­e­sis class has the uniform con­ver­gence prop­erty iff there is a func­tion such that, for ev­ery prob­a­bil­ity dis­tri­bu­tion over , ev­ery gap and ev­ery con­fi­dence level , if , then with prob­a­bil­ity at least over the choice of , the in­equal­ity holds for all .

Re­call that PAC learn­abil­ity de­mands that the true er­ror of the hy­poth­e­sis out­put by is low. On the other hand, uniform con­ver­gence de­mands that the gap be­tween the em­piri­cal and the true er­ror is low. Un­sur­pris­ingly, both prop­er­ties are equiv­a­lent. We will prove one di­rec­tion now, namely that uniform con­ver­gence im­plies PAC learn­abil­ity.

Sup­pose the uniform con­ver­gence prop­erty holds for val­ues . We ap­ply to our task, feed it our train­ing se­quence which fulfills the prop­erty that , and now want to ar­gue that the true er­ror of is not much higher than the true er­ror of some . The uniform con­ver­gence prop­erty does not al­low us to com­pare both er­rors di­rectly, rather, we can ar­gue that be­cause of uniform con­ver­gence, that be­cause of what does, and fi­nally that be­cause of uniform con­ver­gence. Put­ting these three in­equal­ities to­gether, we do in­deed ob­tain a bound on , but it is not . The con­fi­dence pa­ram­e­ter stays the same. Of course, the dou­bling of the gap does not mat­ter for the defi­ni­tion, since we can choose both and to be ar­bi­trar­ily small any­way (to reach the gap for PAC learn­ing, we sim­ply choose for ), so this sim­ple ar­gu­ment in­deed proves that ev­ery hy­poth­e­sis class which has the uniform con­ver­gence prop­erty is PAC learn­able.

Be­fore we get to more ex­cit­ing things, con­sider the fol­low­ing prob­lem in iso­la­tion. Let be some finite set and let be the set of all func­tions from to . Think of some “true” func­tion and of as hy­poth­e­sis class con­tain­ing ev­ery pos­si­ble can­di­date. To be­gin, the com­po­si­tion of re­flects that we’re wholly ig­no­rant as to the na­ture of (ev­ery pos­si­ble func­tion is in ). Now, sup­pose we are given some par­tial in­for­ma­tion about , namely we are given some sub­set and a func­tion such that . (This no­ta­tion means, “g re­stricted to the set .) We now only care about those func­tions in that agree with on the set .

How much does this help? Clearly, for ev­ery el­e­ment , we now know the cor­rect la­bel, namely . But con­sider an el­e­ment . Does our par­tial knowl­edge of help us to figure out the la­bel of ? As a mat­ter of fact, the an­swer is no. Even if con­tains ev­ery el­e­ment in ex­cept , there are still 2 pos­si­ble can­di­dates in , namely the func­tion which agrees with on and sets to , and which agrees with on and sets to 1. We’re still at 5050 odds with re­gard to the la­bel of , the same as we were from the be­gin­ning. Thus, if ev­ery la­bel­ing com­bi­na­tion is pos­si­ble (and equally likely), then learn­ing the la­bel of some points is zero in­for­ma­tion about the la­bel of other points. This prin­ci­ple will be key to the next two sec­tions.

The No-Free-Lunch Theorem

The overfit­ting ques­tion is about how to weigh ev­i­dence vs pri­ors, but is there a need for pri­ors at all? What hap­pens if one only re­lies on ev­i­dence? The pa­payas ex­am­ple shows that will not re­li­ably do the job, but per­haps there is some other, more clever learner that will always suc­ceed given enough sam­ples?

It will come to no sur­prise that the an­swer is nega­tive (for any rea­son­able defi­ni­tion of “enough sam­ples”), and this is for­mally proven in the No-Free-Lunch The­o­rem. Given any learn­ing al­gorithm , we shall con­struct a do­main set , a prob­a­bil­ity dis­tri­bu­tion , and a hy­poth­e­sis class such that will have a high ex­pected er­ror. (“Ex­pected” as in, we take the set of all pos­si­ble train­ing se­quences as our prob­a­bil­ity space and con­sider to be a ran­dom vari­able so that we can com­pute its ex­pected value.) The la­bel set will still be just , so the No-Free-Lunch the­o­rem holds for bi­nary clas­sifi­ca­tion. One no­tice­able fea­ture of this proof is that we will re­quire the in­stance do­main to be much larger than the train­ing se­quence.

So let be the length of the train­ing se­quence , and let be a set of size . We will only con­sider prob­a­bil­ity dis­tri­bu­tions which put all of their prob­a­bil­ity mass into el­e­ments of , so the in­stances out­side of will not mat­ter. Let be the set of all func­tions from to , and let be the set ob­tained by tak­ing ev­ery func­tion from and ex­tend­ing it to a func­tion from all of to by let­ting it la­bel ev­ery el­e­ment out­side of by 0. (Again, these el­e­ments do not mat­ter.) In sym­bols,

Note that we have .

Now we con­struct a set of many differ­ent prob­a­bil­ity dis­tri­bu­tions. Each will sam­ple each do­main point with equal prob­a­bil­ity, but they will la­bel the el­e­ment differ­ently. (We shall not for­mally switch to the set­ting of a true func­tion in the en­vi­ron­ment, but the prob­a­bil­ity dis­tri­bu­tions which we con­sider are all con­sis­tent with such a func­tion.) To be more pre­cise, we con­struct one prob­a­bil­ity dis­tri­bu­tion for each func­tion that la­bels each el­e­ment ac­cord­ing to . In sym­bols, for each we set

and we let . Now no­tice that the learner is in a situ­a­tion analo­gous to what we’ve dis­cussed ear­lier: it can learn about at most half of the points in the en­vi­ron­ment (be­cause ), and since ev­ery in-prin­ci­ple-pos­si­ble la­bel­ing func­tion is equally vi­able, that means it doesn’t know any­thing about the la­bels it doesn’t see, and those are at least half. Thus, it will have a size­able to­tal er­ror – in fact, at least in ex­pec­ta­tion, since has to blindly guess about half of all el­e­ments.

The proof now con­sists of for­mal­iz­ing the above ar­gu­ment. This gets more tech­ni­cal, but I’ll provide the de­tails for the part that I think is at least an in­ter­est­ing tech­ni­cal prob­lem. (You may still choose to skip them, though.) So, for each prob­a­bil­ity dis­tri­bu­tion , one can enu­mer­ate all pos­si­ble train­ing se­quences that are ob­tained by query­ing re­peat­edly, times in to­tal. (We have but this is not im­por­tant.) The ex­pected loss for this prob­a­bil­ity dis­tri­bu­tion is now . The prob­lem here is that may just guess all un­seen la­bels cor­rectly – in fact it may also ig­nore en­tirely and sim­ply guess the la­bels of all el­e­ments cor­rectly. Of course, such a learner would then perform ex­cep­tion­ally poorly on other prob­a­bil­ity dis­tri­bu­tions; the point is that it can­not do bet­ter than guess. Thus, av­er­ag­ing over all train­ing se­quences does not do the trick; we ac­tu­ally want to av­er­age over all prob­a­bil­ity dis­tri­bu­tions. To achieve this, re­call that we need to show that there ex­ists some prob­a­bil­ity dis­tri­bu­tion with high er­ror, or in other words, the max­i­mum across all prob­a­bil­ity dis­tri­bu­tions (of the above sum) will have high er­ror. We can then lower bound the max­i­mum by re­plac­ing it with the av­er­age, hence sum­ming over prob­a­bil­ity dis­tri­bu­tions af­ter all. To be pre­cise, let (we have but this is not im­por­tant) so that , then

Then we can ex­change the or­der of these to sums so that we fi­nally get to sum over all pos­si­ble prob­a­bil­ity dis­tri­bu­tions. We can then re­place the right sum (across all train­ing se­quences) with the min­i­mum across all train­ing se­quences, be­cause in fact the train­ing se­quence does not mat­ter. For each train­ing se­quence, we have that for each of the points out­side the se­quence, ex­actly half of all prob­a­bil­ity dis­tri­bu­tions will la­bel that point 0 and the other half 1. Now the learner can­not differ­en­ti­ate be­tween them, no mat­ter how it is defined, be­cause they have lead to the iden­ti­cal train­ing se­quence, and thus it gets it right in only half of all cases. Alas, it only gets half of all el­e­ments out­side the train­ing se­quence right in ex­pec­ta­tion, and thus at most three quar­ters of all el­e­ments to­tal. (For the re­main­ing de­tails, see pages 6263.)

Now, this leads to an ex­pected er­ror of at least . How­ever, in­stead of choos­ing a set of size , we could more gen­er­ally choose a set of size . Then with an analo­gous con­struc­tion, we show that the learner might still get the el­e­ments in the train­ing se­quence right, but will merely guess on the other , and its ex­pected er­ror ends up as . Thus, by in­creas­ing we can push the er­ror ar­bi­trar­ily close to­ward . In the set­ting of bi­nary clas­sifi­ca­tion, this is the er­ror achieved by blindly guess­ing, and so it is fair to say that no mean­ingful lower bound of qual­ity can be achieved with­out mak­ing as­sump­tions about the na­ture of the learn­ing task. In par­tic­u­lar, such as­sump­tions are the ba­sis of gen­er­al­iza­tion: the idea that a pa­paya which is sur­rounded by similar pa­payas, all of which tasty, is prob­a­bly tasty it­self – that idea is an as­sump­tion about how we can gen­er­al­ize; it would not be true if we took all la­bel­ing func­tions as equally likely can­di­dates for the true func­tion. In fact, hav­ing such a dense hy­poth­e­sis class makes gen­er­al­iza­tion im­pos­si­ble, as we’ve just demon­strated. This fact is also key for the up­com­ing chap­ter.

The VC-Dimension

The VC-di­men­sion is a con­cept that falls right out of the above line of think­ing. You might re­call that ev­ery finite hy­poth­e­sis class is learn­able, but it is in fact the case that many in­finite hy­poth­e­sis classes are also learn­able – for ex­am­ple, the class of all in­ter­vals (or more pre­cisely, the set of all pre­dic­tors such that there ex­ists an in­ter­val with the prop­erty that ) over the do­main is learn­able. (I won’t give a proof be­cause we’re about to get a much more gen­eral re­sult, but it’s prob­a­bly not sur­pris­ing – if you want, stop read­ing and think about it.) This sug­gests that the de­ci­sive prop­erty for learn­abil­ity is not the size of the hy­poth­e­sis class, but rather some kind of mea­sure for how easy it would be to con­struct a no-free-lunch style ex­am­ple.

The key piece in the pre­ced­ing proof is the same idea I’ve em­pha­sized be­fore: if all pos­si­ble func­tions on a finite set are pos­si­ble, then learn­ing about the value of the true func­tion on some points is zero in­for­ma­tion about its value on the other points. So how does that work in the case of in­ter­vals? Can we con­struct a set of points such that all la­bel­ing func­tions are pos­si­ble? Note that the im­por­tant ques­tion is not whether the en­vi­ron­ment dis­tri­bu­tion can la­bel them in ev­ery pos­si­ble com­bi­na­tion – the an­swer to this is likely af­fir­ma­tive since the need not be de­ter­minis­tic, i.e. even for the same point, any la­bel might be pos­si­ble. Rather, the ques­tion is whether there are pre­dic­tors in our class – i.e. in­ter­val pre­dic­tors – which al­low ev­ery com­bi­na­tion of la­bels on any par­tic­u­lar set. This is so be­cause PAC learn­abil­ity only im­plies that we can get close to the best clas­sifier in our hy­poth­e­sis class, so in or­der for a no free lunch type ar­gu­ment to go through, it has to be im­pos­si­ble to ob­tain in­for­ma­tion about this pre­dic­tor (which is in our class).

So is this pos­si­ble with in­ter­vals? Let’s start with a 1-point set:

Suc­cess! We have found in­ter­val pre­dic­tors that give ev­ery pos­si­ble la­bel­ing com­bi­na­tion (i.e. (1) and (0)) on our one do­main point . (To be clear, the black and red x in the image are meant to de­note the same el­e­ment; we drew two pos­si­ble sce­nar­ios of where the in­ter­val might be with re­gard to that el­e­ment). We thus say that our one-point set is shat­tered by the class of in­ter­val pre­dic­tors – which is ac­tu­ally a bad thing be­cause it means it’s harder to learn.

What about a two-point set?

In­deed, we see that our hy­poth­e­sis class is dense enough to shat­ter 2-point sets, since all four se­quences of la­bels (namely (11), (10), (01), (00)) are pos­si­ble. Now, the train­ing se­quence is of course a se­quence and could there­fore con­tain the same point twice. In that case, this two-point se­quence could not be shat­tered. How­ever, we ac­tu­ally care about the ex­is­ten­tial quan­tifier rather than the uni­ver­sal quan­tifier here. The ex­is­tence of a sin­gle se­quence of size two that is shat­tered will be im­por­tant for our up­com­ing defi­ni­tion. (Also, the cases where the same point is in­cluded more than once are un­in­ter­est­ing any­way be­cause those se­quences can triv­ially never be shat­tered.) This is why we go about defin­ing this new con­cept in terms of sets rather than se­quences.

What about three-point sets? Take a mo­ment to think about it: there are eight pos­si­ble la­bel­ing com­bi­na­tions. Are there eight differ­ent in­ter­val clas­sifiers that will pro­duce all eight differ­ent la­bels?

The an­swer is no – we might la­bel the three points and and such that , and then any in­ter­val con­tain­ing both and must also con­tain . There­fore, the la­bel­ing is not pos­si­ble (al­though all 7 other com­bi­na­tions are). No set of size three can be shat­tered by our class.

This proves that the VC-di­men­sion of our class is 2, be­cause the VC-di­men­sion is defined sim­ply as . Since we found some set of size 2 that is shat­tered and proved that no set of size 3 is shat­tered, the largest in­te­ger such that a set of that size can be shat­tered is 2.

To be more pre­cise with this for­mal defi­ni­tion above, we also need to clar­ify what hap­pens if no max­i­mum ex­ists – i.e. if for any , there ex­ists a set of size that is shat­tered. In this case, we shall sim­ply set the VC-di­men­sion to .

How use­ful is this con­cept of VC-di­men­sion? For one, any class with VC-di­men­sion is not PAC learn­able. Given such a hy­poth­e­sis class and any learner , for each , one can find a set of size and a prob­a­bil­ity dis­tri­bu­tion such that the ar­gu­ment used in the No-Free-The­o­rem goes through, thus es­tab­lish­ing a min­i­mum gap be­tween the perfor­mance of the op­ti­mal clas­sifier and the clas­sifier out­put by , and thus con­tra­dict­ing the defi­ni­tion of PAC learn­ing. (To be more pre­cise, we first prove that, if the ex­pected er­ror is at least , then there must be at least a chance to have an er­ror of at least , which then gives us a di­rect con­tra­dic­tion with the defi­ni­tion of PAC learn­ing. This proof of this is easy – prove the con­tra­pos­i­tive state­ment.)

The fact that this con­tra­dicts PAC learn­abil­ity de­pends on the pre­cise defi­ni­tion of PAC learn­abil­ity. In par­tic­u­lar, if the sam­ple com­plex­ity were al­lowed to de­pend on the prob­a­bil­ity dis­tri­bu­tion, then the No-Free Lunch ar­gu­ment would no longer work. The third post will dis­cuss two al­ter­na­tive no­tions of learn­abil­ity, where the sam­ple com­plex­ity is al­lowed to de­pend on fur­ther el­e­ments.

What about classes with finite VC-di­men­sion? Here, the defi­ni­tion of VC-di­men­sion may seem a bit silly – sure, maybe only sets of size five are shat­tered, but per­haps for any there ex­ist sets of size that are al­most shat­tered. In this case, the VC-di­men­sion would ap­pear to be highly mis­lead­ing, and cer­tainly the class would not be PAC learn­able. How­ever, it turns out this is not pos­si­ble. One can prove that, for any and any sub­set with , the num­ber of func­tions in (where ) is equal to the num­ber of sub­sets of that are shat­tered by . Since, if the VC-di­men­sion of is for some , only sub­sets of size at most can be shat­tered, and since there are only polyno­mi­ally many sub­sets of size at most , this im­plies that the size of can be bounded by a func­tion polyno­mial in . This is a pretty strong re­sult, given that the num­ber of pos­si­ble func­tions on equals .

No proof for this re­sult, as I didn’t find read­ing it very in­struc­tive. Un­for­tu­nately, the same will be true for most of the, up­com­ing re­sult – its proof will be largely tech­ni­cal, hard to re­mem­ber, and not very use­ful to build in­tu­ition. The up­side is that it now fits into a sin­gle sec­tion.

The Fun­da­men­tal The­o­rem of Statis­ti­cal Learning

The fun­da­men­tal the­o­rem of statis­ti­cal learn­ing (in slightly sim­plified form) states that, if is some hy­poth­e­sis class for any do­main set and tar­get set , the fol­low­ing state­ments are equiv­a­lent:

1. has the uniform con­ver­gence property
2. is a suc­cess­ful PAC learner for
3. is PAC learnable
4. has finite VC-dimension

We’ve proved in the first sec­tion of this post. is triv­ial. We’ve sketched the proof for in the pre­vi­ous sec­tion, by ar­gu­ing that . The afore­men­tioned difficult-and-very-tech­ni­cal-proof-I-will-not-in­clude is the one for . Those four taken to­gether make it into a cir­cle, thus es­tab­lish­ing equiv­alence.

There is also a quan­ti­ta­tive ver­sion which up­per- and lower bounds the sam­ple com­plex­ity for PAC learn­abil­ity and uniform con­ver­gence.

As nice as this re­sult is, it is also im­por­tant to un­der­stand that it says noth­ing what­so­ever about run­time. As we’ve seen, the class of all pre­dic­tors such that there ex­ists a C pro­gram with at most 10000 sym­bols im­ple­ment­ing the pre­dic­tor is PAC learn­able, but ac­tu­ally im­ple­ment­ing is com­pletely in­fea­si­ble.