Understanding Machine Learning (I)

Un­der­stand­ing Ma­chine Learn­ing is an­other book on Miri’s re­search guide. I’ve pre­vi­ously writ­ten about Topol­ogy and Lin­ear Alge­bra Done Right. With this one, my write-up ended up look­ing a lot more like a self-con­tained in­tro­duc­tion into the field and a lot less like a re­view. I’ve de­cided to go with that, and make this a sort of short­ened ver­sion of the book, or at least the first part of the book, which cov­ers the “fun­da­men­tals” of Ma­chine Learn­ing. This will take three posts of roughly similar size. My main rea­son for do­ing this is that it feels quite use­ful for my­self, and in fact more use­ful than do­ing ex­er­cises.

One pat­tern I no­ticed with this book (in con­trast to the pure math ones) is that both defi­ni­tions and the­o­rem state­ments tend to be longer. This may be be­cause the re­search is more di­rected – it is ac­tu­ally try­ing to solve a par­tic­u­lar prob­lem, rather than always pur­su­ing the most el­e­gant look­ing di­rec­tion, and it is try­ing to do so with the same level of rigor found in non-ap­plied math. That is a difficult task. If you are not already fa­mil­iar with Ma­chine Learn­ing, con­sider the prob­lem of for­mal­iz­ing a the­ory around the idea of learn­ing from data. This seems highly non-triv­ial – how would you even start? Con­se­quently, the pre­cise state­ments and defi­ni­tions we can make end up in­her­ently com­plex, and they make use of sev­eral strands of math­e­mat­ics. Among other fields, this book con­tains prob­a­bil­ity the­ory and calcu­lus, and a lot of lin­ear alge­bra in later chap­ters. It seems to be aimed at grad­u­ate stu­dents.

Through­out this se­ries, I will be us­ing bold and ital­ics to mean “this is the first offi­cial men­tion of a new Ma­chine Learn­ing term”, whereas nor­mal ital­ics means reg­u­lar em­pha­sis.

The Learn­ing Model

The book be­gins with a sort of mis­sion state­ment. Given a do­main set and a tar­get set , we wish to find a func­tion , which we in­ter­change­ably call a hy­poth­e­sis or a pre­dic­tor or a clas­sifier. In the ex­am­ple used in the first chap­ter, rep­re­sents the set of all pa­payas, and rep­re­sents a two-el­e­ment set where one el­e­ment is the la­bel “tasty” and the other the la­bel “not-tasty”. To com­press this into some­thing man­age­able, we rep­re­sent pa­payas by a set of fea­tures, in this case color and soft­ness, which we both map onto the in­ter­val , and we rep­re­sent the two la­bels by 1 and 0. I.e. and . Now the idea is that, if we learn such a pre­dic­tor, we can ap­ply it to any new pa­paya we en­counter, say on the mar­ket, and it will tell us whether or not it is tasty be­fore we’ve spent any money on it. Of course, this will only be use­ful if the pre­dic­tor we learn is ac­tu­ally ac­cu­rate.

This model seems quite gen­eral. It clearly cov­ers the case of chess: one could let be the set of all pos­si­ble states of the chess­board and be the set of all moves; then we want to map each onto the best pos­si­ble move. In fact, it is hard to see which solv­able prob­lem could not be mod­eled this way. Any time we think we can in prin­ci­ple solve some prob­lem, it would con­sist of us look­ing at some stuff and then pre­sent­ing the solu­tion in some way. Now we just let the set of all pos­si­ble in­stances of such stuff be and the set of all pos­si­ble solu­tions be , and we have an in­stance of this model.

One would clearly have to put in a lot more work to fully for­mal­ize this prob­lem – and this is ex­actly what the book does. Re­call that is called the do­main set. If is finite as it is for the pa­payas ex­am­ple, we also call it the la­bel set.

We model the en­vi­ron­ment, which cre­ates our la­beled do­main points by a prob­a­bil­ity dis­tri­bu­tion over as well as the i.i.d. as­sump­tion (not bolded since it’s not ML-spe­cific), which states that all el­e­ments are in­de­pen­dently and iden­ti­cally dis­tributed ac­cord­ing to . In other words, when­ever a pa­paya is plucked in the real world, the en­vi­ron­ment throws some coins and chooses its color and soft­ness, as well as tasti­ness, ac­cord­ing to . Of course, we gen­er­ally as­sume that the color and soft­ness have a ma­jor im­pact on the tasti­ness of the pa­paya, but it is not rea­son­able to as­sume that they fully de­ter­mine its tasti­ness – surely it is pos­si­ble that two pa­payas that are equally col­ored and equally soft taste differ­ently. Thus, given some par­tic­u­lar , we might rea­son­ably hope that ei­ther or is go­ing to be a lot more prob­a­ble than the other, but both are still go­ing to be pos­si­ble. On the other hand, in the chess ex­am­ple, one could ar­gue that the do­main set fully de­ter­mines the value of the tar­get set, i.e. the best pos­si­ble move (al­though then the idea of hav­ing a prob­a­bil­ity dis­tri­bu­tion over be­comes prob­le­matic). In such a case, one can al­ter­na­tively model the en­vi­ron­ment as hav­ing a “true” func­tion . Clearly, this ap­proach is strictly less gen­eral than the former (it cor­re­sponds to a prob­a­bil­ity dis­tri­bu­tion where for ev­ery , ei­ther or has prob­a­bil­ity 0), but in both cases, the joint dis­tri­bu­tion /​ true func­tion is not known, and it is ex­actly what we are try­ing to figure out.

We say that the al­gorithm which out­puts a pre­dic­tor is the learner. As its in­put (as­sum­ing we are do­ing what is called su­per­vised learn­ing here), the learner has ac­cess to a train­ing se­quence . This is a se­quence of la­beled do­main points, i.e. which have been gen­er­ated by the joint dis­tri­bu­tion . For this rea­son, one can con­sider the train­ing se­quence to be a glimpse into how the en­vi­ron­ment works, and the learner’s job to be to gen­er­al­ize from that se­quence to all of , i.e. to out­put a pre­dic­tor which does a good job on not just the train­ing data but also on new el­e­ments that are gen­er­ated by .

We also define a loss func­tion to for­mal­ize what we mean by “good job”. The loss func­tion takes a hy­poth­e­sis and out­puts a num­ber in that is meant to be a rat­ing of how good the hy­poth­e­sis is, where small val­ues are good and large val­ues are bad. For the case of bi­nary clas­sifi­ca­tion (where ), we will set , i.e. the prob­a­bil­ity that our pre­dic­tor gets the la­bel on a freshly sam­pled point wrong (al­though in prac­tice, one might also be in­ter­ested in loss func­tions that pe­nal­ize false pos­i­tives more strongly than false nega­tives, or vice versa). A perfect pre­dic­tor will have loss 0, but in the set­ting of a joint dis­tri­bu­tion , this will gen­er­ally be im­pos­si­ble. How good the best pos­si­ble pre­dic­tor will perform de­pends on how well the fea­ture vec­tor de­scribes the do­main points. With a poor choice of fea­tures, the learner is doomed to fail from the be­gin­ning (like if you want it to rec­og­nize spam emails, but you only show it as­pects of the emails such that there is at best a statis­ti­cally weak cor­re­la­tion be­tween them and whether the email is spam or not-spam). Of course, we can­not ac­tu­ally eval­u­ate our loss func­tion since we don’t know , but it is nonethe­less pos­si­ble to make state­ments about the ex­pected loss of the pre­dic­tor which a learner out­puts.

Fi­nally, we define an em­piri­cal loss func­tion which de­scribes the er­ror of a pre­dic­tor on the train­ing se­quence, i.e. . Un­like the reg­u­lar loss func­tion, the em­piri­cal loss func­tion is some­thing we can eval­u­ate. We use the term true er­ror to re­fer to the out­put of , and em­piri­cal er­ror to re­fer to the out­put of or .

This de­scribes (al­most) all ob­jects of the learn­ing model. Now the ques­tion is how to im­ple­ment the learner . For­mally, is de­scribed as sim­ply an al­gorithm, so we have com­plete free­dom to do what­ever we want.

Gen­er­al­iza­tion /​ Prior Knowl­edge /​ Overfitting

The learner takes a train­ing se­quence, does what­ever it wants with it, and out­puts a pre­dic­tor. How would we go about defin­ing ? Clearly, this is a difficult ques­tion, but a rea­son­able-sound­ing start­ing point is to choose a pre­dic­tor which performs well on the train­ing se­quence . Given that is rep­re­sen­ta­tive of the en­vi­ron­ment (re­call that it was gen­er­ated by ), the hope is that such a pre­dic­tor would also perform well on the en­vi­ron­ment.

This alone does not an­swer the ques­tion, though. Sup­pose we are given the fol­low­ing train­ing se­quence (or -set as we will ig­nore or­der for now):

(Black points tasty in­stances, red points not-tasty in­stances.) There are many ways to con­struct a pre­dic­tor with zero er­ror on this train­ing set. Here is one (namely the pre­dic­tor that la­bels points as tasty iff they are in the blue area):

This does not seem like a good pre­dic­tor. For ex­am­ple, the green point I added (which is meant to be a new in­stance not part of the train­ing se­quence) would al­most cer­tainly cor­re­spond to a tasty pa­paya, yet our hy­poth­e­sis pre­dicts that it is not tasty. Nonethe­less, this pre­dic­tor has zero er­ror on the train­ing se­quence (by the defi­ni­tion of ). An even more ex­treme ex­am­ple of this is , which sim­ply re­mem­bers all do­main points and out­puts a hy­poth­e­sis that only la­bels those in­stances as pos­i­tive that have oc­curred in the train­ing se­quence with a pos­i­tive la­bel. The pro­cess of out­putting such a hy­poth­e­sis is of­ten called overfit­ting. It’s said that the hy­poth­e­sis fits the data “too well,as the book puts it. But I think this is not quite the right way to look at it. Con­sider this al­ter­na­tive hy­poth­e­sis:

This hy­poth­e­sis seems bet­ter, but not be­cause it is a worse fit for the data – both hy­pothe­ses have zero em­piri­cal er­ror. There is noth­ing wrong with fit­ting the data, there is only some­thing wrong with fit­ting the data at the cost of out­putting an “un­rea­son­able look­ing” pre­dic­tor. I think the way to ar­rive at a more ac­cu­rate de­scrip­tion is by look­ing at it from a Bayesian per­spec­tive. We have a prior on how a proper pre­dic­tor will look like, and this prior says that the first pre­dic­tor is worse (as illus­trated by how it clas­sifies the green point). How­ever, sup­pose we had a new ex­am­ple right next to the green point, and it was not-tasty. We would prob­a­bly still dis­like that pre­dic­tor – the one not-tasty pa­paya could be an out­lier. On the other hand, if we had a mil­lion sam­ple points right next to the green one, all of which not-tasty, we would even­tu­ally ad­mit that some­thing su­per weird is re­ally go­ing on here and a very par­tic­u­lar soft­ness-color com­bi­na­tion ac­tu­ally leads to pa­payas be­ing not-tasty.

I think the cor­rect fram­ing of the overfit/​un­der­fit dilemma is that it is about how far to up­date away from pri­ors based on ev­i­dence, or al­ter­na­tively, how to weigh pri­ors vs ev­i­dence. It’s also worth not­ing that if we had a train­ing se­quence that in­cluded all pos­si­ble in­stances (as­sum­ing there is a true la­bel­ing func­tion for the sake of this ex­am­ple), we could dis­re­gard pri­ors al­to­gether – in that case, would work work perfectly. Analo­gously, there is always some amount of ev­i­dence that should over­turn any prior. Of course, in our for­mal model, there may be in­finitely many pos­si­ble fea­ture vec­tors, but in the real world, ev­ery in­stance de­scrip­tion is bounded so the space of all in­stances must be finite.

One crude way of de­cid­ing how much to weigh pri­ors vs ev­i­dence is by com­mit­ting to a hy­poth­e­sis class ahead of time – i.e. to a sub­set – and then find­ing a pre­dic­tor that performs max­i­mally well on the train­ing se­quence, only tak­ing into con­sid­er­a­tions pre­dic­tors in the class we com­mit­ted to (i.e., a pre­dic­tor in the set ). It is crude be­cause it only al­lows the prior to give a bi­nary “yes” or “no” to a hy­poth­e­sis, rather than a “prob­a­bly not, but maybe with enough ev­i­dence”. Sup­pose (for the pa­paya ex­am­ple), we pre-com­mit to the class of axis-al­igned rec­t­an­gles, as the book sug­gests (i.e. rec­t­an­gles with two hori­zon­tal and two ver­ti­cal sides). Con­sider the fol­low­ing train­ing se­quence:

This does not seem like a su­per un­rea­son­able glimpse into re­al­ity – per­haps it is re­ally the case that if green pa­payas are too soft, they be­come gross, but if they’re more or­ange or what­ever, then it’s fine if they’re soft. You might not ex­pect that to be the case, but you’d prob­a­bly as­sign it some rea­son­able prob­a­bil­ity. This is not al­lowed by the class of axis-al­igned rec­t­an­gles: cord­ing to this class, the only pos­si­ble pre­dic­tor is one where both a cer­tain soft­ness area and a cer­tain color area have to be hit, and one fac­tor does not in­fluence the other. There­fore, if we ap­plied a learn­ing al­gorithm to this train­ing se­quence and re­stricted it to that class, it would have to fit an axis-al­igned rec­t­an­gle in there as best as pos­si­ble – which would perform much bet­ter than chance, but still seems like a sub­op­ti­mal choice.

A more so­phis­ti­cated way of an­swer­ing the “how strongly to weigh ev­i­dence vs pri­ors” ques­tion is to choose a fam­ily of sev­eral hy­poth­e­sis classes, as well as a weight­ing func­tion that priv­ileges how much we like each class. Then, per­haps the class of axis-al­igned rec­t­an­gles would have the high­est ini­tial weight­ing, but the class of all rec­t­an­gles would still be rated de­cently, so that if it performs sub­stan­tially bet­ter – as is the case here (al­though it would still have nonzero em­piri­cal er­ror) – the learn­ing al­gorithm would end up prefer­ring a non-axis-al­igned rec­t­an­gle. We can also look at this as hav­ing un­cer­tainty about the right model in ad­di­tion to un­cer­tainty about which pre­dic­tor within the model to choose. More about this later.

PAC learning

Even though I’ve crit­i­cized the pre-com­mit­ment ap­proach, it is a good place to be­gin a more for­mal anal­y­sis. In par­tic­u­lar, we can now talk about a hy­poth­e­sis class be­ing “learn­able”. Here, be­ing learn­able means that, with suffi­ciently many train­ing in­stances, one can prove that, a pre­dic­tor that performs op­ti­mally on the train­ing se­quence will prob­a­bly perform al­most as well on the real world as the hy­poth­e­sis which has the low­est real er­ror in the class. Cru­cially, this holds for ev­ery joint prob­a­bil­ity dis­tri­bu­tion .

So this defi­ni­tion is not about how well the hy­poth­e­sis we end up with performs on the real world in ab­solute terms – that’s not some­thing which is pos­si­ble to bound, be­cause the hy­poth­e­sis class may just not con­tain any good pre­dic­tor at all (as is the case in the ex­am­ple above if we use the class of axis-al­igned rec­t­an­gles, and of course other classes could be much worse). Rather, it is about how good of a choice the low­est-em­piri­cal-er­ror hy­poth­e­sis is given that we are stuck with this class – i.e. how large is the differ­ence be­tween the real er­ror of [the hy­poth­e­sis with min­i­mal em­piri­cal er­ror] and the real er­ror of the best hy­poth­e­sis, i.e. [the hy­poth­e­sis with min­i­mal real er­ror]. Such a bound is im­por­tant be­cause, since we can’t mea­sure the real er­ror, we can­not out­put the best hy­poth­e­sis di­rectly.

There are also two caveats built into the defi­ni­tion. One is the “al­most” part; one can never guaran­tee that the hy­poth­e­sis with min­i­mal em­piri­cal er­ror is ex­actly as good as the hy­poth­e­sis with min­i­mal real er­ror be­cause there is ran­dom­ness in the cre­ation of the train­ing se­quence. The sec­ond is the “prob­a­bly” part, one can never up­per bound the differ­ence with full cer­tainty be­cause there is always some chance that the train­ing se­quence just hap­pens to be un­rep­re­sen­ta­tive. For ex­am­ple, there is always some chance that the train­ing se­quence doesn’t con­tain any pos­i­tively la­beled points, in which case has no way of re­li­ably out­putting any­thing rea­son­able. How­ever, as the train­ing se­quence gets larger, this be­comes in­creas­ingly un­likely.

The com­plete for­mal defi­ni­tion can be stated thus (where we write to de­note the pre­dic­tor which out­puts upon re­ceiv­ing the train­ing se­quence ):

A hy­poth­e­sis class is called PAC learn­able iff there ex­ists a learner and a func­tion such that, for any prob­a­bil­ity dis­tri­bu­tion over , any “gap” , and any failure prob­a­bil­ity , if the train­ing se­quence con­sists of at least i.i.d. ex­am­ples sam­pled via , then with prob­a­bil­ity at least over the choice of , it holds that for all .

While this defin­tion is com­pli­cated, it is not un­nec­es­sar­ily com­pli­cated – the com­plex­ity is just an in­her­ent fea­ture of the thing we are try­ing to define. This kind of learn­abil­ity is called PAC learn­abil­ity be­cause the defi­ni­tion states that is prob­a­bly (hence the ) ap­prox­i­mately (hence the cor­rect (hence the “”).

One can prove that, when­ever a class is PAC-learn­able at all, then the al­gorithm which chooses its pre­dic­tor such that is a suc­cess­ful learner. In other words, we can always just choose the pre­dic­tor that performs best on the train­ing se­quence (or one of them if sev­eral are tied). The ab­bre­vi­a­tion ERM stands for em­piri­cal risk min­i­miza­tion.

One can also prove that ev­ery finite hy­poth­e­sis class is PAC learn­able. In par­tic­u­lar, the sam­ple com­plex­ity (this is a name for the out­put of the func­tion which up­per-bounds the num­ber of train­ing points needed to be prob­a­bil­ity ap­prox­i­mately cor­rect) is up­per bounded by the term .

Let’s re­flect on this re­sult a bit. If we use as our learner, then learn­abil­ity im­plies that the differ­ence be­tween the true er­ror of and the true er­ror of the best hy­poth­e­sis in is prob­a­bly very small. In par­tic­u­lar, if con­sists of hy­pothe­ses, and we want a as­surance that the differ­ence be­tween and the best pre­dic­tor in is at most 0.1, then so we if we have a train­ing se­quence with 139 la­beled points, we’re good. The only re­main­ing er­ror is then the er­ror of the best hy­poth­e­sis in . This is also called the ap­prox­i­ma­tion er­ror, while the er­ror we just bounded is called the es­ti­ma­tion er­ror.

Now con­sider declar­ing the class to be the set of all pre­dic­tors such that there ex­ists a C pro­gram, whose com­piled code is at most 10000 bits long, that im­ple­ments the pre­dic­tor. It’s a safe bet that this class has a very low ap­prox­i­ma­tion er­ror for al­most all prac­ti­cally rele­vant cases (or if not then we can al­low longer pro­grams). Given that the de­pen­dence on the size of – in this case, – is only log­a­r­ith­mic, it in­creases the sam­ple com­plex­ity by a fac­tor smaller than . If we then de­mand an es­ti­ma­tion er­ror of at most 0.01 and some very high con­fi­dence (note that the de­pen­dence on the con­fi­dence is also log­a­r­ith­mic, so even de­mand­ing a 0.9999 cer­tainty doesn’t change much), we would still end up with a sam­ple com­plex­ity of only around a mil­lion. Gather­ing a mil­lion ex­am­ples is prob­a­bly doable in many cases (al­though as Ge­orge points out, as­sum­ing that the i.i.d. as­sump­tion holds may not be re­al­is­tic in prac­tice). Thus, we have just de­rived a gen­eral al­gorithm which, pro­vided there is a way to ob­tain suffi­ciently many i.i.d. data points, solves ev­ery bi­nary clas­sifi­ca­tion task, namely

1. Perform an ex­haus­tive search over all C pro­grams of com­piled length in bits at most 10000
2. Throw out all pro­grams that don’t im­ple­ment a predictor
3. Com­pute the em­piri­cal risk for each one that does
4. Out­put the one with the low­est em­pir­cal risk.

For the out­put of this al­gorithm, both the ap­prox­i­ma­tion er­ror (i.e. the er­ror of the best C pro­gram) and the es­ti­ma­tion er­ror (i.e. the differ­ence be­tween the er­ror of the hy­poth­e­sis out­puts and that of the best C pro­gram) are go­ing to be very low!

Back in the real world, step 1 does of course mean that this al­gorithm is ut­terly use­less in prac­tice, but only due to its run­time. With un­bounded com­put­ing power, this ap­proach would work won­der­fully. In prin­ci­ple, ev­ery­thing (well, at least ev­ery bi­nary clas­sifi­ca­tion task) is learn­able from a rel­a­tively small set of ex­am­ples, cer­tainly fewer ex­am­ples than what is of­ten available in prac­tice. It doesn’t seem com­pletely un­rea­son­able to con­sider this some (small) amount of ev­i­dence on the im­por­tance of tal­ent vs data, and per­haps as de­cent ev­i­dence on the in­fer­ence power of a re­ally smart AI?

The next post will dis­cuss some the­o­ret­i­cal limits of learn­ing and prove the No-Free-Lunch the­o­rem.