# UML final

(This is the four­teenth and fi­nal post in a se­quence on Ma­chine Learn­ing based on this book. Click here for part I.)

The first part of this post will be de­voted to mul­ti­class pre­dic­tion, which is the last topic I’ll cover in the usual style. (It’s on the tech­ni­cal side; re­quires fa­mil­iar­ity with a bunch of con­cepts cov­ered in ear­lier posts.) Then fol­lows a less de­tailed treat­ment of the re­main­ing ma­te­rial and, fi­nally, some mus­ings on the text­book.

# Mul­ti­class Prediction

For this chap­ter, we’re back in the realm of su­per­vised learn­ing.

Pre­vi­ously in this se­quence, we’ve looked at bi­nary clas­sifi­ca­tion (where ) and re­gres­sion (where ). In mul­ti­class pre­dic­tion or mul­ti­class cat­e­go­riza­tion, we have but . For sim­plic­ity, we take the next sim­plest case, i.e., . (Go­ing up from there is straight-for­ward.) More speci­fi­cally, con­sider the prob­lem of rec­og­niz­ing dogs and cats in pic­tures, which we can model by set­ting . For the pur­poses of this chap­ter, we ig­nore the difficult ques­tion of how to rep­re­sent the in­stance space .

We take the fol­low­ing ap­proach:

• train a func­tion to com­pute a score for each la­bel that in­di­cates how well the in­stance fits that label

• clas­sify each new do­main point as the best-fit­ting label

In our case, this means we are in­ter­ested in learn­ing a func­tion of the form . Then, if we have

this trans­lates into some­thing like ” kinda looks like a dog, but a lit­tle bit more like a cat, and it does seem to be some an­i­mal.” Each such func­tion de­ter­mines a pre­dic­tor via the rule .

We will re­strict our­selves to lin­ear pre­dic­tors. I.e., we as­sume that the fea­ture space is rep­re­sented as some­thing like and our pre­dic­tors are based on hy­per­planes.

With the set­ting es­tab­lished, the ques­tion is how to learn .

## Ap­proach 1: Re­duc­tion to Bi­nary Classification

Sup­pose (un­re­al­is­ti­cally) that and our train­ing data looks like this:

where

We could now train a sep­a­rate scor­ing func­tion for each la­bel. Start­ing with cats, we would like a func­tion that as­signs each point a score based on how much it looks like a cat. For , the la­bel cor­re­sponds to the la­bel 1, whereas and both cor­re­spond to the la­bel 0. In this set­ting, learn­ing cor­re­sponds to find­ing a pass­able hy­per­plane. (Ap­ply lin­ear pro­gram­ming for Soft Sup­port Vec­tor Machines.)

Note that, while hy­per­planes are used to clas­sify points, they do, in fact, pro­duce a nu­mer­i­cal score based on how far on the re­spec­tive side each point is.

Next, we train a func­tion. Per­haps its hy­per­plane looks like so:

And, fi­nally, a func­tion in a similar fash­ion. Then, we define

And like­wise with and . As be­fore, we let .

Note that a point might be clas­sified cor­rectly, even if it is on the wrong side of the re­spec­tive hy­per­plane. For ex­am­ple, con­sider this point:

thought this point was not a cat, but it thought it less in­tensely than thought that it was not a dog. Thus, de­pend­ing on ’s ver­dict, the that com­putes might still put out a cat la­bel. Of course, the re­verse is also true – a point might be clas­sified as a cat by but also as a dog by , and if the dog clas­sifi­ca­tion is more con­fi­dent, the will pick that as the la­bel.

The re­duc­tion ap­proach is also pos­si­ble with pre­dic­tors that merely put out a bi­nary yes/​no. How­ever, nu­mer­i­cal scores are prefer­able, pre­cisely be­cause of cases like the above. If we had used clas­sifiers only, then for and we could do no bet­ter than to guess.

While the re­duc­tion ap­proach can work, it won’t have the best perfor­mance in prac­tice, so we look for al­ter­na­tives.

## Ap­proach 2: Lin­ear Programming

Us­ing the fact that we are in the lin­ear set­ting, we can rephrase our prob­lem as fol­lows: let , then we wish to find three vec­tors such that, for all train­ing points , we have for all . In words: for each train­ing point, the in­ner product with the vec­tor mea­sur­ing the similar­ity to the cor­rect la­bel needs to be larger than the in­ner product with the other vec­tors (in this case, the 2 oth­ers). This can be phrased as a lin­ear pro­gram in much the same way as for Sup­port Vec­tor Machines (post VIII).

How­ever, it only works if such vec­tors ex­ist. In prac­tice, this is only go­ing to be the case if the di­men­sion is ex­tremely large.

## Ap­proach 3: Sur­ro­gate Loss and Stochas­tic Gra­di­ent Descent

A gen­eral trick I first men­tioned in post V is to ap­ply Stochas­tic Gra­di­ent Des­cent (post VI) with re­spect to a sur­ro­gate loss func­tion (i.e., a loss func­tion that is con­vex and up­per-bounds the first loss). In more de­tail (but still only sketched), this means we

• rep­re­sent the class of hy­pothe­ses as a fa­mil­iar set (easy for hy­per­planes)

• for each la­beled data point in our train­ing se­quence , con­struct the point-based loss func­tion

• con­struct con­vex point-based loss func­tions such that

• com­pute the gra­di­ents and up­date our hy­poth­e­sis step by step, i.e., set .

To do this, we have to change the fram­ing some­what. Rather than mea­sur­ing the loss of our func­tion (or of the pre­dic­tor ), we would like to an­a­lyze the loss of a vec­tor for some . How­ever, we do not want to change our high-level ap­proach: we still wish to eval­u­ate all pairs and and sep­a­rately.

This can be done by the fol­low­ing con­struc­tion. Let be the num­ber of co­or­di­nates that we use for each la­bel, i.e., . We can “ex­pand” this space to and con­sider it to hold three copies of each data point. For­mally, we define the func­tion by and and , where is the vec­tor con­sist­ing of many ze­ros. Then we can rep­re­sent all parts of our func­tion as a sin­gle vec­tor . If we take the in­ner product , it will only ap­ply the part of that deals with the dog la­bel since the other parts of are all 0s.

Thus, what was pre­vi­ously is now . The non­triv­ial in­for­ma­tion that was pre­vi­ously en­coded in is now en­coded in . Our pre­dic­tor (which is fully de­ter­mined by ) fol­lows the rule . Thus, we still as­sign to each point the la­bel for which it re­ceived the high­est score. Note that makes use of all co­or­di­nates; there’s no rep­e­ti­tion.

Now we need to define loss func­tions. Since our goal is to ap­ply Stochas­tic Gra­di­ent Des­cent, we fo­cus on point-based loss func­tions that mea­sure the perfor­mance of on the point only. (We write fat now since we speci­fied that it’s a vec­tor.) One pos­si­ble choice is the gen­er­al­ized 0-1 loss, that is 1 iff as­signs the wrong la­bel and 0 oth­er­wise. How­ever, this choice may not be op­ti­mal: if an image con­tains a cat, we prob­a­bly pre­fer the pre­dic­tor claiming it con­tains a dog to it claiming it doesn’t con­tain any an­i­mal. There­fore, we might wish to define a func­tion that mea­sures how differ­ent two la­bels are. Then, we set . The func­tion should by sym­met­ric and non-nega­tive, and it should be the case that . For our ex­am­ple, a rea­son­able choice could be

The loss func­tion as defined above is non-con­vex: if and out­put differ­ent la­bels on , they will have differ­ent losses, which means that some­where on the straight line be­tween and , the loss makes a sud­den jump (and sud­den jumps make a func­tion non-con­vex). Thus, our goal now is to con­struct a con­vex loss func­tion such that, for each in the train­ing se­quence, up­per-bounds .

To this end, con­sider the term

for any point . Re­call that the pre­dic­tor chooses its la­bel pre­cisely in such a way that the above in­ner product is max­i­mized. It fol­lows that the term above is non-nega­tive. (It’s 0 for ). There­fore, adding this term to our loss will up­per-bond the origi­nal loss. We obtain

This term pe­nal­izes for (1) the differ­ence be­tween the pre­dicted and the cor­rect la­bel; and (2) the differ­ence be­tween ’s score for the pre­dicted and the cor­rect la­bel. This would be a ques­tion­able im­prove­ment – we now pun­ish twice for es­sen­tially the same mis­take. But the term above is only an in­ter­me­di­ate step. Now, in­stead of us­ing in the term above, we can take the max­i­mum over all pos­si­ble la­bels ; clearly, this will not make the term any smaller. The re­sult­ing term will define our loss func­tion. I.e.:

Note that, un­like be­fore, the differ­ence be­tween the two in­ner prod­ucts may be nega­tive.

Let’s think about what these two equa­tions mean (refer­ring two the two cen­tered equa­tions, let’s call them E1 and E2). E1 (the former) pun­ishes in terms of the cor­rect la­bel and the la­bel which guessed: it looks both at how far its pre­dic­tion is off and how wrong the la­bel is. E2 does the same for , so it will always put out at least as high of an er­ror. How­ever, E2 also looks at all the other la­bels. To illus­trate the effect of this, sup­pose that the cor­rect la­bel is , but guessed . The com­par­i­son of the cor­rect la­bel to the guessed la­bel (cor­re­spond­ing to ) will yield an er­ror of 0.2 (for the differ­ence be­tween both la­bels) plus the differ­ence be­tween ’s score for both (which is in­evitably pos­i­tive since it guessed ). Let’s say and , then the differ­ence is an­other 0.3 and the to­tal er­ror comes out .

Now con­sider the term for . Sup­pose ; in that case, did think was a more likely la­bel than , but only by 0.2. There­fore, . This is a nega­tive term, which means it will be sub­tracted from the er­ror. How­ever, the loss func­tion “ex­pects” a differ­ence of 1, be­cause that’s the differ­ence be­tween the la­bels and . Thus, the full term for comes out at , which is higher than for .

(Mean­while, the triv­ial case of – i.e., com­par­ing the cor­rect la­bel to it­self – always yields an er­ror of 0.)

Alas, the loss func­tion ends up scold­ing the most not for the (slightly) wrong la­bel it ac­tu­ally put out, but for the (very) wrong la­bel it al­most put out. This might be rea­son­able be­hav­ior, al­though it cer­tainly re­quires that the nu­mer­i­cal scores are mean­ingful. In any case, note that the pri­mary point of for­mu­lat­ing the point-based loss func­tions was to have it be con­vex, so even if the re­sult­ing func­tion were less mean­ingful, this might be an ac­cept­able price to pay. (To see that it is, in fact, con­vex, sim­ply note that it is the max­i­mum of a set of lin­ear func­tions.)

With this loss func­tion con­structed, ap­ply­ing Stochas­tic Gra­di­ent Des­cent is easy. We start with some ran­dom weight vec­tor . At step , we take and the la­beled train­ing point . Con­sider the term

The gra­di­ent of the max­i­mum of a bunch of func­tions is the max­i­mum of their gra­di­ents. Thus, if is the la­bel for which the above is max­i­mal, then

There­fore, our up­date rule is .

# Re­main­ing Material

Now we get to the stuff that I read at least par­tially but de­cided to cover in less de­tail.

## Naive Bayes Classifier

Naive Bayes is a sim­ple type of clas­sifier that can be used for spe­cific cat­e­go­riza­tion tasks (su­per­vised learn­ing), both bi­nary clas­sifi­ca­tion and multi-clas­sifi­ca­tion but not re­gres­sion. The “naive” does not im­ply that Bayes is naive; in­stead, it is naive and uses Bayes’ rule.

We usu­ally frame our goal in terms of want­ing to learn a func­tion . How­ever, we can al­ter­na­tively ask to learn the con­di­tional prob­a­bil­ity dis­tri­bu­tion . If we know this prob­a­bil­ity – i.e., the prob­a­bil­ity of ob­serv­ing a (point, la­bel) pair once we see the point – it is easy to pre­dict new points.

For the rest of this sec­tion, we’ll write “prob­a­bil­ity” as rather than , and we’ll write rather than .

Now. For any pair , we can ex­press in terms of and and us­ing Bayes’ rule. This im­plies that know­ing all the is suffi­cient to es­ti­mate any (the pri­ors on and are gen­er­ally not an is­sue). The difficulty is that is usu­ally in­finite and we only have ac­cess to some finite train­ing se­quence .

For this rea­son, Naive Bayes will only be ap­pli­ca­ble to cer­tain types of prob­lems. One of them, which will be our run­ning ex­am­ple, is doc­u­ment clas­sifi­ca­tion in the bag of words model. Here, where is the num­ber of words in our dic­tio­nary, and each doc­u­ment is rep­re­sented as a vec­tor . where bit in­di­cates whether or not the -th word of our dic­tio­nary ap­pears in [the doc­u­ment rep­re­sented by ]. Thus, we throw away or­der and rep­e­ti­tions but keep ev­ery­thing else. Fur­ther­more, is a set of pos­si­ble top­ics, and the goal is to learn a pre­dic­tor that sees a new doc­u­ment and as­signs it a topic.

The cru­cial prop­erty of this ex­am­ple is that each co­or­di­nate of our vec­tors is very sim­ple. On the other hand, the vec­tors them­selves are not sim­ple: we have (num­ber of words in the dic­tio­nary), which means we have about pa­ram­e­ters to learn (all ). The solu­tion is the “naive” part: we as­sume that all fea­ture co­or­di­nates are in­de­pen­dent so that . This as­sump­tion is, of course, in­cor­rect: the prob­a­bil­ities for and are cer­tainly not in­de­pen­dent. But an im­perfect model can still make rea­son­able pre­dic­tions.

Thus, naive Bayes works by es­ti­mat­ing the pa­ram­e­ters for any and . If con­sists of 100 pos­si­ble top­ics, then we wish to es­ti­mate pa­ram­e­ters, which may be fea­si­ble.

Given a la­beled doc­u­ment col­lec­tion (the train­ing se­quence ), es­ti­mat­ing these pa­ram­e­ters is now both sim­ple and fast, which is the ma­jor strength of Naive Bayes. In fact, it is the fastest pos­si­ble pre­dic­tor in the sense that each train­ing point only has to be checked once. Note that is the prob­a­bil­ity of en­coun­ter­ing word in a doc­u­ment about topic . For ex­am­ple, it could be the prob­a­bil­ity of en­coun­ter­ing the word “Deipno­pho­bia” (fear of din­ner par­ties) in a doc­u­ment about sports. To es­ti­mate this prob­a­bil­ity, we read through all doc­u­ments about sport in our col­lec­tion and keep track of how many of them con­tain this term (spoilers: it’s zero). On that ba­sis, we es­ti­mate that

Why the ? Well, as this ex­am­ple illus­trates, the nat­u­ral es­ti­mate will of­ten be 0. How­ever, a 0 as our es­ti­mate will mean that any doc­u­ment which con­tains this word has no chance to be clas­sified as a sports doc­u­ment (see equa­tions be­low), which seems overkill. Hence we “smooth” the es­ti­mate by the .

Small tan­gent: I think it’s worth think­ing about why this prob­lem oc­curs. Why isn’t the com­puted value our best guess? It’s not be­cause there’s any­thing wrong with Bayesian up­dat­ing – ac­tu­ally the op­po­site. While the out­put rule is Bayesian, the pa­ram­e­ter es­ti­ma­tion, as de­scribed above, is not. A Bayesian pa­ram­e­ter es­ti­ma­tion would have a prior on the value of , and this prior would be up­dated upon com­put­ing the above frac­tion. In par­tic­u­lar, it would never go to 0. But, of course, the sim­ple fre­quen­tist guess makes the model vastly more prac­ti­cal.

Now sup­pose we have all these pa­ram­e­ters. Given an un­known doc­u­ment , we would like to out­put a doc­u­ment in the set . For each , we can write

where we got rid of the term be­cause it won’t change the ; it doesn’t mat­ter how likely it was to en­counter our new doc­u­ment, and it will be the same no mat­ter which topic we con­sider.

Now kicks in the naivety: we will choose a topic in the set

This is all stuff we can com­pute – we can es­ti­mate the prior based on our doc­u­ment col­lec­tion, and the re­main­ing prob­a­bil­ities are pre­cisely our pa­ram­e­ters.

In prac­tice, one uses a slightly more so­phis­ti­cated way to es­ti­mate pa­ram­e­ters (tak­ing into ac­count rep­e­ti­tions). One also smoothes slightly differ­ently (smooth­ing is the +1 step) and maps ev­ery­thing into log space so that the val­ues don’t get ab­surdly small, and we can add prob­a­bil­ities rather than mul­ti­ply­ing them. Nonethe­less, the im­por­tant ideas are in­tact.

## Fea­ture Selection

Every learn­ing prob­lem can be di­vided into two steps, namely

• (1) Fea­ture Selection

• (2) Learning

This is true even in the case of un­su­per­vised learn­ing or on­line learn­ing: be­fore ap­ply­ing clus­ter­ing al­gorithms onto a data set, one needs to choose a rep­re­sen­ta­tion.

The re­spec­tive chap­ter in the book does not de­scribe how to se­lect ini­tial fea­tures, only how, given a (po­ten­tially large) set of fea­tures, to choose a sub­set . The ap­proaches are fairly straight-for­ward:

• Mea­sure the effec­tive­ness of each fea­ture by train­ing a pre­dic­tor based on that fea­ture only; pick the best ones (lo­cal scor­ing)

• Start with an empty set of fea­tures, i.e., . In step , for each , test the perfor­mance of a pre­dic­tor trained on . Choose to be the fea­ture max­i­miz­ing the perfor­mance, and set (greedy se­lec­tion)

• Start with the full set of fea­tures, i.e., . In each step, drop one fea­ture such that the loss in perfor­mance is minimal

It’s worth point­ing out that greedy se­lec­tion can be highly in­effec­tive since fea­tures and could be use­less on their own but work well to­gether. For an ab­surdly con­trived ex­am­ple, sup­pose that , i.e., the tar­get value is the sum of both fea­tures mod­ulo some prime num­ber . In that case, if and are uniformly dis­tributed, and know­ing ei­ther one of them is zero in­for­ma­tion about the tar­get value. Nonethe­less, both taken to­gether de­ter­mine the tar­get value ex­actly.

(This ex­am­ple is de­rived from cryp­tog­ra­phy: split­ting a num­ber into ad­di­tive parts mod­ulo some fixed prime num­ber is called se­cret shar­ing. The idea is that all shares are re­quired to re­con­struct the se­cret, but any sub­set of shares is use­less. For ex­am­ple, if and all are uniformly dis­tributed, then and are in­de­pen­dently dis­tributed (both uniform), and ditto with and . This re­mains true for an ar­bi­trary num­ber of shares. In gen­eral, Ma­chine Learn­ing and cryp­tog­ra­phy have an in­ter­est­ing re­la­tion­ship: the former is about learn­ing in­for­ma­tion, the sec­ond about hid­ing in­for­ma­tion, i.e., mak­ing learn­ing im­pos­si­ble. That’s why cryp­tog­ra­phy can be a source of nega­tive ex­am­ples for Ma­chine Learn­ing al­gorithms.)

In this par­tic­u­lar case, even the greedy al­gorithm wouldn’t end up with fea­tures and since it has no mo­ti­va­tion to pick up ei­ther of them to be­gin with – but the third ap­proach would keep them both. On the other hand, if is use­ful on its own but only in com­bi­na­tion with , then the lo­cal scor­ing ap­proach would at best choose , while the greedy al­gorithm might end up with both. And, of course, it is also pos­si­ble to con­struct ex­am­ples where the third ap­proach fails. Sup­pose we have 10 fea­tures that all heav­ily de­pend on each other. To­gether, they add more than one fea­ture typ­i­cally does but not ten times as much, so they’re not worth keep­ing. Lo­cal scor­ing and greedy se­lec­tion won’t bother, but the third ap­proach will be such with them: get­ting rid of one would worsen perfor­mance by too much.

## Reg­u­lariza­tion and Stability

This is a difficult chap­ter that I’ve read par­tially but skipped en­tirely in this se­quence.

The idea of reg­u­lariza­tion is as fol­lows: sup­pose we can rep­re­sent our hy­pothe­ses as vec­tors , i.e., . In­stead of min­i­miz­ing the em­piri­cal loss , we min­i­mize the reg­u­larized em­piri­cal loss defined as

where is a reg­u­lariza­tion func­tion. A com­mon choice is for some pa­ram­e­ter . In that case, we wish to min­i­mize a term of the form . This ap­proach is called Reg­u­larized Loss Min­i­miza­tion. There are at least two rea­sons for do­ing this. One is that the pres­ence of the reg­u­lariza­tion func­tion makes the prob­lem eas­ier. The other is that, if the norm of a hy­poth­e­sis is a mea­sure for its com­plex­ity, choos­ing a pre­dic­tor with a small norm might be de­sir­able. In fact, you might re­call the Struc­tural Risk Min­i­miza­tion paradigm, where we defined a se­quence of hy­poth­e­sis classes such that . If we let , then both Struc­tural Risk Min­i­miza­tion and Reg­u­larized Loss Min­i­miza­tion priv­ilege pre­dic­tors with smaller norm. The differ­ence is merely that the trade­off is con­tin­u­ous (rather than step-wise) in the case of Reg­u­larized Loss Min­i­miza­tion.

The problem

where is the squared loss can be solved with lin­ear alge­bra – it comes down to in­vert­ing a ma­trix, where the term en­sures that it is always in­vert­ible.

## Gen­er­a­tive Models

Re­call the gen­eral model for su­per­vised learn­ing: we have the do­main set , the tar­get set , and the prob­a­bil­ity dis­tri­bu­tion over that gen­er­ates the la­bels. Our goal is to learn a pre­dic­tor .

It took me reach­ing this chap­ter to re­al­ize that this goal is not en­tirely ob­vi­ous. While the func­tion is gen­er­ally go­ing to be the most in­ter­est­ing thing, it is also con­ceiv­able to be in­ter­ested in the full dis­tri­bu­tion . In par­tic­u­lar, even if the best pre­dic­tor is known (that’s the pre­dic­tor given by ), one still doesn’t know how likely each do­main point is to ap­pear. Differ­ently put, the prob­a­bil­ity dis­tri­bu­tion over is in­for­ma­tion sep­a­rate from the pre­dic­tor; it’s nei­ther nec­es­sary nor suffi­cient for find­ing one.

The idea of gen­er­a­tive mod­els is to at­tempt to es­ti­mate the dis­tri­bu­tion di­rectly. A way to mo­ti­vate this is by Teg­mark et. al.’s ob­ser­va­tion that the pro­cess that gen­er­ates la­beled points of­ten has a vastly sim­pler de­scrip­tion than the points them­selves (chap­ter 2.3 of post X). This view sug­gests that es­ti­mat­ing could ac­tu­ally be eas­ier than learn­ing the pre­dic­tor . (Although not liter­ally since learn­ing im­plies learn­ing the best pre­dic­tor .)

The book has Naive Bayes as part of this chap­ter, even though it only in­volves es­ti­mat­ing the , not the prob­a­bil­ity dis­tri­bu­tion over .

The book is struc­tured in four parts. The first is about the fun­da­men­tals, which I cov­ered in posts I-III. The sec­ond and third are about learn­ing mod­els, which I largely cov­ered in posts IV-XIV. If there is a mean­ingful dis­tinc­tion be­tween parts two and three, it’s lost on me – they don’t seem to be in­creas­ing in difficulty, and I found both eas­ier than part one. How­ever, the fourth part, which is ti­tled ad­vanced the­ory, is no­tice­ably harder – too difficult for me to cover it at one post per week.

Here’s a list of its chap­ters:

• Rademacher Com­plex­ities, which is an ad­vanced the­ory study­ing the rate at which train­ing se­quences be­come rep­re­sen­ta­tive (closely re­lated to the con­cept of uniform con­ver­gence from post III).

• Cover­ing num­bers, a re­lated con­cept, about bound­ing the com­plex­ity of sets

• Proofs for the sam­ple com­plex­ity bounds in the quan­ti­ta­tive ver­sion of the fun­da­men­tal the­o­rem of statis­ti­cal learn­ing (I’ve men­tioned those a cou­ple of times in this se­quence)

• Some the­ory on the learn­abil­ity of mul­ti­class pre­dic­tion prob­lems, in­clud­ing an­other off-shoot of the VC-di­men­sion.

• Com­pres­sion Bounds, which is about es­tab­lish­ing yet an­other crite­rion for the learn­abil­ity of classes: if “a learn­ing al­gorithm can ex­press the out­put hy­poth­e­sis us­ing a small sub­set of the train­ing set, then the er­ror of the hy­poth­e­sis on the rest of the ex­am­ples es­ti­mates its true er­ror.”

• PAC-Bayes, which is an­other learn­ing approach

# Clos­ing Thoughts on the Book

I’ve men­tioned be­fore that, while Un­der­stand­ing Ma­chine Learn­ing is the best re­source on the sub­ject I’ve seen so far, it feels weaker than var­i­ous other text­books I’ve tried from Miri’s list. Part one is pretty good – it ac­tu­ally starts from the be­gin­ning, it’s ac­tu­ally rigor­ous, and it’s rea­son­ably well ex­plained. Parts two and are mixed. Some chap­ters are good. Others are quite bad. Oc­ca­sion­ally, I felt the fa­mil­iar sense of an­noy­ance-at-how-any­one-could-pos­si­bly-think-this-was-the-best-way-ex­plain-this-con­cept that I get all the time read­ing aca­demic pa­pers very rarely with Miri-recom­mended text­books. It feels a bit like the au­thors iden­ti­fied a prob­lem (no good text­books that offer a com­pre­hen­sive the­o­ret­i­cal treat­ment of the field), worked on a solu­tion, but lost in­spira­tion some­where on the way and ended up pro­duc­ing sev­eral chap­ters that feel rather un­in­spired. The one on neu­ral net­works is a good ex­am­ple.

(Be­fore bash­ing the book some more, I should note that the crit­i­cally ac­claimed Ele­ments of Statis­ti­cal Learn­ing made me an­gry im­me­di­ately when I tried to read it – point be­ing, I ap­pear to be un­usu­ally hard to please.)

The ex­er­cises are lack­ing through­out. I think the goal of ex­er­cises should be to help me in­ter­nal­ize the con­cepts of the re­spec­tive chap­ter, and they didn’t re­ally do that. It seems that the mo­ti­va­tion for them was of­ten “let’s offload the proof of this lemma into the ex­er­cises to save space” or “here’s an­other in­ter­est­ing thing we didn’t get to cover, let’s make it an ex­er­cise.” (Note that there’s noth­ing wrong with offload­ing the proof of a lemma to ex­er­cises per se, they can be perfectly fine, but they’re not au­to­mat­i­cally fine.) Con­versely, in Topol­ogy, the first ex­er­cises of each sub-chap­ter usu­ally felt like “let’s make sure you un­der­stood the defi­ni­tions” and the later ones like “let’s make sure you get some real prac­tice with these con­cepts,” which seems much bet­ter. Ad­mit­tedly, it’s prob­a­bly more challeng­ing to de­sign ex­er­cises for Ma­chine Learn­ing than it is for a purely math­e­mat­i­cal topic. Nonethe­less, I think they could be im­proved dra­mat­i­cally.

Next up (but not any­time soon) are -calcu­lus and Cat­e­gory the­ory.