UML VI: Stochastic Gradient Descent

(This is part six in a se­quence on Ma­chine Learn­ing based on this book. Click here for part 1.)

Stochas­tic Gra­di­ent Des­cent is the big Ma­chine Learn­ing tech­nique. It performs well in prac­tice, it’s sim­ple, and it sounds su­per cool. But what is it?

Roughly, Gra­di­ent Des­cent is an ap­proach for func­tion min­i­miza­tion, but it gen­er­ally isn’t pos­si­ble to ap­ply it to Ma­chine Learn­ing tasks. Stochas­tic Gra­di­ent Des­cent, which is Gra­di­ent Des­cent plus noise, is a var­i­ant which can be used in­stead.

Let’s be­gin with reg­u­lar Gra­di­ent Des­cent.

Gra­di­ent Descent

Given a func­tion and a point , the gra­di­ent of at is the vec­tor of par­tial deriva­tives of at , i.e.

So the gra­di­ent at a point is an el­e­ment of . In con­trast, the gra­di­ent it­self (not yet ap­plied to a point) is the func­tion defined by the above rule. Go­ing for­ward, gra­di­ent always means “gra­di­ent at a point”.

If , then the gra­di­ent will be , a num­ber in . It can look like this:

Or like this:

Or also like this:

Point be­ing, if it’ll point right­ward and if it’ll point left­ward, also it can have differ­ent lengths, but that’s it. The idea of gra­di­ent de­scent is that the gra­di­ent points into the di­rec­tion of fastest pos­i­tive change, thus the op­po­site of the gra­di­ent is the di­rec­tion of fastest nega­tive change. It fol­lows that, to min­i­mize a func­tion, one can start any­where, com­pute the gra­di­ent, and then move into the op­po­site di­rec­tion.

In the above ex­am­ple, the func­tion goes up, there­fore the deriva­tive is pos­i­tive, there­fore the gra­di­ent points right­ward. Gra­di­ent Des­cent tells us to go into the op­po­site di­rec­tion, i.e. left­wards. In­deed, left­ward is where the func­tion de­creases. Clearly, this is a dis­play of ut­ter brilli­ance.

Im­por­tantly, note that the pic­ture is mis­lead­ing in­so­far as it sug­gests there are more di­rec­tions than two. But ac­tu­ally, the gra­di­ent lives in the do­main space, in this case, , not in the Carte­sian product of do­main space and tar­get space, in this case, . There­fore, it can­not point up­ward or down­ward.

As silly as the one-di­men­sional case is, it quickly be­comes less triv­ial as we in­crease the di­men­sion. Con­sider this func­tion (pic­ture taken from Wikipe­dia):

Again, the gra­di­ent lives in the do­main space, which is , as this image illus­trates nicely. Thus it can­not point up­ward or down­ward; how­ever, it can point into any di­rec­tion within the flat plane. If we look at a point on this func­tion, it is not im­me­di­ately ob­vi­ous in which di­rec­tion one should move to ob­tain the fastest pos­si­ble de­crease of its func­tion value. But if we look at the lit­tle ar­rows (the gra­di­ents at differ­ent do­main points), they tell us the di­rec­tion of the fastest pos­i­tive change. If we re­verse them, we get the di­rec­tion of the fastest nega­tive change.

The fol­low­ing is im­por­tant to keep in mind: in the con­text of Ma­chine Learn­ing, the do­main space for our loss func­tion is the space of all pos­sibe hy­pothe­ses. So the do­main is it­self a func­tion space. What we want to do is to min­i­mize a func­tion (a loss func­tion) that takes a func­tion (a pre­dic­tor, prop­erly parametrized), as an ar­gu­ment, and out­puts the perfor­mance of that pre­dic­tor on the real world. For ex­am­ple, if we al­low pre­dic­tors of the form for a re­gres­sion prob­lem, i.e. , our func­tion space could be , where each defines one pre­dic­tor . Then, there is some well-defined num­ber that eval­u­ates the perfor­mance of in the real world. Fur­ther­more, if is differ­en­tiable, there is also some well-defined three-di­men­sional vec­tor . The vec­tor, which lives in , tells us that the real er­ror in­creases most quickly if we change the pre­dic­tor in the di­rec­tion of the gra­di­ent – or equiv­a­lently, that it de­creases most quickly if we change the pre­dic­tor in the di­rec­tion op­po­site to the gra­di­ent. This is why we will re­fer to our el­e­ments of by the let­ters and such; not or .

If the loss func­tion were known, one could thus min­i­mize it by start­ing with an ar­bi­trary pre­dic­tor parametrized by , com­put­ing the gra­di­ent , and then “mov­ing” into the di­rec­tion op­po­site to the gra­di­ent, i.e. set­ting , where is a pa­ram­e­ter de­ter­min­ing the step size. There is no tel­ling how long mov­ing into the di­rec­tion op­po­site to the gra­di­ent will de­crease the func­tion, and there­fore the choice of is non­triv­ial. One might also de­crease it over time.

But, of course, the loss func­tion isn’t known, which makes reg­u­lar Gra­di­ent Des­cent im­pos­si­ble to ap­ply.

Stochas­tic Gra­di­ent Descent

As always, we use the train­ing se­quence as a sub­sti­tute for in­for­ma­tion on the real er­ror, be­cause that’s all we have. Each el­e­ment defines the point-based loss func­tion , which we can ac­tu­ally use to com­pute a gra­di­ent. Here, has to be some fa­mil­iar set – and to ap­ply the for­mal re­sults we look at later, it also has to be bounded, so think of .

Now we do the same thing as de­scribed above, ex­cept that we eval­u­ate the perfor­mance of our pre­dic­tor at only a sin­gle point at a time – how­ever, we will use a differ­ent point for each step. Thus, we be­gin with some which defines a pre­dic­tor , and we will up­date it each step so that, af­ter step , we have the pre­dic­tor . To perform step , we take the loss func­tion which maps each onto the er­ror of the pre­dic­tor on . We com­pute the gra­di­ent of this loss func­tion at our cur­rent pre­dic­tor, i.e. we com­pute . Then we up­date by do­ing a small step in the di­rec­tion op­po­site to this gra­di­ent. Our com­plete up­date rule can be ex­pressed by the equation

The rea­son why, as the book puts it, Stochas­tic Gra­di­ent Des­cent can be used to “min­i­mize the loss func­tion di­rectly” is that, given the i.i.d. as­sump­tion, the point is an un­bi­ased es­ti­mate of the real dis­tri­bu­tion, and there­fore, one can prove that the ex­pected value – with re­spect to the point – of the gra­di­ent we com­pute equals the gra­di­ent of the real loss func­tion . Thus, we won’t always up­date into the right di­rec­tion, but we will up­date into the right di­rec­tion noise. The more steps one does, the lower the ex­pected to­tal noise will be, at least if is de­creased over time. (This is so be­cause, if one does ran­dom steps from some ori­gin, then as the num­ber of steps in­creases, even though the ex­pected ab­solute dis­tance from the ori­gin will in­crease, the ex­pected rel­a­tive dis­tance de­creases. The ex­pected rel­a­tive dis­tance is the term , which is an ana­log to in our case. Then, since we have this con­sis­tent move­ment into the cor­rect di­rec­tion that grows lin­early with “num­ber of steps”, the term also de­creases, mak­ing us likely to con­verge to­ward the solu­tion.)

In the con­text of su­per­vised learn­ing, i.e. where train­ing data is available at all, the only things that needs to hold in or­der for Stochas­tic Gra­di­ent Des­cent to be ap­pli­ca­ble is that (a) the hy­poth­e­sis class can be rep­re­sented as a fa­mil­iar set, and (b) we can com­pute gra­di­ents on the point-based loss func­tions (in prac­tice, they don’t even nec­es­sar­ily need to be differ­en­tiable at ev­ery point). The re­main­ing ques­tion is whether or not it will perform well. The most im­por­tant prop­erty here is cer­tainly the con­vex­ity of the loss func­tion be­cause that is what guaran­tees that the pre­dic­tor will not get stuck in a lo­cal min­i­mum, which could oth­er­wise eas­ily hap­pen (the nega­tive gra­di­ent will just keep push­ing us back to­ward that lo­cal min­i­mum).

In the pre­vi­ous chap­ter, we in­tro­duced the classes of con­vex Lip­s­chitz bounded and con­vex smooth bounded prob­lems. For both of these, one can de­rive up­per-bounds on the ex­pected er­ror of a clas­sifier trained via stochas­tic gra­di­ent de­scent for a cer­tain num­ber of steps. In the re­main­der of this chap­ter, we will do one of those proofs, namely the one for con­vex smooth bounded prob­lems. This is an ar­bi­trary choice; both proofs are tech­ni­cal and difficult, and un­for­tu­nately also quite differ­ent, so we’re not do­ing both.

Deriv­ing an up­per-bound on the Error

The proof which up­per-bounds the ex­pected real er­ror is of the kind that I usu­ally don’t work out in de­tail be­cause it just doesn’t seem worth it. On the other hand, I don’t want difficult proofs to be ex­cluded in prin­ci­ple. So this will be an ex­per­i­ment of how my style of dis­sect­ing each prob­lem un­til its pieces are suffi­ciently small works out if ap­plied to a proof that seems to re­sist in­tu­itive un­der­stand­ing. In any case, work­ing through this proof is not re­quired to un­der­stand Stochas­tic Gra­di­ent Des­cent, so you might end this chap­ter here.

Now let us be­gin.

The plan

Re­call that we wish are given a prob­lem out of the class of con­vex smooth bounded prob­lems (with pa­ram­e­ters ), and we run the al­gorithm defined ear­lier in this post on our train­ing se­quence, . We wish to bound the (ex­pected) real er­ror of .

Meet the prob­lem in­stance:

Of course, this is merely the prob­lem in­stance within the images-world; the sym­bolic world where the ac­tual proof hap­pens will make no as­sump­tions about the na­ture of the prob­lem. That said, in the prob­lem in­stance, we have and , be­cause that’s enough to illus­trate the steps. Our train­ing se­quence (not shown!) con­sists of two el­e­ments, i.e. , this time in­dexed from 0 to avoid hav­ing to write a lot of “”. For our hy­poth­e­sis class, think of , where each defines a sim­ple lin­ear pre­dic­tor – al­though go­ing for­ward, we will pre­tend as if it­self is the pre­dic­tor. What the pic­ture does show is the pre­dic­tors through which the al­gorithm iter­ates (the high­est one is , the sec­ond-high­est , and the one with­out a gra­di­ent ar­row is ). They live in -di­men­sional space (on the -axis). Their real er­ror cor­re­sponds to their po­si­tion on the -axis. Thus, both the pre­dic­tors them­selves and their er­rors are mono­ton­i­cally de­creas­ing. The green ar­rows de­note the gra­di­ents and . The filled point de­notes the op­ti­mal pre­dic­tor .

For un­speci­fied rea­sons – prob­a­bly be­cause it makes anal­y­sis eas­ier, al­though it could also re­duce var­i­ance in prac­tice – the pre­cise al­gorithm we an­a­lyze puts out the pre­dic­tor rather than , even though naively looks like the bet­ter choice. In our case, that means we put out . With that in mind, now meet the proof roadmap:

The proof roadmap illus­trates what terms we’re con­cerned with (red), and which up­per-bounds we’re de­riv­ing. Dashed lines mean a small fac­tor in front of the term; fat lines mean a large fac­tor. In the be­gin­ning, we look at the differ­ence be­tween the er­ror of our out­put pre­dic­tor and that of the op­ti­mal pre­dic­tor (left­most pic­ture). Then we do three steps; in each step, we bound our cur­rent term by a differ­ent term. First, we bound it in terms of the in­ner product be­tween the pre­dic­tors them­selves and the gra­di­ents. Then, we bound that by times the norm of the gra­di­ents plus many times the norm of the op­ti­mal pre­dic­tor, where . Then we bound that by times the to­tal er­ror of our out­put pre­dic­tor plus a lot of times the norm of the op­ti­mal pre­dic­tor.

At that point, we’ve made a cir­cle. Well – not quite, since we started with the differ­ence be­tween the er­ror of the out­put pre­dic­tor and that of the op­ti­mal pre­dic­tor (hence why the lines don’t go all the way down in pic­ture one) and ended with the to­tal er­ror of the out­put pre­dic­tor. But it’s good enough. Through alge­braic ma­nipu­la­tions (adding the er­ror of the op­ti­mal pre­dic­tor on both sides and rescal­ing), we ob­tain a bound on the er­ror of our out­put pre­dic­tor, in terms of the er­ror of the op­ti­mal pre­dic­tor and the norm of the op­ti­mal pre­dic­tor, i.e.:

And the equa­tion cor­re­spond­ing to this pic­ture will be the the­o­rem state­ment.

Now we just have to do the three steps, and then the alge­braic stuff at the end.

Step 1

We start by bound­ing the term , which is closely re­lated to but not iden­ti­cal to the real er­ror of the out­put pre­dic­tor (it is defined in terms of the point-based loss func­tions rather than the real er­ror). will be the num­ber of steps, so . Re­call the proof roadmap:

The first stop (sec­ond pic­ture) is the term , where we write for . This bound ap­plies for each sum­mand in­di­vi­d­u­ally, and it re­lies on a fun­da­men­tal prop­erty of con­vex func­tions, namely that ev­ery tan­gent re­mains be­low the func­tion:

This should be easy enough to be­lieve; let’s skip the for­mal proof and con­tinue.

Step 2

Our cur­rent term is . Re­call the proof roadmap:

The next stop (third pic­ture) is the term . This will be the hard­est step – let’s be­gin by re­flect­ing on what it means.

We wish to bound the in­ner product of pre­dic­tors and cor­re­spond­ing gra­di­ents with the norm of the gra­di­ents and the norm of the fi­nal pre­dic­tor. So what we lose is the val­ues of the pre­dic­tors them­selves. How­ever, these are im­plicit in the gra­di­ents, since they en­code how we move to­ward our fi­nal pre­dic­tor (or equiv­a­lently, how we move away from the fi­nal pre­dic­tor). Con­se­quently, we will need to use the up­date rule, , to prove this re­sult.

Our ap­proach will be to re­duce the in­ner product to the differ­ence be­tween and . Alas,

Or differ­ently writ­ten, . Thus, we have

Then, , and the first two el­e­ments of this are a telescopic sum, i.e. each seg­ment negates part of the next seg­ment. What re­mains is , and we can up­per bound it as , which is the bound we wanted to have. Here, the dis­ap­peared be­cause the al­gorithm as­sumes we start with , which I con­ve­niently ig­nored in my draw­ings.

Step 3

Our cur­rent term is . Re­call the proof roadmap:

The next stop (last pic­ture) is the term . This one will also work for each sum­mand sep­a­rately. We wish to prove that

,

where is from the smooth­ness defi­ni­tion of our point-wise loss func­tion. So this bound is all about smooth func­tions. It even has a name: func­tions with this prop­erty are called self-bounded, be­cause one can bound the value of the gra­di­ent in terms of the value of the same func­tion.

If one were to prove this for­mally, one would do so for ar­bi­trary smooth func­tions, so one would prove that, if , where is con­vex, and is a -smooth and non­nega­tive func­tion, then

Re­call that is -smooth iff for all and . Now, why is this state­ment true? Well, if we imag­ine to model the po­si­tion of a space ship, then smooth­ness says that can­not ac­cel­er­ate too quickly, so in or­der to reach a cer­tain speed, it will grad­u­ally have to ac­cel­er­ate to­wards that point and thus will have to bridge a cer­tain dis­tance. Let’s take the one-di­men­sional case, and sup­pose we start from and con­sis­tently ac­cel­er­ate as fast as we’re al­lowed to (this should be the op­ti­mal way to in­crease speed while min­i­miz­ing dis­tance). Then and , which means that and ; then . This is re­as­sur­ing, be­cause we’ve tried to con­struct the most difficult case pos­si­ble, and the state­ment held with equal­ity.

The for­mal proof, at least the one I came up with, is quite long and re­lies on line in­te­grals, so we will also skip it.

The alge­braic stuff

We have es­tab­lished that

which can be re­ordered as

.

At this point, we bet­ter have that , oth­er­wise, this bound is ut­terly use­less. As­sum­ing does in­deed hold, we can fur­ther re­for­mu­late this equa­tion as

Rescal­ing this, we obtain

Di­vid­ing this by to have the left term closer to our ac­tual er­ror, we get

.

Now we take ex­pec­ta­tions across the ran­dom­iza­tion of the train­ing se­quence on both sides. Then the left side be­comes the ex­pected real er­ror of our out­put pre­dic­tor . The right term doesn’t change, be­cause it doesn’t de­pend on . Thus, we have de­rived the bound

and this will be our fi­nal re­sult. By choos­ing cor­rectly and re­ar­rang­ing, it is pos­si­ble to define a func­tion such that, for any and up­per-bound­ing the hy­poth­e­sis class ( can­not be an in­put since we don’t know the op­ti­mal pre­dic­tor ) and any ar­bi­trar­ily small , if , then the above is up­per-bounded by . To be spe­cific, will be given by .

Reflections

Re­call that is the pa­ram­e­ter from the smooth­ness of our loss func­tion, is the num­ber of train­ing steps we make (which equals the num­ber of el­e­ments in our train­ing se­quence), and is the step size, which we can choose freely. In the re­sult, we see that a smaller is strictly bet­ter, a larger is strictly bet­ter, and will have its op­ti­mum at some un­known point. This all makes perfect sense – if things had come out any other way, that would be highly sur­pris­ing. Prob­a­bly the most non-ob­vi­ous qual­i­ta­tive prop­erty of this re­sult is that the norm of the op­ti­mal pre­dic­tor plays the role that it does.

The most in­ter­est­ing as­pect of the proof is prob­a­bly the “go­ing around in a cir­cle” part. How­ever, even af­ter work­ing this out in de­tail, I still don’t have a good sense of why we chose these par­tic­u­lar steps. If any­one has some in­sight here, let me know.

No comments.