A Primer on Matrix Calculus, Part 2: Jacobians and other fun

I started this post think­ing that I would write all the rules for eval­u­at­ing Ja­co­bi­ans of neu­ral net­work pa­ram­e­ters in spe­cific cases. But while this would cer­tainly be use­ful for grokking deep learn­ing pa­pers, frankly it’s difficult to write that in La­tex and the peo­ple who have writ­ten The Ma­trix Calcu­lus You Need For Deep Learn­ing pa­per have already done it much bet­ter than I can do.

Rather, I con­sider my com­par­a­tive ad­van­tage here to provide some ex­pan­sion on why we should use Ja­co­bi­ans in the first place. If you were to just read the pa­per above, you might start to think that Ja­co­bi­ans are just no­ta­tional perks. I hope to con­vince you that they are much more than that. In at least one set­ting, Ja­co­bi­ans provide a math­e­mat­i­cal frame­work for an­a­lyz­ing the in­put-out­put be­hav­ior of deep neu­ral net­works, which can help us see things which we might have missed with­out this frame­work. A spe­cific case of this phe­nomenon is a re­cently dis­cov­ered tech­nique which was even more re­cently put into a prac­ti­cal im­ple­men­ta­tion: Ja­co­bian reg­u­lariza­tion. Here we will see some fruits of our ma­trix calcu­lus la­bor.

Deep learn­ing tech­niques re­quire us to train a neu­ral net­work by slowly mod­ify­ing pa­ram­e­ters of some func­tion un­til the func­tion be­gins re­turn­ing some­thing close to the in­tended out­put. Th­ese pa­ram­e­ters are of­ten rep­re­sented in the form of ma­tri­ces. There are a few rea­sons for this rep­re­sen­ta­tion: the ma­trix form is com­pact, and it al­lows us to use the tools of lin­ear alge­bra di­rectly. Ma­trix com­pu­ta­tions can also be pro­cessed in par­allel, and this stan­dard­iza­tion al­lows pro­gram­mers to build effi­cient libraries for the train­ing of deep neu­ral net­works.

One quite im­por­tant ma­trix in deep learn­ing is the Ja­co­bian.

In one sense, the Ja­co­bian ma­trix is just a way of or­ga­niz­ing gra­di­ent vec­tors. Gra­di­ent vec­tors, in turn, are just ways of or­ga­niz­ing par­tial deriva­tives of an ex­pres­sion. There­fore, the Ja­co­bian ma­trix is just a big ma­trix which al­lows us to or­ga­nize a lot of par­tial deriva­tives. For­mally, the Ja­co­bian of is defined by the fol­low­ing ma­trix. We de­note as the map­ping from , where is the real num­ber line in the th co­or­di­nate of the out­put vec­tor . Then the Ja­co­bian is simply

But de­scribing the Ja­co­bian as just some big rec­t­an­gu­lar box of par­tial deriva­tives of vec­tor-val­ued func­tions hides the in­tu­ition for why Ja­co­bi­ans are im­por­tant. To truly grasp the Ja­co­bian, we will first build up to it by imag­in­ing a func­tion that maps be­tween two real co­or­di­nate spaces: . If , this func­tion has a nat­u­ral in­ter­pre­ta­tion. We can see it by con­sid­er­ing the case where .

Now imag­ine stretch­ing, warp­ing, and con­tract­ing the plane so that the points are moved around in some man­ner. This means that ev­ery point is mapped to some other point on the graph. For in­stance, (2,5) could be mapped to (1,3). Cru­cially, make sure this trans­for­ma­tion doesn’t cause any sort of sharp dis­con­ti­nu­ity: we don’t want the graph to rip. I am not good with illus­trat­ing that type of thing, so I en­courage you to imag­ine it in­stead in your head, or al­ter­na­tively watch this video.

There is a spe­cial set of such map­pings, which we call lin­ear trans­for­ma­tions. Lin­ear trans­for­ma­tions have the prop­erty that when we perform the stretch­ing, the gridlines are kept straight. We could still ro­tate the graph, for in­stance. But what we can’t do is bend some axis so that it takes on a curvy shape af­ter the trans­for­ma­tion. If we drew a line on the graph be­fore the lin­ear trans­for­ma­tion, it must re­main a straight line af­ter the trans­for­ma­tion.

What does this have to do with the Ja­co­bian? To see we first must ask why deriva­tives are use­ful. In the most ba­sic sense, the deriva­tive is get­ting at try­ing to an­swer the questoin, “What is the lo­cal be­hav­ior of this func­tion?”

First, if you will con­sider by anal­ogy, we could ask for the lo­cal be­hav­ior of some differ­en­tiable func­tion . This would be a line whose slope is pro­vided by the deriva­tive. Similarly we could ask for the lo­cal be­hav­ior of some mul­ti­vari­ate func­tion , which would be a hy­per­plane whose di­rec­tion is de­ter­mined by the gra­di­ent. Now when we ask what the lo­cal be­hav­ior of some vec­tor-val­ued func­tion is, we get a lin­ear trans­for­ma­tion de­scribed by the Ja­co­bian.

In the above illus­tra­tion, the Ja­co­bian is eval­u­ated at the point in the bot­tom left cor­ner of the red tiles. The lin­ear trans­for­ma­tion im­plied by the Ja­co­bian is rep­re­sented by the translu­cent square af­ter the func­tion is ap­plied, which is a ro­ta­tion trans­for­ma­tion some an­gle clock­wise. As we can see, while is some ex­tra curvy func­tion, the Ja­co­bian can ap­prox­i­mate the lo­cal be­hav­ior at a point quite well.

To see how this vi­su­al­iza­tion is use­ful, con­sider the case of ap­ply­ing a Ja­co­bian to a neu­ral net­work. We could be de­sign­ing a sim­ple neu­ral net­work to pre­dict the out­put of two vari­ables, rep­re­sent­ing per­haps nor­mal­ized class prob­a­bil­ities, given an in­put of three vari­ables, rep­re­sent­ing per­haps in­put pixel data. We now illus­trate the neu­ral net­work.

This neu­ral net­works im­ple­ments a par­tic­u­lar func­tion from . How­ever, the ex­act func­tion that is be­ing im­ple­mented de­pends cru­cially on the pa­ram­e­ters, here de­noted by the con­nec­tions be­tween the nodes. If we com­pute the Ja­co­bian of this neu­ral net­work with re­spect to the in­put, at some in­put in­stance, we would end up get­ting a good idea of how the neu­ral net­work changes within the neigh­bor­hood of that par­tic­u­lar in­put.

One way we can gain in­sight from a Ja­co­bian is by com­put­ing its de­ter­mi­nant. Re­call, a de­ter­mi­nant is a func­tion from square ma­tri­ces to scalars which is defined re­cur­sively as the al­ter­nat­ing sum and sub­trac­tion of de­ter­mi­nants of the minors of the ma­trix mul­ti­plied by el­e­ments in the top row. On sec­ond thought, don’t re­call that defi­ni­tion of de­ter­mi­nant; that’s not go­ing to get you any­where. De­spite the de­ter­mi­nant’s opaque defi­ni­tion, we can gain deeper in­sight into what the de­ter­mi­nant rep­re­sents by in­stead view­ing it ge­o­met­ri­cally. In a few words, the de­ter­mi­nant com­putes the scal­ing fac­tor for a given lin­ear trans­for­ma­tion of a ma­trix.

Above, I have pul­led from Wiki­me­dia a par­allelepiped, which was formed from some lin­ear map­ping of a cube. The vol­ume of this par­allelepiped is some mul­ti­ple of the vol­ume of the cube be­fore the trans­for­ma­tion. It turns out that no mat­ter which re­gion of space we look at, this lin­ear trans­for­ma­tion will gen­er­ate the same ra­tio from post-trans­formed re­gions to pre-trans­formed re­gions. This ra­tio is given by the de­ter­mi­nant of the ma­trix rep­re­sent­ing the lin­ear map­ping. In other words, the de­ter­mi­nant tells us how much some trans­for­ma­tion is ex­pand­ing or con­tract­ing space.

What this means is for the Ja­co­bian is that the de­ter­mi­nant tells us how much space is be­ing squished or ex­panded in the neigh­bor­hood around a point. If the out­put space is be­ing ex­panded a lot at some in­put point, then this means that the neu­ral net­work is a bit un­sta­ble at that re­gion, since minor al­ter­a­tions in the in­put could cause huge dis­tor­tions in the out­put. By con­trast, if the de­ter­mi­nant is small, then some small change to the in­put will hardly make a differ­ence to the out­put.

This very fact about the Ja­co­bian is be­hind a re­cent de­vel­op­ment in the reg­u­lariza­tion of deep neu­ral net­works. The idea is that we could use this in­ter­pre­ta­tion of the Ja­co­bian as a mea­sure of ro­bust­ness to in­put-per­tur­ba­tions around a point to make neu­ral net­works more ro­bust off their train­ing dis­tri­bu­tion. Tra­di­tional ap­proaches like reg­u­lariza­tion have em­pha­sized the idea of keep­ing some pa­ram­e­ters of the neu­ral net­work from wan­der­ing off into ex­treme re­gions. The idea here is that smaller pa­ram­e­ters are more likely a pri­ori, which mo­ti­vates the con­struc­tion of some type of penalty on pa­ram­e­ters that are too large.

In con­trast to reg­u­lariza­tion, the con­cep­tual fram­ing of Ja­co­bian reg­u­lariza­tion comes from a differ­ent place. In­stead of hold­ing a leash on some pa­ram­e­ters, to keep them from wan­der­ing off into the abyss, Ja­co­bian reg­u­lariza­tion em­pha­sizes pro­vid­ing ro­bust­ness to small changes in the in­put space. The mo­ti­va­tion be­hind this ap­proach is clear to any­one who has been pay­ing at­ten­tion to ad­ver­sar­ial ex­am­ples over the last few years. To ex­plain, ad­ver­sar­ial ex­am­ples are cases where we provide in­stances of a neu­ral net­work where it performs very poorly, even if it had ini­tially done well on a non-ad­ver­sar­ial test set. Con­sider this ex­am­ple, pro­vided by OpenAI.

The first image was cor­rectly iden­ti­fied as a panda by the neu­ral net­work. How­ever, when we added a tiny bit of noise to the image, the neu­ral net­work spit out garbage, con­fi­dently clas­sify­ing a nearly ex­act copy as a gib­bon. One could imag­ine a hy­po­thet­i­cal ad­ver­sary us­ing this ex­ploit to defeat neu­ral net­work sys­tems in prac­tice. In the con­text of AI safety, ad­ver­sar­ial at­tacks con­sti­tutes a po­ten­tially im­por­tant sub­prob­lem of sys­tem re­li­a­bil­ity.

In Ja­co­bian reg­u­lariza­tion, we ap­proach this is­sue by putting a penalty on the size of the en­tries in the Ja­co­bian ma­trix. The idea is sim­ple: the smaller the val­ues of the ma­trix, the less that tiny per­tur­ba­tions in in­put-space will af­fect the out­put. Con­cretely, the reg­u­larizer is de­scribed by tak­ing the frobe­nius norm of the Ja­co­bian ma­trix, . The frobe­nius norm is noth­ing com­pli­cated, and is re­ally just a way of de­scribing that we square all of the el­e­ments in the ma­trix, take the sum, and then take the square root of this sum. Put an­other way, if we imag­ine con­cate­nat­ing all the gra­di­ent vec­tors which com­pose the Ja­co­bian, the frobe­nius norm is just de­scribing the penalty of this con­cate­nated vec­tor.

Im­por­tantly, this tech­nique is sub­tly differ­ent from tak­ing the norm over the pa­ram­e­ters. In the case of a ma­chine learn­ing al­gorithm with no lin­ear­ity, this penalty does how­ever re­duce to reg­u­lariza­tion. Why? Be­cause when we take the Ja­co­bian of a purely af­fine func­tion, we ob­tain the global in­for­ma­tion about how the func­tion stretches and ro­tates space, ex­clud­ing the trans­la­tion offset. This global in­for­ma­tion pre­cisely com­poses the pa­ram­e­ters that we would be pe­nal­iz­ing. It is the­o­ret­i­cally similar to how if we take the deriva­tive of a line, we can re­con­struct the line from the deriva­tive and a bias term.

If while read­ing the last few para­graphs, you start­ing think­ing how is this just now be­ing dis­cov­ered? you share my thoughts ex­actly. As far as I can tell, the seeds of Ja­co­bian reg­u­lariza­tion have ex­isted since at least the 1990s. How­ever, it took un­til 2016 for a team to cre­ate a full im­ple­men­ta­tion. Only re­cently, as I write this in Au­gust 2019, has a team of re­searchers claimed to have dis­cov­ered an effi­cient al­gorithm for ap­ply­ing this reg­u­lariza­tion penalty to neu­ral net­works.

The way that re­searchers cre­ated this new method is by us­ing ran­dom pro­jec­tions to ap­prox­i­mate the Frobe­nius norm. Whereas the prior ap­proach men­tioned ran­dom pro­jec­tions, it was never put into prac­tice. The new pa­per suc­ceeded by de­vis­ing an al­gorithm to ap­prox­i­mate Ja­co­bi­ans effi­ciently with min­i­mal over­head cost.

How effi­ciently? The pa­per states that there is

only a neg­ligible differ­ence in model solu­tion qual­ity be­tween train­ing with the ex­act com­pu­ta­tion of the Ja­co­bian as com­pared to train­ing with the ap­prox­i­mate al­gorithm, even when us­ing a sin­gle ran­dom pro­jec­tion.

If this tech­nique re­ally works as its de­scribed, this is a sig­nifi­cant re­sult. The pa­per claims that by ap­ply­ing Ja­co­bian reg­u­lariza­tion, train­ing uses only 30 per­cent more com­pu­ta­tion com­pared to tra­di­tional stochas­tic gra­di­ent de­scent with­out reg­u­lariza­tion at all. And for all that, we get some nice benefits: the sys­tem was sig­nifi­cantly more ro­bust to a PGD at­tack, and it was ap­par­ently much bet­ter than vanilla reg­u­lariza­tion due to the dis­tance be­tween de­ci­sion cells in the out­put space.

I recom­mend look­ing at the pa­per for more de­tails.