A Primer on Matrix Calculus, Part 1: Basic review

Con­sider whether this story ap­plies to you. You went through col­lege and made it past lin­ear alge­bra and mul­ti­vari­able calcu­lus, and then be­gan your train­ing for deep learn­ing. To your sur­prise, much of what they taught you in the pre­vi­ous courses is not very use­ful to the cur­rent sub­ject mat­ter.

And this is fine. Math­e­mat­ics is use­ful in its own right. You can ex­pect a lot of stuff isn’t go­ing to show up on the deep learn­ing fi­nal, but it’s also quite use­ful for un­der­stand­ing higher math­e­mat­ics.

How­ever, what isn’t fine is that a lot of im­por­tant stuff that you do need to know was omit­ted. In par­tic­u­lar, the deep learn­ing course re­quires you to know ma­trix calcu­lus, a spe­cial­ized form of writ­ing mul­ti­vari­able calcu­lus (mostly differ­en­tial calcu­lus). So now you slog through the no­ta­tion, get­ting con­fused, and only learn­ing as much as you need to know in or­der to do the back­prop­a­ga­tion on the fi­nal exam.

This is not how things should work!

Ma­trix calcu­lus can be beau­tiful in its own right. I’m here to find the beauty for my­self. If I may find it beau­tiful, then per­haps I will find new joy in read­ing those ma­chine learn­ing pa­pers. And maybe you will too.

There­fore, I ded­i­cate the next few posts in this se­quence to cov­er­ing this pa­per, which boldly states that it is, “an at­tempt to ex­plain all the ma­trix calcu­lus you need in or­der to un­der­stand the train­ing of deep neu­ral net­works.” All the ma­trix calcu­lus we need? Per­haps it’s enough to un­der­stand train­ing neu­ral net­works, but it isn’t enough ma­trix calcu­lus for deep learn­ing more gen­er­ally — I just Ctrl-F’d and found no in­stance of “hes­sian”!

Since it’s clearly not the full pic­ture, I will sup­ple­ment my posts with ma­te­rial from chap­ter 4 of the Deep Learn­ing Book, and Wikipe­dia.

This sub­se­quence of my daily in­sights se­quence will con­tain three parts. The first part is this post, the in­tro­duc­tion. For the posts in the se­quence I have out­lined the fol­low­ing rubric:

Part 1 (this one) will be re­view­ing some mul­ti­vari­able calcu­lus and will in­tro­duce the ma­trix calcu­lus no­ta­tion.

Part 2 will cover Ja­co­bi­ans, deriva­tives of el­e­ment-wise bi­nary op­er­a­tors, deriva­tives in­volv­ing scalar ex­pan­sions, vec­tor sum re­duc­tion, and some com­mon deriva­tives en­coun­tered in deep learn­ing.

Part 3 will cover the hes­sian ma­trix, higher or­der deriva­tives and Tay­lor ap­prox­i­ma­tions, and we will step through an ex­am­ple of ap­ply­ing the chain rule in a neu­ral net­work.

First, what’s im­por­tant to un­der­stand is that most of the calcu­lus used in deep learn­ing is not much more ad­vanced than what is usu­ally taught in a first course in calcu­lus. For in­stance, there is rarely any need for un­der­stand­ing in­te­grals.

On the other hand, even though the math­e­mat­ics it­self is not com­pli­cated, it takes the form of spe­cial­ized no­ta­tion, en­abling us to write calcu­lus us­ing large vec­tors and ma­tri­ces, in con­trast to a sin­gle vari­able ap­proach.

Given this, we might as well start from some­what of a be­gin­ning, with limits, and then build up to deriva­tives. In­tu­itively, a limit is a way of filling in the gaps of cer­tain func­tions by find­ing what value is “ap­proached” when we eval­u­ate a func­tion in a cer­tain di­rec­tion. The for­mal defi­ni­tion for a limit given by the ep­silon-delta defi­ni­tion, which was pro­vided ap­prox­i­mately 150 years af­ter the limit was first in­tro­duced.

Let be a real val­ued func­tion on a sub­set of the real num­bers. Let be a limit point of and let be a real num­ber. We say that if for ev­ery there ex­ists a such that, for all , if then . For an in­tu­itive ex­pla­na­tion of this defi­ni­tion, see this video from 3Blue1Brown.

It is gen­er­ally con­sid­ered that this for­mal defi­ni­tion is too cum­ber­some to be ap­plied ev­ery time to el­e­men­tary func­tions. There­fore, in­tro­duc­tory calcu­lus courses gen­er­ally teach a few rules which al­low stu­dents to quickly eval­u­ate the limits of func­tions that we are fa­mil­iar with.

I will not at­tempt to list all of the limit rules and tricks, since that would be out­side of the scope of this sin­gle blog post. That said, this re­source pro­vides much more in­for­ma­tion than what would be typ­i­cally nec­es­sary for suc­ceed­ing in a ma­chine learn­ing course.

The deriva­tive is defined on real val­ued func­tions by the fol­low­ing defi­ni­tion. Let be a real val­ued func­tion, then the deriva­tive of at is writ­ten . The in­tu­itive no­tion of a deriva­tive is that it mea­sures the slope of a func­tion at a point . Since slope is tra­di­tion­ally defined as the rate of change be­tween two points, it may first ap­pear ab­surd to a be­gin­ner how we can mea­sure slope at a sin­gle point. But this ab­sur­dity can be vi­su­ally re­solved by view­ing the fol­low­ing GIF

Just as in the case of limits, we are usu­ally not in­ter­ested in ap­ply­ing the for­mal defi­ni­tion to func­tions ex­cept when pressed. In­stead we have a list of com­mon deriva­tive rules which can help us sim­plify deriva­tives of com­mon ex­pres­sions. Here is a re­source which lists com­mon differ­en­ti­a­tion rules.

In ma­chine learn­ing, we are most com­monly pre­sented with func­tions which whose do­main is multi-di­men­sional. That is, in­stead of tak­ing the deriva­tive of a func­tion where is a real val­ued vari­able, we are in­ter­ested in tak­ing the deriva­tive of func­tions where is a vec­tor, which is in­tu­itively some­thing that can be writ­ten as an or­dered list of num­bers.

To see why we work with vec­tors, con­sider that we usu­ally in­ter­ested in find­ing a lo­cal min­i­mum of the loss func­tion of a neu­ral net­work (a mea­sure of how badly the neu­ral net­work is perform­ing) where the pa­ram­e­ters of the neu­ral net­work are writ­ten as an or­dered list of num­bers. Dur­ing train­ing, we can write the loss func­tion as a func­tion of its weights and bi­ases. In gen­eral, all of deep learn­ing can be re­duced no­ta­tion-wise to sim­ple mul­ti­di­men­sional func­tions, and com­po­si­tions of those sim­ple func­tions.

There­fore, in or­der to un­der­stand deep learn­ing, we must un­der­stand the mul­ti­di­men­sional gen­er­al­iza­tion of the deriva­tive, the gra­di­ent. First, in or­der to con­struct the gra­di­ent, how­ever, we must briefly con­sider the no­tion of par­tial deriva­tives. A quick aside, I have oc­ca­sion­ally ob­served that some peo­ple seem at first con­fused by par­tial deriva­tives, imag­in­ing them to some type of frac­tional no­tion of calcu­lus.

Do not be alarmed. As long as you un­der­stand what a deriva­tive is in one di­men­sion, par­tial deriva­tives should be a piece of cake. A par­tial deriva­tive is sim­ply a deriva­tive of a func­tion with re­spect to a par­tic­u­lar vari­able, with all the other vari­ables held con­stant. To vi­su­al­ize, con­sider the fol­low­ing mul­ti­di­men­sional func­tion that I have ripped from Wiki­me­dia.

Here, we are tak­ing the par­tial deriva­tive of the func­tion with re­spect to where is held con­stant at and the axis rep­re­sents the co-do­main of the func­tion (ie. the axis that is be­ing mapped to). I think about par­tial deriva­tives in the same way as the image above, by imag­in­ing tak­ing a slice of the func­tion in the x-di­rec­tion, and thereby re­duc­ing the func­tion to one di­men­sion, al­low­ing us to take a deriva­tive. Sym­bol­i­cally, we can eval­u­ate a func­tion’s par­tial deriva­tive like this: say . If we want to take the deriva­tive of with re­spect to , we can treat the ’s in that ex­pres­sion as con­stants and write . Here, the sym­bol is a spe­cial sym­bol in­di­cat­ing that we are tak­ing the par­tial deriva­tive rather than the to­tal deriva­tive.

The gra­di­ent is sim­ply the column vec­tor of par­tial deriva­tives of a func­tion. In the pre­vi­ous ex­am­ple, we would have the gra­di­ent as . The no­ta­tion I will use here is that the gra­di­ent is writ­ten as for some func­tion. The gra­di­ent is im­por­tant be­cause it gen­er­al­izes the con­cept of slope to a higher di­men­sion. Whereas the sin­gle vari­able deriva­tive pro­vided us the slope of the func­tion at a sin­gle point, the gra­di­ent pro­vides us a vec­tor which points in the di­rec­tion of great­est as­cent at a point and whose mag­ni­tude is equal to the rate of in­crease in this di­rec­tion. Also, just as a deriva­tive al­lows us to con­struct a lo­cal lin­ear ap­prox­i­ma­tion of a func­tion about a point, the gra­di­ent al­lows us to con­struct a lin­ear ap­prox­i­ma­tion of a mul­ti­vari­ate func­tion about a point in the form of a hy­per­plane. From this no­tion of a gra­di­ent, we can “de­scend” down a loss func­tion by re­peat­edly sub­tract­ing the gra­di­ent start­ing at some point, and in the pro­cess find neu­ral net­works which are bet­ter at do­ing their as­signed tasks.

In deep learn­ing we are of­ten asked to take the gra­di­ent of a func­tion (this no­ta­tion is just say­ing that we are map­ping from a space of ma­tri­ces to the real num­ber line). This may oc­cur be­cause the func­tion in ques­tion has its pa­ram­e­ters or­ga­nized in the form of an by ma­trix, rep­re­sent­ing for in­stance the strength of con­nec­tions from neu­ron to neu­ron . In this case, we treat the gra­di­ent ex­actly as we did be­fore, by col­lect­ing all of the par­tial deriva­tives. There is no differ­ence, ex­cept in no­ta­tion.

Some calcu­lus stu­dents are not well ac­quainted with the proof for why the gra­di­ent points in the di­rec­tion of great­est as­cent. Since it is sim­ply a list of par­tial deriva­tives, this fact may seem sur­pris­ing. Nonethe­less, this fact is what makes the gra­di­ent cen­trally im­por­tant in deep learn­ing, so it is there­fore worth re­peat­ing here.

In or­der to see why the gra­di­ent points in the di­rec­tion of steep­est as­cent, we first need a way of mea­sur­ing the as­cent in a par­tic­u­lar di­rec­tion. It turns out that mul­ti­vari­able calcu­lus offers us such a tool. The di­rec­tional deriva­tive is the rate of change of a func­tion along a di­rec­tion . We can imag­ine the di­rec­tional deriva­tive as be­ing con­cep­tu­ally similar to the par­tial deriva­tive, ex­cept we would first change the ba­sis while rep­re­sent­ing the func­tion, and then eval­u­ate the par­tial deriva­tive with re­spect to a ba­sis vec­tor which is on the span of . Similar to the defi­ni­tion for the deriva­tive, we define the di­rec­tional deriva­tive as

Ad­di­tion­ally, we can em­ploy the mul­ti­vari­able chain rule to re-write the di­rec­tional deriva­tive in a way that uses the gra­di­ent. In sin­gle vari­able calcu­lus, the chain rule can be writ­ten as . In the mul­ti­vari­able case, for a func­tion , we write where is the par­tial deriva­tive of with re­spect to its th ar­gu­ment. This can be sim­plified by em­ploy­ing the fol­low­ing no­ta­tion, which uses a dot product: .

If we rewrite the defi­ni­tion of the di­rec­tional deriva­tive as , and then ap­ply the mul­ti­vari­ate chain rule to this new for­mu­la­tion, we find that .

Given that , the unit vec­tor which max­i­mizes this dot product is the unit vec­tor which points in the same di­rec­tion as . This pre­vi­ous fact can be proven by a sim­ple in­spec­tion of the defi­ni­tion of the dot product be­tween two vec­tors, which is that where is the an­gle be­tween the two vec­tors. is max­i­mized when . For more in­tu­ition on how to de­rive the dot product, I recom­mend this video from 3Blue1Brown. I also recom­mend this one for in­tu­itions on why the gra­di­ent points in the di­rec­tion of max­i­mum in­crease.

For more depth I recom­mend this part four of this pdf text. For even more depth I recom­mend this book (though I have not read it). For even more depth, I recom­mend see­ing the foot­note be­low. For even more depth than that, per­haps just try to com­plete a four year de­gree in math­e­mat­ics.

Even this 1962 page be­he­moth called a book, in­tended to in­tro­duce all of the math­e­mat­ics needed for a com­puter sci­ence ed­u­ca­tion, in­cludes very lit­tle in­for­ma­tion on in­te­gra­tion, de­spite de­vot­ing full chap­ters to top­ics like ten­sor alge­bras and topol­ogy. How­ever, if or when I blog about prob­a­bil­ity the­ory, in­te­gra­tion will be­come rele­vant again.

If this wasn’t enough for you, al­ter­na­tively you can view the 3Blue1Brown video on deriva­tives.

In­ter­est­ingly, frac­tional calcu­lus is in­deed a real thing, and is very cool.

For a proof which is does not use the mul­ti­vari­able chain rule, see here. I figured given the pri­macy of the chain rule in deep learn­ing, it is worth men­tion­ing now.