# Why Gradients Vanish and Explode

Epistemic sta­tus: Con­fused, but try­ing to ex­plain a con­cept that I pre­vi­ously thought I un­der­stood. I sus­pect much of what I wrote be­low is false.

Without tak­ing proper care of a very deep neu­ral net­work, gra­di­ents tend to sud­denly be­come quite large or quite small. If the gra­di­ent is too large, then the net­work pa­ram­e­ters will be thrown com­pletely off, pos­si­bly caus­ing them to be­come NaN. If they are too small, then the net­work will stop train­ing en­tirely. This prob­lem is called the van­ish­ing and ex­plod­ing gra­di­ents prob­lem.

When I first learned about the van­ish­ing gra­di­ents prob­lem, I ended up get­ting a vague sense of why it oc­curs. In my head I vi­su­al­ized the sig­moid func­tion.

I then imag­ined this be­ing ap­plied el­e­ment-wise to an af­fine trans­for­ma­tion. If we just look at one el­e­ment, then we can imag­ine it be­ing the re­sult of a dot product of some pa­ram­e­ters, and that num­ber is be­ing plugged in on the x-axis. On the far left and on the far right, the deriva­tive of this func­tion is very small. This means that if we take the par­tial deriva­tive with re­spect to some pa­ram­e­ter, it will end up be­ing ex­tremely (per­haps van­ish­ingly) small.

Now, I know the way that I was vi­su­al­iz­ing this was very wrong. There are a few mis­takes I made:

1. This pic­ture doesn’t tell me any­thing about why the gra­di­ent “van­ishes.” It’s just show­ing me a pic­ture of where the gra­di­ents get small. Gra­di­ents also get small when they reach a lo­cal min­i­mum. Does this mean that van­ish­ing gra­di­ents are some­times good?

2. I knew that gra­di­ent van­ish­ing had some­thing to do with the depth of a net­work, but I didn’t see how the net­work be­ing deep af­fected why the gra­di­ents got small. I had a rudi­men­tary sense that each layer of sig­moid com­pounds the prob­lem un­til there’s no gra­di­ent left, but this was never pre­sented to me in a pre­cise way, so I just ig­nored it.

I now think I un­der­stand the prob­lem a bit bet­ter, but maybe not a whole lot bet­ter.

(Note: I have gath­ered ev­i­dence that the van­ish­ing gra­di­ent prob­lem is not linked to sig­moids and put it in this com­ment. I will be glad to see ev­i­dence which proves I’m wrong on this one, but I cur­rently be­lieve this is ev­i­dence that ma­chine learn­ing pro­fes­sors are teach­ing it in­cor­rectly).

First, the ba­sics. Without de­scribing the prob­lem in a very gen­eral sense, I’ll walk through a brief ex­am­ple. In par­tic­u­lar, I’ll show how we can imag­ine a for­ward pass in a sim­ple re­cur­rent neu­ral net­work that en­ables a feed­back effect to oc­cur. We can then im­me­di­ately see how gra­di­ent van­ish­ing can be­come a prob­lem within this frame­work (no sig­moids nec­es­sary).

Imag­ine that there is some se­quence of vec­tors which are defined via the fol­low­ing re­cur­sive defi­ni­tion,

This se­quence of vec­tors can be iden­ti­fied as the se­quence of hid­den states of the net­work. Let ad­mit an or­thog­o­nal eigen­de­com­po­si­tion. We can then rep­re­sent this re­peated ap­pli­ca­tion of the weights ma­trix as

where is a di­ag­o­nal ma­trix con­tain­ing the eigen­val­ues of , and is an or­thog­o­nal ma­trix. If we con­sider the eigen­val­ues, which are the di­ag­o­nal en­tries of , we can tell that the ones that are less than one will de­cay ex­po­nen­tially to­wards zero, and the val­ues that are greater than one will blow up ex­po­nen­tially to­wards in­finity as grows in size.

Since is or­thog­o­nal, the trans­for­ma­tion can be thought of as a ro­ta­tion trans­for­ma­tion of the vec­tor where each co­or­di­nate in the new trans­for­ma­tion re­flects be­ing pro­jected onto an eigen­vec­tor of . There­fore, when is very large, as in the case of an un­rol­led re­cur­rent net­work, then this ma­trix calcu­la­tion will end up get­ting dom­i­nated by the parts of that point in the same di­rec­tion as the ex­plod­ing eigen­vec­tors.

This is a prob­lem be­cause if an in­put vec­tor ends up point­ing in the di­rec­tion of one of these eigen­vec­tors, the loss func­tion may be very high. From this, it will turn out that in these re­gions, stochas­tic gra­di­ent de­scent may mas­sively over­shoot. If SDG over­shoots, then we end up re­vers­ing all of the de­scent progress that we pre­vi­ously had to­wards de­scend­ing down to a lo­cal min­i­mum.

As Good­fel­low et al. note, this er­ror is rel­a­tively easy to avoid in the case of non-re­cur­rent neu­ral net­works, be­cause in that case the weights aren’t shared be­tween lay­ers. How­ever, in the case of vanilla re­cur­rent neu­tral net­works, this prob­lem is al­most un­avoid­able. Ben­gio et al. showed that in cases where a sim­ple neu­ral net­work is even a depth of 10, this prob­lem will show up with near cer­tainty.

One way to help the prob­lem is by sim­ply clip­ping the gra­di­ents so that they can’t re­verse all of the de­scent progress so far. This helps the symp­tom of ex­plod­ing gra­di­ents, but doesn’t fix the prob­lem en­tirely, since the is­sue with blown up or van­ish­ing eigen­val­ues re­mains.

There­fore, in or­der to fix this prob­lem, we need to fun­da­men­tally re-de­sign the way that the gra­di­ents are back­prop­a­gated through time, mo­ti­vat­ing echo state net­works, leaky units, skip con­nec­tions, and LSTMs. I plan to one day go into all of these, but I first need to build up my skills in ma­trix calcu­lus, which are cur­rently quite poor.

There­fore, I in­tend to make the next post (and maybe a few more) about ma­trix calcu­lus. Then per­haps I can re­visit this topic and gain a deeper un­der­stand­ing.

This may be an idiosyn­cratic er­ror of mine. See page 105 in these lec­ture notes to see where I first saw the prob­lem of van­ish­ing gra­di­ents de­scribed.

See sec­tion 10.7 in the Deep Learn­ing Book for a ful­ler dis­cus­sion of van­ish­ing and ex­plod­ing gra­di­ents.

• Yay for learn­ing ma­trix calcu­lus! I’m ea­ger to read and learn. Per­son­ally I’ve done very well in the class where we learned it, but I’d say I didn’t get it at a deep /​ use­ful level.

• Great! I’ll do my best to keep the post as in­for­ma­tive as pos­si­ble, and I’ll try to get into it on a deep level.

• If you’re look­ing to im­prove your ma­trix calcu­lus skills, I speci­fi­cally recom­mend prac­tic­ing ten­sor in­dex no­ta­tion and the Ein­stein sum­ma­tion con­ven­tion. It will make neu­ral net­works much more pleas­ant, es­pe­cially re­cur­rent nets. (This may have been ob­vi­ous already, but it’s some­times tough to tell what’s use­ful when learn­ing a sub­ject.)

• I think the prob­lem with van­ish­ing gra­di­ents is usu­ally linked to re­peated ap­pli­ca­tions of the sig­moid ac­ti­va­tion func­tion. The gra­di­ent in back­prop­a­ga­tion is calcu­lated from the chain rule, where each fac­tor d\sigma/​dz in the “chain” will always be less than zero, and close to zero for large or small in­puts. So for feed-for­ward net­work, the prob­lem is a lit­tle differ­ent from re­cur­rent net­works, which you de­scribe.

The usual miti­ga­tion is to use ReLU ac­ti­va­tions, L2 reg­u­lariza­tion, and/​or batch nor­mal­iza­tion.

A minor point: the gra­di­ent doesn’t nec­es­sar­ily tend to­wards zero as you get closer to a lo­cal min­i­mum, that de­pends on the higher or­der deriva­tives. Imag­ine a lo­cal min­i­mum at the bot­tom of a fun­nel or spike, for in­stance—or a very spiky frac­tal-like land­scape. On the other hand, a lo­cal min­i­mum in a re­gion with a small gra­di­ent is a de­sir­able prop­erty, since it means small per­tur­ba­tions in the in­put data doesn’t change the out­put much. But this point will be difficult to reach, since learn­ing de­pends on the gra­di­ent...

(Thanks for the in­ter­est­ing anal­y­sis, I’m happy to dis­cuss this but prob­a­bly won’t drop by reg­u­larly to check com­ments—feel free to email me at ketil at malde point org)

• I think the prob­lem with van­ish­ing gra­di­ents is usu­ally linked to re­peated ap­pli­ca­tions of the sig­moid ac­ti­va­tion func­tion.

That’s what I used to think too. :)

If you look at the post above, I even linked to the rea­son why I thought that. In par­tic­u­lar, van­ish­ing gra­di­ents was taught as in­trin­si­cally re­lated to the sig­moid func­tion in page 105 in these lec­ture notes, which is where I ini­tially learned about the prob­lem.

How­ever, I no longer think gra­di­ent van­ish­ing is fun­da­men­tally linked to sig­moids or tanh ac­ti­va­tions.

I think that there is prob­a­bly some con­fu­sion in ter­minol­ogy, and some peo­ple use the the words differ­ently than oth­ers. If we look in the Deep Learn­ing Book, there are two sec­tions that talk about the prob­lem, namely sec­tion 8.2.5 and sec­tion 10.7, nei­ther of which bring up sig­moids as be­ing re­lated (though they do bring up deep weight shar­ing net­works). Good­fel­low et al. cite Sepp Hochre­iter’s 1991 the­sis as be­ing the origi­nal doc­u­ment de­scribing the is­sue, but un­for­tu­nately it’s in Ger­man so I can­not com­ment as to whether it links the is­sue to sig­moids.

Cur­rently, when I Ctrl-F “sig­moid” on the Wikipe­dia page for van­ish­ing gra­di­ents, there are no men­tions. There is a sin­gle sub­header which states, “Rec­tifiers such as ReLU suffer less from the van­ish­ing gra­di­ent prob­lem, be­cause they only sat­u­rate in one di­rec­tion.” How­ever, the cita­tion for this state­ment comes from this pa­per which men­tions van­ish­ing gra­di­ents only once and ex­plic­itly states,

We can see the model as an ex­po­nen­tial num­ber of lin­ear mod­els that share pa­ram­e­ters (Nair and Hin­ton, 2010). Be­cause of this lin­ear­ity, gra­di­ents flow well on the ac­tive paths of neu­rons (there is no gra­di­ent van­ish­ing effect due to ac­ti­va­tion non-lin­ear­i­ties of sig­moid or tanh units)

(Note: I mis­read the quote above—I’m still con­fused).

I think this is quite strong ev­i­dence that I was not taught the cor­rect us­age of van­ish­ing gra­di­ents.

The usual miti­ga­tion is to use ReLU ac­ti­va­tions, L2 reg­u­lariza­tion, and/​or batch nor­mal­iza­tion.

In­ter­est­ing you say that. I ac­tu­ally wrote a post on re­think­ing batch nor­mal­iza­tion, and I no longer think it’s jus­tified to say that batch nor­mal­iza­tion sim­ply miti­gates van­ish­ing gra­di­ents. The ex­act way that batch nor­mal­iza­tion works is a bit differ­ent, and it would be in­ac­cu­rate to de­scribe it as an ex­plicit strat­egy to re­duce van­ish­ing gra­di­ents (al­though it may help. Funny enough the origi­nal batch nor­mal­iza­tion pa­per says that with batch­norm they were able to train with sig­moids eas­ier).

A minor point: the gra­di­ent doesn’t nec­es­sar­ily tend to­wards zero as you get closer to a lo­cal min­i­mum, that de­pends on the higher or­der deriva­tives.

True. I had a sort of smooth loss func­tion in my head.

• I think this is quite strong ev­i­dence that I was not taught the cor­rect us­age of van­ish­ing gra­di­ents.

I’m very con­fused. The way I’m read­ing the quote you pro­vided, it says ReLu works bet­ter be­cause it doesn’t have the gra­di­ent van­ish­ing effect that sig­moid and tanh have.

• In­ter­est­ing. I just re-read it and you are com­pletely right. Well I won­der how that in­ter­acts with what I said above.

• That proof of the in­sta­bil­ity of RNNs is very nice.

The ver­sion of the van­ish­ing gra­di­ent prob­lem I learned is sim­ply that if you’re up­dat­ing weights pro­por­tional to the gra­di­ent, then if your av­er­age weight some­how ends up as 0.98, as you in­crease the num­ber of lay­ers your gra­di­ent, and there­fore your up­date size, will shrink kind of like (0.98)^n, which is not the be­hav­ior you want it to have.

• That proof of the in­sta­bil­ity of RNNs is very nice.

Great, thanks. It is adapted from Good­fel­low et al.’s dis­cus­sion of the topic, which I cite in the post.

The ver­sion of the van­ish­ing gra­di­ent prob­lem I learned is sim­ply that if you’re up­dat­ing weights pro­por­tional to the gra­di­ent, then if your av­er­age weight some­how ends up as 0.98, as you in­crease the num­ber of lay­ers your gra­di­ent, and there­fore your up­date size, will shrink kind of like (0.98)^n, which is not the be­hav­ior you want it to have.

That makes sense. How­ever, Good­fel­low et al. ar­gue that this isn’t a big is­sue for non-RNNs. Their dis­cus­sion is a bit con­fus­ing to me so I’ll just leave it be­low,

This prob­lem is par­tic­u­lar to re­cur­rent net­works. In the scalar case, imag­ine mul­ti­ply­ing a weight by it­self many times. The product will ei­ther van­ish or ex­plode de­pend­ing on the mag­ni­tude of . How­ever, if we make a non-re­cur­rent net­work that has a differ­ent weight at each time step, the situ­a­tion is differ­ent. If the ini­tial state is given by 1, then the state at time t is given by . Sup­pose that the val­ues are gen­er­ated ran­domly, in­de­pen­dently from one an­other, with zero mean and var­i­ance . The var­i­ance of the product is . To ob­tain some de­sired var­i­ance we may choose the in­di­vi­d­ual weights with var­i­ance . Very deep feed­for­ward net­works with care­fully cho­sen scal­ing can thus avoid the van­ish­ing and ex­plod­ing gra­di­ent prob­lem, as ar­gued by Sus­sillo (2014).