Why Gradients Vanish and Explode

Epistemic sta­tus: Con­fused, but try­ing to ex­plain a con­cept that I pre­vi­ously thought I un­der­stood. I sus­pect much of what I wrote be­low is false.

Without tak­ing proper care of a very deep neu­ral net­work, gra­di­ents tend to sud­denly be­come quite large or quite small. If the gra­di­ent is too large, then the net­work pa­ram­e­ters will be thrown com­pletely off, pos­si­bly caus­ing them to be­come NaN. If they are too small, then the net­work will stop train­ing en­tirely. This prob­lem is called the van­ish­ing and ex­plod­ing gra­di­ents prob­lem.

When I first learned about the van­ish­ing gra­di­ents prob­lem, I ended up get­ting a vague sense of why it oc­curs. In my head I vi­su­al­ized the sig­moid func­tion.

I then imag­ined this be­ing ap­plied el­e­ment-wise to an af­fine trans­for­ma­tion. If we just look at one el­e­ment, then we can imag­ine it be­ing the re­sult of a dot product of some pa­ram­e­ters, and that num­ber is be­ing plugged in on the x-axis. On the far left and on the far right, the deriva­tive of this func­tion is very small. This means that if we take the par­tial deriva­tive with re­spect to some pa­ram­e­ter, it will end up be­ing ex­tremely (per­haps van­ish­ingly) small.

Now, I know the way that I was vi­su­al­iz­ing this was very wrong. There are a few mis­takes I made:

1. This pic­ture doesn’t tell me any­thing about why the gra­di­ent “van­ishes.” It’s just show­ing me a pic­ture of where the gra­di­ents get small. Gra­di­ents also get small when they reach a lo­cal min­i­mum. Does this mean that van­ish­ing gra­di­ents are some­times good?

2. I knew that gra­di­ent van­ish­ing had some­thing to do with the depth of a net­work, but I didn’t see how the net­work be­ing deep af­fected why the gra­di­ents got small. I had a rudi­men­tary sense that each layer of sig­moid com­pounds the prob­lem un­til there’s no gra­di­ent left, but this was never pre­sented to me in a pre­cise way, so I just ig­nored it.

I now think I un­der­stand the prob­lem a bit bet­ter, but maybe not a whole lot bet­ter.

(Note: I have gath­ered ev­i­dence that the van­ish­ing gra­di­ent prob­lem is not linked to sig­moids and put it in this com­ment. I will be glad to see ev­i­dence which proves I’m wrong on this one, but I cur­rently be­lieve this is ev­i­dence that ma­chine learn­ing pro­fes­sors are teach­ing it in­cor­rectly).

First, the ba­sics. Without de­scribing the prob­lem in a very gen­eral sense, I’ll walk through a brief ex­am­ple. In par­tic­u­lar, I’ll show how we can imag­ine a for­ward pass in a sim­ple re­cur­rent neu­ral net­work that en­ables a feed­back effect to oc­cur. We can then im­me­di­ately see how gra­di­ent van­ish­ing can be­come a prob­lem within this frame­work (no sig­moids nec­es­sary).

Imag­ine that there is some se­quence of vec­tors which are defined via the fol­low­ing re­cur­sive defi­ni­tion,

This se­quence of vec­tors can be iden­ti­fied as the se­quence of hid­den states of the net­work. Let ad­mit an or­thog­o­nal eigen­de­com­po­si­tion. We can then rep­re­sent this re­peated ap­pli­ca­tion of the weights ma­trix as

where is a di­ag­o­nal ma­trix con­tain­ing the eigen­val­ues of , and is an or­thog­o­nal ma­trix. If we con­sider the eigen­val­ues, which are the di­ag­o­nal en­tries of , we can tell that the ones that are less than one will de­cay ex­po­nen­tially to­wards zero, and the val­ues that are greater than one will blow up ex­po­nen­tially to­wards in­finity as grows in size.

Since is or­thog­o­nal, the trans­for­ma­tion can be thought of as a ro­ta­tion trans­for­ma­tion of the vec­tor where each co­or­di­nate in the new trans­for­ma­tion re­flects be­ing pro­jected onto an eigen­vec­tor of . There­fore, when is very large, as in the case of an un­rol­led re­cur­rent net­work, then this ma­trix calcu­la­tion will end up get­ting dom­i­nated by the parts of that point in the same di­rec­tion as the ex­plod­ing eigen­vec­tors.

This is a prob­lem be­cause if an in­put vec­tor ends up point­ing in the di­rec­tion of one of these eigen­vec­tors, the loss func­tion may be very high. From this, it will turn out that in these re­gions, stochas­tic gra­di­ent de­scent may mas­sively over­shoot. If SDG over­shoots, then we end up re­vers­ing all of the de­scent progress that we pre­vi­ously had to­wards de­scend­ing down to a lo­cal min­i­mum.

As Good­fel­low et al. note, this er­ror is rel­a­tively easy to avoid in the case of non-re­cur­rent neu­ral net­works, be­cause in that case the weights aren’t shared be­tween lay­ers. How­ever, in the case of vanilla re­cur­rent neu­tral net­works, this prob­lem is al­most un­avoid­able. Ben­gio et al. showed that in cases where a sim­ple neu­ral net­work is even a depth of 10, this prob­lem will show up with near cer­tainty.

One way to help the prob­lem is by sim­ply clip­ping the gra­di­ents so that they can’t re­verse all of the de­scent progress so far. This helps the symp­tom of ex­plod­ing gra­di­ents, but doesn’t fix the prob­lem en­tirely, since the is­sue with blown up or van­ish­ing eigen­val­ues re­mains.

There­fore, in or­der to fix this prob­lem, we need to fun­da­men­tally re-de­sign the way that the gra­di­ents are back­prop­a­gated through time, mo­ti­vat­ing echo state net­works, leaky units, skip con­nec­tions, and LSTMs. I plan to one day go into all of these, but I first need to build up my skills in ma­trix calcu­lus, which are cur­rently quite poor.

There­fore, I in­tend to make the next post (and maybe a few more) about ma­trix calcu­lus. Then per­haps I can re­visit this topic and gain a deeper un­der­stand­ing.

This may be an idiosyn­cratic er­ror of mine. See page 105 in these lec­ture notes to see where I first saw the prob­lem of van­ish­ing gra­di­ents de­scribed.

See sec­tion 10.7 in the Deep Learn­ing Book for a ful­ler dis­cus­sion of van­ish­ing and ex­plod­ing gra­di­ents.