Understanding “Deep Double Descent”

If you’re not fa­mil­iar with the dou­ble de­scent phe­nomenon, I think you should be. I con­sider dou­ble de­scent to be one of the most in­ter­est­ing and sur­pris­ing re­cent re­sults in an­a­lyz­ing and un­der­stand­ing mod­ern ma­chine learn­ing. To­day, Pree­tum et al. re­leased a new pa­per, “Deep Dou­ble Des­cent,” which I think is a big fur­ther ad­vance­ment in our un­der­stand­ing of this phe­nomenon. I’d highly recom­mend at least read­ing the sum­mary of the pa­per on the OpenAI blog. How­ever, I will also try to sum­ma­rize the pa­per here, as well as give a his­tory of the liter­a­ture on dou­ble de­scent and some of my per­sonal thoughts.

Prior work

The dou­ble de­scent phe­nomenon was first dis­cov­ered by Mikhail Belkin et al., who were con­fused by the phe­nomenon wherein mod­ern ML prac­ti­tion­ers would claim that “big­ger mod­els are always bet­ter” de­spite stan­dard statis­ti­cal ma­chine learn­ing the­ory pre­dict­ing that big­ger mod­els should be more prone to overfit­ting. Belkin et al. dis­cov­ered that the stan­dard bias-var­i­ance trade­off pic­ture ac­tu­ally breaks down once you hit ap­prox­i­mately zero train­ing er­ror—what Belkin et al. call the “in­ter­po­la­tion thresh­old.” Be­fore the in­ter­po­la­tion thresh­old, the bias-var­i­ance trade­off holds and in­creas­ing model com­plex­ity leads to overfit­ting, in­creas­ing test er­ror. After the in­ter­po­la­tion thresh­old, how­ever, they found that test er­ror ac­tu­ally starts to go down as you keep in­creas­ing model com­plex­ity! Belkin et al. demon­strated this phe­nomenon in sim­ple ML meth­ods such as de­ci­sion trees as well as sim­ple neu­ral net­works trained on MNIST. Here’s the di­a­gram that Belkin et al. use in their pa­per to de­scribe this phe­nomenon:

Belkin et al. de­scribe their hy­poth­e­sis for what’s hap­pen­ing as fol­lows:

All of the learned pre­dic­tors to the right of the in­ter­po­la­tion thresh­old fit the train­ing data perfectly and have zero em­piri­cal risk. So why should some—in par­tic­u­lar, those from richer func­tions classes—have lower test risk than oth­ers? The an­swer is that the ca­pac­ity of the func­tion class does not nec­es­sar­ily re­flect how well the pre­dic­tor matches the in­duc­tive bias ap­pro­pri­ate for the prob­lem at hand. [The in­duc­tive bias] is a form of Oc­cam’s ra­zor: the sim­plest ex­pla­na­tion com­pat­i­ble with the ob­ser­va­tions should be preferred. By con­sid­er­ing larger func­tion classes, which con­tain more can­di­date pre­dic­tors com­pat­i­ble with the data, we are able to find in­ter­po­lat­ing func­tions that [are] “sim­pler”. Thus in­creas­ing func­tion class ca­pac­ity im­proves perfor­mance of clas­sifiers.

I think that what this is say­ing is pretty mag­i­cal: in the case of neu­ral nets, it’s say­ing that SGD just so hap­pens to have the right in­duc­tive bi­ases that let­ting SGD choose which model it wants the most out of a large class of mod­els with the same train­ing perfor­mance yields sig­nifi­cantly bet­ter test perfor­mance. If you’re right on the in­ter­po­la­tion thresh­old, you’re effec­tively “forc­ing” SGD to choose from a very small set of mod­els with perfect train­ing ac­cu­racy (maybe only one re­al­is­tic op­tion), thus ig­nor­ing SGD’s in­duc­tive bi­ases com­pletely—whereas if you’re past the in­ter­po­la­tion thresh­old, you’re let­ting SGD choose which of many mod­els with perfect train­ing ac­cu­racy it prefers, thus al­low­ing SGD’s in­duc­tive bias to shine through.

I think this is strong ev­i­dence for the crit­i­cal im­por­tance of im­plicit sim­plic­ity and speed pri­ors in mak­ing mod­ern ML work. How­ever, such bi­ases also pro­duce strong in­cen­tives for mesa-op­ti­miza­tion (since op­ti­miz­ers are sim­ple, com­pressed poli­cies) and pseudo-al­ign­ment (since sim­plic­ity and speed penalties will fa­vor sim­pler, faster prox­ies). Fur­ther­more, the ar­gu­ments for the uni­ver­sal prior and min­i­mal cir­cuits be­ing ma­lign sug­gest that such strong sim­plic­ity and speed pri­ors could also pro­duce an in­cen­tive for de­cep­tive al­ign­ment.

“Deep Dou­ble Des­cent”

Now we get to Pree­tum et al.’s new pa­per, “Deep Dou­ble Des­cent.” Here are just some of the things that Pree­tum et al. demon­strate in “Deep Dou­ble Des­cent:”

  1. dou­ble de­scent oc­curs across a wide va­ri­ety of differ­ent model classes, in­clud­ing ResNets, stan­dard CNNs, and Trans­form­ers, as well as a wide va­ri­ety of differ­ent tasks, in­clud­ing image clas­sifi­ca­tion and lan­guage trans­la­tion,

  2. dou­ble de­scent oc­curs not just as a func­tion of model size, but also as a func­tion of train­ing time and dataset size, and

  3. since dou­ble de­scent can hap­pen as a func­tion of dataset size, more data can lead to worse test perfor­mance!

Crazy stuff. Let’s try to walk through each of these re­sults in de­tail and un­der­stand what’s hap­pen­ing.

First, dou­ble de­scent is a highly uni­ver­sal phe­nomenon in mod­ern deep learn­ing. Here is dou­ble de­scent hap­pen­ing for ResNet18 on CIFAR-10 and CIFAR-100:

And again for a Trans­former model on Ger­man-to-English and English-to-French trans­la­tion:

All of these graphs, how­ever, are just show­cas­ing the stan­dard Belkin et al.-style dou­ble de­scent over model size (what Pree­tum et al. call “model-wise dou­ble de­scent”). What’s re­ally in­ter­est­ing about “Deep Dou­ble Des­cent,” how­ever, is that Pree­tum et al. also demon­strate that the same thing can hap­pen for train­ing time (“epoch-wise dou­ble de­scent”) and a similar thing for dataset size (“sam­ple-wise non-mono­ton­ic­ity”).

First, let’s look at epoch-wise dou­ble de­scent. Take a look at these graphs for ResNet18 on CIFAR-10:

There’s a bunch of crazy things hap­pen­ing here which are worth point­ing out. First, the ob­vi­ous: epoch-wise dou­ble de­scent is definitely a thing—hold­ing model size fixed and train­ing for longer ex­hibits the stan­dard dou­ble de­scent be­hav­ior. Fur­ther­more, the peak hap­pens right at the in­ter­po­la­tion thresh­old where you hit zero train­ing er­ror. Se­cond, no­tice where you don’t get epoch-wise dou­ble de­scent: if your model is too small to ever hit the in­ter­po­la­tion thresh­old—like was the case in ye olden days of ML—you never get epoch-wise dou­ble de­scent. Third, no­tice the log scale on the y axis: you have to train for quite a while to start see­ing this phe­nomenon.

Fi­nally, sam­ple-wise non-mono­ton­ic­ity—Pree­tum et al. find a regime where in­creas­ing the amount of train­ing data by four and a half times ac­tu­ally in­creases test loss (!):

What’s hap­pen­ing here is that more data in­creases the amount of model ca­pac­ity/​num­ber of train­ing epochs nec­es­sary to reach zero train­ing er­ror, which pushes out the in­ter­po­la­tion thresh­old such that you can regress from the mod­ern (in­ter­po­la­tion) regime back into the clas­si­cal (bias-var­i­ance trade­off) regime, de­creas­ing perfor­mance.

Ad­di­tion­ally, an­other thing which Pree­tum et al. point out which I think is worth talk­ing about here is the im­pact of la­bel noise. Pree­tum et al. find that in­creas­ing la­bel noise sig­nifi­cantly ex­ag­ger­ates the test er­ror peak around the in­ter­po­la­tion thresh­old. Why might this be the case? Well, if we think about the in­duc­tive bi­ases story from ear­lier, greater la­bel noise means that near the in­ter­po­la­tion thresh­old SGD is forced to find the one model which fits all of the noise—which is likely to be pretty bad since it has to model a bunch of noise. After the in­ter­po­la­tion thresh­old, how­ever, SGD is able to pick be­tween many mod­els which fit the noise and se­lect one that does so in the sim­plest way such that you get good test perfor­mance.

Fi­nal comments

I’m quite ex­cited about “Deep Dou­ble Des­cent,” but it still leaves what is in my opinion the most im­por­tant ques­tion unan­swered, which is: what ex­actly are the mag­i­cal in­duc­tive bi­ases of mod­ern ML that make in­ter­po­la­tion work so well?

One pro­posal I am aware of is the work of Keskar et al., who ar­gue that SGD gets its good gen­er­al­iza­tion prop­er­ties from the fact that it finds “shal­low” as op­posed to “sharp” min­ima. The ba­sic in­sight is that SGD tends to jump out of min­ima with­out broad bas­ins around them and only re­ally set­tle into min­ima with large at­trac­tors, which tend to be the ex­act sort of min­ima that gen­er­al­ize. Keskar et al. use the fol­low­ing di­a­gram to ex­plain this phe­nom­ena:

The more re­cent work of Dinh et al. in “Sharp Min­ima Can Gen­er­al­ize For Deep Nets,” how­ever, calls the whole shal­low vs. sharp min­ima hy­poth­e­sis into ques­tion, ar­gu­ing that deep net­works have re­ally weird ge­om­e­try that doesn’t nec­es­sar­ily work the way Keskar et al. want it to. (EDIT: Maybe not. See this com­ment for an ex­pla­na­tion of why Dinh et al. doesn’t nec­es­sar­ily rule out the shal­low vs. sharp min­ima hy­poth­e­sis.)

Another idea that might help here is Fran­kle and Carbin’s “Lot­tery Ticket Hy­poth­e­sis,” which pos­tu­lates that large neu­ral net­works work well be­cause they are likely to con­tain ran­dom sub­net­works at ini­tial­iza­tion (what they call “win­ning tick­ets”) which are already quite close to the fi­nal policy (at least in terms of be­ing highly amenable to par­tic­u­larly effec­tive train­ing). My guess as to how dou­ble de­scent works if the Lot­tery Tick­ets Hy­poth­e­sis is true is that in the in­ter­po­la­tion regime SGD gets to just fo­cus on the win­ing tick­ets and ig­nore the oth­ers—since it doesn’t have to use the full model ca­pac­ity—whereas on the in­ter­po­la­tion thresh­old SGD is forced to make use of the full net­work (to get the full model ca­pac­ity), not just the win­ning tick­ets, which hurts gen­er­al­iza­tion.

That’s just spec­u­la­tion on my part, how­ever—we still don’t re­ally un­der­stand the in­duc­tive bi­ases of our mod­els, de­spite the fact that, as dou­ble de­scent shows, in­duc­tive bi­ases are the rea­son that mod­ern ML (that is, the in­ter­po­la­tion regime) works as well as it does. Fur­ther­more, as I noted pre­vi­ously, in­duc­tive bi­ases are highly rele­vant to the like­li­hood of pos­si­ble dan­ger­ous phe­nomenon such as mesa-op­ti­miza­tion and pseudo-al­ign­ment. Thus, it seems quite im­por­tant to me to do fur­ther work in this area and re­ally un­der­stand our mod­els’ in­duc­tive bi­ases, and I ap­plaud Pree­tum et al. for their ex­cit­ing work here.

EDIT: I have now writ­ten a fol­low-up to this post talk­ing more about why I think dou­ble de­scent is im­por­tant ti­tled “In­duc­tive bi­ases stick around.”