Inductive biases stick around

This post is a fol­low-up to Un­der­stand­ing “Deep Dou­ble Des­cent”.

I was talk­ing to Ro­hin at NeurIPS about my post on dou­ble de­scent, and he asked the very rea­son­able ques­tion of why ex­actly I think dou­ble de­scent is so im­por­tant. I re­al­ized that I hadn’t fully ex­plained that in my pre­vi­ous post, so the goal of this post is to fur­ther ad­dress the ques­tion of why you should care about dou­ble de­scent from an AI safety stand­point. This post as­sumes you’ve read my Un­der­stand­ing “Deep Dou­ble Des­cent” post, so you should read that first be­fore read­ing this if you haven’t already.

Speci­fi­cally, I think dou­ble de­scent demon­strates the in my opinion very im­por­tant yet coun­ter­in­tu­itive re­sult that larger mod­els can ac­tu­ally be sim­pler than smaller mod­els. On its face, this sounds some­what crazy—how can a model with more pa­ram­e­ters be sim­pler? But in fact I think this is just a very straight­for­ward con­se­quence of dou­ble de­scent: in the dou­ble de­scent paradigm, larger mod­els with zero train­ing er­ror gen­er­al­ize bet­ter than smaller mod­els with zero train­ing er­ror be­cause they do bet­ter on SGD’s in­duc­tive bi­ases. And if you buy that SGD’s in­duc­tive bi­ases are ap­prox­i­mately sim­plic­ity, that means that larger mod­els with zero train­ing er­ror are sim­pler than smaller mod­els with zero train­ing er­ror.

Ob­vi­ously, larger mod­els do have more pa­ram­e­ters than smaller ones, so if that’s your mea­sure of sim­plic­ity, larger mod­els will always be more com­pli­cated, but for other mea­sures of sim­plic­ity that’s not nec­es­sar­ily the case. For ex­am­ple, it could hy­po­thet­i­cally be the case that larger mod­els have lower Kol­mogorov com­plex­ity. Though I don’t ac­tu­ally think that’s true in the case of K-com­plex­ity, I think that’s only for the bor­ing rea­son that model weights have a lot of noise. If you had a way of some­how only count­ing the “es­sen­tial com­plex­ity,” I sus­pect larger mod­els would ac­tu­ally have lower K-com­plex­ity.

Really, what I I’m try­ing to do here is dis­pel what I see as the myth that as ML mod­els get more pow­er­ful sim­plic­ity will stop mat­ter­ing for them. In a Bayesian set­ting, it is a fact that the im­pact of your prior on your pos­te­rior (for those re­gions where your prior is non-zero[1]) be­comes neg­ligible as you up­date on more and more data. I have some­times heard it claimed that as a con­se­quence of this re­sult, as we move to do­ing ma­chine learn­ing with ever larger datasets and ever big­ger mod­els, the im­pact of our train­ing pro­cesses’ in­duc­tive bi­ases will be­come neg­ligible. How­ever, I think that’s quite wrong, and I think dou­ble de­scent does a good job of show­ing why, be­cause all of the perfor­mance gains you get past the in­ter­po­la­tion thresh­old are com­ing from your im­plicit prior.[2] Thus, if you sus­pect mod­ern ML to mostly be in that regime, what will mat­ter in terms of which tech­niques beat out other tech­niques is how good they are at com­press­ing their data into the “ac­tu­ally sim­plest” model that fits it.

Fur­ther­more, even just from the sim­ple Bayesian per­spec­tive, I sus­pect you can still get dou­ble de­scent. For ex­am­ple, sup­pose your train­ing pro­cess looks like the fol­low­ing: you have some hy­poth­e­sis class that keeps get­ting larger as you train and at each time step you se­lect the best a pos­te­ri­ori hy­poth­e­sis. I think that this setup will nat­u­rally yield a dou­ble de­scent for noisy data: first you get a “like­li­hood de­scent” as you get hy­pothe­ses with greater and greater like­li­hood, but then you start overfit­ting to noise in your data as you get close to the in­ter­po­la­tion thresh­old. Past the in­ter­po­la­tion thresh­old, how­ever, you get a sec­ond “prior de­scent” where you’re se­lect­ing hy­pothe­ses with greater and greater prior prob­a­bil­ity rather than greater and greater like­li­hood. I think this is a good model for how mod­ern ma­chine learn­ing works and what dou­ble de­scent is do­ing.

All of this is only for mod­els with zero train­ing er­ror, how­ever—be­fore you reach zero train­ing er­ror larger mod­els can cer­tainly have more es­sen­tial com­plex­ity than smaller ones. That be­ing said, if you don’t do very many steps of train­ing then your in­duc­tive bi­ases will also mat­ter a lot be­cause you haven’t up­dated that much on your data yet. In the dou­ble de­scent frame­work, the only re­gion where your in­duc­tive bi­ases don’t mat­ter very much is right on the in­ter­po­la­tion thresh­old—be­fore the in­ter­po­la­tion thresh­old or past it they should still be quite rele­vant.

Why does any of this mat­ter from a safety per­spec­tive, though? Ever since I read Belkin et al. I’ve had dou­ble de­scent as part of my talk ver­sion of “Risks from Learned Op­ti­miza­tion” be­cause I think it ad­dresses a pretty im­por­tant part of the story for mesa-op­ti­miza­tion. That is, mesa-op­ti­miz­ers are sim­ple, com­pressed poli­cies—but as ML moves to larger and larger mod­els, why should that mat­ter? The an­swer, I think, is that larger mod­els can gen­er­al­ize bet­ter not just by fit­ting the data bet­ter, but also by be­ing sim­pler.[3]

  1. Ne­gat­ing the im­pact of the prior not hav­ing sup­port over some hy­pothe­ses re­quires re­al­iz­abil­ity (see Embed­ded World-Models). ↩︎

  2. Note that dou­ble de­scent hap­pens even with­out ex­plicit reg­u­lariza­tion, so the prior we’re talk­ing about here is the im­plicit one im­posed by the ar­chi­tec­ture you’ve cho­sen and the fact that you’re train­ing it via SGD. ↩︎

  3. Which is ex­actly what you should ex­pect if you think Oc­cam’s ra­zor is the right prior: if two hy­pothe­ses have the same like­li­hood but one gen­er­al­izes bet­ter, ac­cord­ing to Oc­cam’s ra­zor it must be be­cause it’s sim­pler. ↩︎