Learning the prior and generalization

This post is a re­sponse to Paul Chris­ti­ano’s post “Learn­ing the prior.”

The gen­er­al­iza­tion problem

Gen­er­ally, when we train mod­els, we of­ten end up de­ploy­ing them in situ­a­tions that are dis­tinctly differ­ent from those they were trained un­der. Take, for ex­am­ple, GPT-3. GPT-3 was trained to pre­dict web text, not serve as a dun­geon mas­ter—and the sort of queries that peo­ple pre­sent to AI dun­geon are quite differ­ent than ran­dom web text—but nev­er­the­less GPT-3 can perform quite well here be­cause it has learned a policy which is gen­eral enough that it con­tinues to func­tion quite effec­tively in this new do­main.

Rely­ing on this sort of gen­er­al­iza­tion, how­ever, is po­ten­tially quite trou­ble­some. If you’re in a situ­a­tion where your train­ing and de­ploy­ment data are in fact in­de­pen­dently and iden­ti­cally dis­tributed (i.i.d.), you can pro­duce all sorts of nice guaran­tees about the perfor­mance of your model. For ex­am­ple, in an i.i.d. set­ting, you know that in the limit of train­ing you’ll get the de­sired be­hav­ior. Fur­ther­more, even be­fore the limit of train­ing, you know that val­i­da­tion and de­ploy­ment perfor­mance will pre­cisely track each other such that you can bound the prob­a­bil­ity of catas­trophic be­hav­ior by the in­ci­dence of catas­trophic be­hav­ior on the val­i­da­tion data.

In a gen­er­al­iza­tion set­ting, on the other hand, you have no such guaran­tees—even in the limit of train­ing, pre­cisely what your model does off-dis­tri­bu­tion is de­ter­mined by your train­ing pro­cess’s in­duc­tive bi­ases. In the­ory, any off-dis­tri­bu­tion be­hav­ior is com­pat­i­ble with zero train­ing er­ror—the only rea­son ma­chine learn­ing pro­duces good off-dis­tri­bu­tion be­hav­ior is be­cause it finds some­thing like the sim­plest model that fits the data. As a re­sult, how­ever, a model’s off-dis­tri­bu­tion be­hav­ior will be highly de­pen­dent on ex­actly what the train­ing pro­cess’s in­ter­pre­ta­tion of “sim­pler” is—that is, its in­duc­tive bi­ases. And rely­ing on such in­duc­tive bi­ases for your gen­er­al­iza­tion be­hav­ior can po­ten­tially have catas­trophic con­se­quences.

Nuances with generalization

That be­ing said, the pic­ture I’ve painted above of off-dis­tri­bu­tion gen­er­al­iza­tion be­ing the prob­lem isn’t quite right. For ex­am­ple, con­sider an au­tore­gres­sive model (like GPT-3) that’s just trained to learn a par­tic­u­lar dis­tri­bu­tion. Then, if I have some set of train­ing data and a new data point , there’s no test you can do to de­ter­mine whether was re­ally sam­pled from the same dis­tri­bu­tion as . In fact, for any and , I can always give you a dis­tri­bu­tion that could have been sam­pled from that as­signs what­ever prob­a­bil­ity I want to . Thus, to the ex­tent that we’re able to train mod­els that can do a good job for i.i.d. —that is, that as­sign high prob­a­bil­ity to —it’s be­cause there’s an im­plicit prior there that’s as­sign­ing a fairly high prob­a­bil­ity to the ac­tual dis­tri­bu­tion you used to sam­ple the data from rather than any other of the in­finitely many pos­si­ble dis­tri­bu­tions (this is the no free lunch the­o­rem). Even in the i.i.d. case, there­fore, there’s still a real and mean­ingful sense in which your perfor­mance is com­ing from the ma­chine learn­ing prior.

It’s still the case, how­ever, that ac­tu­ally us­ing i.i.d. data does give you some real and mean­ingful guaran­tees—such as the abil­ity to in­fer perfor­mance prop­er­ties from val­i­da­tion data, as I men­tioned pre­vi­ously. How­ever, at least in the con­text of mesa-op­ti­miza­tion, you can never re­ally get i.i.d. data thanks to fun­da­men­tal dis­tri­bu­tional shifts such as the the very fact that one set of data points is used in train­ing and one set of data points is used in de­ploy­ment. Paul Chris­ti­ano’s RSA-2048 ex­am­ple is a clas­sic ex­am­ple of how that sort of fun­da­men­tal dis­tri­bu­tional shift could po­ten­tially man­i­fest. Both Paul and I have also writ­ten about pos­si­ble solu­tions to this prob­lem, but it’s still a prob­lem that you need to deal with even if you’ve oth­er­wise fully dealt with the gen­er­al­iza­tion prob­lem.

Paul’s ap­proach and verifiability

The ques­tion I want to ask now, how­ever, is the ex­tent to which we can nev­er­the­less at least some­what stop rely­ing on ma­chine learn­ing gen­er­al­iza­tion and what benefits we might be able to get from do­ing so. As I men­tioned, there’s a sense in which we’ll never fully be able to stop rely­ing on gen­er­al­iza­tion, but there might still be ma­jor benefits to be had from at least par­tially stop­ping do­ing so. At first, this might sound crazy—if you want to be com­pet­i­tive, surely you need to be able to do gen­er­al­iza­tion? And I think that’s true—but the ques­tion was whether we needed our ma­chine learn­ing mod­els to do gen­er­al­iza­tion, not whether we needed gen­er­al­iza­tion at all.

Paul’s re­cent post “Learn­ing the prior” pre­sents a pos­si­ble way to get gen­er­al­iza­tion in the way that a hu­man would gen­er­al­ize while rely­ing on sig­nifi­cantly less ma­chine learn­ing gen­er­al­iza­tion. Speci­fi­cally, Paul’s idea is to use ML to learn a set of fore­cast­ing as­sump­tions that max­i­mize the hu­man’s pos­te­rior es­ti­mate of the like­li­hood of over some train­ing data, then gen­er­al­ize by learn­ing a model that pre­dicts hu­man fore­casts given . Paul ar­gues that this ap­proach is nicely i.i.d., but for the rea­sons men­tioned above I don’t fully buy that—for ex­am­ple, there are still fun­da­men­tal dis­tri­bu­tional shifts that I’m skep­ti­cal can ever be avoided such as the fact that a de­cep­tive model might care about some data points (e.g. the de­ploy­ment ones) more than oth­ers (e.g. the train­ing ones). That be­ing said, I nev­er­the­less think that there is still a real and mean­ingful sense in which Paul’s pro­posal re­duces the ML gen­er­al­iza­tion bur­den in a helpful way—but I don’t think that i.i.d-ness is the right way to talk about that.

Rather, I think that what’s spe­cial about Paul’s pro­posal is that it guaran­tees ver­ifi­a­bil­ity. That is, un­der Paul’s setup, we can always check whether any an­swer matches the ground truth by query­ing the hu­man with ac­cess to . In prac­tice, for ex­tremely large which are rep­re­sented only im­plic­itly as in Paul’s post, we might not always check whether the model matches the ground truth by ac­tu­ally gen­er­at­ing the ground truth and in­stead just ask the hu­man to ver­ify the an­swer given , but re­gard­less the point is that we have the abil­ity to check the model’s an­swers. This is differ­ent even than di­rectly do­ing some­thing like imi­ta­tive am­plifi­ca­tion, where the only ground truth we can get in gen­er­al­iza­tion sce­nar­ios is ei­ther com­pu­ta­tion­ally in­fea­si­ble (HCH) or di­rectly refer­ences the model it­self (). One nice thing about this sort of ver­ifi­a­bil­ity is that, if we de­ter­mine when to do the checks ran­domly, we can get a rep­re­sen­ta­tive sam­ple of the model’s av­er­age-case gen­er­al­iza­tion be­hav­ior—some­thing we re­ally can’t do oth­er­wise. Of course, we still need worst-case guaran­tees—but hav­ing strong av­er­age-case guaran­tees is still a big win.

To achieve ver­ifi­a­bil­ity while still be­ing com­pet­i­tive across a large set of ques­tions, how­ever, re­quires be­ing able to fully ver­ify an­swers to all of those ques­tions. That’s a pretty tall or­der be­cause it means there needs to ex­ist some pro­ce­dure which can jus­tify ar­bi­trary knowl­edge start­ing only from hu­man knowl­edge and rea­son­ing. This is the same sort of thing that am­plifi­ca­tion and de­bate need to be com­pet­i­tive, how­ever, so at the very least it’s not a new thing that we need for such ap­proaches.

In any event, I think that striv­ing for ver­ifi­a­bil­ity is a pretty good goal that I ex­pect to have real benefits if it can be achieved—and I think it’s a much more well-speci­fied goal than i.i.d.-ness.

EDIT: I clar­ify a lot of stuff in the above post in this com­ment chain be­tween me and Ro­hin.