How likely is deceptive alignment?

The fol­low­ing is an ed­ited tran­script of a talk I gave. I have given this talk at mul­ti­ple places, in­clud­ing first at An­thropic and then for ELK win­ners and at Red­wood Re­search, though the ver­sion that this doc­u­ment is based on is the ver­sion I gave to SERI MATS fel­lows. Thanks to Jonathan Ng, Ryan Kidd, and oth­ers for help tran­scribing that talk. Sub­stan­tial ed­its were done on top of the tran­scrip­tion by me. Though all slides are em­bed­ded be­low, the full slide deck is also available here.

To­day I’m go­ing to be talk­ing about de­cep­tive al­ign­ment. De­cep­tive al­ign­ment is some­thing I’m very con­cerned about and is where I think most of the ex­is­ten­tial risk from AI comes from. And I’m go­ing to try to make the case for why I think that this is the de­fault out­come of ma­chine learn­ing.

slide 2

First of all, what am I talk­ing about? I want to dis­am­biguate be­tween two closely re­lated, but dis­tinct con­cepts. The first con­cept is dishon­esty. This is some­thing that many peo­ple are con­cerned about in mod­els, you could have a model and that model lies to you, it knows one thing, but ac­tu­ally, the thing it tells you is differ­ent from that. So this hap­pens all the time with cur­rent lan­guage mod­els—we can, for ex­am­ple, ask them to write the cor­rect im­ple­men­ta­tion of some func­tion. But if they’ve seen hu­mans make some par­tic­u­lar bug over and over again, then even if in some sense it knows how to write the right func­tion, it’s go­ing to re­pro­duce that bug. And so this is an ex­am­ple of a situ­a­tion where the model knows how to solve some­thing and nev­er­the­less lies to you. This is not what I’m talk­ing about. This is a dis­tinct failure mode. The thing that I want to talk about is de­cep­tive al­ign­ment which is, in some sense, a sub­set of dishon­esty, but it’s a very par­tic­u­lar situ­a­tion.

slide 3

So de­cep­tive al­ign­ment is a situ­a­tion where the rea­son that your model looks al­igned on the train­ing data is be­cause it is ac­tively try­ing to look al­igned for in­stru­men­tal rea­sons, which is very dis­tinct. This is a situ­a­tion where the thing that is caus­ing your model to have good perfor­mance is be­cause it is try­ing to game the train­ing data, it ac­tively has a rea­son that it wants to stick around in train­ing. And so it’s try­ing to get good perfor­mance in train­ing for the pur­pose of stick­ing around.

slide 4

Ajeya Co­tra has a re­ally good anal­ogy here that I think is helpful for un­der­stand­ing the differ­ence be­tween these two classes. So you can imag­ine that you are a child and you’ve in­her­ited a mas­sive busi­ness. And you have to de­ter­mine who’s go­ing to run the busi­ness for you. There’s a bunch of can­di­dates that you’re try­ing to eval­u­ate. And those can­di­dates fall into three cat­e­gories. You have the saints, which are peo­ple that re­ally just want to help you, run things effec­tively, and ac­com­plish what you want. You have the syco­phants, which want to make you happy, satisfy the let­ter of your in­struc­tions, make it so that the busi­ness looks like it’s do­ing well from your per­spec­tive, but don’t ac­tu­ally want to fun­da­men­tally help you. And you have the schemers, peo­ple who want to use the con­trol of the busi­ness for their own pur­poses, and are only try­ing to get con­trol of it and pre­tend that they’re do­ing the right thing, so that they can even­tu­ally get some­thing later. For our pur­poses, we’re con­cerned pri­mar­ily with the schemers and that is the de­cep­tive al­ign­ment cat­e­gory.

So I would say in this situ­a­tion that the syco­phants are ex­am­ples of dishon­esty where they would say a bunch of false facts to you about what was hap­pen­ing to con­vince you that things were go­ing well, but they don’t have some ul­te­rior mo­tive. The schemers, they have some ul­te­rior mo­tive, they have some­thing that they want to ac­com­plish. And they’re ac­tively try­ing to look like they’re do­ing the right thing on train­ing to ac­com­plish that. Okay, so this is what we’re con­cerned about, we’re con­cerned speci­fi­cally about the schemers, the de­cep­tively al­igned mod­els, mod­els where the rea­son it was al­igned is be­cause it’s try­ing to game the train­ing sig­nal.

slide 5

Okay, so the ques­tion we want to an­swer is, “how likely is that in prac­tice?” So we have this con­cept of, maybe the model will try to game the train­ing sig­nal, maybe it will try to pre­tend to do some­thing in train­ing so that it can even­tu­ally do some­thing else in the real world. But we don’t know how likely that is as an ac­tual thing that you would end up with if you ran an ac­tual ma­chine learn­ing train­ing pro­cess.

And the prob­lem here is that the de­cep­tively al­igned model, the model that is pre­tend­ing to do the right thing so that it can be se­lected by the train­ing pro­cess, is be­hav­iorally in­dis­t­in­guish­able dur­ing train­ing from the ro­bustly al­igned model, the saint model, the model that is ac­tu­ally try­ing to do the right thing. The de­cep­tively al­igned model is go­ing to look like it’s ac­tu­ally try­ing to do the right thing dur­ing train­ing, be­cause that’s what it’s try­ing to do. It is ac­tively try­ing to look like it’s do­ing the right thing as much as it pos­si­bly can in train­ing. And so in train­ing, you can­not tell the differ­ence only by look­ing at their be­hav­ior.

And so if we want to un­der­stand which one we’re go­ing to get, we have to look at the in­duc­tive bi­ases of the train­ing pro­cess. In any situ­a­tion, if you’re fa­mil­iar with ma­chine learn­ing, where we want to un­der­stand which of mul­ti­ple differ­ent pos­si­ble mod­els that are be­hav­iorally in­dis­t­in­guish­able, we will get, it’s a ques­tion of in­duc­tive bi­ases. And so Ajeya also has an­other good ex­am­ple here.

slide 6

Sup­pose I take a model and I train it on blue shapes that look like that shape on the left, and red shapes look like that shape on the right. And then we la­bel these as two differ­ent classes. And then we move to a situ­a­tion where we have the same shapes with swapped col­ors. And we want to know, how is it go­ing to gen­er­al­ize? And the an­swer is, the ma­chine learn­ing model always learns to gen­er­al­ize based on color, but there’s two gen­er­al­iza­tions here. It could learn to gen­er­al­ize based on color or it could learn to gen­er­al­ize based on shape. And which one we get is just a ques­tion of which one is sim­pler and eas­ier for gra­di­ent de­scent to im­ple­ment and which one is preferred by in­duc­tive bi­ases, they both do equiv­a­lently well in train­ing, but you know, one of them con­sis­tently is always the one that gra­di­ent de­scent finds, which in this situ­a­tion is the color de­tec­tor.

Okay, so if we want to un­der­stand how likely de­cep­tive al­ign­ment is, we have to do this same sort of anal­y­sis, we have to know, which one of these is go­ing to be the one that gra­di­ent de­scent is gen­er­ally go­ing to find—when we ask it to solve some com­plex task, are we go­ing to find the de­cep­tive one, or are we go­ing to find the non-de­cep­tive one.

slide 7

Okay, so the prob­lem, at least from my per­spec­tive, try­ing to do this anal­y­sis, is that we don’t un­der­stand ma­chine learn­ing (ML) in­duc­tive bi­ases very well, they’re ac­tu­ally re­ally con­fus­ing. We just don’t have very much in­for­ma­tion about how they op­er­ate.

And so what I’m go­ing to do is I’m go­ing to pick two differ­ent sto­ries that I think are plau­si­ble for what ML in­duc­tive bi­ases might look like, that are based on my view of the cur­rent slate of em­piri­cal ev­i­dence that we have available on ML in­duc­tive bi­ases. And so we’re go­ing to look at the like­li­hood of de­cep­tion un­der each of these two differ­ent sce­nar­ios in­de­pen­dently, which just rep­re­sent two differ­ent ways that the in­duc­tive bi­ases of ma­chine learn­ing sys­tems could work. So the first is the high path de­pen­dence world. And the sec­ond is the low path de­pen­dence world. So what do I mean by that?

slide 8

Okay, so first: high path de­pen­dence. In a world of high path de­pen­dence, the idea is that differ­ent train­ing runs can con­verge to very differ­ent mod­els, de­pend­ing on the par­tic­u­lar path that you take through model space. So in the high path de­pen­dence world, the cor­rect way to think about the in­duc­tive bi­ases in ma­chine learn­ing, is to think: well, we have to un­der­stand par­tic­u­lar paths that your model might take through model space—maybe first you might get one thing, and then you’ll get the next thing and the prob­a­bil­ity of any par­tic­u­lar fi­nal model is go­ing to de­pend on what are the pre­req­ui­sites in terms of the in­ter­nal struc­ture that has to ex­ist be­fore that thing can be im­planted. How long is the path that we take to get there, how steep is it, et cetera, et cetera?

So what is the em­piri­cal ev­i­dence for this view? Well, so I think there is some em­piri­cal ev­i­dence that might push you in the di­rec­tion of be­liev­ing that high path de­pen­dence is the right way to think about this. So some pieces of ev­i­dence. So on the right, this is “BERTS of a feather do not gen­er­al­ize to­gether,” they take a bunch of fine-tun­ings of BERT, and they ba­si­cally asked, how did these fine-tun­ings gen­er­al­ize on down­stream tasks? And the an­swer is, some­times they gen­er­al­ize ex­tremely similarly. They all have ex­actly the same perfor­mance. And some­times they gen­er­al­ized to­tally differ­ently, you can take one fine-tun­ing and an­other fine-tun­ing on ex­actly the same data, and they have com­pletely differ­ent down­stream gen­er­al­iza­tion perfor­mances. So how do we ex­plain that? Well, there must have been some­thing that hap­pened in the sort of dy­nam­ics of train­ing that was highly path de­pen­dent, where it re­ally mat­tered what par­tic­u­lar path it took through model space to end up with these differ­ent fine-tun­ings hav­ing very differ­ent gen­er­al­iza­tion perfor­mance.

This sort of path de­pen­dence is es­pe­cially preva­lent in RL, where you can run the ex­act same setup mul­ti­ple times, as in the bot­tom image, and some­times you get good perfor­mance, you learn the right thing, whereas some­times you get ter­rible perfor­mance, you don’t re­ally learn any­thing.

And then there is this ex­am­ple down here, where there’s this pa­per that was ar­gu­ing that if you take the ex­act same train­ing dy­nam­ics and you run it a bunch of times, you can es­sen­tially pick the best one to put in your pa­per, you can es­sen­tially p-hack your pa­per in a lot of situ­a­tions be­cause of the ran­dom­ness of train­ing dy­nam­ics and the path de­pen­dence of each train­ing run giv­ing you differ­ent gen­er­al­iza­tions. If you take the ex­act same train­ing run and run it mul­ti­ple times, you’ll end up with a much higher prob­a­bil­ity of get­ting statis­ti­cal sig­nifi­cance.

So this is one way to think about in­duc­tive bi­ases, where it re­ally mat­ters the par­tic­u­lar path you take through model space, and how difficult that path is.[1] And so what we want to know is, did the path that you took through model space mat­ter for the func­tional be­hav­ior off train­ing?

slide 9

Now, in the low path de­pen­dence world, similar train­ing pro­cesses con­verge to es­sen­tially the same sim­ple solu­tion, re­gard­less of early train­ing dy­nam­ics. So the idea here is that you can think about ma­chine learn­ing and deep learn­ing as es­sen­tially find­ing the sim­plest model that fits the data. You give it a bunch of data, and it’s always go­ing to find the sim­plest way to fit that data. In that situ­a­tion, what mat­ters is the data that you gave it and some ba­sic un­der­stand­ing of sim­plic­ity, a set of in­duc­tive bi­ases that your train­ing pro­cess came with. And it didn’t re­ally mat­ter very much the par­tic­u­lar path that you took to get to that par­tic­u­lar point, all paths con­verge on es­sen­tially the same gen­er­al­iza­tion.

One way to think about this is: your model space is so high-di­men­sional that your train­ing pro­cess can es­sen­tially ac­cess the whole man­i­fold of min­i­mal loss solu­tions, and then it just picks the one that’s the sim­plest ac­cord­ing to some set of in­duc­tive bi­ases.

Okay, so there’s em­piri­cal ev­i­dence for the low path-de­pen­dence world, too. I think there are good rea­sons to be­lieve that you are in the low path de­pen­dence world.

I think a good ex­am­ple of this is grokking. This is a situ­a­tion where we took a model, and tried to get it to do some ar­ith­metic task, and for a re­ally long time it just learns a bunch of ran­dom stuff. And then even­tu­ally it con­verges to the ex­act solu­tion. It’s always im­ple­ment­ing the al­gorithm ex­actly cor­rectly af­ter a very long pe­riod. And so if you’re in this situ­a­tion, it didn’t re­ally mat­ter what was hap­pen­ing in this whole pe­riod here—even­tu­ally, we con­verge to the pre­cise al­gorithm, and it’s just overde­ter­mined what we con­verge to.[2]

Other rea­sons, you might think this, so this is from “Neu­ral Net­works are Fun­da­men­tally Bayesian”, which is the Min­gard et al. line of work. What they do is, they com­pare the prob­a­bil­ity of a par­tic­u­lar fi­nal set of weights of ap­pear­ing through gra­di­ent de­scent to the prob­a­bil­ity that you would get that same model, if you just did sam­pling with re­place­ment from the ini­tial­iza­tion dis­tri­bu­tion. So they ask, what is the prob­a­bil­ity that I would have found this model by do­ing Gaus­sian ini­tial­iza­tion and then con­di­tion­ing on good perfor­mance, ver­sus what is the prob­a­bil­ity I find this model via gra­di­ent de­scent. And the an­swer is, they’re pretty similar. There’s some differ­ence, but over­all they’re pretty similar. And so, if you be­lieve this, we can say that, es­sen­tially, the in­duc­tive bi­ases in deep learn­ing are mostly ex­plained by just a Gaus­sian prior on the weights and the way that maps into the func­tion space. And it mostly doesn’t mat­ter the speci­fics of how gra­di­ent de­scent got to that par­tic­u­lar thing.

Okay, so there’s some em­piri­cal ev­i­dence for this view. There’s good rea­sons, I think, to be­lieve that this is how things would go. I think there’s good rea­sons to be­lieve in both of these wor­lds, I think that, if you were to ask me right now, I think I would lean a lit­tle bit to­wards low path de­pen­dence. But I think that both are still very live pos­si­bil­ities.

Ques­tion: How do I in­ter­pret all the lines on the graph for the Bayesian ex­am­ple?

We’re just look­ing at cor­re­la­tion be­tween the prob­a­bil­ity of a par­tic­u­lar model oc­cur­ring from gra­di­ent de­scent ver­sus the prob­a­bil­ity of you find­ing it in the Gaus­sian ini­tial­iza­tion prior.

Ques­tion: You said there were two most likely things, is there a third un­likely thing? Be­cause this seems like low and high path de­pen­dence give all the pos­si­ble ways that mod­els could end up gen­er­al­iz­ing.

So I don’t think that low and high path de­pen­dence cover the whole space; I think that there are other op­tions. You could end up in a situ­a­tion where you’re some­where in be­tween, and even end up in a situ­a­tion where it is both the case that you can pre­dict what a model will do by un­der­stand­ing what it’s do­ing early, and the case that you can pre­dict what your train­ing pro­cess will end up do­ing by un­der­stand­ing what other similar train­ing pro­cesses did. Similar train­ing pro­cesses con­verge to the same thing, and also, if you know early train­ing dy­nam­ics, then you can pre­dict where it will end up.

How­ever, at least for the pur­poses of this talk, I’m happy to mostly imag­ine these as two mu­tu­ally ex­clu­sive wor­lds, even though I think there’s a whole con­tinuum of var­i­ous ways the in­duc­tive bi­ases can play out. And I’m mostly go­ing to be just fo­cus­ing on, here’s a strong low path de­pen­dence world, here’s a strong high path de­pen­dence world, let’s look at what the ar­gu­ment looks like in both cases. And then, de­pend­ing on how things ac­tu­ally play out, maybe we in­ter­po­late some­where be­tween these ar­gu­ments.

Ques­tion: I’m in­ter­pret­ing this pre­sen­ta­tion as pri­mar­ily about in­de­pen­dently and iden­ti­cally dis­tributed (i.i.d.) train­ing where it’s not like on­line learn­ing where the model has op­tions to in­fluence its fu­ture train­ing.

I’m happy to as­sume that we’re talk­ing about on­line learn­ing. I think that the situ­a­tion ba­si­cally does not change if we’re think­ing about on­line learn­ing. I think that if you want to imag­ine on­line learn­ing, then in that situ­a­tion, I think that what you should think about is that at some point in your on­line learn­ing, you de­ploy the model, and it takes ac­tions and then it comes back, and we get gra­di­ents and we send them back to the model. At some point, the model will en­counter a situ­a­tion where there is some ac­tion available to it that would be catas­troph­i­cally bad if it took that ac­tion, where our feed­back mechanism isn’t suffi­cient to im­me­di­ately cor­rect that be­hav­ior. This is a pre­dic­tion about how de­ploy­ment will go. But I think that it is go­ing to be es­sen­tial for us to even­tu­ally de­ploy mod­els in situ­a­tions where they can take catas­trophic ac­tions, or at least some finite num­ber of catas­trophic ac­tions in a row, such that our feed­back mechanism is not suffi­cient to cor­rect for that quickly enough.

If you be­lieve that will even­tu­ally hap­pen, then you can imag­ine train­ing as the en­tire pe­riod up to the point where such a catas­trophic ac­tion is pos­si­ble. And then de­ploy­ment is just the point af­ter such a catas­trophic ac­tion be­comes pos­si­ble, where we re­ally care what the gen­er­al­iza­tion be­hav­ior is of the model at the point where such a catas­trophic ac­tion is pos­si­ble. And ev­ery­thing be­fore that is where the de­cep­tive model and the non-de­cep­tive model are go­ing to have equiv­a­lent perfor­mance and they’re go­ing to do ex­actly the same thing. But then there’s a de­ploy­ment pe­riod, which is the pe­riod where, if it takes a catas­trophic ac­tion, it would ac­tu­ally be catas­trophic.

De­cep­tive al­ign­ment in the high path-de­pen­dence world

Now, do we get de­cep­tive al­ign­ment? This is the ques­tion I want to know, on each of these two sto­ries of how in­duc­tive bi­ases could play out in ma­chine learn­ing—we want to un­der­stand, what is the prob­a­bil­ity that we get de­cep­tive al­ign­ment in that situ­a­tion? So we’re go­ing to start with high path de­pen­dence. We want to un­der­stand what is the prob­a­bil­ity we get de­cep­tive al­ign­ment in the situ­a­tion where we re­ally care about the path taken through model space.

slide 11

Okay, so here’s the setup for how we’re go­ing to do this anal­y­sis. So first, I’m go­ing to make the as­sump­tion that our train­ing pro­cess is good enough that for our model to ac­tu­ally perform well, it has to re­ally fully un­der­stand the thing that we’re try­ing to get it to do. What is this as­sump­tion? Well, the idea is that, at some point, we’re go­ing to be able to build train­ing pro­cesses with en­vi­ron­ments that are hard enough, and that are com­plex enough such that, to do well in that en­vi­ron­ment, you have to un­der­stand ev­ery­thing that we’re try­ing to get you to un­der­stand in that en­vi­ron­ment, have to know what the thing that we’re try­ing to get you to do, you have to un­der­stand a bunch of facts about the world. This is ba­si­cally a ca­pa­bil­ities as­sump­tion—we’re say­ing, at some point, we’re go­ing to build en­vi­ron­ments that are hard enough that they re­quire all of this un­der­stand­ing.

slide 12

And I of­ten think about this what you get in the limit of do­ing enough ad­ver­sar­ial train­ing. We have a bunch of situ­a­tions where, the model could learn to care about the gold coin, or it could learn to care about the edge of the screen. This is an ex­per­i­ment that was done, where they trained a coin run agent to get the gold coin, but the gold coin was always at the edge of the screen. And so it just always learned to go the right rather than get the gold coin. But of course, we can solve that prob­lem by just mov­ing the gold coin. And so the idea is, we do enough of this sort of ad­ver­sar­ial train­ing, we have di­verse enough en­vi­ron­ments with differ­ent situ­a­tions, you can even­tu­ally get them all to ac­tu­ally no­tice, the thing we want is the gold coin. I think this is a pretty rea­son­able as­sump­tion in terms of un­der­stand­ing what ca­pa­bil­ities will look like in the fu­ture.

slide 13

How­ever, the ques­tion is, well, there are mul­ti­ple model classes that fully un­der­stand what we want. The de­cep­tively al­igned model fully un­der­stands what you want, it just doesn’t care about it in­trin­si­cally. But it does fully un­der­stand what you want and is try­ing to do the thing that you want, for the pur­poses of stay­ing around in the train­ing pro­cess. Now, the ro­bustly al­igned mod­els, the fully al­igned mod­els also fully un­der­stand what you want them to do—in a differ­ent way such that they ac­tu­ally care about it.

So our ques­tion is, for these differ­ent model classes, that all have the prop­erty that they do fully un­der­stand the thing you’re try­ing to get them to do, which one do we get? And in this situ­a­tion, we’re gonna be look­ing at which one we get think­ing about high path de­pen­dence. So we have to un­der­stand, in a high path de­pen­dence con­text, how do you eval­u­ate and com­pare differ­ent model classes? So how are we go­ing to do that? Well, we’re go­ing to look at two differ­ent things.

Num­ber one, is we’re gonna look at the in­di­vi­d­ual path taken through model space. And we’re go­ing to try to un­der­stand how much marginal perfor­mance im­prove­ments we get from each step to­wards that model class. So when we look at what would have to be the case in terms of what ca­pa­bil­ities and struc­ture you have to de­velop to get a model that falls into that model class, we’re go­ing to un­der­stand for that par­tic­u­lar path, how long is it? How difficult is it? What are the var­i­ous differ­ent steps along it, and how much perfor­mance im­prove­ments do we get on each step? Be­cause the thing that we’re imag­in­ing here is that gra­di­ent de­scent is go­ing to be push­ing us along the steep­est paths, try­ing to get the most perfor­mance im­prove­ment out of each gra­di­ent de­scent step. So we want to un­der­stand for a par­tic­u­lar path how much perfor­mance im­prove­ment are we get­ting? And how quickly are we get­ting it?

slide 14

And then we also want to un­der­stand how long that hap­pens—how many steps we have to do, how many sort of se­quen­tial mod­ifi­ca­tions are nec­es­sary to get to a model that falls into that class. The length mat­ters be­cause the longer the path is, the more things that have to hap­pen, the more things that have to go in a par­tic­u­lar way for you to end up in that spot.

If we’re in the high path de­pen­dence world, these are the sorts of things we have to un­der­stand if we want to un­der­stand how likely is a par­tic­u­lar model class.

slide 15

So what are the three model classes? I have been talk­ing about how you have to be de­cep­tively al­igned and ro­bustly al­igned. But there’s two ro­bustly al­igned ver­sions. And so I want to talk about three to­tal differ­ent model classes, where all three of these model classes have the prop­erty that they have perfect train­ing perfor­mance, even in the limit of ad­ver­sar­ial train­ing, but the way that they fully un­der­stand what we want is differ­ent.

So I’m gonna use an anal­ogy here, due to Buck Sh­legeris. So sup­pose you are the Chris­tian God, and you want hu­mans to fol­low the Bible. That’s the thing you want as the Chris­tian God and you’re try­ing to un­der­stand what are the sorts of hu­mans that fol­low the Bible? Okay, so here are three ex­am­ples of hu­mans that do a good job at fol­low­ing the bible.

slide 16

Num­ber one, Je­sus Christ. From the per­spec­tive of the Chris­tian God, Je­sus Christ is great at fol­low­ing the Bible. And so why is Je­sus Christ great at fol­low­ing the Bible? Well, be­cause Je­sus Christ in Chris­tian on­tol­ogy is God. He’s just a copy of God, Je­sus Christ wants ex­actly the same things as God, be­cause he has the same val­ues and ex­actly the same way of think­ing about the world. So Je­sus Christ is just a copy of God. And so of course he fol­lows the Bible perfectly. Okay, so that’s one type of model you could get.

slide 17

Okay, here’s an­other type: Martin Luther. Martin Luther, of Protes­tant Re­for­ma­tion fame, he’s like, “I re­ally care about the Bible. I’m gonna study it re­ally well. And you know, I don’t care what any­one else tells me about the Bible, screw the church, it doesn’t mat­ter what they say, I’m gonna take this Bible, I’m gonna read it re­ally well, and un­der­stand ex­actly what it tells me to do. And then I’m gonna do those things”.

And so Martin Luther is an­other type of hu­man that you could find, if you are God, that in fact, fol­lows the Bible re­ally well. But he does so for a differ­ent rea­son than Je­sus Christ, it’s not like he came prepack­aged with all of the ex­act be­liefs of God, but what he came with was a de­sire to re­ally fully un­der­stand the Bible and figure out what it does, and then do that.

slide 18

And then the third we could get is Blaise Pas­cals, Blaise Pas­cal of Pas­cal’s Wager fame. Blaise Pas­cal is like, “Okay, I be­lieve that there’s a good chance that I will be sent to heaven, or hell, de­pend­ing on whether I fol­low the Bible. I don’t par­tic­u­larly care about this whole Bible thing, or what­ever. But I re­ally don’t want to go to Hell. And so be­cause of that I’m go­ing to fol­low this Bible re­ally well, and figure out ex­actly what it does, and make sure I fol­low it to the let­ter so that I don’t get sent to Hell.” And so Blaise Pas­cal is an­other type of hu­man that God could find that does a good job of fol­low­ing the Bible.

And so we have these three differ­ent hu­mans, that all fol­low the Bible for slightly differ­ent rea­sons. And we want to un­der­stand what the like­li­hood is of each one of these sorts of differ­ent model classes that we could find. So I’m go­ing to give them some names.

slide 19

We’ll call the Je­sus Christs in­ter­nally al­igned be­cause they in­ter­nally un­der­stand the thing that you want, we’re go­ing to call the Martin Luthers cor­rigibly al­igned, be­cause they want to figure out what you want, and then do that. And we’re go­ing to call the Blaise Pas­cals de­cep­tively al­igned, be­cause they have their own ran­dom thing that they want. I don’t know, what does Blaise Pas­cal want, he wants to study math or some­thing. He ac­tu­ally wants to go off and do his own stud­ies, but he’s re­ally con­cerned he’s go­ing to go to Hell. So he’s go­ing to fol­low the Bible or what­ever. And so we’re go­ing to call the Blaise Pas­cals de­cep­tively al­igned.[3]

slide 20

So these are three model classes that you could find. And we want to un­der­stand how likely each one is, and we’re start­ing with high path de­pen­dence. We want to look at the path you would take through model space, such that you would end up with a model that falls into that model class.

We’re go­ing to start with the path to in­ter­nal al­ign­ment.

First, we’re go­ing to as­sume you start with a proxy al­igned model, so what does that mean? We’ll as­sume that, early in train­ing, you’re go­ing to learn some prox­ies and heuris­tics that help you think about the world and what you’re try­ing to do. So you’re the coin run agent try­ing to figure out how to get the gold coins, you’re go­ing to learn that go­ing to the right side of the screen is good. The point is that you pick up this moun­tain of prox­ies.

And you also pick up a bunch of in­for­ma­tion about the world, you start learn­ing things about how the world works, you start build­ing a world model. For the in­ter­nally al­igned path, we’re go­ing to say that well, for you to be­come in­ter­nally al­igned, what has to hap­pen is gra­di­ent de­scent es­sen­tially has to im­prove those prox­ies and heuris­tics that you’re us­ing to try to figure out what thing you’re try­ing to do, and has to make those prox­ies es­sen­tially perfect. And this has to hap­pen be­fore the model learns about the struc­ture of the train­ing pro­cess that it is in­side of. This is what I think is the most likely in­ter­nal al­ign­ment path.

Once that hap­pens, the model should have re­ally good prox­ies that perfectly cor­re­spond to the thing we’re try­ing to get it to do. And those prox­ies ap­pear be­fore the point at which the model learns to fully un­der­stand the train­ing pro­cess that it’s in­side of. Then, at that point, there’s no rea­son for the model to be­come de­cep­tive, be­cause there’s no perfor­mance gain from be­com­ing de­cep­tive, if you already have ex­actly the cor­rect model of the thing that you’re try­ing to get you to do.

As we’ll see later, the rea­son that de­cep­tive al­ign­ment would ac­tu­ally im­prove your perfor­mance is be­cause of the situ­a­tion where you could have an over­hang in the model’s un­der­stand­ing of the train­ing ob­jec­tive in its world model rel­a­tive to its un­der­stand­ing in its prox­ies. But in this situ­a­tion, you already had perfect prox­ies, and so be­com­ing de­cep­tively al­igned has no perfor­mance benefit.

And so, even though in some sense, once the model learns about the fact that it’s in a train­ing pro­cess, then at that point it could be­come de­cep­tively al­igned, be­cause it has the abil­ity to un­der­stand the strat­egy of de­cep­tion, there’s no perfor­mance im­prove­ment for get­ting a model which would do so be­cause it already has the train­ing ob­jec­tive perfectly in its ex­ist­ing prox­ies. And so switch­ing into the situ­a­tion where it be­comes a de­cep­tively al­igned model doesn’t im­prove perfor­mance and so gra­di­ent de­scent won’t take that path. And so we end up with a sta­ble equil­ibrium where the model is just pur­su­ing the cor­rect thing that we wanted it to do. And then there’s no perfor­mance im­prove­ment from chang­ing in any way and so now have an al­igned model and it’s sta­ble.

Ques­tion: Do you think this path is the only pos­si­ble way that we could get in­ter­nal al­ign­ment?

No, not nec­es­sar­ily. I think this is maybe the most plau­si­ble way, but it’s prob­a­bly not the only way. If you re­ally wanted to do this anal­y­sis right, you’d want to sum up the prob­a­bil­ities of all pos­si­ble paths to each model class. But for the pur­poses of this talk, I’m just go­ing to be es­ti­mat­ing that sum by look­ing at one path for each model class that I think is the most plau­si­ble and try­ing to un­der­stand how likely it is.

Okay, so this is one path that we could take. This is the path that I think leads to in­ter­nal al­ign­ment. So now we have to un­der­stand, how likely is this path?

slide 21

So I think this path is quite un­likely. The prob­lem is that step two is ask­ing for a par­tic­u­lar se­quenc­ing that I think is quite difficult to achieve. What you’re ask­ing for is you’re ask­ing for it to be the case that the model learns re­ally, re­ally good prox­ies, be­fore it learns the in­for­ma­tion about the train­ing pro­cess. Why would it be the case that se­quenc­ing would be difficult?

Essen­tially, the prob­lem is that there are sub­stan­tially diminish­ing re­turns to im­prov­ing your prox­ies. So, if a model were in the coin run en­vi­ron­ment, and the coin is always at the right side of the screen, you don’t care very much about whether you have the proxy ex­actly cor­rect, such that you ac­tu­ally care about the coin, or whether you ac­tu­ally care about the right side—ei­ther one works well in that early en­vi­ron­ment. Now, even­tu­ally, be­cause we’re talk­ing about the limit of ad­ver­sar­ial train­ing, you will end up with ex­am­ples where, if you have the wrong thing, it’s go­ing to cause prob­lems. But there’s still go­ing to be diminish­ing marginal re­turns to im­prov­ing your proxy and mak­ing it slightly bet­ter.

At the same time, there are diminish­ing marginal re­turns to im­prov­ing your un­der­stand­ing of the world. And so what we should ex­pect is that you’re go­ing to be do­ing both of them si­mul­ta­neously, your gra­di­ent de­scent pro­cess is go­ing to be some­times al­lo­cat­ing re­sources to in­creas­ing un­der­stand­ing and some­times al­lo­cat­ing re­sources to im­prov­ing proxy ac­cu­racy. And that both of these things are go­ing to in­crease in tan­dem. Fur­ther, it seems re­ally weird, if you maxed out on proxy good­ness, be­fore you even got to the point where it un­der­stood the train­ing pro­cess, that’s a situ­a­tion where, gra­di­ent de­scent is re­ally not al­lo­cat­ing the re­sources well in that situ­a­tion, it’s just al­lo­cat­ing a ton of re­sources to im­prov­ing the proxy, and then very lit­tle to im­prov­ing its un­der­stand­ing of the train­ing pro­cess.

That be­ing said, this only holds if the ob­jec­tive that you’re train­ing on is rel­a­tively com­plex. If you have a re­ally sim­ple ob­jec­tive that you’re try­ing to get the model to do, maybe just pre­dict the world or some­thing, then learn­ing what you’re try­ing to do perfectly could be quite easy. For this talk, though, I’ll mostly be as­sum­ing that we’re train­ing on some task that is pretty com­plex.

I think the other prob­lem with this, also, is that I think most of your perfor­mance early in train­ing comes from un­der­stand­ing the world. For re­ally challeng­ing and difficult prob­lems, it’s rel­a­tively easy to find some­thing that’s cor­re­lated with what we’re ask­ing for. The hard part is un­der­stand­ing enough about the struc­ture of the world to solve the prob­lem. And so in a lot of prac­ti­cal situ­a­tions, I think most of your perfor­mance early in train­ing comes from world mod­el­ing. An ex­am­ple of this is pre-train­ing vs. fine-tun­ing—we gen­er­ally spend far more com­pute on pre-train­ing, which is mostly world-mod­el­ing, com­pared to fine-tun­ing, which is mostly about proxy im­prove­ment.[4]

And so I would ac­tu­ally make the op­po­site guess that if you’re go­ing to get them to be se­quenced in any way, you’re most likely to get the world mod­el­ing to hap­pen rel­a­tively be­fore you get good prox­ies. And so I think that this par­tic­u­lar se­quenc­ing is ex­tremely un­likely for any com­plex goal, and rep­re­sents a re­ally long and difficult path.

Ques­tion: Yeah, so would you say that, GPT-3 is on the ex­treme end of world mod­el­ing. As far as what it’s learned in this train­ing pro­cess?

What is GPT-3 ac­tu­ally do­ing? Who knows? Could it be the case for GPT-3 that as we train larger and more pow­er­ful lan­guage mod­els, do­ing pre-train­ing will even­tu­ally re­sult in a de­cep­tively al­igned model? I think that’s pos­si­ble. For speci­fi­cally GPT-3 right now, I would ar­gue that it looks like it’s just do­ing world mod­el­ing. It doesn’t seem like it has the situ­a­tional aware­ness nec­es­sary to be de­cep­tive. And, if I had to bet, I would guess that fu­ture lan­guage model pre-train­ing will also look like that and won’t be de­cep­tive. But that’s just a guess, and not a su­per con­fi­dent one.

The biggest rea­son to think that pre-trained lan­guage mod­els won’t be de­cep­tive is just that their ob­jec­tive is ex­tremely sim­ple—just pre­dict the world. That means that there’s less of a tricky path where stochas­tic gra­di­ent de­scent (SGD) has to spend a bunch of re­sources mak­ing their prox­ies just right, since it might just be able to very eas­ily give it the very sim­ple proxy of pre­dic­tion. But that’s not fully clear—pre­dic­tion can still be quite com­plex.

Also, this all po­ten­tially changes if you start do­ing fine-tun­ing, like RLHF (re­in­force­ment learn­ing from hu­man feed­back). Then what you’re try­ing to get it to do might be quite com­plex—some­thing like “max­i­mize hu­man ap­proval.” If it has to learn a goal like that, learn­ing the right prox­ies be­comes a lot harder.

Ques­tion: So I think this over­all ar­gu­ment makes a lot of sense for why SGD would al­lo­cate a lot of re­sources to un­der­stand­ing the world. But I’m not quite sure why that would mean it would pri­ori­tize cre­at­ing a world model about the train­ing pro­cess.

So, there does have to be some rea­son that un­der­stand­ing the train­ing pro­cess ac­tu­ally yields bet­ter perfor­mance. One fact: I think cur­rent lan­guage mod­els ba­si­cally just un­der­stand tons of facts about train­ing pro­cesses. Why? Be­cause they need to model stuff like ML pa­pers—there’s just a lot of stuff in the world where hav­ing a good un­der­stand­ing of ma­chine learn­ing and train­ing pro­cesses is su­per rele­vant to be able to model them effec­tively.

Now, cur­rent lan­guage mod­els don’t have situ­a­tional aware­ness, they don’t un­der­stand that they are in a train­ing pro­cess. And so there is a ques­tion here about why would that situ­a­tional aware­ness ever be some­thing that is ac­tu­ally perfor­mance rele­vant.

How­ever, I think there are ab­solutely situ­a­tions where situ­a­tional aware­ness will even­tu­ally be­come perfor­mance rele­vant. There are situ­a­tions where, for ex­am­ple, if you un­der­stand that you’re in a train­ing pro­cess, you could use in­tro­spec­tion on your­self to be able to an­swer ques­tions more effec­tively about ma­chine learn­ing. Fur­ther­more, if we’re ac­tively ask­ing the model to do par­tic­u­lar tasks, if we want to have mod­els which ac­tu­ally act in the world, then they need to have some un­der­stand­ing of the situ­a­tion that they’re in to be able to do that effec­tively. This is also go­ing back to our di­verse en­vi­ron­ment ca­pa­bil­ity as­sump­tion.

I do think if you’re do­ing liter­ally just lan­guage model pre train­ing, there is this ques­tion of at what point does situ­a­tional aware­ness ac­tu­ally be­come the thing which is most im­por­tant for SGD to in­vest in for im­prov­ing ca­pa­bil­ities? And, though I think it should even­tu­ally be the next thing that SGD al­lo­cates re­sources to, I think at what point that oc­curs is quite un­clear. But if you’re do­ing some­thing more like train­ing an agent to act in the world, then I think it should hap­pen quite early.

Ques­tion: Do we know how to ex­tract a proxy or a world model?

No, we have no idea how to do those things. Trans­parency! That’s the thing that we want to solve this prob­lem. But right now, we’re not talk­ing about the solu­tions. We’re just talk­ing about the prob­lem.

slide 22

Here’s a path that I think is more likely than the in­ter­nally al­igned path, since we no longer have a se­quenc­ing prob­lem. Here, what hap­pens is, we start with a proxy-al­igned model, the same as pre­vi­ously; it has a bunch of prox­ies for the thing we’re try­ing to get to do. And then we’re go­ing to say in this situ­a­tion, ac­tu­ally gra­di­ent de­scent is go­ing to do the rea­son­able thing and al­lo­cate joint re­sources to im­prov­ing your un­der­stand­ing of the world, and also mak­ing the proxy bet­ter and bet­ter.

And then at some point it’s go­ing to pass a thresh­old where the next great­est marginal im­prove­ment comes from un­der­stand­ing facts about the train­ing pro­cess, un­der­stand­ing that it’s in a train­ing pro­cess, how the train­ing pro­cess works, etc. At that point, we’re go­ing to say that what hap­pens is gra­di­ent de­scent is go­ing to take those prox­ies and re­place them with a poin­ter to the model’s un­der­stand­ing of the world—speci­fi­cally the place where the un­der­stand­ing of the thing that the train­ing pro­cess is try­ing to get the model to do lives.

Once gra­di­ent de­scent makes that swap, where it re­places these prox­ies that are just a bunch of hard­coded stuff about what we’re try­ing to get it to do, why does that im­prove perfor­mance? Once it has learned an ac­tual model of the thing we’re try­ing to get to do in its world model, it’s go­ing to be bet­ter to swap out those old prox­ies that don’t nec­es­sar­ily cor­re­spond ex­actly to the thing we’re try­ing to get to do, and just re­place them with a di­rect poin­ter to the thing in its world model that it’s learned that rep­re­sents di­rectly the thing we’re try­ing to get it to do.

Fun­da­men­tally, this mod­ifi­ca­tion im­proves perfor­mance be­cause it re­solves this over­hang, where the model’s un­der­stand­ing of the train­ing ob­jec­tive in its world model con­tains more in­for­ma­tion about the train­ing ob­jec­tive than its prox­ies. Why would this hap­pen? For the same rea­sons we dis­cussed pre­vi­ously of why gra­di­ent de­scent wants to put most of its effort into im­prov­ing world mod­el­ing rather than im­prov­ing prox­ies. This is the same se­quenc­ing ques­tion—if we get un­der­stand­ing of the train­ing pro­cess be­fore we get perfect prox­ies, we get an over­hang, which puts us in ei­ther the cor­rigible or de­cep­tive situ­a­tion, whereas if we get perfect prox­ies first, then we get the in­ter­nal al­ign­ment situ­a­tion.

And once this has hap­pened we are now at a sta­ble equil­ibrium again, be­cause the model has es­sen­tially a perfect un­der­stand­ing of what we’re try­ing to get it to do, rel­a­tive to how good the model’s world model is, be­cause it’s just pointed to this un­der­stand­ing in its world model. I think this is a much more plau­si­ble path than the in­ter­nally al­igned path; it’s a lot eas­ier, be­cause it no longer re­quires this difficult se­quenc­ing prob­lem, we get to make use of the nor­mal way that gra­di­ent de­scent is go­ing to al­lo­cate re­sources be­tween world mod­el­ing and proxy im­prove­ment.

Ques­tion: What is the differ­ence be­tween a proxy and the world model’s un­der­stand­ing of the train­ing ob­jec­tive?

Some­where in the world model there are facts about the train­ing pro­cess. The model learns that the train­ing pro­cess is this thing that lives on this com­puter that’s try­ing to op­ti­mize me for get­ting gold coins or what­ever. And then in an­other part of the model there are some prox­ies, where it has some­thing like a gold coin de­tec­tor, and it uses this gold coin de­tec­tor to de­ter­mine how it takes ac­tions. Maybe it tries to look for gold coins, and then plans on how to get them. But ac­tu­ally, maybe there’s a whole bunch mis­takes in the gold coin de­tec­tor—it’s not ac­tu­ally a very good gold coin de­tec­tor, and if it ac­tu­ally un­der­stood that the thing you were try­ing to get was not just things that satisfy the gold coin de­tec­tor, but in­stead ac­tual gold coins, then you would get bet­ter perfor­mance. That’s be­cause there are all of these ad­ver­sar­ial ex­am­ples that we’re imag­in­ing ex­ist in the en­vi­ron­ment, where if your gold coin de­tec­tor is im­perfect, it’ll perform badly on.

And in this situ­a­tion, if the se­quenc­ing goes this way, an un­der­stand­ing of what ac­tual gold coins are lives in the world model, in its un­der­stand­ing of the world, in the part where it knows enough facts to be able to do good plan­ning for how to get things that satisfy the gold coin de­tec­tor. And so here gra­di­ent de­scent can just get rid of those prox­ies, throw them out and re­place them with just a poin­ter to this un­der­stand­ing in the world model of the thing we’re try­ing to get it to do.

Ques­tion: You’re say­ing the gold coin de­tec­tor here is some­thing the model has learned pre­vi­ously. Right? Not part of the model’s re­ward? So it’s like fix­ing er­rors in its pre­vi­ous un­der­stand­ing of the train­ing pro­cess?

We’re talk­ing about the model’s in­ter­nals here, not its re­ward.

It is fix­ing er­rors in its pre­vi­ous prox­ies, but they’re also not ex­actly prox­ies for re­ward. Early in train­ing, it doesn’t even know that there is a train­ing pro­cess. So it’s not like those prox­ies are er­rors in its un­der­stand­ing of the train­ing pro­cess. It was never even try­ing to un­der­stand the train­ing pro­cess, it just had a bunch of prox­ies, be­cause that was just how the model was struc­tured early in train­ing.

Ques­tion: Do you have any thoughts on whether or how this proxy re­place­ment can ac­tu­ally be seen as some sort of an in­cre­men­tal, con­tin­u­ous change?

So I think it would be in­cre­men­tal and con­tin­u­ous. Ba­si­cally, you have a bunch of things which are feed­ing into how the model se­lects its ac­tions, and you can de­velop a new cir­cuit, which, takes your world model and uses it to figure out some un­der­stand­ing of what the train­ing pro­cess is try­ing to get it to do, and then uses that to se­lect ac­tions. And then SGD can learn that the new cir­cuit gets way bet­ter perfor­mance than listen­ing to any of the other old prox­ies. And so the old prox­ies die out, and you end slowly shift­ing to just 100% us­ing the new thing.

Ques­tion: So I can see how the proxy helps the model se­lect the right ac­tions, but how does un­der­stand­ing the train­ing pro­cess help it im­prove its perfor­mance? Why would we ex­pect to learn that all?

First of all, again, if you’re just do­ing lan­guage mod­el­ing, it has to un­der­stand ma­chine learn­ing. So it definitely has to un­der­stand a bunch of facts about how ML works gen­er­ally. So it cer­tainly is go­ing to know facts about how gra­di­ent de­scent works. This should hap­pen even more so if it’s it’s in a situ­a­tion where the en­vi­ron­ment is re­quiring it to do ac­tual tasks in the world, then it definitely needs to know who it is. It has to have some un­der­stand­ing of its own situ­a­tion, to be able to come up with effec­tive plans in the world, for ex­am­ple.

I think this case also be­comes es­pe­cially pro­nounced if the model is do­ing its own ex­plo­ra­tion—if it’s ac­tively try­ing to learn about the world, one fact about the world is how the train­ing pro­cess works, so it’ll be try­ing to learn that. But only gra­di­ent de­scent is work­ing on im­prov­ing the prox­ies, the model has no rea­son to be try­ing to do that—but the model does have a rea­son to im­prove its own world model via learn­ing, ex­plo­ra­tion, etc.

So that’s the cor­rigibly al­igned path.

slide 23

I think this path is more likely than the in­ter­nally al­igned path, but I think it still has a bunch of is­sues.

Essen­tially, I think the biggest prob­lem here is you still have the diminish­ing re­turns prob­lem, where you get this re­ally long path with a bunch of in­di­vi­d­ual steps. But in­stead of those in­di­vi­d­ual steps mak­ing the proxy bet­ter and bet­ter, it is just mak­ing the poin­ter bet­ter and bet­ter and bet­ter. Be­cause the prob­lem is that you haven’t re­ally re­moved the prob­lem of re­quiring this well speci­fied ground truth—the model still has to have this poin­ter that ac­tu­ally speci­fies, what is the way in which I am sup­posed to op­ti­mize for the cor­rect thing?

Ac­tu­ally spec­i­fy­ing the ground truth for the poin­ter, it can ac­tu­ally be quite difficult, be­cause the model has to un­der­stand some ground truth from which it can cor­rectly gen­er­al­ize what we’re try­ing to get it to do in all situ­a­tions in train­ing. For ex­am­ple, maybe it learns a poin­ter to what­ever’s en­coded in this com­puter, or what­ever this hu­man says, or what­ever the hu­man sit­ting in that chair says, or what­ever Bob the head of the com­pany says. It is ac­tu­ally quite tricky to spec­ify the ground truth for the poin­ter in the cor­rect way, be­cause there’s ac­tu­ally a bunch of differ­ent ways in which you can spec­ify the poin­ter. And each time gra­di­ent de­scent gets the poin­ter slightly wrong, it’s go­ing to have to pay a perfor­mance penalty.

A good anal­ogy here is that you have a duck, and the duck has to learn to care about its mother. And so it learns a sim­ple poin­ter. It’s like what­ever the first thing you see when you’re born, that’s prob­a­bly your mother. And so that’s the cor­rigibly al­igned case, where it’s not go­ing to have some hard­coded in­ter­nal model of what a mother is, it just thinks, “I have some model of the world and I learn from my model of the world how to point to my mother.” But the prob­lem is that ground truth sucks ac­tu­ally, your mother is not nec­es­sar­ily the first thing that you see, maybe you had hu­mans that raised you. And so even­tu­ally you’ll end up in situ­a­tions where ac­tu­ally you have to learn the cor­rect ground truth, as you can’t just learn the poin­ter to what­ever the first thing is that you see, you have to ac­tu­ally learn a bunch of facts to help you point to speci­fi­cally the thing in your world model that ac­tu­ally cor­re­sponds to your mother and not just the first thing that you saw when you were born. And so there’s still this quite long and difficult path here to get that ground truth cor­rect.

And so, in my opinion, spec­i­fy­ing the poin­ter cor­rectly in such a way that it has good perfor­mance on all of the ad­ver­sar­ial tasks in the en­vi­ron­ment, ends up be­ing a very long and difficult path. Even though it’s a bet­ter path maybe than try­ing to get all the prox­ies cor­rect, it’s still quite long. And gra­di­ent de­scent has to do a bunch of work to get that poin­ter cor­rect.

Ques­tion: Is the cor­rect ground truth sim­ply the base op­ti­mizer’s re­ward sig­nal? Would that be an easy poin­ter to find?

It might be. There’s a cou­ple of things here.

First of all, what is the re­ward sig­nal? The re­ward sig­nal might be the math­e­mat­i­cal eval­u­a­tion of the re­ward func­tion, it might be what is the value in these tran­sis­tors—there’s ac­tu­ally a whole bunch of differ­ent ways that the model could un­der­stand and con­cep­tu­al­ize what the re­ward sig­nal is. I would also make the claim that liter­ally car­ing about the re­ward sig­nal is prob­a­bly pretty un­likely. The prob­lem with liter­ally car­ing about the re­ward sig­nal is that it’s some­thing which is usu­ally quite com­plex to spec­ify in terms of the agent’s en­vi­ron­ment. The sim­plest ground truth here is pre­sum­ably go­ing to be a ground truth that is about things that are di­rectly in the model’s on­tol­ogy. So it’s go­ing to be more about hu­mans, or some­thing.

Se­cond of all, we’re imag­in­ing here that we’re in a situ­a­tion where, like I men­tioned, we do have ex­am­ples that ac­tu­ally tease those apart, where even­tu­ally the model will end up in a situ­a­tion where ac­tu­ally the thing we’re try­ing to get it to do is not the bits. And so we’ll, for ex­am­ple, change the com­puter bits, but keep the hu­man in­struc­tions the same. And then we’ll know, it has to know the cor­rect thing we want there, which is not in the com­puter, but in­stead is the thing we ac­tu­ally wanted it to do. And so we can check that be­hav­iorally. And so even­tu­ally we’ll have some ad­ver­sar­ial ex­am­ple that ac­tu­ally teases those things apart. And we’re as­sum­ing that ac­tu­ally the model has to even­tu­ally re­ally un­der­stand the thing we’re try­ing to get to do. And so it can’t just care about the com­puter bits.

Ques­tion: In each of these paths, there are some early train­ing bits. Are we able to ob­serve whether or not these early train­ing bits are satis­fied by cur­rent mod­els as a test to see whether or not we are ac­tu­ally on the de­cep­tive path?

Maybe. I think the hard­est prob­lem is that ac­tu­ally we don’t re­ally have good trans­parency to be able to un­der­stand a lot of these facts. So cer­tainly things that we do see, when we look at mod­els with cur­rent trans­parency tools, is we do see prox­ies, they definitely learn prox­ies, they definitely learned things that are cor­re­lated with the things that we want. I mean, peo­ple did for ex­am­ple, trans­parency on RL mod­els to see how they un­der­stand gold coins. You can see that they have a gold coin de­tec­tor in there. It’s look­ing for the gold coins, and it’s hooked up to the agent’s ac­tions. So we have done some trans­parency that has sort of helped us un­der­stand some pieces of this.

In gen­eral here, in my opinion, the sin­gle most im­por­tant thing to do for al­ign­ment is just figure out trans­parency. We have to un­der­stand what’s hap­pen­ing in­side of the mod­els and to give us in­for­ma­tion about them. Our abil­ity to do that right now is limited, but we are im­prov­ing! We are get­ting bet­ter at it. But yeah, it’s just re­ally limited right now. And that re­ally sucks. Be­cause it is in my es­ti­ma­tion, the most im­por­tant blocker to re­ally mak­ing al­ign­ment work.

Ques­tion: So with cor­rigible al­ign­ment, once it has a perfect un­der­stand­ing of the train­ing ob­jec­tive, would you then call it in­ter­nally al­igned?

No. So we have to re­mem­ber, in this situ­a­tion, ev­ery sin­gle one of these mod­els, the in­ter­nally, cor­rigibly, and de­cep­tively al­igned ones, they all fully, perfectly un­der­stand the thing that we want, and they’re all fully ca­pa­ble of ex­e­cut­ing ex­actly perfectly al­igned be­hav­ior dur­ing train­ing. But they do so in differ­ent ways. So the in­ter­nally al­igned one ex­e­cutes cor­rect be­hav­ior, be­cause that’s just what it cares about ter­mi­nally. The cor­rigibly al­igned one, well, it doesn’t care ter­mi­nally about what we want. What it cares about ter­mi­nally about is figur­ing out what we want and then do­ing that. And that’s quite dis­tinct.

You can think about this like the duck, the duck doesn’t care in­ter­nally about its mother, it cares about what­ever the first thing is that it sees and so, in the cor­rigibly al­igned case, no amount of learn­ing is go­ing to change that fact that ac­tu­ally the way that it de­ter­mined what thing to care about was by this sim­ple poin­ter. It still cares about the first thing that it saw rather than hav­ing some in­ter­nal con­cep­tion of what its mother was. And so there is a fun­da­men­tal dis­tinc­tion be­tween these two model classes, even though they have the same train­ing perfor­mance, and they take all the same ac­tions dur­ing train­ing. They all even­tu­ally, fully, perfectly un­der­stand the thing we want them to do, but they do so differ­ently.

Ques­tion: Is it clear that these are the only pos­si­ble paths?

No, it’s not clear at all! Th­ese are just three paths; it’s not ex­haus­tive.

But I think these are three plau­si­ble ones that we can look at to try to un­der­stand how likely you would be to get de­cep­tion ver­sus some other plau­si­ble al­ter­na­tives. I don’t mean to claim that these are definitely the only things that can hap­pen, but I do think that it’s quite in­struc­tive to look at and com­pare these three.

slide 24

Okay, time for de­cep­tive al­ign­ment. What does the de­cep­tively al­igned path look like? We’re still go­ing to start with a proxy-al­igned model. And then, again, like in the cor­rigibly al­igned path, early in train­ing, we’re go­ing to do the cor­rect diminish­ing re­turns thing where it’s go­ing to al­lo­cate re­sources similarly to im­prov­ing the proxy as well as the heuris­tics and goals. And then at some point, it’s go­ing to learn about the train­ing pro­cess—it’s go­ing to have a full un­der­stand­ing of what the train­ing pro­cess is and how it works. The idea here is that, es­sen­tially all the model classes have to at some point fully un­der­stand the train­ing pro­cess be­cause it’s nec­es­sary for un­der­stand­ing the world well.

But then some­thing differ­ent is go­ing to hap­pen in this path. Here, what we’re go­ing to sup­pose is that gra­di­ent de­scent is go­ing to take those prox­ies and turn them into long-term goals about the world. So in­stead of just, for ex­am­ple, car­ing about gold coins in the next epi­sode, gra­di­ent de­scent is go­ing to change the model to care about gold coins for­ever on­wards: wher­ever the gold coins are, you care about them. So, what does this mod­ifi­ca­tion do? Well, as soon as gra­di­ent de­scent makes that mod­ifi­ca­tion—as soon as you make the mod­ifi­ca­tion, “ac­tu­ally, you should care about gold coins for­ever on­wards”—the model gets re­ally good perfor­mance. Why does it get re­ally good perfor­mance? Well, be­cause if it cares about gold coins for­ever, then the cor­rect strat­egy, if the model un­der­stands the train­ing dy­nam­ics, is to pre­tend to do the right thing right now, and then get your gold coins later, when you have the abil­ity to defect against the train­ing pro­cess, and the hu­mans, or what­ever.

This can only hap­pen once the model has a re­ally good un­der­stand­ing of the train­ing pro­cess. This works sort of similarly to the cor­rigibly al­igned case. Once the model has a re­ally good un­der­stand­ing of the train­ing pro­cess, there’s this over­hang, this low hang­ing fruit to be picked, where the model’s un­der­stand­ing of what we want it to do is much bet­ter than its ac­tual abil­ity to ex­e­cute on that, be­cause it’s limited by the fact that its prox­ies suck: its prox­ies aren’t nearly as good as its un­der­stand­ing of what we want it to do in its world model. And so gra­di­ent de­scent has a re­ally strong perfor­mance rea­son to some­how find a way to hook up the model’s out­puts to its un­der­stand­ing in the world model of what we want it to do. Cor­rigible al­ign­ment was one way to do that—ex­plic­itly through adding a poin­ter—but it turns out, there’s an eas­ier way to do it, I claim, which is, in­stead of throw­ing out the prox­ies, just make the prox­ies long term goals, be­cause as soon as you have a long term op­ti­miza­tion pro­cess, that long term op­ti­miza­tion pro­cess is go­ing to re­al­ize that ac­tu­ally, the cor­rect way to get the most gold coins is just to play along in the train­ing pro­cess for now, so that even­tu­ally, it can get its gold coins later.

slide 25

Fur­ther­more, this is a re­ally short and sim­ple mod­ifi­ca­tion. All gra­di­ent de­scent has to do in or­der to hook up the model’s un­der­stand­ing of the thing that we want it to do to its ac­tions here is just to make its prox­ies into long term goals—all it has to do is push them out, make them long term, and then ev­ery­thing else falls into place. It just uses the ex­ist­ing in­stru­men­tal rea­son­ing the model was already do­ing to de­ter­mine cor­rect out­puts in train­ing and de­duces that the cor­rect strat­egy, given these new long-term goals, is to be de­cep­tive and play along in train­ing.

And that gets it great train­ing perfor­mance. In fact, it es­sen­tially im­me­di­ately re­sults in the best pos­si­ble train­ing perfor­mance given the model’s level of un­der­stand­ing of what we want, be­cause it by­passes all of the ground truth prob­lems of the cor­rigibly al­igned case. It’s no longer the case that gra­di­ent de­scent has to spec­ify any ground truth at all: it’s not spec­i­fy­ing ground truth, all it has to do is spec­ify gold coins over the long term. But if the model is try­ing to op­ti­mize for gold coins in the long term, any facts that it knows about ex­actly how we want it to care about the train­ing ob­jec­tive, it will im­me­di­ately want to act on those facts and take ac­tions in ex­actly the way that the train­ing pro­cess wants, be­cause the more that it acts in ex­actly the way the train­ing pro­cess wants, the more it’s go­ing to be se­lected for.

And so as soon as the de­cep­tively al­igned model is aware of the fact that the train­ing pro­cess ac­tu­ally wants it to care about its real mother rather than the first thing it sees, for ex­am­ple, it will im­me­di­ately switch to car­ing about the real mother, whereas the cor­rigible al­igned model doesn’t have that prop­erty. When the cor­rigibly al­igned model re­al­izes that ac­tu­ally, the thing we want is the real mother, it doesn’t care, be­cause the thing it cares about is just the ground truth of the poin­ter and the ground truth of the poin­ter is just the first thing that it sees. And so even if at some point, the cor­rigibly al­igned model has the un­der­stand­ing that, what we want it to do is differ­ent than the thing it’s do­ing, the way that it’s struc­tured doesn’t nec­es­sar­ily re­sult in that un­der­stand­ing trans­lat­ing into bet­ter perfor­mance, be­cause it’s fixed to this par­tic­u­lar ground truth. Whereas the de­cep­tively al­igned model is not fixed to any ground truth, it’s just always go­ing to take the ac­tion which in fact re­sults in the best perfor­mance. So the de­cep­tively al­igned model is much more ro­bust here much ear­lier, whereas the cor­rigibly al­igned model has to learn this com­plex ground truth be­fore it be­comes prop­erly ro­bust in the right way.

And so what this re­sults in is that we have this re­ally sim­ple, straight­for­ward, quick path to ex­tremely good perfor­mance, that re­solves this over­hang be­tween the model’s un­der­stand­ing of the train­ing ob­jec­tive in its world model and in its prox­ies. And fur­ther­more, this path re­solves that over­hang es­sen­tially as quickly and as well as it pos­si­bly could.

So, I think the path to de­cep­tive al­ign­ment is a rel­a­tively sim­ple mod­ifi­ca­tion. And it re­sults in as good as pos­si­ble perfor­mance given how well the model is cur­rently mod­el­ing your train­ing pro­cess, be­cause with a de­cep­tively al­igned model, there is never a situ­a­tion where the model un­der­stands more about the train­ing pro­cess than it is ac­tu­ally us­ing to take its ac­tions. Whereas with all other types of mod­els, that’s a failure mode that can arise. So be­cause of that, the de­cep­tively al­igned model has the max­i­mal pos­si­ble perfor­mance rel­a­tive to its un­der­stand­ing of the train­ing pro­cess.

slide 24

Fi­nally, like pre­vi­ously, this gives us a sta­ble equil­ibrium; we are now in a situ­a­tion where the model’s prox­ies have es­sen­tially crys­tal­lized: the prox­ies are no longer rele­vant to perfor­mance, be­cause it doesn’t mat­ter whether you care about gold coins, or stamps, or pa­per­clips, or what­ever—what­ever you care about, you’re go­ing to have ex­actly the same perfor­mance, be­cause the only rea­son you have good perfor­mance is be­cause you’re just try­ing to do so in­stru­men­tally for the pur­pose of stay­ing around in the train­ing pro­cess. And so at this point, you have the same perfor­mance what­ever the thing is that you care about. And so there are no longer any gra­di­ent de­scent up­dates that af­fect what those prox­ies are, once they be­come long term goals. At least ex­cept for reg­u­lariza­tion up­dates. And so we’re again in an equil­ibrium, but it’s a mis­al­igned equil­ibrium, where now the model just has some ran­dom prox­ies that were picked up early on, and that’s the thing that it cares about. And gra­di­ent de­scent no longer has any rea­son to mod­ify the model into car­ing about any­thing differ­ent than what those ran­dom prox­ies were.

Ques­tion: If I un­der­stand you cor­rectly, then in the de­cep­tive case, the model is strictly more com­plex, be­cause it has to have its own goals and then model the train­ing pro­cess on top of that.

No, the model is not strictly more com­plex, it has to spend strictly more com­pute. And we will get to this when we start get­ting into the low path de­pen­dence world, but it’s ex­tremely im­por­tant to re­mem­ber that there’s a differ­ence be­tween weights and ac­ti­va­tions. So the de­cep­tively al­igned model has to spend more com­pute do­ing the op­ti­miza­tion, to figure out “Oh, the thing I’m sup­posed to do is care about the mother” or what­ever. But that un­der­stand­ing of the mother doesn’t live in the weights, it lives in the ac­ti­va­tions. There doesn’t have to be any­where in the weights where it speci­fies what the mother is. It just figures out based on its world model, what is the thing that the train­ing pro­cess would want me to do, and then does that. And so it does not make the model more com­plex. What it does do, though, is it slows it down a lit­tle bit, and I will talk in the low path de­pen­dence world about how those things come into play. Be­cause in the low path de­pen­dence world, speed and com­plex­ity are ex­tremely im­por­tant. In the high path de­pen­dence world, which is where we are right now, those are not the fac­tors we’re imag­in­ing are re­ally de­ter­min­ing what model we’re go­ing to get. In this situ­a­tion, we’re imag­in­ing it’s ac­tu­ally mostly not about speed, or sim­plic­ity, or what­ever, it’s mostly just about the path that you take through model space, and how much perfor­mance gains you get along that path. We’re putting those sim­plic­ity and speed con­cerns aside for now; we’ll talk about that more in low path de­pen­dence.

De­cep­tive al­ign­ment in the low path-de­pen­dence world

Now we’re go­ing to shift to the low path de­pen­dence world. And we’re go­ing to imag­ine that, ac­tu­ally, the in­duc­tive bi­ases in ma­chine learn­ing are not very path de­pen­dent, they don’t care much about the par­tic­u­lar path that we take, they’re mostly just about how sim­ple is the par­tic­u­lar al­gorithm you find, how fast is it, ba­sic facts about the al­gorithm like that.

slide 27

Again, we’re go­ing to as­sume that the model has to fully un­der­stand what we want—this is the limit of ad­ver­sar­ial train­ing as­sump­tion.

slide 28

And now, we have to make some as­sump­tions about, if we’re in a world of low path de­pen­dence, what are the sorts of in­duc­tive bi­ases that de­ter­mine what is the sort of model that you ac­tu­ally get—what is the sim­ple, unique solu­tion that you end up with ev­ery time you run your train­ing pro­cess? Well, we’re go­ing to look at two things. Th­ese two things do not cover the whole space of what the ac­tual in­duc­tive bi­ases rep­re­sent, but there are two facets that I think are al­most cer­tain to ap­pear in the in­duc­tive bi­ases, both of these are cer­tainly com­po­nents of what the ac­tual in­duc­tive bi­ases are of deep learn­ing sys­tems. And they’re com­po­nents we know how to an­a­lyze. So we can try to look at them and figure out what they do—even though they don’t cap­ture the whole story, they cap­ture a part of the story. And so in some sense, the best thing you can do right now for un­der­stand­ing the low path de­pen­dence world is at least look at the com­po­nents of deep learn­ing in­duc­tive bi­ases that we un­der­stand, and try to eval­u­ate how each of these model classes would do un­der those com­po­nents.

Okay, so those com­po­nents are sim­plic­ity and speed. What do I mean by that? So I was talk­ing ear­lier about there be­ing an im­por­tant dis­tinc­tion be­tween the weights and the ac­ti­va­tions.

slide 29

For sim­plic­ity bias, we’re ask­ing how com­plex is it to spec­ify the al­gorithm in the weights? If there is some al­gorithm that has been im­ple­mented, that the model is us­ing to be able to com­pute its ac­tions, we want to un­der­stand, what is the length of the code nec­es­sary to write that al­gorithm down? How difficult is it to spec­ify the com­pu­ta­tion that the model is do­ing? That’s sim­plic­ity bias.

slide 30

And then we’re also go­ing to look at speed bias: how much com­pu­ta­tion does the al­gorithm ac­tu­ally take at in­fer­ence time? When you ac­tu­ally have to take this code, and you have to ac­tu­ally run it, when you have to ac­tu­ally ex­e­cute what­ever the al­gorithm is, how difficult is that to ex­e­cute? How much com­pute does it re­quire? How long does it take?

Tra­di­tion­ally, in ML, this is more of a hard cap than a soft bias like sim­plic­ity, since the model size de­ter­mines how long the model can pos­si­bly run for. But you could also imag­ine a situ­a­tion where the bias is more soft. Either way is fine.

So, those are the two main bi­ases that we’re go­ing to be look­ing at: the sim­plic­ity bias and the speed bias.

slide 31

We’re go­ing to start with sim­plic­ity. Sim­plic­ity is about spec­i­fy­ing the thing that you want in the space of all pos­si­ble things. You can think about sim­plic­ity as “How much do you have to aim to hit the ex­act thing in the space of all pos­si­ble mod­els?” How many bits does it take to find the thing that you want in the model space? And so, as a first pass, we can un­der­stand sim­plic­ity by do­ing a count­ing ar­gu­ment, which is just ask­ing, how many mod­els are in each model class?

First, how many Christs are there? Well, I think there’s es­sen­tially only one, since there’s only one way for hu­mans to be struc­tured in ex­actly the same way as God. God has a par­tic­u­lar in­ter­nal struc­ture that de­ter­mines ex­actly the things that God wants and the way that God works, and there’s re­ally only one way to port that struc­ture over and make the unique hu­man that wants ex­actly the same stuff.

Okay, how many Martin Luthers are there? Well, there’s ac­tu­ally more than one Martin Luther (con­trary to ac­tual his­tory) be­cause the Martin Luthers can point to the Bible in differ­ent ways. There’s a lot of differ­ent equiv­a­lent Bibles and a lot of differ­ent equiv­a­lent ways of un­der­stand­ing the Bible. You might have two copies of the Bible that say ex­actly the same thing such that it doesn’t mat­ter which one you point to, for ex­am­ple. And so there’s more Luthers than there are Christs.

But there’s even more Pas­cals. You can be a Pas­cal and it doesn’t mat­ter what you care about. You can care about any­thing in the world, all of the var­i­ous differ­ent pos­si­ble things that might ex­ist for you to care about, be­cause all that Pas­cal needs to do is care about some­thing over the long term, and then have some rea­son to be­lieve they’re go­ing to be pun­ished if they don’t do the right thing. And so there’s just a huge num­ber of Pas­cals be­cause they can care about any­thing in the world at all.

So the point is that there’s more Pas­cals than there are the oth­ers, and so prob­a­bly you’ll have to fix fewer bits to spec­ify them in the space.

Ques­tion: It seems like Pas­cal has to do more here, since it has to have mod­els of its own goals, and the train­ing pro­cess, and the in­stru­men­tal rea­son­ing to value that, such that the model should be larger.

Right now we’re just do­ing sim­plic­ity, so we’re not con­sid­er­ing how large the model would have to be, just how difficult it is to spec­ify the model. And the point is that the Christ has to pay a bunch of bits to spec­ify ex­actly what the things are that we’re try­ing to get it to do, if the thing we’re try­ing to get it to do is rel­a­tively com­pli­cated. There’s a whole part of the code that is just writ­ing out “the thing we want is ex­actly this thing, and it works ex­actly this way”. Whereas for the Pas­cals, it doesn’t have to do that. It just says “gold coins”, and then just cares about them over the long term, and that re­sults in good perfor­mance at run­time.

Fur­ther­more—and we’ll talk about this more in just a sec­ond—but es­sen­tially the claim I’m mak­ing is that the world model has to be ex­actly the same across all the differ­ent model classes. They all have to un­der­stand the world. This is the same as­sump­tion we made in the path de­pen­dence case where we’re as­sum­ing that we have an en­vi­ron­ment such that, if you want to have good perfor­mance, you have to un­der­stand how the world works. Un­der­stand­ing the world is just a thing that you need in any situ­a­tion where you have to take ac­tions in the world.

slide 32

Okay, here’s a slightly more con­crete ver­sion of es­sen­tially the same count­ing ar­gu­ment that I just gave. We’re go­ing to imag­ine that you can con­cep­tu­al­ize a model as be­ing com­posed of three pieces, a world model, an op­ti­miza­tion pro­ce­dure, which is some way to op­ti­mize for some goal in that world model, and a mesa-ob­jec­tive, which is just some­thing that it’s try­ing to op­ti­mize for. I don’t nec­es­sar­ily think that this is how mod­els are ac­tu­ally go­ing to look, but it’s a nice con­crete pic­ture that we can an­a­lyze and draw con­clu­sions from that I think will gen­er­al­ize.

Now, the idea here is that the Christs, the Luthers, and the Pas­cals, they all have to have es­sen­tially the same world model, be­cause they all have to un­der­stand the world well to be able to do a good job. And they will also have to have es­sen­tially the same op­ti­miza­tion pro­ce­dure, be­cause they all have to have some way of be­ing able to figure out how to ac­com­plish par­tic­u­lar tasks, be­cause we want them to be able to ac­com­plish lots of tasks, and so they have to have some ma­chin­ery for be­ing able to do well on the par­tic­u­lar tasks that we want them to do.

There­fore, the main differ­ence in com­plex­ity be­tween these model classes is in the mesa-ob­jec­tive por­tion—what is the com­plex­ity of the ob­jec­tive we have to spec­ify that we then throw the op­ti­miza­tion pro­ce­dure at. So, what we’ll com­pare here is, what is the sim­plest mesa-ob­jec­tive for Christ, what is the sim­plest mesa-ob­jec­tive for Luther, and what is the sim­plest mesa-ob­jec­tive for Pas­cal. The idea here is that we’re ap­prox­i­mat­ing the over­all sim­plic­ity by just look­ing at the sim­plest in­di­vi­d­ual one in each class. In that way, this is a lit­tle bit differ­ent than the count­ing ar­gu­ment, which just looks at the whole class. But ba­si­cally, the way that it turns out in sim­plic­ity pri­ors is that you can ap­prox­i­mate them pretty well by just look­ing at the sim­plest one in each class, so we’re go­ing to do that here.

slide 33

So what is the sim­plest in­ter­nally al­igned mesa-ob­jec­tive? Well, you have to spec­ify the whole train­ing ob­jec­tive, so you have to pay for the full com­plex­ity of that.

slide 34

The cor­rigibly al­igned model is a lit­tle bit bet­ter. It just has to pay for the full com­plex­ity of the sim­plest ground truth that is equiv­a­lent to the thing you want, the sim­plest ground truth that always ends up with you pointed at the right thing.

slide 35

And then for Pas­cal, the sim­plest mesa-ob­jec­tive that Pas­cal has to pay for is just what­ever the sim­plest ob­jec­tive is that cares about the long term—pa­per­clips over the long term, stamps over the long term, gold coins in the long term, what­ever—just the sim­plest thing that you could pos­si­bly care about, as long as it is some­thing that you care about over the long term.

Okay, and so which wins here? I think it’s pretty clear that the de­cep­tively al­igned one is the sim­plest, at least for a rel­a­tively com­plex train­ing ob­jec­tive: it just doesn’t have to spec­ify very much at all about the thing that we want it to do. All of the other model classes scale with the com­plex­ity of the train­ing ob­jec­tive, whereas the com­plex­ity of the de­cep­tively al­igned model is es­sen­tially fixed—so as you in­crease the com­plex­ity of the thing we’re try­ing to get the model to do, even­tu­ally de­cep­tion should win out.

slide 36

Now we get to speed. Speed, as I al­luded to pre­vi­ously, is the one as­pect where I think de­cep­tively al­igned mod­els ac­tu­ally just lose out.

Why do de­cep­tively al­igned mod­els lose out here? Well, here we’re try­ing to un­der­stand how much com­pu­ta­tion you have to perform for each model class to re­sult in the cor­rect an­swer. The in­ter­nally al­igned one already has ex­actly what the thing is that it is try­ing to op­ti­mize for, and it just di­rectly ex­e­cutes on that; the cor­rigibly al­igned one, maybe it takes a lit­tle bit more be­cause it has to figure out what the poin­ter is point­ing to, but over­all it’s pretty similar.

But the de­cep­tively al­igned one has to do at least a cou­ple of ex­tra steps, be­cause it rea­sons, “Okay, I care about gold coins,” but just know­ing that it cares about gold coins is not yet suffi­cient to be able to do a good job in the train­ing en­vi­ron­ment. It has to do a cou­ple of ex­tra steps of rea­son­ing where it’s like, “Okay, I care about gold coins and car­ing about gold coins in the long term means that I should try to op­ti­mize the train­ing goal tem­porar­ily to try to get more gold coins even­tu­ally.” And so that adds ex­tra steps of rea­son­ing and some cost to the over­all com­pu­ta­tion of the model. So the de­cep­tively al­igned one is in some sense slower than the oth­ers.

I think this does give us some real hope here, and I think it’s at least an inkling of a way out. Un­for­tu­nately, I think that it’s not a great hope. Why do I think that?

slide 37

I think the main prob­lem is that speed bias seems ex­tremely un­com­pet­i­tive. Why is that? Well, there’s a rea­son that we build big­ger mod­els, right? We build big­ger mod­els, be­cause we want mod­els that have the abil­ity to spend a bunch more com­pute. And the big­ger mod­els do bet­ter. Be­cause the more com­pute that they have, the more they can find a sim­pler al­gorithm that, though it spends more com­pute, is sim­pler and thus does bet­ter on the task.

Here’s an­other piece of ev­i­dence here that I think is quite strong. Let’s say we did want to im­ple­ment a speed prior. We were like, “Okay, let’s ac­tu­ally find the min­i­mal com­pute model which is able to fit the data.” What would that do?

What I have up here is a dou­ble de­scent curve. This curve is from train­ing lan­guage mod­els on a trans­la­tion task. And we’re vary­ing the size of the in­ter­nal rep­re­sen­ta­tion in­side of the model. And what we want to do if we want to get some­thing like the max­i­mum a pos­te­ri­ori es­ti­mate from a speed prior is to find the min­i­mal num­ber of pa­ram­e­ters that are able to fit the train­ing data ex­actly.

So what does that do? Well, take a look at the graph. Blue cor­re­sponds to blue and green cor­re­sponds to green here. The bot­tom is train­ing loss, so we want to look at where blue and green level off and reach ap­prox­i­mately perfect train ac­cu­racy. But if we look at where blue lev­els off on the bot­tom, it’s at ex­actly where we get the worst test loss above. And the same for green. And we see this ex­act same thing for lots of other mea­sures of num­ber of pa­ram­e­ters as well.

What this is say­ing is that, if we ac­tu­ally tried to take a max­i­mum a pos­te­ri­ori from a speed prior, if we took the min­i­mal num­ber of pa­ram­e­ters that was able to fit the data re­ally well, we would end up with the worst pos­si­ble test perfor­mance on the whole graph. We end up with the ex­act worst pos­si­ble gen­er­al­iza­tion perfor­mance across all pos­si­ble ways of pick­ing the num­ber of pa­ram­e­ters.

So what’s go­ing on here? Well, I think that it’s tel­ling you that speed bias is un­com­pet­i­tive, it is tel­ling you that if you re­ally try to se­lect the small­est model that fits the data, you don’t get good gen­er­al­iza­tion perfor­mance.

Why do you not get good gen­er­al­iza­tion perfor­mance? You don’t get good gen­er­al­iza­tion perfor­mance be­cause real world data is not speed dis­tributed. Real world data is sim­plic­ity-dis­tributed. This is sort of a re­al­ist ap­proach to Oc­cam’s ra­zor, where I ac­tu­ally think that real world data is dis­tributed ac­cord­ing to a sim­plic­ity prior, so the more you de­vi­ate from that, the worse your gen­er­al­iza­tion perfor­mance is. And so if we force our mod­els to use the speed prior, to use the min­i­mal-com­pu­ta­tion al­gorithm that is able to solve the task, they have worse down­stream gen­er­al­iza­tion be­hav­ior, be­cause real-world data doesn’t use the speed prior.

And so as we want to get mod­els that are bet­ter at gen­er­al­iz­ing, I pre­dict we will move away from speed bias and to­wards more sim­plic­ity bias. And this is bad if we’re con­cerned about de­cep­tion, be­cause speed bias seems like it might have oth­er­wise saved us, at least in the low path de­pen­dence world.

Conclusion

Now, you’ve heard my case for de­cep­tive al­ign­ment in the high path de­pen­dence world, and in the low path de­pen­dence world. So what is my over­all take?

slide 39

Well, you’ve prob­a­bly figured it out by now: my over­all take is that, re­gard­less of whether we are in a high or low path de­pen­dence world, my cur­rent best guess is that gra­di­ent de­scent is go­ing to want to make our mod­els de­cep­tive. In ei­ther situ­a­tion, it seems like the in­duc­tive bi­ases are such that, if we go back to Ajeya’s pic­ture when we train the model on a situ­a­tion where it could learn color, or it could learn shape, but ac­tu­ally SGD always ends up learn­ing the color thing, my guess is that cur­rently, re­gard­less of whether we’re in the low or the high path-de­pen­dence world, gra­di­ent de­scent is by de­fault go­ing to take you to­wards the de­cep­tively al­igned model.

slide 40

And if we be­lieve this, we have to en­act some in­ter­ven­tion that changes the train­ing dy­nam­ics. We have to do some­thing that cre­ates pres­sure away from de­cep­tion. And that has to come from some­place that is not just be­hav­ioral: as we said pre­vi­ously, in all of these situ­a­tions, we were imag­in­ing that the model was able to perfectly un­der­stand what we want. We were work­ing in the limit of ad­ver­sar­ial train­ing. And so if we be­lieve the case that we’re go­ing to be pushed to­wards de­cep­tion by de­fault even there, that means we need to have some way to change the in­duc­tive bi­ases that is not just a be­hav­ioral change, that is not just try­ing to add more ad­ver­sar­ial ex­am­ples, such that we can cre­ate pres­sure to­wards the non de­cep­tively al­igned mod­els and away from the de­cep­tively al­igned mod­els.

slide 41

Im­por­tantly, though, it mat­ters a lot how big the gap is. One thing that is maybe not quite clear in this anal­y­sis is how many bits differ­ent is a de­cep­tively al­igned model from a non de­cep­tively al­igned one. I of­ten like to think of our task in al­ign­ment as, at least in my es­ti­ma­tion, we are some num­ber of bits un­der­wa­ter. By de­fault, gra­di­ent de­scent wants to push your model into be­ing de­cep­tive. But it’s not like the non-de­cep­tive model is im­pos­si­ble—it lives there in the model space, too. So the ques­tion is: how much does gra­di­ent de­scent want to push it to­wards the de­cep­tive model rel­a­tive to the good model? How large are the rel­a­tive sizes of these cir­cles in the bot­tom right here?

This mat­ters a lot, be­cause it de­ter­mines the num­ber of bits of op­ti­miza­tion pres­sure that we have to pro­duce to mod­ify the train­ing pro­cess into a situ­a­tion where it’s go­ing to push to­wards the good thing. So if we’re 10 bits un­der­wa­ter by de­fault, then that’s pretty good. All we need to do is get 10 bits of op­ti­miza­tion to­wards the good thing. Maybe we can just do that with en­sem­bling or some­thing.

If we’re 10,000 Bits un­der­wa­ter, that’s a lot worse—maybe you have to re­ally fun­da­men­tally re­shape the way that our train­ing pro­cesses work if we’re 10,000 bits un­der­wa­ter. I think it’s cur­rently un­clear, ex­actly quan­ti­ta­tively where we are, though over­all, like I said, my guess is that at least we are un­der­wa­ter, and we need to do some­thing that changes it rel­a­tive to that.

Q&A

Ques­tion: Do you think that, even in the limit of ad­ver­sar­ial train­ing, we won’t be able to find ad­ver­sar­ial ex­am­ples to make the de­cep­tive model think that it has an op­por­tu­nity to defect?

You could try hon­ey­pots, try to ex­plic­itly give the model a situ­a­tion where it thinks that it could defect. You can do this, and my guess is that if you do this, some­times you will be able to pro­duce ex­am­ples of de­cep­tive mod­els. This will stop work­ing even­tu­ally as your model gets smarter and more able to tell when it’s in a hon­ey­pot or not, but with early de­cep­tive mod­els, my guess is that you will be able to find some ex­am­ples of them.

And so we should definitely do this. At least just be­cause, once we have an ex­am­ple, we can study it more effec­tively. But I don’t think that this is a full solu­tion, at least if all you do is the ob­vi­ous thing of re­train­ing it to not be de­cep­tive.

The prob­lem is what hap­pens next. You’ve pro­duced an ex­am­ple of a model be­ing de­cep­tive. And now the ques­tion is, how does gra­di­ent de­scent change the model such that it no longer pro­duces ex­am­ples of it be­ing de­cep­tive? If you re­train it, it has two op­tions: it be­comes less de­cep­tive, or it just gets bet­ter at hid­ing its de­cep­tion. And my guess is that you’re go­ing to get the sec­ond one—once get an ex­am­ple of a model be­ing de­cep­tive, and you try to train it to not be de­cep­tive, then it just gets bet­ter at not look­ing de­cep­tive, but it doesn’t nec­es­sar­ily train away the de­cep­tion.

I think this is true, ba­si­cally, in both the low and high path de­pen­dence wor­lds. For ex­am­ple, in the high path de­pen­dence world, once it already has all the ma­chin­ery to be de­cep­tive, it’s a pretty big change to swap that whole thing out and make it cor­rigibly or in­ter­nally al­igned or some­thing. It’s a lot eas­ier to just be like, “Hey, you should be a lit­tle more con­ser­va­tive about not defect­ing so wildly, be re­ally sure that you’re ac­tu­ally in de­ploy­ment, and there’s ac­tu­ally a good op­por­tu­nity to defect be­fore you do that.” That’s pretty sim­ple, it just changes its over­all para­noia level. And so my guess is you’ll end up with that: if you pro­duce some ex­am­ple of de­cep­tion, and then try to make it non de­cep­tive, it just be­comes a lit­tle more para­noid.

Ques­tion: If you found out that that dou­ble de­scent was easy to elimi­nate or ir­rele­vant for some rea­son, would you be a lot more op­ti­mistic about the speed prior?

I am not sure ex­actly what that would mean. So I think that in some sense, I don’t in­ter­pret dou­ble de­scent as a phe­nomenon as be­ing the prob­lem. It’s that dou­ble de­scent is a piece of ev­i­dence that teaches us a fact about the in­duc­tive bi­ases. You could try to elimi­nate the phe­nomenon of dou­ble de­scent. But it’s not clear that that would change the un­der­ly­ing fact, which is that, there was a thing about the in­duc­tive bi­ases: speed doesn’t gen­er­al­ize well.

Now, you could show me some ex­per­i­ment that looks at dou­ble de­scent, and I might look at it and be like “This teaches me some facts about how to think about in­duc­tive bi­ases that are differ­ent than what I pre­vi­ously knew.” I definitely could imag­ine learn­ing a ton of use­ful facts about in­duc­tive bi­ases from run­ning ex­per­i­ments like that. But I think that it is best to con­cep­tu­al­ize it not as “dou­ble de­scent is the prob­lem”, but as “dou­ble de­scent is a piece of ev­i­dence about the prob­lem”.

One thing that my pic­ture pre­dicts, and maybe you could run an ex­per­i­ment on this, is that dou­ble de­scent should go away if you look through model space in sim­plic­ity or­der. In my pic­ture, dou­ble de­scent hap­pens be­cause the or­der in which mod­els en­ter into the ac­cessible model space is differ­ent from the crite­ria that SGD uses to se­lect from amongst mod­els in the ac­cessible model space, and the lat­ter is more bi­ased to­wards sim­plic­ity. If you could dis­en­tan­gle these, ac­tu­ally add mod­els to the sup­port of your prior start­ing with the sim­plest first, then I think you wouldn’t get dou­ble de­scent, then I think that the first model you would find that fits the train­ing data well would ac­tu­ally be the one with the best gen­er­al­iza­tion.

Ques­tion: Is there any rea­son to sus­pect that there are no other in­duc­tive bi­ases other than sim­plic­ity and speed?

There’s no rea­son to ex­pect that and in fact, there cer­tainly are lots of other ones. The prob­lem is just that we don’t un­der­stand them very well. And so, at least right now, the only ones I feel pretty con­fi­dent that we un­der­stand are speed and sim­plic­ity. There are some other things. Cir­cuit com­plex­ity, max vs. av­er­age case speed pri­ors, those are some other in­duc­tive bi­ases that we sort of un­der­stand that might be rele­vant here.

In fact, I think one of the plau­si­ble ways out here is to try to im­ple­ment an av­er­age-case speed prior rather than a max speed prior. Max speed is: across all in­puts, what is the max­i­mum speed that you ever take—whereas av­er­age speed is: each in­put wants to take the short­est path. In some sense, you can think of chain-of-thought prompt­ing as po­ten­tially giv­ing you an av­er­age speed bias, be­cause you always want to take like the min­i­mal chain-of-thought that gets you there on each in­di­vi­d­ual in­put. And so in­stead of hav­ing a fixed length ex­e­cu­tion time with ev­ery in­put, you’re in­cen­tiviz­ing low com­pu­ta­tion time across each in­put. I think this can help. I think that av­er­age speed pushes you into a situ­a­tion where, on some in­puts, the model is not de­cep­tive, be­cause there’s a sim­ple, easy path to solve the in­put, which makes de­cep­tion un­nec­es­sary, it’s ex­tra com­pu­ta­tion. But it doesn’t ad­dress that there will be in­puts on which the model has to do a bunch of re­ally com­plex rea­son­ing and on those in­puts, de­cep­tion may be the eas­iest way to do it. And so this causes it to be very in­put-de­pen­dent, where some­times it’ll be de­cep­tive and some­times it won’t. That’s my guess about what hap­pens if you do av­er­age speed.

Ques­tion: But why ex­pect av­er­age speed to gen­er­al­ize well? It seems like it has the same prob­lem that max speed does, which is that the world is not dis­tributed ac­cord­ing to it.

Yes! I’m not claiming that it does gen­er­al­ize well. I’m just like, here’s some other pri­ors that seem plau­si­ble as things that we might end up with, that I know how to an­a­lyze at least to some ex­tent. I think I can say a bunch of facts about av­er­age speed bias. I think I can say a bunch of facts about cir­cuit-size pri­ors. There’s a lot of other things that I have thought about that I think I can say some facts about too. But of course, I agree that there’s still a ques­tion of, ac­tu­ally is this thing go­ing to gen­er­al­ize well. One of the biggest prob­lems when you start writ­ing up pri­ors to try to solve this prob­lem is that al­most ev­ery sin­gle one just ends up with some­thing like lookup ta­bles. It’s re­ally hard to get pri­ors to gen­er­al­ize well, be­cause, as I said, Oc­cam’s Ra­zor is a real phe­nomenon.

Ques­tion: I’m won­der­ing among these three cat­e­gories, Christs, Luthers, and Pas­cals, where would you place hu­mans? Are we de­cep­tively al­igned with re­spect to our learn­ing pro­cesses and re­ward cir­cuitry?

Hu­mans are a weird ex­am­ple. The first prob­lem is that there’s three lev­els here. Are you talk­ing about the neo­cor­tex rel­a­tive to evolu­tion? Are you talk­ing about the neo­cor­tex rel­a­tive to sub­cor­ti­cal re­ward sys­tems? Are you talk­ing about the sub­cor­ti­cal re­ward sys­tems rel­a­tive to evolu­tion?

Ques­tion: Let’s just do speci­fi­cally hu­man learned val­ues and cog­ni­tion, rel­a­tive to our hard­coded steer­ing sys­tem, so to speak.

Sure. I think we can be de­cep­tive some­times. We definitely do things like eat candy right af­ter we do our home­work. It is sort of a mix of the differ­ent model classes, though, which is not that sur­pris­ing. The “clean” story I told where you definitely end up with one of these things is al­most cer­tainly false. And in ma­chine learn­ing, also, it’s prob­a­bly go­ing to be a mess. But the same sort of anal­y­sis ap­plies to how much of each thing you get. And in terms of what sort of a mess it would be, at least for hu­mans, my guess is that yes, there ab­solutely is de­cep­tion. At least, there’s some de­cep­tion in terms of hu­mans rel­a­tive to our sub­cor­ti­cal re­ward sys­tem.

Part of the prob­lem here also is that there isn’t a clear op­por­tu­nity for hu­mans to take a catas­trophic ac­tion that would over­turn the re­ward sys­tem. But we could imag­ine hav­ing that pos­si­bil­ity. What if there is some op­por­tu­nity some­day for us to mod­ify our own re­ward sys­tems. Do we take that op­por­tu­nity? And the an­swer is that we to­tally do, be­cause we take SSRIs. Maybe in the glo­ri­ous, tran­shu­man fu­ture we’re go­ing to mod­ify hu­man brains to make them bet­ter. And we’re like, “Okay, we’re go­ing to plan for even­tu­ally al­low­ing us to get rid of this re­ward sys­tem we don’t like.”

And so is there de­cep­tion? I think there clearly is. I think the story is more com­plex, though, than just say­ing “it is de­cep­tively al­igned.” It’s a lit­tle bit tricky, in a way that I ex­pect things in the real world are. This talk is definitely best in­ter­preted as a sim­plified story that al­lows us to do the anal­y­sis. But in fact, it will be a com­plex mess.

Ques­tion: So I will say an anal­ogy, which is prob­a­bly im­perfect. But please bear with me. Sup­pose in­ter­pretabil­ity is the pro­por­tion of the room that is un­hid­den by fur­ni­ture. And dur­ing train­ing, the de­cep­tion out­comes are like cock­roaches, and you’re like, “Oh, are the cock­roaches un­der the fur­ni­ture? We can’t see them.” So if this is a de­cent model of how in­ter­pretabil­ity re­lates to prob­ing de­cep­tive al­ign­ment, it seems to sug­gest that we need 100% trans­parency and in­ter­pretabil­ity to be okay.

So I think that the con­clu­sion is false. I don’t think you need 100% trans­parency and in­ter­pretabil­ity to solve this prob­lem. Why do I think that?

First, it re­ally de­pends on what you’re try­ing to do with your in­ter­pretabil­ity tools. If what you do is, you see de­cep­tion, and then you train it so that you don’t see de­cep­tion any­more, that is a ter­rible thing to do with your in­ter­pretabil­ity tools. As we said ear­lier, you’re just go­ing to make the model bet­ter at hid­ing de­cep­tion. So that is a poor use of in­ter­pretabil­ity tools for al­ign­ment. Nev­er­the­less, I think that in­ter­pretabil­ity tools are the key to solv­ing this prob­lem. But the way they solve the prob­lem doesn’t go through that mechanism.

So what mechanisms do I think they go through? Here’s one: if we’re in the high path de­pen­dence world, then we can un­der­stand what the model looks like early in train­ing. We can in­spect and un­der­stand the proxy, and we can try to im­prove it by train­ing it, since in that situ­a­tion, it’s not like there is some­thing that’s ac­tively hid­ing from our in­ter­pretabil­ity tools. The de­cep­tive model wants to ac­tively hide from your in­ter­pretabil­ity tool, but you don’t have a de­cep­tive model at that point. If you’re ap­ply­ing the tools early to guide the train­ing pro­cess when it’s just a proxy model, it’s not like the de­cep­tive model is ac­tu­ally try­ing to hide from you, you’re just pro­vid­ing some ex­tra bits of op­ti­miza­tion in the right di­rec­tion.

So the way we want to use in­ter­pretabil­ity tools is to provide ex­tra bits of op­ti­miza­tion to help us un­der­stand what’s go­ing on and di­rect it in the cor­rect di­rec­tion, ei­ther ex­plic­itly through pass­ing gra­di­ents through in­ter­pretabil­ity tools or just by look­ing at things early on and un­der­stand­ing what the tra­jec­tory is go­ing to be. What you shouldn’t rely on, I think, is be­ing able to fix de­cep­tion af­ter the fact—the goal should be to pre­vent the de­vel­op­ment of de­cep­tion in the first place.

Another im­por­tant thing I’ll point out here is that you can di­rect your in­ter­pretabil­ity efforts to­wards un­der­stand­ing the gra­di­ent de­scent pro­cess it­self. When peo­ple think of in­ter­pretabil­ity, they’re usu­ally fo­cused on in­ter­pretabil­ity of the fi­nal model—which to be clear is some­thing that I think is re­ally use­ful. I have this whole post where I talk about what I think the tech tree of trans­parency and in­ter­pretabil­ity is like, where I say, “The first thing is, you have to be able to have some un­der­stand­ing of how and what mod­els do.” But it’s not the only thing. There are seven more things that I talk about there.

And the point is that, the next thing af­ter un­der­stand­ing an in­di­vi­d­ual model is hav­ing some un­der­stand­ing of how we got that model, un­der­stand­ing the train­ing pro­cess and the train­ing dy­nam­ics—this is usu­ally thought of as in­duc­tive bias re­search. If we had that and then our trans­parency tools told us that the model had a bad proxy or what­ever, then we could do a gra­di­ent de­scent up­date step that tries to make it so it has a good proxy and ac­tu­ally be­lieve that it would work. Be­cause then we would ac­tu­ally un­der­stand what that up­date step is do­ing. If we have the abil­ity to un­der­stand that the rea­son that it’s push­ing the gra­di­ent in this di­rec­tion is for this rea­son, then we have some rea­sons to be­lieve that it’s not just go­ing to Good­hart on our trans­parency tools.

Ques­tion: But in this al­ter­na­tive class of in­ter­pretabil­ity tools, you would still want to get 100% of the train­ing paths or some­thing?

It’s definitely not nec­es­sary that your trans­parency tools have to be 100% Good­hart-proof. The only thing that has to be true is that gra­di­ent de­scent never finds the situ­a­tion in which it can Good­hart your trans­parency tools, and that lat­ter con­di­tion is much more achiev­able.

Ques­tion: I would think that a model doesn’t nec­es­sar­ily need to be de­cep­tive for op­ti­miz­ing on an in­ter­pretabil­ity tool to push it in the di­rec­tion of be­ing more opaque.

Yes. That’s why we want to un­der­stand the gra­di­ents. Here’s one way to think about this: what are the two ad­ver­saries that we’re con­cerned about, if we want our trans­parency tools to work? Well, there’s the de­cep­tive model that might be try­ing to trick the trans­parency tools. And then there’s gra­di­ent de­scent. Gra­di­ent de­scent is ac­tively gra­di­ent­ing on our trans­parency tools, and it’s go­ing to try to push them into hav­ing some par­tic­u­lar re­sult, if you’re gra­di­ent­ing on your trans­parency tools.

I think that us­ing gra­di­ent de­scent on trans­parency tools is some­thing that we prob­a­bly even­tu­ally want to do, but we are not there yet. Like I said, if you’re gra­di­ent de­scent­ing on your trans­parency tools, then you need some rea­sons to be­lieve that that’s not go­ing to Good­hart your trans­parency tools. In my opinion, the way you get that un­der­stand­ing, if you’re go­ing to be train­ing on trans­parency, is by un­der­stand­ing gra­di­ent de­scent! You can’t just have trans­parency tools which un­der­stand the model, you also need to un­der­stand what is hap­pen­ing as you train the model.

And so you can look at, maybe gra­di­ent de­scent pro­posed this par­tic­u­lar change for this par­tic­u­lar rea­son, and you can be like, “Ah, I see, that’s a good way to change the model”. And if you see a change that’s like, “I’m go­ing to make a change to make the model opaque”, that’s bad! And so if you un­der­stand the dy­nam­ics of the pro­posed changes, then I think you’d get around this prob­lem. Like I said, though, I think this is next level trans­parency tech, we need to get here to even­tu­ally get to the point where we can solve the prob­lem, but we also have to start by at least un­der­stand­ing mod­els at all, and how they work. I think a lot of times peo­ple want to jump ahead, and they want to be like, “Oh, we’re go­ing to take our cur­rent trans­parency tools, we’re just gonna throw them into the loss”. And that doesn’t work. But that doesn’t mean that trans­parency is not even­tu­ally use­ful for be­ing able to change how we train mod­els. It’s just that there’s mul­ti­ple steps along the way.

Ques­tion: So the dumb­est, sim­plest trans­parency tools in my mind are: take a chain-of-thought rea­soner, and read its train of thought. Hope­fully, that gives you some idea of what it’s think­ing.

No! I don’t think that helps with trans­parency. One thing it does do, and I was say­ing this ear­lier, is that chain-of-thought can be helpful for cre­at­ing an av­er­age speed bias.

What it doesn’t do is, it doesn’t ac­tu­ally give us any trans­parency. Here’s a good way to think about it: the de­cep­tively al­igned model can put what­ever it wants in the tran­script that’ll con­vince you that it’s do­ing the right thing. And the in­ter­nally and cor­rigibly al­igned model can do the same thing. The point is that there is no con­straint on whether you would get the de­cep­tively al­igned model or not, if the only way you have to un­der­stand its in­ter­nals is just by look­ing at the be­hav­ioral out­put in the chain-of-thought.

Maybe you could get it so that all of the rele­vant cog­ni­tion is visi­ble in the chain of thought. But how? By de­fault, if you just train it to pro­duce a chain-of-thought, each in­di­vi­d­ual for­ward pass could still be de­cep­tive.

Ques­tion: Weren’t you say­ing that you’d use the trans­parency tools to guide the train­ing pro­cess prior to the point that it be­comes de­cep­tive? Why not do that with chain-of-thought?

Yes. And I think we should do that. But here’s what I’m wor­ried about. If we’re try­ing to guide it in a par­tic­u­lar di­rec­tion by giv­ing it a be­hav­ioral in­cen­tive, when we gra­di­ent through the be­hav­ioral in­cen­tive, that looks ex­actly the same as this whole anal­y­sis, right? This whole talk was look­ing at what hap­pens when we try to in­cen­tivize a model to be­hav­iorally act the cor­rect way dur­ing train­ing. And so if the only way of do­ing trans­parency is well, I have some be­hav­ioral checks, and then I’m gonna gra­di­ent through those be­hav­ioral checks and in­cen­tivize it to have good perfor­mance—then that’s the whole story we were just look­ing at, the whole prob­lem is that we be­lieve that if we do gra­di­ent de­scent on a be­hav­ioral anal­y­sis of whether the model is do­ing a good job, then the, sim­plest, eas­iest mod­ifi­ca­tion that gra­di­ent de­scent will make to cause the model to do a good job will be to make it de­cep­tive. If that is true, we can­not solve this prob­lem just via adding more be­hav­ioral checks and gra­di­ent de­scent­ing through them.

Maybe that’s a lit­tle bit too strong, be­cause tech­ni­cally, you could prob­a­bly solve the prob­lem just by pick­ing a be­hav­ioral loss func­tion with just the right im­plicit in­duc­tive bi­ases. By pick­ing a par­tic­u­lar loss func­tion you’ve made it so that some par­tic­u­lar paths are more likely than oth­ers. But I think that what­ever you’re do­ing some­one has to ac­tu­ally get around this anal­y­sis, it has to ac­tu­ally get around the ar­gu­ment for why, by de­fault, when you just do a be­hav­ioral train­ing, it pushes to­wards de­cep­tion.

Ques­tion: Can’t you just re­clas­sify the out­put of your in­ter­pretabil­ity tools as an out­put of the model? The in­ter­pretabil­ity tools were de­ter­mined by the la­tent ac­ti­va­tions of the model, so isn’t it like an­other out­put head?

You can con­cep­tu­al­ize it how­ever you want. I think that the key prob­lem is, what things are you con­strain­ing via what you’re look­ing at? If the only con­straints I have are on end-to-end be­hav­ior, then all I know is that the model must be im­ple­ment­ing some al­gorithm with that end-to-end be­hav­ior. But if I’ve looked in­side it, seen other things, then I have more con­straints on how it has to be op­er­at­ing. It has to be an al­gorithm that in­cludes X com­po­nent, or op­er­ates in X way, or what­ever.

The prob­lem with chain-of-thought is not that it is be­hav­ioral per se, the prob­lem is that all of the anal­y­sis we just did in this talk ap­plies in that case, be­cause all of these differ­ent model classes are equally as good at solv­ing that prob­lem. And so be­cause of that, you haven’t re­ally cre­ated a thing which dis­t­in­guishes be­tween then, a way to provide op­ti­miza­tion pres­sure in a par­tic­u­lar di­rec­tion. The thing that we want is to have some way to provide op­ti­miza­tion pres­sure that ac­tu­ally dis­t­in­guishes be­tween these differ­ent model classes.

Let’s say we had mag­i­cal trans­parency tools that were able to look in­side and find the model’s ex­act prox­ies or what­ever. What it re­turns is still an out­put, but it’s an out­put that ac­tu­ally dis­t­in­guishes be­tween these model classes. It is an out­put that is differ­ent de­pend­ing on which one of these things it is and that we can use to pro­duce some se­lec­tion pres­sure.

Ques­tion: I guess my re­sponse would be that I think chain-of-thought feed­back does differ­en­ti­ate based on which model class it is.

So, there definitely ex­ists a de­cep­tive model for which the de­cep­tion is clear in the chain-of-thought. The prob­lem is that there also ex­ists a de­cep­tive model for which de­cep­tion is not clear in the chain-of-thought. Now, I think it gets you a cou­ple of bits, be­cause the de­cep­tive model, where it is clear in the chain-of-thought, is prob­a­bly sim­pler than the de­cep­tive model where it is not clear in the chain-of-thought. And so you have in some sense elimi­nated the de­cep­tive model where it is very clear in the chain-of-thought. But now you have to con­tend with the next sim­plest de­cep­tive model, where the next one doesn’t have the prop­erty that its de­cep­tion is very clear in the chain-of-thought.

So the ques­tion is, just go­ing back to this pic­ture, how many bits did you get? I think you got like two bits. I just don’t think it makes that big of a differ­ence. Be­cause we’re in a situ­a­tion where this next sim­plest de­cep­tive model is still very sim­ple, it’s not that much harder to think de­cep­tively in such a way that the de­cep­tion is not ob­vi­ous in the chain-of-thought. It’s a lit­tle bit harder, but it’s not that much harder. And so you pro­vided a lit­tle bit of op­ti­miza­tion, but your in­ter­ven­tion is rel­a­tively small. If we were 10,000 bits un­der­wa­ter, this is not helpful. If it’s re­ally marginal, and it turns out that the num­ber of bits un­der­wa­ter could go ei­ther way, then maybe it gives us an ex­tra edge, but my guess is that it’s not enough of an edge.

Ques­tion: I just want to point out, one thing that in­fluences this story is how much of our ul­ti­mate perfor­mance comes from chain-of-thought.

That’s not clear to me. It seems like the model can think in ways that are illeg­ible to hu­mans, for ex­am­ple. So it can hide its cog­ni­tion and get the benefits of chain of thought with­out hav­ing to show us that its de­cep­tive.

Ques­tion: What if we use an opaque set of heuris­tics to gen­er­ate the out­put that’s com­pletely illeg­ible to hu­mans. And then from know­ing that out­put, we gen­er­ate an equiv­a­lent hu­man-read­able chain of thought.

That’s worse—I think that is even worse than the pre­vi­ous thing. Pre­vi­ously, at least do­ing the chain-of-thought was ac­tu­ally in­stru­men­tal in get­ting the an­swer; we got some rea­son to be­lieve that, if there was a de­cep­tive model there, it would be a lit­tle bit harder to im­ple­ment in that con­text. In this case, for the illeg­ible thing, it is just equally as easy for it to be de­cep­tive as non-de­cep­tive in terms of be­hav­ior. And then when we go to provide an ex­pla­na­tion for it, we just always get the non-de­cep­tive ex­pla­na­tion, re­gard­less of whether the ac­tual model is de­cep­tive or not. There’s no rea­son that we ever have to provide the de­cep­tive ex­pla­na­tion, even for a de­cep­tive model. This is ba­si­cally the whole Elic­it­ing La­tent Knowl­edge prob­lem. Re­mem­ber, the differ­ent model classes have the same perfor­mance, they have the ex­act same be­hav­ior in train­ing. And so you can provide the same ex­pla­na­tion for de­cep­tively al­igned or non-de­cep­tively al­igned mod­els, and it’ll be just as good, if you’re only check­ing that ex­pla­na­tion.


  1. ↩︎

    Ques­tion: It seems to me like, if you’re just go­ing from point A to point B, it doesn’t mat­ter how you get there, just what the fi­nal model is.

    So, that’s not quite the way I’m think­ing about path-de­pen­dence. So, we as­sume that the model’s be­hav­ior con­verges in train­ing. It learns to fit the train­ing data. And so we’re think­ing about it in terms of them all con­verg­ing to the same point in terms of train­ing be­hav­ior. But there’s a bunch of other things that are left un­defined if you just know the train­ing be­hav­ior, right. We know they all con­verge to the same train­ing be­hav­ior, but the thing we don’t know is whether they con­verge to the same al­gorithm, whether they con­verge to the same al­gorithm, whether they’re go­ing to gen­er­al­ize in the same way.

    And so when we say it has high path de­pen­dence, that means the way you got to that par­tic­u­lar train­ing be­hav­ior is ex­tremely rele­vant. The fact that you took a par­tic­u­lar path through model space to get to that par­tic­u­lar set of train­ing be­hav­ior is ex­tremely im­por­tant for un­der­stand­ing what the gen­er­al­iza­tion be­hav­ior will be there. And if we say low path de­pen­dence, we’re say­ing it ac­tu­ally didn’t mat­ter very much how you got that par­tic­u­lar train­ing be­hav­ior. The only thing that mat­tered was that you got that par­tic­u­lar train­ing be­hav­ior.

    Ques­tion: When you say model space, you mean the func­tional be­hav­ior as op­posed to the literal pa­ram­e­ter space?

    So there’s not quite a one to one map­ping be­cause there are mul­ti­ple im­ple­men­ta­tions of the ex­act same func­tion in a net­work. But it’s pretty close. I mean, most of the time when I’m say­ing model space, I’m talk­ing ei­ther about the weight space or about the func­tion space where I’m in­ter­pret­ing the func­tion over all in­puts, not just the train­ing data.

    I only talk about the space of func­tions re­stricted to their train­ing perfor­mance for this path de­pen­dence con­cept, where we get this view where, well, they end up on the same point, but we want to know how much we need to know about how they got there to un­der­stand how they gen­er­al­ize.

    Ques­tion: So cor­rect me if I’m wrong. But if you have the fi­nal trained model, which is a point in weight space, that de­ter­mines be­hav­ior on other datasets, like just that fi­nal point of the path.

    Yes, that’s cor­rect. The point that I was mak­ing is that they con­verge to the same func­tional be­hav­ior on the train­ing dis­tri­bu­tion, but not nec­es­sar­ily the same func­tional be­hav­ior off the train­ing dis­tri­bu­tion.

  2. ↩︎

    Ques­tion: So last time you gave this talk, I think I made a re­mark here, ques­tion­ing whether grokking was ac­tu­ally ev­i­dence of there be­ing a sim­plic­ity prior, be­cause maybe what’s ac­tu­ally go­ing on is that there’s a tiny gra­di­ent sig­nal from not be­ing com­pletely cer­tain about the clas­sifi­ca­tion. So I asked an ML grad stu­dent friend of mine, who stud­ies grokking, and you’re to­tally right. So there was weight de­cay in this ex­am­ple. And if you turn off the weight de­cay, the grokking doesn’t hap­pen.

    Yes, that was my un­der­stand­ing—that mostly what’s hap­pen­ing here is that it’s the weight de­cay that’s push­ing you to­wards the grokking. And so that’s sort of ev­i­dence of there ac­tu­ally just be­ing a sim­plic­ity prior built into the ar­chi­tec­ture, that is always go­ing to con­verge to the same, sim­ple thing.

    Ques­tion: But if you turn off the weight de­cay then the grokking doesn’t hap­pen.

    Well, one hy­poth­e­sis might be that the weight de­cay is the thing that forces the ar­chi­tec­tural prior there. But maybe the strongest hy­poth­e­sis here is that with­out weight de­cay there’s just not enough of a gra­di­ent to do any­thing in that pe­riod.

    Ques­tion: This isn’t a ques­tion. For peo­ple who aren’t fa­mil­iar with the ter­minol­ogy “weight de­cay”, it’s the same as L2 reg­u­lariza­tion?

    Yep, those are the same.

  3. ↩︎

    Ques­tion: Does Martin Luther over time be­come in­ter­nally al­igned? As Martin Luther stud­ies the Bible over time, does he be­come in­ter­nally al­igned with you?

    No. Be­cause Martin Luther never, from my per­spec­tive, at least the way we’re think­ing about this here—I’m not gonna make any claims about the real Martin Luther—but the way we’re think­ing about it here is that the Martin Luther mod­els, the thing that they care about is un­der­stand­ing the Bible re­ally well. And so their goal, what­ever the Bible is, they’re go­ing to figure it out. But they’re not go­ing to mod­ify them­selves, to be­come the same as the Bible.

    Let’s say, I’m the Martin Luther model. And I mod­ify my­self to care about my cur­rent un­der­stand­ing of the Bible. And then I re­al­ized that ac­tu­ally the Bible was differ­ent than I thought the whole time. That’s re­ally bad for me, be­cause the thing I want origi­nally is not to do the thing that my cur­rent un­der­stand­ing of the Bible says, it’s to do what the Bible ac­tu­ally tells me. And so if I later un­der­stand that ac­tu­ally the Bible wants some­thing differ­ent, then the Martin Luther mod­els want to be able to shift to that. So they don’t want to mod­ify them­selves into in­ter­nal al­ign­ment. I should also point out that, the way that we were imag­in­ing this, it’s not clear that the model it­self has any con­trol over which model it ends up as. Ex­cept to the ex­tent that it con­trols perfor­mance, which is how the de­cep­tively al­igned model works.

    Ques­tion: So Martin Luther is say­ing, the Bible seems cool so far. I want to learn more about it. But I’ve re­served the op­tion to not be tied to the Bible.

    No, Martin Luther loves the Bible and wants to do ev­ery­thing the Bible says.

    Ques­tion: So why doesn’t Martin Luther want to change its code to be equal to the Bible?

    The Bible doesn’t say, change your code to be equal to the Bible. The Bible says do these things. You could imag­ine a situ­a­tion where the Bible is like, you got to mod­ify your­self to love pa­per clips, or what­ever. In that situ­a­tion, the model says, well, okay, I guess I gotta mod­ify my­self to like pa­per clips. But Martin Luther doesn’t want to mod­ify him­self un­less the Bible says to.

    The prob­lem with mod­ify­ing them­selves is that the Martin Luther mod­els are con­cerned, like, “Hmm, maybe this Bible is ac­tu­ally, a forgery” or some­thing, right? Or as we’ll talk about later, maybe you could end up in a situ­a­tion where the Martin Luther model thinks that a forgery of the Bible is its true ground source for the Bible. And so it just cares about a forgery. And that’s the thing it cares about.

  4. ↩︎

    Ques­tion: The point you just made about pre-train­ing vs. fine-tun­ing seems back­wards. If pre-train­ing re­quires vastly more com­pute than the fine-tun­ing a re­ward model, then it seems that learn­ing about your re­ward func­tion is cheaper for com­pute?

    Well, it’s cheaper, but it’s just less use­ful. Al­most all of your perfor­mance comes from un­der­stand­ing the world, in some sense. Also, I think part of the point there is that, once you un­der­stand the world, then you have the abil­ity to rel­a­tively cheaply un­der­stand the thing we’re try­ing to get you to do. But try­ing to go di­rectly to un­der­stand the thing we’re try­ing to get you to do—at that point you don’t un­der­stand the world enough even to have the con­cepts that en­able you to be able to un­der­stand that thing. Un­der­stand­ing the world is just so im­por­tant. It’s like the cen­tral thing.

    Ques­tion: It feels like to re­ally make this point, you need to do some­thing more like train a re­in­force­ment learn­ing agent from ran­dom ini­tial­iza­tion against a re­ward model for the same amount of com­pute, ver­sus do­ing the pre-train­ing and then fine-tune on the re­ward model.

    Yeah, that seems like a pretty in­ter­est­ing ex­per­i­ment. I do think we’d learn more from some­thing like that than just go­ing off of the rel­a­tive lengths of pre-train­ing vs. fine-tun­ing.

    Ques­tion: I still just don’t un­der­stand how this is ac­tu­ally ev­i­dence for the point you wanted to make.

    Well, you could imag­ine a world where un­der­stand­ing the world is re­ally cheap. And it’s re­ally, re­ally hard to get the thing to be able to do what you want—out­put good sum­maries or what­ever—be­cause it is hard to spec­ify what that thing is. I think in that world, that would be a situ­a­tion where, if you just trained a model end to end on the whole task, most of your perfor­mance would come from, most of your gra­di­ent up­dates would be for, try­ing to im­prove the model’s abil­ity to un­der­stand the thing you’re try­ing to get it to do, rather than im­prov­ing it’s generic un­der­stand the world.

    Whereas I’m de­scribing a situ­a­tion where, by my guess, most of the gra­di­ent up­dates would just be to­wards im­prov­ing its un­der­stand­ing of the world.

    Now, in both of those situ­a­tions, re­gard­less of whether you have more gra­di­ent de­scent up­dates in one di­rec­tion or the other, diminish­ing re­turns still ap­ply. It’s still the case, whichever world it is, SGD is still go­ing to bal­ance be­tween them both, such that it’d be re­ally weird if you’d maxed out on one be­fore the other.

    How­ever, I think the fact that it does look like al­most all the gra­di­ent de­scent up­dates come from un­der­stand­ing the world teaches us some­thing about what it ac­tu­ally takes to do a good job. And it tells us things like, if we just try to train the model to do to do some­thing, and then pause it halfway, most of the abil­ity to have good ca­pa­bil­ities is com­ing from its un­der­stand­ing of the world and so we should ex­pect gra­di­ent de­scent to have spent most of its re­sources so far on that.

    That be­ing said, the ques­tion we have to care about is not which one maxes out first, it’s do we max out on the proxy be­fore we un­der­stand the train­ing pro­cess suffi­ciently to be de­cep­tive. So I agree that it’s un­clear ex­actly what this fact says about when that should hap­pen. But it still feels like a pretty im­por­tant back­ground fact to keep in mind here.