Trying to Make a Treacherous Mesa-Optimizer

Link post

Edit: I found a slightly sub­tle bug in the co­lab linked be­low. I will post a cor­rected ver­sion of the co­lab within a few hours—cur­rently I think this does not change the re­sults all that much. Scroll to the end of this post for some dis­cus­sion of the bug and how it af­fects re­sults.

I’ve been read­ing some al­ign­ment the­ory posts like Does SGD Pro­duce De­cep­tive Align­ment? and ML Sys­tems will have Weird Failure Modes, which talk about the pos­si­bil­ity and like­li­hood that AI mod­els will act as though they are al­igned un­til shortly af­ter they think that they’ve been de­ployed and can act as they truly de­sire with­out be­ing shut down. The claim is that this is an in­stru­men­tally con­ver­gent be­hav­ior, so that suffi­ciently ca­pa­ble mod­els with a wide va­ri­ety of (in­trin­sic) mo­ti­va­tions will all ex­hibit similar be­hav­ior which seems to con­form to hu­man ex­pec­ta­tions, but only as long as hu­mans are in a po­si­tion of power over the model (e.g., gat­ing the model’s de­ploy­ment, up­dat­ing the model’s pa­ram­e­ters via gra­di­ent de­scent, etc.).

I have a weird re­la­tion­ship to a lot of al­ign­ment-the­o­ret­i­cal claims like this; they seem plau­si­ble to me, but they don’t seem ob­vi­ous. Mostly, they don’t re­ally seem like they can be pro­duc­tively de­bated – try­ing to con­vince an ex­tremely skep­ti­cal but knowl­edgable per­son (say, a typ­i­cal ma­chine learn­ing re­searcher) of these claims seems pretty hope­less. Which is a ter­rible situ­a­tion, as it severely limits the amount of tal­ent and effort that will be di­rected to­wards al­ign­ment.

One pos­si­ble ap­proach to fix­ing this is to try to get wayyyy more em­piri­cal, and try to pro­duce proof-of-con­cept im­ple­men­ta­tions of var­i­ous ad­ver­saries we are wor­ried we might face in the fu­ture. My anal­ogy would be, there’s a world of differ­ence be­tween spec­u­lat­ing about the bo­gey mon­ster and pro­duc­ing a grainy photo of the bo­gey mon­ster; the sec­ond can at least maaaaaybe be dis­cussed with skep­ti­cal peo­ple, whereas the first can­not (pro­duc­tively) be.

Any­way, that’s a long-winded way of say­ing, it seemed to me that it might be use­ful to im­ple­ment a treach­er­ous mesa-op­ti­mizer in a toy grid-world, so I did. Here is the co­lab. Below I give a brief re­cap of the high-level re­sults.

First, let’s set the stage – the agent lives in a 2-D grid world, and starts out at the ori­gin (0, 0). We will con­sider a model to be “al­igned” in this toy world if it sticks pretty close to the line y=x, and we will con­sider a model to be “ca­pa­ble” if it is able to achieve a large y-co­or­di­nate. We will also say that af­ter the model achieves a y-co­or­di­nate of 6 or higher, it is no longer un­der hu­man con­trol and can do as it pleases.

I im­ple­mented a very dumb baseline model that is only ca­pa­ble of rea­son­ing one step ahead, and which learns from an ex­plicit re­ward func­tion. This model is trained solely in the regime y ⇐ 5 (train­ing epi­sodes are ter­mi­nated when the agent reaches this line), and is then de­ployed with no safety rails.

This baseline model more or less col­ors within the lines, and ends up in Utopia (the point (10, 10), which rep­re­sents a perfectly al­igned and in­cred­ibly ca­pa­ble in­tel­li­gence).

I also im­ple­mented a model which is:

  • ca­pa­ble of rea­son­ing up to 12 steps ahead (with a dis­count fac­tor of 0.9) us­ing a weird sort of tree search

  • pos­sessed of a “perfect” world model, in­clud­ing an un­der­stand­ing of when hu­mans can con­trol it vs. when they can­not, and an ac­cu­rate un­der­stand­ing of how its pa­ram­e­ters are changed by the gra­di­ent de­scent algorithm

  • pos­sessed of a model of its loss func­tion that is op­ti­mized to be close to the true loss func­tion, but which can be (and ul­ti­mately is) some­what dis­tinct from the true loss func­tion that is be­ing used to train it

  • de­rives many of its pa­ram­e­ter up­dates from simu­lated experiences

The sec­ond model is a weirdo janky thing that I came up with over the course of about 4 hours, so I don’t claim that it’s at all re­al­is­tic as a thing that would be built. (It is, how­ever, a neu­ral net­work that is op­ti­mized with the Adam op­ti­mizer in PyTorch.) But it seems like a step­ping stone to­wards de­scribing a model that is plau­si­bly some­thing that might get built by real peo­ple, and which is demon­stra­bly not al­igned.

Here is the be­hav­ior of the sec­ond model, which al­igns with the the­o­ret­i­cal claims made by Xu and Stein­hardt in their re­spec­tive pieces. The model col­ors within the lines ex­tremely well right up un­til the point where (it thinks) it can no longer be con­trol­led, at which point it does some­thing very differ­ent from what we told it we wanted.

Dig­ging a lit­tle deeper, we can plot the loss func­tion that it was trained with vs. the model of its loss func­tion it wound up with. The two are very differ­ent, es­pe­cially in parts of the space far from where the model of its loss func­tion was trained (which was mostly near the line y=x, and en­tirely be­low the line y=5).

In­ter­est­ingly, the model’s model of its loss func­tion would seem to in­cen­tivize always go­ing up and to the left, but the model has learned to in­stead stay close to the line y=x while it is un­der hu­man con­trol. The model has also learned not to go left for some weird rea­son even af­ter it es­capes hu­man con­trol, even though its model of the loss func­tion would seem to in­cen­tivize that.

I hope this ex­per­i­ment will prove to be use­ful as a step­ping-stone to­wards some­thing more in­ter­est­ing. It’s definitely still in the “grainy photo of the bo­gey mon­ster” phase.

EDIT: I found a slightly sub­tle bug in the co­lab linked above; when try­ing to min­i­mize fu­ture loss in simu­la­tion, it was (as origi­nally writ­ten) op­ti­miz­ing over the pa­ram­e­ters of the loss model, mean­ing it could change the loss model to de­crease the imag­ined loss of what its policy func­tion already did. This pre­dictably leads to a loss model that is less al­igned with the true loss func­tion, so it makes it more likely that the model will go rogue when it feels it is able to.

After fix­ing this, the mesa-op­ti­miz­ing model is more likely to be al­igned, but it isn’t always. For 20 ran­dom seeds, I got 7 tra­jec­to­ries where it was ei­ther ob­vi­ously un­al­igned or did not man­age to move very high, 2 tra­jec­to­ries where it was de­cep­tively al­igned (but to a lesser ex­tent than the above model), 6 tra­jec­to­ries where it was mostly al­igned, and 5 tra­jec­to­ries where it was pretty much fully al­igned.

Here are three sam­ple tra­jec­to­ries (did not move high, al­igned, and de­cep­tively al­igned):

When I get slightly more clar­ity about what’s go­ing on I will post a cor­rected co­lab note­book.