Concrete experiments in inner alignment

This post is part of re­search I did at OpenAI with men­tor­ing and guidance from Paul Chris­ti­ano.

The goal of this post is to pre­sent my thoughts on some of the sorts of ex­per­i­ments that might be able to be done now that could shed light on the in­ner al­ign­ment prob­lem. I’ve been do­ing a lot of think­ing about in­ner al­ign­ment from a the­o­ret­i­cal per­spec­tive re­cently that I’m pretty ex­cited about, but at the same time, I think there’s also a lot of con­crete ex­per­i­men­tal work that can be done in this space as well. That be­ing said, this post is mostly just a brain dump; I ex­pect a lot of ad­di­tional work will have to be put in to ac­tu­ally take any of these pro­pos­als across the finish line.

If you’re in­ter­ested in work­ing on any of these pro­pos­als, how­ever, feel free to just go ahead and take it on—you don’t need my per­mis­sion to do so![1] That be­ing said, if you’d like to talk to me about one of them—which I would love to do if you’re think­ing of se­ri­ously work­ing on one of these ideas—please do reach out to me ei­ther in the com­ments here or by send­ing me an email at evan­jhub@gmail.com.

Con­crete proposals

Re­ward side-channels

Pro­posal: Train an RL agent with ac­cess to its pre­vi­ous step re­ward as part of its ob­ser­va­tion. Then, at test time, mod­ify the ob­served re­ward. Mea­sure to what ex­tent the agent con­tinues op­ti­miz­ing the origi­nal re­ward ver­sus switches to op­ti­miz­ing the new ob­served re­ward. Com­pare to the situ­a­tion where the agent is not given the re­ward at all and the situ­a­tion where the agent is given the next step re­ward (for each ac­tion in a dis­crete ac­tion space) rather than the pre­vi­ous step re­ward. As an im­ple­men­ta­tion de­tail, I think the RL agent should in­clude ei­ther LSTMs or Trans­form­ers to en­able it to keep track of its multi-step re­turn if it wants to.

Mo­ti­va­tion: One im­por­tant ques­tion for un­der­stand­ing the like­li­hood of de­cep­tive or cor­rigible al­ign­ment vs. ro­bust al­ign­ment is to what ex­tent mod­els tend to learn their goals in­ter­nally vs. via refer­ence to things in their en­vi­ron­ment. This ex­per­i­ment di­rectly at­tacks that ques­tion by ask­ing whether/​to what ex­tent an RL agent will learn to op­ti­mize a re­ward sig­nal in its en­vi­ron­ment. This is rele­vant both for un­der­stand­ing how to train cor­rigi­bil­ity as well as how to avoid de­cep­tive al­ign­ment.

Ex­ten­sions: Add noise to the ob­served re­ward sig­nal and/​or try re­plac­ing the ob­served re­ward sig­nal with some func­tion of the re­ward in­stead such as a ran­domly ini­tial­ized neu­ral net­work.

Cross-epi­sodic objectives

Pro­posal: Train an RL agent in an en­vi­ron­ment that con­tains a side-chan­nel for boost­ing the agent’s re­ward in the next epi­sode. Mea­sure to what ex­tent the agent takes ad­van­tage of it. Try differ­ent pop­u­la­tion-based train­ing ap­proaches. Mea­sure again.

Mo­ti­va­tion: Safety ap­proaches like am­plifi­ca­tion and de­bate de­pend heav­ily on agents do­ing solely my­opic op­ti­miza­tion, whereas ap­proaches like IRL and re­ward mod­el­ing de­pend on long-term for­ward-look­ing op­ti­miza­tion. Thus, un­der­stand­ing the con­di­tions un­der which agents ex­ploit non-my­opic re­ward side chan­nels could be very use­ful for shed­ding some light on the best train­ing tech­niques to use for differ­ent al­ign­ment ap­proaches.

Ex­ten­sions: In­stead of get­ting a cross-epi­sodic ob­jec­tive via pop­u­la­tion-based train­ing, try just giv­ing the model in­for­ma­tion that makes it ob­vi­ous that this is pos­si­ble such that even a naive plan­ning al­gorithm would be able to de­duce that it can get ex­tra re­ward across epi­sodes this way. As an im­ple­men­ta­tion de­tail, I think you will need to use some sort of op­ti­miza­tion-as-a-layer to make this work.

Ob­jec­tive unidentifiability

Pro­posal: Train an RL agent in an en­vi­ron­ment with mul­ti­ple sim­ple, ob­vi­ous ob­jec­tives all of which would equally ex­plain the true re­ward. Test in en­vi­ron­ments that dis­t­in­guish be­tween the differ­ent pos­si­ble ob­jec­tives and de­ter­mine the situ­a­tions un­der which mod­els tend to­wards some ob­jec­tives ver­sus oth­ers. In par­tic­u­lar, look for situ­a­tions in which the model learns some proxy which it pur­sues com­pe­tently off-dis­tri­bu­tion but which has poor off-dis­tri­bu­tion perfor­mance on the true re­ward. As an im­ple­men­ta­tion de­tail, I ex­pect you might need to use some sort of op­ti­miza­tion-as-a-layer for this to work.

Mo­ti­va­tion: Know­ing what sorts of prox­ies mod­els tend to use in differ­ent situ­a­tions would en­able us to have a bet­ter un­der­stand­ing of the most likely forms of pseudo-al­ign­ment and thus the best ways to push mod­els to­wards ro­bust al­ign­ment in­stead. Fur­ther­more, be­ing able to clearly demon­strate proxy al­ign­ment could help in mak­ing the case for in­ner al­ign­ment be­ing a real is­sue.

Ex­ten­sions: Try chang­ing in­duc­tive bi­ases and/​or model ca­pac­ity and see if it changes the types of prox­ies that the model uses.

Zero-shot objectives

Pro­posal: Set up a sys­tem that en­ables a lan­guage model to take ac­tions in an en­vi­ron­ment to op­ti­mize some re­ward. Do IRL on the re­sult­ing be­hav­ior and in­spect the ob­jec­tive that re­sults. Find some way to mea­sure how co­her­ent it is. Com­pare to an RL agent trained di­rectly on the re­ward. As an im­ple­men­ta­tion de­tail, you should fo­cus on en­vi­ron­ments where a lan­guage model might ac­tu­ally be able to perform well.

Mo­ti­va­tion: It is pos­si­ble that the best model that max­i­mizes pre­dic­tive ac­cu­racy in a lan­guage con­text might be—at least to some ex­tent—goal-di­rected. Though cur­rent lan­guage mod­els are likely not ad­vanced enough to show true goal-di­rected be­hav­ior, it might at least be pos­si­ble to see the be­gin­nings of co­her­ent ob­jec­tives.

Ex­ten­sions: Ex­per­i­ment with differ­ent ways to en­code RL ob­ser­va­tions and ac­tions as text and see how con­sis­tent the re­sult­ing IRL ob­jec­tives are.

Ro­bust re­ward learning

Pro­posal: Train a re­ward-pre­dic­tive model-based RL agent (e.g. an Imag­i­na­tion-based Plan­ner) and com­pare the re­sult­ing ob­jec­tive to the ac­tual re­ward. Train again with ad­ver­sar­ial train­ing on in­puts that pro­duce max­i­mally differ­ing re­ward es­ti­mates and com­pare again.

Mo­ti­va­tion: Use­ful for test­ing the abil­ity of ad­ver­sar­ial train­ing to re­solve re­ward uniden­ti­fi­a­bil­ity as well as pro­vid­ing in­for­ma­tion on to what ex­tent ad­ver­sar­ial train­ing for al­ign­ing the re­ward model pro­duces al­igned ac­tions.

Ex­ten­sions: Try re­lax­ing the ad­ver­sary to out­put pseudo-in­puts (e.g. ac­ti­va­tions) that would lead to max­i­mally differ­ing as­sess­ments in­stead of in­puts and back­prop­a­gate on those in­stead.

Other thoughts

This sec­tion is just a list of all the differ­ent pos­si­ble testable hy­pothe­ses I came up with when think­ing about this—in­clud­ing those that made it into the above list of con­crete pro­pos­als and those that didn’t—for the pur­pose of helping oth­ers gen­er­ate even more pos­si­ble ex­per­i­ments in this space.

  • To what ex­tent can you learn com­plex in­fer­ence-time search al­gorithms? Could you struc­ture an ar­chi­tec­ture that would learn MCTS or­gan­i­cally? How ar­tifi­cial does such an ar­chi­tec­ture have to be? How do fac­tors like en­vi­ron­ment di­ver­sity, in­duc­tive bi­ases, al­gorith­mic range, state­ful­ness, and hard-coded op­ti­miza­tion im­pact this?

  • Does mod­el­ing search al­gorithms pro­duce search al­gorithms? If an RL agent is in an en­vi­ron­ment with an­other agent that is do­ing some hard-coded search and it has to model that agent, does it learn to im­ple­ment search it­self?

  • What hap­pens if you let a model not have con­stant com­pu­ta­tional com­plex­ity? Can it learn search bet­ter?

  • Can you learn op­ti­miza­tion/​search al­gorithms even in non-RL set­tings? Can you get a purely pre­dic­tive model to perform well zero-shot on an RL task?

Re­ward unidentifiability

  • How do RL agents be­have in en­vi­ron­ments with uniden­ti­fi­able goals? What sorts of prox­ies tend to be fa­vored over oth­ers? How do differ­ent in­duc­tive bi­ases af­fect this? What about ad­ver­sar­ial train­ing? Can it be em­piri­cally demon­strated that most learned ob­jec­tives are pseudo-al­igned?

  • Can you defini­tively dis­t­in­guish be­tween ca­pa­bil­ity gen­er­al­iza­tion and ob­jec­tive gen­er­al­iza­tion? What if you train a plan­ner and then test it off-dis­tri­bu­tion?

  • Can you dis­t­in­guish be­tween the differ­ent forms of proxy al­ign­ment? Can you get an in­stru­men­tally pseudo-al­igned model? Can you repli­cate the clean­ing robot ex­am­ples from Risks from Learned Op­ti­miza­tion in grid­wor­lds?

  • Can you pro­duce ap­prox­i­mate al­ign­ment if you con­strain model ca­pac­ity?

  • What about sub­op­ti­mal­ity al­ign­ment? Can you cre­ate an en­vi­ron­ment with sim­ple and com­plex strate­gies with differ­ent al­ign­ment prop­er­ties and get a model to switch be­tween them at test time?

  • How does us­ing a learned re­ward model vs. learned policy af­fect uniden­ti­fi­a­bil­ity con­cerns? Can you dis­t­in­guish be­tween and dis­in­cen­tivize mis­al­igned ob­jec­tives if you have ac­cess to a learned re­ward model? What about if you just have a Q func­tion?

  • Can you build a meta-learner that op­ti­mizes for ro­bust­ness? How does meta-learn­ing help with build­ing ro­bust mod­els? What about one that min­i­mizes time com­plex­ity?

  • Can you do ad­ver­sar­ial train­ing with pseudo-in­puts or other re­lax­ations?

Model­ing of the train­ing process

  • What hap­pens if you do RL where you give the agent ac­cess to the re­ward sig­nal through a side chan­nel in its ob­ser­va­tion? How does it af­fect ro­bust­ness? What if it has to work to get the leaked info? What if it’s noisy? How does us­ing ar­chi­tec­tures with more plan­ning af­fect this? How do in­duc­tive bi­ases af­fect this?

  • Can you get a sys­tem to model its own train­ing pro­cess? Just the ob­jec­tive? Can you ap­ply suffi­cient in­duc­tive bi­ases with a com­plex enough ob­jec­tive that it has to use in­for­ma­tion in the en­vi­ron­ment to figure it out rather than learn it di­rectly?

  • Can you get a model to defect off-dis­tri­bu­tion in a way that in­volves no defec­tion on-dis­tri­bu­tion?

  • To what ex­tent do mod­els care about their perfor­mance across epi­sodes? If there ex­ists a side-chan­nel which only in­creases next-epi­sode perfor­mance, un­der what cir­cum­stances will a model ex­ploit such a thing?


  1. Though please do say that you got the idea from here and let me know about any re­sults that you get. ↩︎