Alignment Newsletter #34

Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can sign up, or look through the database of all sum­maries.

Highlights

Scal­able agent al­ign­ment via re­ward mod­el­ing (Jan Leike): This blog post and the as­so­ci­ated pa­per out­line a re­search di­rec­tion that Deep­Mind’s AGI safety team is pur­su­ing. The key idea is to learn be­hav­ior by learn­ing a re­ward and a policy si­mul­ta­neously, from hu­man eval­u­a­tions of out­comes, which can scale to su­per­hu­man perfor­mance in tasks where eval­u­a­tion is eas­ier than demon­stra­tion. How­ever, in many cases it is hard for hu­mans to eval­u­ate out­comes: in this case, we can train sim­pler agents us­ing re­ward mod­el­ing that can as­sist the hu­man in eval­u­at­ing out­comes for the harder task, a tech­nique the au­thors call re­cur­sive re­ward mod­el­ing. For ex­am­ple, if you want to train an agent to write a fan­tasy novel, it would be quite ex­pen­sive to have a hu­man eval­u­ate out­comes, i.e. rate how good the pro­duced fan­tasy nov­els are. We could in­stead use re­ward mod­el­ing to train agents that can pro­duce plot sum­maries, as­sess prose qual­ity and char­ac­ter de­vel­op­ment, etc. which al­lows a hu­man to as­sess the fan­tasy nov­els. There are sev­eral re­search challenges, such as what kind of feed­back to get, mak­ing it suffi­ciently sam­ple effi­cient, pre­vent­ing re­ward hack­ing and un­ac­cept­able out­comes, and clos­ing the re­ward-re­sult gap. They out­line sev­eral promis­ing ap­proaches to solv­ing these prob­lems.

Ro­hin’s opinion: The pro­posal sounds to me like a spe­cific fla­vor of nar­row value learn­ing, where you learn re­ward func­tions to ac­com­plish par­tic­u­lar tasks, rather than try­ing to figure out the “true hu­man util­ity func­tion”. The re­cur­sive as­pect is similar to iter­ated am­plifi­ca­tion and de­bate. Iter­ated am­plifi­ca­tion and de­bate can be thought of as op­er­at­ing on a tree of ar­gu­ments, where each node is the re­sult of con­sid­er­ing many child nodes (the con­sid­er­a­tions that go into the ar­gu­ment). Im­por­tantly, the child nodes are them­selves ar­gu­ments that can be de­com­posed into smaller con­sid­er­a­tions. Iter­ated am­plifi­ca­tion works by learn­ing how to com­pose and de­com­pose nodes from chil­dren, while de­bate works by hav­ing hu­mans eval­u­ate a par­tic­u­lar path in the ar­gu­ment tree. Re­cur­sive re­ward mod­el­ing in­stead uses re­ward mod­el­ing to train agents that can help eval­u­ate out­comes on the task of in­ter­est. This seems less re­cur­sive to me, since the sub­agents are used to eval­u­ate out­comes, which would typ­i­cally be a differ­ent-in-kind task than the task of in­ter­est. This also still re­quires the tasks to be fast—it is not clear how to use re­cur­sive re­ward mod­el­ing to eg. train an agent that can teach math to chil­dren, since it takes days or months of real time to even pro­duce out­comes to eval­u­ate. Th­ese con­sid­er­a­tions make me a bit less op­ti­mistic about re­cur­sive re­ward mod­el­ing, but I look for­ward to see­ing fu­ture work that proves me wrong.

The post also talks about how re­ward mod­el­ing al­lows us to sep­a­rate what to do (re­ward) from how to do it (policy). I think it is an open ques­tion whether this is de­sir­able. Past work found that the re­ward gen­er­al­ized some­what (whereas poli­cies typ­i­cally don’t gen­er­al­ize at all), but this seems rel­a­tively minor. For ex­am­ple, re­wards in­ferred us­ing deep var­i­ants of in­verse re­in­force­ment learn­ing of­ten don’t gen­er­al­ize. Another pos­si­bil­ity is that the par­tic­u­lar struc­ture of “policy that op­ti­mizes a re­ward” pro­vides a use­ful in­duc­tive bias that makes things eas­ier to learn. It would prob­a­bly also be eas­ier to in­spect a speci­fi­ca­tion of “what to do” than to in­spect learned be­hav­ior. How­ever, these ad­van­tages are fairly spec­u­la­tive and it re­mains to be seen whether they pan out. There are also prac­ti­cal ad­van­tages: any ad­vances in deep RL can im­me­di­ately be lev­er­aged, and re­ward func­tions can of­ten be learned much more sam­ple effi­ciently than be­hav­ior, re­duc­ing re­quire­ments on hu­man la­bor. On the other hand, this de­sign “locks in” that the speci­fi­ca­tion of be­hav­ior must be a re­ward func­tion. I’m not a fan of re­ward func­tions be­cause they’re so un­in­tu­itive for hu­mans to work with—if we could have agents that work with nat­u­ral lan­guage, I sus­pect I do not want the nat­u­ral lan­guage to be trans­lated into a re­ward func­tion that is then op­ti­mized.

Tech­ni­cal AI alignment

Iter­ated am­plifi­ca­tion sequence

Pro­saic AI al­ign­ment (Paul Chris­ti­ano): It is plau­si­ble that we can build “pro­saic” AGI soon, that is, we are able to build gen­er­ally in­tel­li­gent sys­tems that can out­com­pete hu­mans with­out qual­i­ta­tively new ideas about in­tel­li­gence. It seems likely that this would use some var­i­ant of RL to train a neu­ral net ar­chi­tec­ture (other ap­proaches don’t have a clear way to scale be­yond hu­man level). We could write the code for such an ap­proach right now (see An un­al­igned bench­mark from AN #33), and it’s at least plau­si­ble that with enough com­pute and tun­ing this could lead to AGI. How­ever, this is likely to be bad if im­ple­mented as stated due to the stan­dard is­sues of re­ward gam­ing and Good­hart’s Law. We do have some ap­proaches to al­ign­ment such as IRL and ex­e­cut­ing nat­u­ral lan­guage in­struc­tions, but nei­ther of these are at the point where we can write down code that would plau­si­bly lead to an al­igned AI. This sug­gests that we should fo­cus on figur­ing out how to al­ign pro­saic AI.

There are sev­eral rea­sons to fo­cus on pro­saic AI. First, since we know the gen­eral shape of the AI sys­tem un­der con­sid­er­a­tion, it is eas­ier to think about how to al­ign it (while ig­nor­ing de­tails like ar­chi­tec­ture, var­i­ance re­duc­tion tricks, etc. which don’t seem very rele­vant cur­rently). Se­cond, it’s im­por­tant, both be­cause we may ac­tu­ally build pro­saic AGI, and be­cause even if we don’t the in­sights gained will likely trans­fer. In ad­di­tion, wor­lds with short AGI timelines are higher lev­er­age, and in those wor­lds pro­saic AI seems much more likely. The main coun­ter­ar­gu­ment is that al­ign­ing pro­saic AGI is prob­a­bly in­fea­si­ble, since we need a deep un­der­stand­ing of in­tel­li­gence to build al­igned AI. How­ever, it seems un­rea­son­able to be con­fi­dent in this, and even if it is in­fea­si­ble, it is worth get­ting strong ev­i­dence of this fact in or­der change pri­ori­ties around AI de­vel­op­ment, and co­or­di­nate on not build­ing an AGI that is too pow­er­ful.

Ro­hin’s opinion: I don’t re­ally have much to say here, ex­cept that I agree with this post quite strongly.

Ap­proval-di­rected agents: overview and Ap­proval-di­rected agents: de­tails (Paul Chris­ti­ano): Th­ese two posts in­tro­duce the idea of ap­proval-di­rected agents, which are agents that choose ac­tions that they be­lieve their op­er­a­tor Hugh the hu­man would most ap­prove of, if he re­flected on it for a long time. This is in con­trast to the tra­di­tional ap­proach of goal-di­rected be­hav­ior, which are defined by the out­comes of the ac­tion.

Since the agent Arthur is no longer rea­son­ing about how to achieve out­comes, it can no longer out­perform Hugh at any given task. (If you take the move in chess that Hugh most ap­proves of, you prob­a­bly still lose to Gary Kas­parov.) This is still bet­ter than Hugh perform­ing ev­ery ac­tion him­self, be­cause Hugh can provide an ex­pen­sive learn­ing sig­nal which is then dis­til­led into a fast policy that Arthur ex­e­cutes. For ex­am­ple, Hugh could de­liber­ate for a long time when­ever he is asked to eval­u­ate an ac­tion, or he could eval­u­ate very low-level de­ci­sions that Arthur makes billions of times. We can also still achieve su­per­hu­man perfor­mance by boot­strap­ping (see the next sum­mary).

The main ad­van­tage of ap­proval-di­rected agents is that we avoid lock­ing in a par­tic­u­lar goal, de­ci­sion the­ory, prior, etc. Arthur should be able to change any of these, as long as Hugh ap­proves it. In essence, ap­proval-di­rec­tion al­lows us to del­e­gate these hard de­ci­sions to fu­ture over­seers, who will be more in­formed and bet­ter able to make these de­ci­sions. In ad­di­tion, any mis­speci­fi­ca­tions seem to cause grace­ful failures—you end up with a sys­tem that is not very good at do­ing what Hugh wants, rather than one that works at cross pur­poses to him.

We might worry that in­ter­nally Arthur still uses goal-di­rected be­hav­ior in or­der to choose ac­tions, and this in­ter­nal goal-di­rected part of Arthur might be­come un­al­igned. How­ever, we could even have in­ter­nal de­ci­sion-mak­ing about cog­ni­tion be ap­proval-based. Of course, even­tu­ally we reach a point where de­ci­sions are sim­ply made—Arthur doesn’t “choose” to ex­e­cute the next line of code. Th­ese sorts of things can be thought of as heuris­tics that have led to choos­ing good ac­tions in the past, that could be changed if nec­es­sary (eg. by rewrit­ing the code).

How might we write code that defines ap­proval? If our agents can un­der­stand nat­u­ral lan­guage, we could try defin­ing “ap­proval” in nat­u­ral lan­guage. If they are able to rea­son about for­mally speci­fied mod­els, then we could try to define a pro­cess of de­liber­a­tion with a simu­lated hu­man. Even in the case where Arthur learns from ex­am­ples, if we train Arthur to pre­dict ap­proval from ob­ser­va­tions and take the ac­tion with the high­est ap­proval, it seems pos­si­ble that Arthur would not ma­nipu­late ap­proval judg­ments (un­like AIXI).

There are also im­por­tant de­tails on how Hugh should rate—in par­tic­u­lar, we have to be care­ful to dis­t­in­guish be­tween Hugh’s be­liefs/​in­for­ma­tion and Arthur’s. For ex­am­ple, if Arthur thinks there’s a 1% chance of a bridge col­laps­ing if we drive over it, then Arthur shouldn’t drive over it. How­ever, if Hugh always as­signs ap­proval 1 to the op­ti­mal ac­tion and ap­proval 0 to all other ac­tions, and Arthur be­lieves that Hugh knows whether the bridge will col­lapse, then the max­i­mum ex­pected ap­proval ac­tion is to drive over the bridge.

The main is­sues with ap­proval-di­rected agents is that it’s not clear how to define them (es­pe­cially from ex­am­ples), whether they can be as use­ful as goal-di­rected agents, and whether ap­proval-di­rected agents will have in­ter­nal goal-seek­ing be­hav­ior that brings with it all of the prob­lems that ap­proval was meant to solve. It may also be a prob­lem if some other Hugh-level in­tel­li­gence gets con­trol of the data that defines ap­proval.

Ro­hin’s opinion: Goal-di­rected be­hav­ior re­quires an ex­tremely in­tel­li­gent over­seer in or­der to en­sure that the agent is pointed at the cor­rect goal (as op­posed to one the over­seer thinks is cor­rect but is ac­tu­ally slightly wrong). I think of ap­proval-di­rected agents as pro­vid­ing the in­tu­ition that we may only re­quire an over­seer that is slightly smarter than the agent in or­der to be al­igned. This is be­cause the over­seer can sim­ply “tell” the agent what ac­tions to take, and if the agent makes a mis­take, or tries to op­ti­mize a heuris­tic too hard, the over­seer can no­tice and cor­rect it in­ter­ac­tively. (This is as­sum­ing that we solve the in­formed over­sight prob­lem so that the agent doesn’t have in­for­ma­tion that is hid­den from the over­seer, so “in­tel­li­gence” is the main thing that mat­ters.) Only need­ing a slightly smarter over­seer opens up a new space of solu­tions where we start with a hu­man over­seer and sub­hu­man AI sys­tem, and scale both the over­seer and the AI at the same time while pre­serv­ing al­ign­ment at each step.

Ap­proval-di­rected boot­strap­ping (Paul Chris­ti­ano): To get a very smart over­seer, we can use the idea of boot­strap­ping. Given a weak agent, we can define a stronger agent that hap­pens from let­ting the weak agent think for a long time. This strong agent can be used to over­see a slightly weaker agent that is still stronger than the origi­nal weak agent. Iter­at­ing this pro­cess al­lows us to reach very in­tel­li­gent agents. In ap­proval-di­rected agents, we can sim­ply have Arthur ask Hugh to eval­u­ate ap­proval for ac­tions, and in the pro­cess of eval­u­a­tion Hugh can con­sult Arthur. Here, the weak agent Hugh is be­ing am­plified into a stronger agent by giv­ing him the abil­ity to con­sult Arthur—and this be­comes stronger over time as Arthur be­comes more ca­pa­ble.

Ro­hin’s opinion: This com­ple­ments the idea of ap­proval from the pre­vi­ous posts nicely: while ap­proval tells us how to build an al­igned agent from a slightly smarter over­seer, boot­strap­ping tells us how to im­prove the ca­pa­bil­ities of the over­seer and the agent.

Hu­mans Con­sult­ing HCH (Paul Chris­ti­ano): Sup­pose we un­roll the re­cur­sion in the pre­vi­ous boot­strap­ping post: in that case, we see that Hugh’s eval­u­a­tion of an an­swer can de­pend on a ques­tion that he asked Arthur whose an­swer de­pends on how Hugh eval­u­ated an an­swer that de­pended on a ques­tion that he asked Arthur etc. In­spired by this struc­ture, we can define HCH (hu­mans con­sult­ing HCH) to be a pro­cess that an­swers ques­tion Q by perfectly imi­tat­ing how Hugh would an­swer ques­tion Q, if Hugh had ac­cess to the ques­tion-an­swer­ing pro­cess. This means Hugh is able to con­sult a copy of Hugh, who is able to con­sult a copy of Hugh, who is able to con­sult a copy of Hugh, ad in­fini­tum. This is one pro­posal for how to for­mally define a hu­man’s en­light­ened judg­ment.

You could also com­bine this with par­tic­u­lar ML al­gorithms in an at­tempt to define ver­sions of those al­gorithms al­igned with Hugh’s en­light­ened judg­ment. For ex­am­ple, for RL al­gorithm A, we could define max-HCH_A to be A’s cho­sen ac­tion when max­i­miz­ing Hugh’s ap­proval af­ter con­sult­ing max-HCH_A.

Ro­hin’s opinion: This has the same nice re­cur­sive struc­ture of boot­strap­ping, but with­out the pres­ence of the agent. This prob­a­bly makes it more amenable to for­mal anal­y­sis, but I think that the in­ter­ac­tive na­ture of boot­strap­ping (and iter­ated am­plifi­ca­tion more gen­er­ally) is quite im­por­tant for en­sur­ing good out­comes: it seems way eas­ier to con­trol an AI sys­tem if you can provide in­put con­stantly with feed­back.

Fixed point sequence

Fixed Point Dis­cus­sion (Scott Garrabrant): This post dis­cusses the var­i­ous fixed point the­o­rems from a math­e­mat­i­cal per­spec­tive, with­out com­ment­ing on their im­por­tance for AI al­ign­ment.

Tech­ni­cal agen­das and prioritization

In­te­gra­tive Biolog­i­cal Si­mu­la­tion, Neu­ropsy­chol­ogy, and AI Safety (Gopal P. Sarma et al): See Im­port AI and this com­ment.

Learn­ing hu­man intent

Scal­able agent al­ign­ment via re­ward mod­el­ing (Jan Leike): Sum­ma­rized in the high­lights!

Ad­ver­sar­ial examples

A Geo­met­ric Per­spec­tive on the Trans­fer­abil­ity of Ad­ver­sar­ial Direc­tions (Zachary Charles et al)

AI strat­egy and policy

MIRI 2018 Up­date: Our New Re­search Direc­tions (Nate Soares): This post gives a high-level overview of the new re­search di­rec­tions that MIRI is pur­su­ing with the goal of de­con­fu­sion, a dis­cus­sion of why de­con­fu­sion is so im­por­tant to them, an ex­pla­na­tion of why MIRI is now plan­ning to leave re­search un­pub­lished by de­fault, and a case for soft­ware en­g­ineers to join their team.

Ro­hin’s opinion: There aren’t enough de­tails on the tech­ni­cal re­search for me to say any­thing use­ful about it. I’m broadly in sup­port of de­con­fu­sion but am ei­ther less op­ti­mistic on the tractabil­ity of de­con­fu­sion, or more op­ti­mistic on the pos­si­bil­ity of suc­cess with our cur­rent no­tions (prob­a­bly both). Keep­ing re­search un­pub­lished-by-de­fault seems rea­son­able to me given the MIRI view­point for the rea­sons they talk about, though I haven’t thought about it much. See also Im­port AI.

Other progress in AI

Re­in­force­ment learning

Woulda, Coulda, Shoulda: Coun­ter­fac­tu­ally-Guided Policy Search (Lars Buesing et al) (sum­ma­rized by Richard): This pa­per aims to alle­vi­ate the data in­effi­ciency of RL by us­ing a model to syn­the­sise data. How­ever, even when en­vi­ron­ment dy­nam­ics can be mod­eled ac­cu­rately, it can be difficult to gen­er­ate data which matches the true dis­tri­bu­tion. To solve this prob­lem, the au­thors use a Struc­tured Causal Model trained to pre­dict the out­comes which would have oc­curred if differ­ent ac­tions had been taken from pre­vi­ous states. Data is then syn­the­sised by rol­ling out from pre­vi­ously-seen states. The au­thors test perfor­mance in a par­tially-ob­serv­able ver­sion of SOKOBAN, in which their sys­tem out­performs other meth­ods of gen­er­at­ing data.

Richard’s opinion: This is an in­ter­est­ing ap­proach which I can imag­ine be­com­ing use­ful. It would be nice to see more ex­per­i­men­tal work in more stochas­tic en­vi­ron­ments, though.

Nat­u­ral En­vi­ron­ment Bench­marks for Re­in­force­ment Learn­ing (Amy Zhang et al) (sum­ma­rized by Richard): This pa­per notes that RL perfor­mance tends to be mea­sured in sim­ple ar­tifi­cial en­vi­ron­ments—un­like other ar­eas of ML in which us­ing real-world data such as images or text is com­mon. The au­thors pro­pose three new bench­marks to ad­dress this dis­par­ity. In the first two, an agent is as­signed to a ran­dom lo­ca­tion in an image, and can only ob­serve parts of the image near it. At ev­ery time step, it is able to move in one of the car­di­nal di­rec­tions, un­mask­ing new sec­tions of the image, un­til it can clas­sify the image cor­rectly (task 1) or lo­cate a given ob­ject (task 2). The third type of bench­mark is adding nat­u­ral video as back­ground to ex­ist­ing Mu­joco or Atari tasks. In test­ing this third cat­e­gory of bench­mark, they find that PPO and A2C fall into a lo­cal op­ti­mum where they ig­nore the ob­served state when de­cid­ing the next ac­tion.

Richard’s opinion: While I agree with some of the con­cerns laid out in this pa­per, I’m not sure that these bench­marks are the best way to ad­dress them. The third task in par­tic­u­lar is mainly test­ing for abil­ity to ig­nore the “nat­u­ral data” used, which doesn’t seem very use­ful. I think a bet­ter al­ter­na­tive would be to re­place Atari with tasks in pro­ce­du­rally-gen­er­ated en­vi­ron­ments with re­al­is­tic physics en­g­ines. How­ever, this pa­per’s bench­marks do benefit from be­ing much eas­ier to pro­duce and less com­pu­ta­tion­ally de­mand­ing.

Deep learning

Do Bet­ter ImageNet Models Trans­fer Bet­ter? (Si­mon Korn­blith et al) (sum­ma­rized by Dan H)

Dan H’s opinion: This pa­per shows a strong cor­re­la­tion be­tween a model’s ImageNet ac­cu­racy and its ac­cu­racy on trans­fer learn­ing tasks. In turn, bet­ter ImageNet mod­els learn stronger fea­tures. This is ev­i­dence against the as­ser­tion that re­searchers are sim­ply overfit­ting ImageNet. Other ev­i­dence is that the ar­chi­tec­tures them­selves work bet­ter on differ­ent vi­sion tasks. Fur­ther ev­i­dence against overfit­ting ImageNet is that many ar­chi­tec­tures which are des­gined for CIFAR-10, when trained on ImageNet, can be highly com­pet­i­tive on ImageNet.

Gather-Ex­cite: Ex­ploit­ing Fea­ture Con­text in Con­volu­tional Neu­ral Net­works (Jie Hu, Li Shen, Sa­muel Albanie et al) (sum­ma­rized by Dan H)

Read more: This method uses spa­tial sum­ma­riza­tion for in­creas­ing con­vnet ac­cu­racy and was dis­cov­ered around the same time as this similar work. Papers with in­de­pen­dent re­dis­cov­er­ies tend to be worth tak­ing more se­ri­ously.

Im­prov­ing Gen­er­al­iza­tion for Ab­stract Rea­son­ing Tasks Us­ing Disen­tan­gled Fea­ture Rep­re­sen­ta­tions (Xan­der Steen­brugge et al)