[AN #78] Formalizing power and instrumental convergence, and the end-of-year AI safety charity comparison

Link post

Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can sign up, or look through this spread­sheet of all sum­maries that have ever been in the newslet­ter. I’m always happy to hear feed­back; you can send it to me by re­ply­ing to this email.

Merry Christ­mas!

Au­dio ver­sion here (may not be up yet).


2019 AI Align­ment Liter­a­ture Re­view and Char­ity Com­par­i­son (Larks) (sum­ma­rized by Ro­hin): As in three pre­vi­ous years (AN #38), this mam­moth post goes through the work done within AI al­ign­ment from De­cem­ber 2018 - Novem­ber 2019, from the per­spec­tive of some­one try­ing to de­cide which of sev­eral AI al­ign­ment or­ga­ni­za­tions to donate to. As part of this en­deavor, Larks sum­ma­rizes sev­eral pa­pers that were pub­lished at var­i­ous or­ga­ni­za­tions, and com­pares them to their bud­get and room for more fund­ing.

Ro­hin’s opinion: I look for­ward to this post ev­ery year. This year, it’s been a stark demon­stra­tion of how much work doesn’t get cov­ered in this newslet­ter—while I tend to fo­cus on the tech­ni­cal al­ign­ment prob­lem, with some fo­cus on AI gov­er­nance and AI ca­pa­bil­ities, Larks’s liter­a­ture re­view spans many or­ga­ni­za­tions work­ing on ex­is­ten­tial risk, and as such has many pa­pers that were never cov­ered in this newslet­ter. Any­one who wants to donate to an or­ga­ni­za­tion work­ing on AI al­ign­ment and/​or x-risk should read this post. How­ever, if your goal is in­stead to figure out what the field has been up to for the last year, for the sake of build­ing in­side view mod­els of what’s hap­pen­ing in AI al­ign­ment, I might soon write up such an overview my­self, but no promises.

Seek­ing Power is Prov­ably In­stru­men­tally Con­ver­gent in MDPs (Alex Turner et al) (sum­ma­rized by Ro­hin): The Ba­sic AI Drives ar­gues that it is in­stru­men­tally con­ver­gent for an agent to col­lect re­sources and gain power. This post and as­so­ci­ated pa­per aim to for­mal­ize this ar­gu­ment. In­for­mally, an ac­tion is in­stru­men­tally con­ver­gent if it is helpful for many goals, or equiv­a­lently, an ac­tion is in­stru­men­tally con­ver­gent to the ex­tent that we ex­pect an agent to take it, if we do not know what the agent’s goal is. Similarly, a state has high power if it is eas­ier to achieve a wide va­ri­ety of goals from that state.

A nat­u­ral for­mal­iza­tion is to as­sume we have a dis­tri­bu­tion over the agent’s goal, and define power and in­stru­men­tal con­ver­gence rel­a­tive to this dis­tri­bu­tion. We can then define power as the ex­pected value that can be ob­tained from a state (mod­ulo some tech­ni­cal caveats), and in­stru­men­tal con­ver­gence as the prob­a­bil­ity that an ac­tion is op­ti­mal, from our per­spec­tive of un­cer­tainty: of course, the agent knows its own goal, and acts op­ti­mally in pur­suit of that goal.

You might think that op­ti­mal agents would prov­ably seek out states with high power. How­ever, this is not true. Con­sider a de­ci­sion faced by high school stu­dents: should they take a gap year, or go di­rectly to col­lege? Let’s as­sume col­lege is nec­es­sary for (100-ε)% of ca­reers, but if you take a gap year, you could fo­cus on the other ε% of ca­reers or de­cide to go to col­lege af­ter the year. Then in the limit of far­sight­ed­ness, tak­ing a gap year leads to a more pow­er­ful state, since you can still achieve all of the ca­reers, albeit slightly less effi­ciently for the col­lege ca­reers. How­ever, if you know which ca­reer you want, then it is (100-ε)% likely that you go to col­lege, so go­ing to col­lege is very strongly in­stru­men­tally con­ver­gent even though tak­ing a gap year leads to a more pow­er­ful state.

Nonethe­less, there are things we can prove. In en­vi­ron­ments where the only cy­cles are states with a sin­gle ac­tion lead­ing back to the same state, and apart from that ev­ery ac­tion leads to a new state, and many states have more than one ac­tion, far­sighted agents are more likely to choose tra­jec­to­ries that spend more time nav­i­gat­ing to a cy­cle be­fore spend­ing the rest of the time in the cy­cle. For ex­am­ple, in Tic-Tac-Toe where the op­po­nent is play­ing op­ti­mally ac­cord­ing to the nor­mal win con­di­tion, but the agent’s re­ward for each state is drawn in­de­pen­dently from some dis­tri­bu­tion on [0, 1], the agent is much more likely to play out to a long game where the en­tire board is filled. This is be­cause the num­ber of states that can be reached grows ex­po­nen­tially in the hori­zon, and so agents have more con­trol by tak­ing longer tra­jec­to­ries. Equiv­a­lently, the cy­cle with max­i­mal re­ward is much more likely to be at the end of a longer tra­jec­tory, and so the op­ti­mal pos­si­bil­ity is more likely to be a long tra­jec­tory.

Ro­hin’s opinion: I like the for­mal­iza­tions of power and in­stru­men­tal con­ver­gence. I think in prac­tice there will be a lot of com­plex­ity in a) the re­ward dis­tri­bu­tion that power and in­stru­men­tal con­ver­gence are defined rel­a­tive to, b) the struc­ture of the en­vi­ron­ment, and c) how pow­er­ful AI sys­tems ac­tu­ally work (since they won’t be perfectly op­ti­mal, and won’t know the en­vi­ron­ment struc­ture ahead of time). Nonethe­less, re­sults with spe­cific classes of re­ward dis­tri­bu­tions, en­vi­ron­ment struc­tures, and agent mod­els can still provide use­ful in­tu­ition.

Read more: Clar­ify­ing Power-Seek­ing and In­stru­men­tal Con­ver­gence, Paper: Op­ti­mal Far­sighted Agents Tend to Seek Power

Tech­ni­cal AI alignment

Tech­ni­cal agen­das and prioritization

A dilemma for pro­saic AI al­ign­ment (Daniel Koko­ta­jlo) (sum­ma­rized by Ro­hin): This post points out a po­ten­tial prob­lem for Pro­saic AI al­ign­ment (AN #34), in which we try to al­ign AI sys­tems built us­ing cur­rent tech­niques. Con­sider some pro­saic al­ign­ment scheme, such as iter­ated am­plifi­ca­tion (AN #30) or de­bate (AN #5). If we try to train an AI sys­tem di­rectly us­ing such a scheme, it will likely be un­com­pet­i­tive, since it seems likely that the most pow­er­ful AI sys­tems will prob­a­bly re­quire cut­ting-edge al­gorithms, ar­chi­tec­tures, ob­jec­tives, and en­vi­ron­ments, at least some of which will be re­placed by new ver­sions from the safety scheme. Alter­na­tively, we could first train a gen­eral AI sys­tem, and then use our al­ign­ment scheme to fine­tune it into an al­igned AI sys­tem. How­ever, this runs the risk that the ini­tial train­ing could cre­ate a mis­al­igned mesa op­ti­mizer, that then de­liber­ately sab­o­tages our fine­tun­ing efforts.

Ro­hin’s opinion: The com­ments re­veal a third pos­si­bil­ity: the al­ign­ment scheme could be trained jointly alongside the cut­ting edge AI train­ing. For ex­am­ple, we might hope that we can train a ques­tion an­swerer that can an­swer ques­tions about any­thing “the model already knows”, and this ques­tion an­swer­ing sys­tem is trained si­mul­ta­neously with the train­ing of the model it­self. I think this takes the “oomph” out of the dilemma as posed here—it seems rea­son­ably likely that it only takes frac­tion­ally more re­sources to train a ques­tion an­swer­ing sys­tem on top of the model, if it only has to use knowl­edge “already in” the model, which would let it be com­pet­i­tive, while still pre­vent­ing mesa op­ti­miz­ers from aris­ing (if the al­ign­ment scheme does its job). Of course, it may turn out that it takes a huge amount of re­sources to train the ques­tion an­swer­ing sys­tem, mak­ing the sys­tem un­com­pet­i­tive, but that seems hard to pre­dict given our cur­rent knowl­edge.

Tech­ni­cal AGI safety re­search out­side AI (Richard Ngo) (sum­ma­rized by Ro­hin): This post lists 30 ques­tions rele­vant to tech­ni­cal AI safety that could benefit from ex­per­tise out­side of AI, di­vided into four cat­e­gories: study­ing and un­der­stand­ing safety prob­lems, solv­ing safety prob­lems, fore­cast­ing AI, and meta.

Mesa optimization

Is the term mesa op­ti­mizer too nar­row? (Matthew Bar­nett) (sum­ma­rized by Ro­hin): The mesa op­ti­miza­tion (AN #58) pa­per defined an op­ti­mizer as a sys­tem that in­ter­nally searches through a search space for el­e­ments that score high ac­cord­ing to some ex­plicit ob­jec­tive func­tion. How­ever, hu­mans would not qual­ify as mesa op­ti­miz­ers by this defi­ni­tion, since there (pre­sum­ably) isn’t some part of the brain that ex­plic­itly en­codes some ob­jec­tive func­tion that we then try to max­i­mize. In ad­di­tion, there are in­ner al­ign­ment failures that don’t in­volve mesa op­ti­miza­tion: a small feed­for­ward neu­ral net doesn’t do any ex­plicit search; yet when it is trained in the chest and keys en­vi­ron­ment (AN #67), it learns a policy that goes to the near­est key, which is equiv­a­lent to a key-max­i­mizer. Rather than talk­ing about “mesa op­ti­miz­ers”, the post recom­mends that we in­stead talk about “ma­lign gen­er­al­iza­tion”, to re­fer to the prob­lem when ca­pa­bil­ities gen­er­al­ize but the ob­jec­tive doesn’t (AN #66).

Ro­hin’s opinion: I strongly agree with this post (though note that the post was writ­ten right af­ter a con­ver­sa­tion with me on the topic, so this isn’t in­de­pen­dent ev­i­dence). I find it very un­likely that most pow­er­ful AI sys­tems will be op­ti­miz­ers as defined in the origi­nal pa­per, but I do think that the ma­lign gen­er­al­iza­tion prob­lem will ap­ply to our AI sys­tems. For this rea­son, I hope that fu­ture re­search doesn’t spe­cial­ize to the case of ex­plicit-search-based agents.

Learn­ing hu­man intent

Pos­i­tive-Un­la­beled Re­ward Learn­ing (Dan­fei Xu et al) (sum­ma­rized by Zach): The prob­lem with learn­ing a re­ward model and train­ing an agent on the (now fixed) model is that the agent can learn to ex­ploit er­rors in the re­ward model. Ad­ver­sar­ial imi­ta­tion learn­ing seeks to avoid this by train­ing a dis­crim­i­na­tor re­ward model with the agent: the dis­crim­i­na­tor is trained via su­per­vised learn­ing to dis­t­in­guish be­tween ex­pert tra­jec­to­ries and agent tra­jec­to­ries, while the agent tries to fool the dis­crim­i­na­tor. How­ever, this effec­tively treats the agent tra­jec­to­ries as nega­tive ex­am­ples — even once the agent has mas­tered the task. What we would re­ally like to do is to treat the agent tra­jec­to­ries as un­la­beled data. This is an in­stance of semi-su­per­vised learn­ing, in which a clas­sifier has ac­cess to a small set of la­beled data and a much larger col­lec­tion of un­la­beled data. In gen­eral, the com­mon ap­proach is to prop­a­gate clas­sifi­ca­tion in­for­ma­tion learned us­ing la­bels to the un­la­beled dataset. The au­thors ap­ply a re­cent al­gorithm for pos­i­tive-un­la­beled (PU) learn­ing, and show that this ap­proach can im­prove upon both GAIL and su­per­vised re­ward learn­ing.

Zach’s opinion: I liked this pa­per be­cause it offers a novel solu­tion to a com­mon con­cern with the ad­ver­sar­ial ap­proach. Namely, GAN ap­proaches of­ten train dis­crim­i­na­tors that over­power the gen­er­a­tor lead­ing to mode col­lapse. In the RL set­ting, it seems nat­u­ral to leave agent gen­er­ated tra­jec­to­ries un­la­beled since we don’t have any sort of ground truth for whether or not agent tra­jec­to­ries are suc­cess­ful. For ex­am­ple, it might be pos­si­ble to perform a task in a way that’s differ­ent than is shown in the demon­stra­tions. In this case, it makes sense to try and prop­a­gate feed­back to the larger un­la­beled agent tra­jec­tory data set in­di­rectly. Pre­sum­ably, this wasn’t pre­vi­ously pos­si­ble be­cause pos­i­tive-un­la­beled learn­ing has only re­cently been gen­er­al­ized to the deep learn­ing set­ting. After read­ing this pa­per, my broad take­away is that semi-su­per­vised meth­ods are start­ing to reach the point where they have po­ten­tial to fur­ther progress in imi­ta­tion learn­ing.

Mis­cel­la­neous (Align­ment)

What are some non-purely-sam­pling ways to do deep RL? (Evan Hub­inger) (sum­ma­rized by Matthew): A deep re­in­force­ment learn­ing agent trained by re­ward sam­ples alone may pre­dictably lead to a proxy al­ign­ment is­sue: the learner could fail to de­velop a full un­der­stand­ing of what be­hav­ior it is be­ing re­warded for, and thus be­have un­ac­cept­ably when it is taken off its train­ing dis­tri­bu­tion. Since we of­ten use ex­plicit speci­fi­ca­tions to define our re­ward func­tions, Evan Hub­inger asks how we can in­cor­po­rate this in­for­ma­tion into our deep learn­ing mod­els so that they re­main al­igned off the train­ing dis­tri­bu­tion. He names sev­eral pos­si­bil­ities for do­ing so, such as giv­ing the deep learn­ing model ac­cess to a differ­en­tiable copy of the re­ward func­tion dur­ing train­ing, and fine-tun­ing a lan­guage model so that it can map nat­u­ral lan­guage de­scrip­tions of a re­ward func­tion into op­ti­mal ac­tions.

Matthew’s opinion: I’m un­sure, though lean­ing skep­ti­cal, whether in­cor­po­rat­ing a copy of the re­ward func­tion into a deep learn­ing model would help it learn. My guess is that if some­one did that with a cur­rent model it would make the model harder to train, rather than mak­ing any­thing eas­ier. I will be ex­cited if some­one can demon­strate at least one fea­si­ble ap­proach to ad­dress­ing proxy al­ign­ment that does more than sam­ple the re­ward func­tion.

Ro­hin’s opinion: I’m skep­ti­cal of this ap­proach. Mostly this is be­cause I’m gen­er­ally skep­ti­cal that an in­tel­li­gent agent will con­sist of a sep­a­rate “plan­ning” part and “re­ward” part. How­ever, if that were true, then I’d think that this ap­proach could plau­si­bly give us some ad­di­tional al­ign­ment, but can’t solve the en­tire prob­lem of in­ner al­ign­ment. Speci­fi­cally, the re­ward func­tion en­codes a huge amount of in­for­ma­tion: it speci­fies the op­ti­mal be­hav­ior in all pos­si­ble situ­a­tions you could be in. The “in­tel­li­gent” part of the net is only ever go­ing to get a sub­set of this in­for­ma­tion from the re­ward func­tion, and so its plans can never be perfectly op­ti­mized for that re­ward func­tion, but in­stead could be com­pat­i­ble with any re­ward func­tion that would provide the same in­for­ma­tion on the “queries” that the in­tel­li­gent part has pro­duced.

For a slightly-more-con­crete ex­am­ple, for any “nor­mal” util­ity func­tion U, there is a util­ity func­tion U’ that is “like U, but also the best out­comes are ones in which you hack the mem­ory so that the ‘re­ward’ vari­able is set to in­finity”. To me, wire­head­ing is pos­si­ble be­cause the “in­tel­li­gent” part doesn’t get enough in­for­ma­tion about U to dis­t­in­guish U from U’, and so its plans could very well be op­ti­mized for U’ in­stead of U.

Other progress in AI

Re­in­force­ment learning

Model-Based Re­in­force­ment Learn­ing: The­ory and Prac­tice (Michael Jan­ner et al) (sum­ma­rized by Ro­hin): This post pro­vides a broad overview of model-based re­in­force­ment learn­ing, and ar­gues that a learned (ex­plicit) model al­lows you to gen­er­ate sam­ple tra­jec­to­ries from the cur­rent policy at ar­bi­trary states, cor­rect­ing for off-policy er­ror, at the cost of in­tro­duc­ing model bias. Since model er­rors com­pound as you sam­ple longer and longer tra­jec­to­ries, the au­thors pro­pose an al­gorithm in which the model is used to sam­ple short tra­jec­to­ries from states in the re­play buffer, rather than sam­pling tra­jec­to­ries from the ini­tial state (which are as long as the task’s hori­zon).

Read more: Paper: When to Trust Your Model: Model-Based Policy Optimization

Deep learning

In­duc­tive bi­ases stick around (Evan Hub­inger) (sum­ma­rized by Ro­hin): This up­date to Evan’s dou­ble de­scent post (AN #77) ex­plains why he thinks dou­ble de­scent is im­por­tant. Speci­fi­cally, Evan ar­gues that it shows that in­duc­tive bi­ases mat­ter even for large, deep mod­els. In par­tic­u­lar, dou­ble de­scent shows that larger mod­els are sim­pler than smaller mod­els, at least in the over­pa­ram­e­ter­ized set­ting where mod­els are past the in­ter­po­la­tion thresh­old where they can get ap­prox­i­mately zero train­ing er­ror. This makes the case for mesa op­ti­miza­tion (AN #58) stronger, since mesa op­ti­miz­ers are sim­ple, com­pressed poli­cies.

Ro­hin’s opinion: As you might have gath­ered last week, I’m not sold on dou­ble de­scent as a clear, always-pre­sent phe­nomenon, though it cer­tainly is a real effect that oc­curs in at least some situ­a­tions. So I tend not to be­lieve coun­ter­in­tu­itive con­clu­sions like “larger mod­els are sim­pler” that are premised on dou­ble de­scent.

Re­gard­less, I ex­pect that pow­er­ful AI sys­tems are go­ing to be severely un­der­pa­ram­e­ter­ized, and so I don’t think it re­ally mat­ters that past the in­ter­po­la­tion thresh­old larger mod­els are sim­pler. I don’t think the case for mesa op­ti­miza­tion should de­pend on this; hu­mans are cer­tainly “un­der­pa­ram­e­ter­ized”, but should count as mesa op­ti­miz­ers.

The Quiet Semi-Su­per­vised Revolu­tion (Vin­cent Van­houcke) (sum­ma­rized by Flo): His­tor­i­cally, semi-su­per­vised learn­ing that uses small amounts of la­bel­led data com­bined with a lot of un­la­beled data only helped when there was very lit­tle la­bel­led data available. In this regime, both su­per­vised and semi-su­per­vised learn­ing were too in­ac­cu­rate to be use­ful. Fur­ther­more, ap­proaches like us­ing a rep­re­sen­ta­tion learnt by an au­toen­coder for clas­sifi­ca­tion em­piri­cally limited asymp­totic perfor­mance. This is strange be­cause us­ing more data should not lead to worse perfor­mance.

Re­cent trends sug­gest that this might change soon: semi-su­per­vised sys­tems have be­gun to out­perform su­per­vised sys­tems by larger and larger mar­gins in the low data regime and their ad­van­tage now ex­tends into regimes with more and more data. An im­por­tant driver of this trend is the idea of us­ing data aug­men­ta­tion for more con­sis­tent self-la­bel­ling.

Bet­ter semi-su­per­vised learn­ing might for ex­am­ple be use­ful for fed­er­ated learn­ing which at­tempts to re­spect pri­vacy by learn­ing lo­cally on (la­bel­led) user data and send­ing the mod­els trained by differ­ent users to be com­bined in a cen­tral server. One prob­lem with this ap­proach is that the cen­tral model might mem­o­rize some of the pri­vate mod­els’ idiosyn­cra­cies such that in­fer­ence about the pri­vate la­bels is pos­si­ble. Semi-su­per­vised learn­ing makes this harder by re­duc­ing the amount of in­fluence pri­vate data has on the ag­gre­gate model.

Flo’s opinion: Be­cause the way hu­mans clas­sify things are strongly in­fluenced by our pri­ors about how classes “should” be­have, learn­ing with limited data most likely re­quires some in­for­ma­tion about these pri­ors. Semi-su­per­vised learn­ing that re­spects that data aug­men­ta­tion does not change the cor­rect clas­sifi­ca­tion might be an effi­cient and scal­able way to force some of these pri­ors onto a model. Thus it seems likely that more di­verse and so­phis­ti­cated data aug­men­ta­tion could lead to fur­ther im­prove­ments in the near term. On the other hand, it seems like a lot of our pri­ors would be very hard to cap­ture only us­ing au­to­matic data aug­men­ta­tion, such that other meth­ods to trans­fer our pri­ors are still im­por­tant.