Alignment Newsletter #52

Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can sign up, or look through this spread­sheet of all sum­maries that have ever been in the newslet­ter.


Thoughts on Hu­man Models (Ra­mana Ku­mar and Scott Garrabrant): Many ap­proaches to AI safety in­volve mod­el­ing hu­mans in some way, for ex­am­ple in or­der to cor­rectly in­ter­pret their feed­back. How­ever, there are sig­nifi­cant dis­ad­van­tages to hu­man mod­el­ing. First and most im­por­tantly, if we have AI sys­tems do use­ful things with­out mod­el­ing hu­mans, then we can use hu­man ap­proval as a “test set”: we can check whether the AI’s be­hav­ior is some­thing we ap­prove of, and this is an in­de­pen­dent eval­u­a­tion of the AI sys­tem. How­ever, if the AI sys­tem had a hu­man model, then it may have op­ti­mized its be­hav­ior for hu­man ap­proval, and so we can­not use ap­proval as a “test set”. Se­cond, if our AI sys­tem has a catas­trophic bug, it seems bet­ter if it doesn’t have any hu­man mod­els. An AI sys­tem with­out hu­man mod­els will at worst op­ti­mize for some un­re­lated goal like pa­per­clips, which at worst leads to it treat­ing hu­mans as ob­sta­cles and caus­ing ex­tinc­tion. How­ever, an AI sys­tem with hu­man mod­els with a catas­trophic bug might op­ti­mize for hu­man suffer­ing, or hav­ing hu­mans re­spond to email all day, etc. Thirdly, an AI sys­tem with hu­man mod­els might be simu­lat­ing con­scious be­ings that can suffer. Fourthly, since hu­mans are agent-like, an AI sys­tem that mod­els hu­mans is likely to pro­duce a sub­sys­tem that is agent-like and so dan­ger­ous.

The au­thors then dis­cuss why it might be hard to avoid hu­man mod­els. Most no­tably, it is hard to see how to use a pow­er­ful AI sys­tem that avoids hu­man mod­els to pro­duce a bet­ter fu­ture. In par­tic­u­lar, hu­man mod­els could be par­tic­u­larly use­ful for in­ter­pret­ing speci­fi­ca­tions (in or­der to do what hu­mans mean, as op­posed to what we liter­ally say) and for achiev­ing perfor­mance given a speci­fi­ca­tion (e.g. if we want to repli­cate as­pects of hu­man cog­ni­tion). Another is­sue is that it is hard to avoid hu­man mod­el­ing, since even “in­de­pen­dent” tasks have some amount of in­for­ma­tion about hu­man mo­ti­va­tions in se­lect­ing that task.

Nev­er­the­less, the au­thors would like to see more work on en­g­ineer­ing-fo­cused ap­proaches to AI safety with­out hu­man mod­els, es­pe­cially since this area is ne­glected, with very lit­tle such work cur­rently. While MIRI does work on AI safety with­out hu­man mod­els, this is from a very the­o­ret­i­cal per­spec­tive. In ad­di­tion to tech­ni­cal work, we could also pro­mote cer­tain types of AI re­search that is less likely to de­velop hu­man mod­els “by de­fault” (e.g. train­ing AI sys­tems in pro­ce­du­rally gen­er­ated simu­la­tions, rather than on hu­man-gen­er­ated text and images).

Ro­hin’s opinion: While I don’t dis­agree with the rea­son­ing, I dis­agree with the main thrust of this post. I wrote a long com­ment about it; the TL;DR is that since hu­mans want very spe­cific be­hav­ior out of AI sys­tems, the AI sys­tem needs to get a lot of in­for­ma­tion from hu­mans about what it should do, and if it un­der­stands all that in­for­ma­tion then it nec­es­sar­ily has a (maybe im­plicit) hu­man model. In other words, if you re­quire your AI sys­tem not to have hu­man mod­els, it will not be very use­ful, and peo­ple will use other tech­niques.

Tech­ni­cal AI alignment

Iter­ated amplification

AI Align­ment Pod­cast: AI Align­ment through De­bate (Lu­cas Perry and Ge­offrey Irv­ing) (sum­ma­rized by Richard): We want AI safety solu­tions to scale to very in­tel­li­gent agents; de­bate is one scal­a­bil­ity tech­nique. It’s for­mu­lated as a two player zero-sum perfect in­for­ma­tion game in which agents make ar­gu­ments in nat­u­ral lan­guage, to be eval­u­ated by a hu­man judge. Whether or not such de­bates are truth-con­ducive is an em­piri­cal ques­tion which we can try to eval­u­ate ex­per­i­men­tally; do­ing so will re­quire both tech­ni­cal and so­cial sci­ence ex­per­tise (as dis­cussed in a pre­vi­ous post (AN #47)).

Richard’s opinion: I think one of the key ques­tions un­der­ly­ing De­bate is how effi­ciently nat­u­ral lan­guage can sum­marise rea­son­ing about prop­er­ties of the world. This ques­tion is sub­ject to some dis­agree­ment (at one ex­treme, Face­book’s roadmap to­wards ma­chine in­tel­li­gence de­scribes a train­ing en­vi­ron­ment which is “en­tirely lin­guis­ti­cally defined”) and prob­a­bly de­serves more pub­lic dis­cus­sion in the con­text of safety.

Ro­hin’s note: If you’ve read the pre­vi­ous posts on de­bate, the novel parts of this pod­cast are on the re­la­tion be­tween iter­ated am­plifi­ca­tion and de­bate (which has been dis­cussed be­fore, but not in as much depth), and the rea­sons for op­ti­mism and pes­simism about de­bate.

Agent foundations

Pavlov Gen­er­al­izes (Abram Dem­ski): In the iter­ated pris­oner’s dilemma, the Pavlov strat­egy is to start by co­op­er­at­ing, and then switch the ac­tion you take when­ever the op­po­nent defects. This can be gen­er­al­ized to ar­bi­trary games. Roughly, an agent is “dis­con­tent” by de­fault and chooses ac­tions ran­domly. It can be­come “con­tent” if it gets a high pay­off, in which case it con­tinues to choose what­ever ac­tion it pre­vi­ously chose as long as the pay­offs re­main con­sis­tently high. This gen­er­al­iza­tion achieves Pareto op­ti­mal­ity in the limit, though with a very bad con­ver­gence rate. Ba­si­cally, all of the agents start out dis­con­tent and do a lot of ex­plo­ra­tion, and as long as any one agent is dis­con­tent the pay­offs will be in­con­sis­tent and all agents will tend to be dis­con­tent. Only when by chance all of the agents take ac­tions that lead to all of them get­ting high pay­offs do they all be­come con­tent, at which point they keep choos­ing the same ac­tion and stay in the equil­ibrium.

De­spite the bad con­ver­gence, the cool thing about the Pavlov gen­er­al­iza­tion is that it only re­quires agents to no­tice when the re­sults are good or bad for them. In con­trast, typ­i­cal strate­gies that aim to mimic Tit-for-Tat re­quire the agent to rea­son about the be­liefs and util­ity func­tions of other agents, which can be quite difficult to do. By just fo­cus­ing on whether things are go­ing well for them­selves, Pavlov agents can get a lot of prop­er­ties in en­vi­ron­ments with other agents that Tit-for-Tat strate­gies don’t ob­vi­ously get, such as ex­ploit­ing agents that always co­op­er­ate. How­ever, when think­ing about log­i­cal time (AN #25), it would seem that a Pavlov-es­que strat­egy would have to make de­ci­sions based on a pre­dic­tion about its own be­hav­ior, which is… not ob­vi­ously doomed, but seems odd. Re­gard­less, given the lack of work on Pavlov strate­gies, it’s worth try­ing to gen­er­al­ize them fur­ther.

Ap­proval-di­rected agency and the de­ci­sion the­ory of New­comb-like prob­lems (Cas­par Oester­held)

Learn­ing hu­man intent

Thoughts on Hu­man Models (Ra­mana Ku­mar and Scott Garrabrant): Sum­ma­rized in the high­lights!


Al­gorithms for Ver­ify­ing Deep Neu­ral Net­works (Changliu Liu et al): This is a sur­vey pa­per about ver­ifi­ca­tion of prop­er­ties of deep neu­ral nets.


Towards Ro­bust and Ver­ified AI: Speci­fi­ca­tion Test­ing, Ro­bust Train­ing, and For­mal Ver­ifi­ca­tion (Push­meet Kohli et al): This post high­lights three ar­eas of cur­rent re­search to­wards mak­ing ro­bust AI sys­tems. First, we need bet­ter eval­u­a­tion met­rics: rather than just eval­u­at­ing RL sys­tems on the en­vi­ron­ments they were trained on, we need to ac­tively search for situ­a­tions in which they fail. Se­cond, given a speci­fi­ca­tion or con­straint that we would like to en­sure, we can de­velop new train­ing tech­niques that can en­sure that the speci­fi­ca­tions hold. Fi­nally, given a speci­fi­ca­tion, we can use for­mal ver­ifi­ca­tion tech­niques to en­sure that the model obeys the speci­fi­ca­tion on all pos­si­ble in­puts. The au­thors also list four ar­eas of fu­ture re­search that they are ex­cited about: lev­er­ag­ing AI ca­pa­bil­ities for eval­u­a­tion and ver­ifi­ca­tion, de­vel­op­ing pub­li­cly available tools for eval­u­a­tion and ver­ifi­ca­tion, broad­en­ing the scope of ad­ver­sar­ial ex­am­ples be­yond the L-in­finity norm ball, and learn­ing speci­fi­ca­tions.

Ro­hin’s opinion: The biggest challenge I see with this area of re­search, at least in its ap­pli­ca­tion to pow­er­ful and gen­eral AI sys­tems, is how you get the speci­fi­ca­tion in the first place, so I’m glad to see “learn­ing speci­fi­ca­tions” as one of the ar­eas of in­ter­est.

If I take the view from this post, it seems to me that tech­niques like do­main ran­dom­iza­tion, and more gen­er­ally train­ing on a larger dis­tri­bu­tion of data, would count as an ex­am­ple of the sec­ond type of re­search: it is a change to the train­ing pro­ce­dure that al­lows us to meet the speci­fi­ca­tion “the agent should achieve high re­ward in a broad va­ri­ety of en­vi­ron­ments”. Of course, this doesn’t give us any prov­able guaran­tees, so I’m not sure if the au­thors of the post would in­clude it in this cat­e­gory.


His­tor­i­cal eco­nomic growth trends (Katja Grace) (sum­ma­rized by Richard): Data on his­tor­i­cal eco­nomic growth “sug­gest that (pro­por­tional) rates of eco­nomic and pop­u­la­tion growth in­crease roughly lin­early with the size of the world econ­omy and pop­u­la­tion”, at least from around 0 CE to 1950. How­ever, this trend has not held since 1950 - in fact, growth rates have fallen since then.

Mis­cel­la­neous (Align­ment)

Co­her­ent be­havi­our in the real world is an in­co­her­ent con­cept (Richard Ngo): In a pre­vi­ous post (AN #35), I ar­gued that co­her­ence ar­gu­ments (such as those based on VNM ra­tio­nal­ity) do not con­strain the be­hav­ior of an in­tel­li­gent agent. In this post, Richard delves fur­ther into the ar­gu­ment, and con­sid­ers other ways that we could draw im­pli­ca­tions from co­her­ence ar­gu­ments.

I mod­eled the agent as hav­ing prefer­ences over full tra­jec­to­ries, and ob­jected that if you only look at ob­served be­hav­ior (rather than hy­po­thet­i­cal be­hav­ior), you can always con­struct a util­ity func­tion such that the ob­served be­hav­ior op­ti­mizes that util­ity func­tion. Richard agrees that this ob­jec­tion is strong, but looks at an­other case: when the agent has prefer­ences over states at a sin­gle point in time. This case leads to other ob­jec­tions. First, many rea­son­able prefer­ences can­not be mod­eled via a re­ward func­tion over states, such as the prefer­ence to sing a great song perfectly. Se­cond, in the real world you are never in the same state more than once, since at the very least your mem­o­ries will change, and so you can never in­fer a co­her­ence vi­o­la­tion by look­ing at ob­served be­hav­ior.

He also iden­ti­fies fur­ther prob­lems with ap­ply­ing co­her­ence ar­gu­ments to re­al­is­tic agents. First, all be­hav­ior is op­ti­mal for the con­stant zero re­ward func­tion. Se­cond, any real agent will not have full in­for­ma­tion about the world, and will have to have be­liefs over the world. Any defi­ni­tion of co­her­ence will have to al­low for mul­ti­ple be­liefs—but if you al­low all be­liefs, then you can ra­tio­nal­ize any be­hav­ior as based on some weird be­lief that the agent has. If you re­quire the agent to be Bayesian, you can still ra­tio­nal­ize any be­hav­ior by choos­ing a prior ap­pro­pri­ately.

Ro­hin’s opinion: I re­ject mod­el­ing agents as hav­ing prefer­ences over states pri­mar­ily for the first rea­son that Richard iden­ti­fied: there are many “rea­son­able” prefer­ences that can­not be mod­eled with a re­ward func­tion solely on states. How­ever, I don’t find the ar­gu­ment about be­liefs as a free vari­able very con­vinc­ing: I think it’s rea­son­able to ar­gue that a su­per­in­tel­li­gent AI sys­tem will on av­er­age have much bet­ter be­liefs than us, and so any­thing that we could de­ter­mine as a co­her­ence vi­o­la­tion with high con­fi­dence should be some­thing the AI sys­tem can also de­ter­mine as a co­her­ence vi­o­la­tion with high con­fi­dence.

Three ways that “Suffi­ciently op­ti­mized agents ap­pear co­her­ent” can be false (Wei Dai): This post talks about three ways that agents could not ap­pear co­her­ent, where here “co­her­ent” means “op­ti­miz­ing for a rea­son­able goal”. First, if due to dis­tri­bu­tional shift the agent is put into situ­a­tions it has never en­coun­tered be­fore, it may not act co­her­ently. Se­cond, we may want to “force” the agent to pre­tend as though com­pute is very ex­pen­sive, even if this is not the case, in or­der to keep them bounded. Fi­nally, we may ex­plic­itly try to keep the agent in­co­her­ent—for ex­am­ple, pop­u­la­tion ethics has im­pos­si­bil­ity re­sults that show that any co­her­ent agent must bite some bul­let that we don’t want to bite, and so we may in­stead elect to keep the agent in­co­her­ent in­stead. (See Im­pos­si­bil­ity and Uncer­tainty The­o­rems in AI Value Align­ment (AN #45).)

The Unavoid­able Prob­lem of Self-Im­prove­ment in AI and The Prob­lem of Self-Refer­en­tial Rea­son­ing in Self-Im­prov­ing AI (Jolene Creighton and Ra­mana Ku­mar): Th­ese ar­ti­cles in­tro­duce the think­ing around AI self-im­prove­ment, and the prob­lem of how to en­sure that fu­ture, more in­tel­li­gent ver­sions of an AI sys­tem are just as safe as the origi­nal sys­tem. This can­not be eas­ily done in the case of proof-based sys­tems, due to Godel’s in­com­plete­ness the­o­rem. Some ex­ist­ing work on the prob­lem: Bot­world, Vingean re­flec­tion, and Log­i­cal in­duc­tion.

Other progress in AI

Deep learning

The Lot­tery Ticket Hy­poth­e­sis at Scale (Jonathan Fran­kle et al) (sum­ma­rized by Richard): The lot­tery ticket hy­poth­e­sis is the claim that “dense, ran­domly-ini­tial­ized, feed-for­ward net­works con­tain sub­net­works (win­ning tick­ets) that—when trained in iso­la­tion—reach test ac­cu­racy com­pa­rable to the origi­nal net­work in a similar num­ber of iter­a­tions”. This pa­per builds on pre­vi­ous work to show that win­ning tick­ets can also be found for larger net­works (Res­net-50, not just Res­net-18), if those win­ning tick­ets are ini­tial­ised not with their ini­tial weights from the full net­work, but rather with their weights af­ter a small amount of full-net­work train­ing.

Richard’s opinion: It’s in­ter­est­ing that the lot­tery ticket hy­poth­e­sis scales; how­ever, this pa­per seems quite in­cre­men­tal over­all.


OpenAI LP (OpenAI) (sum­ma­rized by Richard): OpenAI is tran­si­tion­ing to a new struc­ture, con­sist­ing of a capped-profit com­pany (OpenAI LP) con­trol­led by the origi­nal OpenAI non­profit or­gani­sa­tion. The non­profit is still ded­i­cated to its char­ter, which OpenAI LP has a le­gal duty to pri­ori­tise. All in­vestors must agree that gen­er­at­ing prof­its for them is a sec­ondary goal, and that their over­all re­turns will be capped at 100x their in­vest­ment (with any ex­cess go­ing back to the non­profit).

Richard’s opinion: Given the high cost of salaries and com­pute for ma­chine learn­ing re­search, I don’t find this a par­tic­u­larly sur­pris­ing de­vel­op­ment. I’d also note that, in the con­text of in­vest­ing in a startup, a 100x re­turn over a timeframe of decades is not ac­tu­ally that high.