AI Alignment 2018-19 Review


What this post is

This is a re­view post of pub­lic work in AI al­ign­ment over 2019, with some in­clu­sions from 2018. It has this pream­ble (~700 words), a short ver­sion /​ sum­mary (~1.6k words), and a long ver­sion (~8.3k words). It is available as a Google Doc here.

There are many ar­eas of work that are rele­vant to AI al­ign­ment that I have barely touched on, such as in­ter­pretabil­ity, un­cer­tainty es­ti­ma­tion, ad­ver­sar­ial ex­am­ples, and as­sured au­ton­omy, pri­mar­ily be­cause I have not been fol­low­ing these fields and wouldn’t be able to write a good sum­mary of what has hap­pened in them. I have also mostly fo­cused on ar­ti­cles that provide some con­cep­tual in­sight, and ex­cluded or briefly linked to pa­pers that pri­mar­ily make quan­ti­ta­tive im­prove­ments on im­por­tant met­rics. While such pa­pers are ob­vi­ously im­por­tant (ul­ti­mately, our tech­niques need to work well), there isn’t much to say about them in a yearly re­view other than that the quan­ti­ta­tive met­ric was im­proved.

De­spite these ex­clu­sions, there was still a ton of work to se­lect from, per­haps around ~500 ar­ti­cles, of which over 300 have been linked to in this post. There are many in­ter­est­ing ar­ti­cles that I re­ally en­joyed that get only a sen­tence of de­scrip­tion, in which I ig­nore many of the points that the ar­ti­cle makes. Most have been sum­ma­rized in the Align­ment Newslet­ter, so if you’d like to learn more about any par­tic­u­lar link, but don’t want to read the en­tire thing, just search for its ti­tle in the database.

What you should know about the struc­ture of this post

I am not speak­ing for my­self; by de­fault I am try­ing to ex­plain what has been said, in a way that the au­thors of the ar­ti­cles would agree with. Any ex­tra opinion that I add will be in ital­ics.

As a post, this is meant to be read se­quen­tially, but the un­der­ly­ing struc­ture is a graph (nodes are posts, edges con­nect posts that are very re­lated). I ar­ranged it in a se­quence that high­lights the most salient-to-me con­nec­tions. This means that the or­der in which I pre­sent subtopics is very much not a re­flec­tion of what I think is most im­por­tant in AI safety: in my pre­sen­ta­tion or­der, I fo­cused on edges (con­nec­tions) rather than nodes (subtopics).

Other minor de­tails:

  1. Any links from ear­lier than 2018 will have their year of pub­li­ca­tion right af­ter the link (ex­cept for ar­ti­cles that were re­posted as part of Align­ment Fo­rum se­quences).

  2. I typ­i­cally link to blog posts; in sev­eral cases there is also an as­so­ci­ated pa­per that I have not linked.

How to read this post

I have put the most effort into mak­ing the prose of the long ver­sion read smoothly. The hi­er­ar­chi­cal or­ga­ni­za­tion is com­par­a­tively less co­her­ent; this is partly be­cause I op­ti­mized the prose, and partly be­cause AI safety work is hard to cluster. As a re­sult, for those will­ing to put in the effort, I’d recom­mend read­ing the long ver­sion di­rectly, with­out pay­ing too much at­ten­tion to the hi­er­ar­chy. If you have less time, or are less in­ter­ested in the minu­tiae of AI al­ign­ment re­search, the short ver­sion is for you.

Since I don’t name au­thors or or­ga­ni­za­tions, you may want to take this as your op­por­tu­nity to form be­liefs about which ar­gu­ments in AI al­ign­ment are im­por­tant based on the ideas (as op­posed to based on trust in the au­thor of the post).

Peo­ple who keep up with AI al­ign­ment work might want to know which posts I’m refer­enc­ing as they read, which is a bit hard since I don’t name the posts in the text. If this de­scribes you, you should be read­ing this post on the Align­ment Fo­rum, where you can hover over most links to see what they link to. Alter­na­tively, the refer­ences sec­tion in the Google Doc lists all links in the or­der that they ap­pear in the post, along with the hi­er­ar­chi­cal or­ga­ni­za­tion, and so you can open the refer­ences in a new tab, and read through the post and the refer­ences to­gether.

I ex­pect that if you aren’t already fa­mil­iar with them, some ar­ti­cles will sound crazy from my sum­mary here; please read at least the newslet­ter sum­mary and ideally the full ar­ti­cle be­fore ar­gu­ing that it’s crazy.


Thanks to the Align­ment Newslet­ter team, Ben Pace, Oliver Habryka, Jonathan Ue­sato, Tom Ever­itt, Luke Muehlhauser, Jan Leike, Rob Bens­inger, Adam Gleave, Scott Em­mons, Rachel Freed­man, An­drew Critch, Vic­to­ria Krakovna, and prob­a­bly a few oth­ers (I re­ally should have kept bet­ter track of this). Thanks es­pe­cially to Ben Pace for sug­gest­ing that I write this re­view in the first place.

Short ver­sion (~1.6k words)

While the full text tries to ac­cu­rately sum­ma­rize differ­ent points of view, that is not a goal in this sum­mary. Here I sim­ply try to give a sense of the top­ics in­volved in the dis­cus­sion, with­out say­ing what dis­cus­sion ac­tu­ally hap­pened.

Ba­sic anal­y­sis of AI risk. Tra­di­tional ar­gu­ments for AI risk ar­gue that since agen­tic AI sys­tems will ap­ply lots of op­ti­miza­tion, they will lead to ex­treme out­comes that can’t be han­dled with nor­mal en­g­ineer­ing efforts. Pow­er­ful AI sys­tems will not have their re­sources stolen from them, which by var­i­ous dutch book the­o­rems im­plies that they must be ex­pected util­ity max­i­miz­ers; since ex­pected util­ity max­i­miz­ers are goal-di­rected, they are dan­ger­ous.

How­ever, the VNM the­o­rem does not jus­tify the as­sump­tion that an AI sys­tem will be goal-di­rected: such an as­sump­tion is re­ally based on in­tu­itions and con­cep­tual ar­gu­ments (which are still quite strong).

Com­pre­hen­sive AI Ser­vices (CAIS) challenges the as­sump­tion that we will have a sin­gle agen­tic AI, in­stead sug­gest­ing that any task will be performed by a col­lec­tion of mod­u­lar ser­vices.

That be­ing said, there are sev­eral other ar­gu­ments for AI risk, such as the ar­gu­ment that AI might cause “lock in” which may re­quire us to solve hard philo­soph­i­cal prob­lems be­fore the de­vel­op­ment of AGI.

Nonethe­less, there are dis­junc­tive rea­sons to ex­pect that catas­tro­phe does not oc­cur: for ex­am­ple, there may not be a prob­lem, or ML re­searchers may solve the prob­lem af­ter we get “warn­ing shots”, or we could co­or­di­nate to not build un­al­igned AI.

Agency and op­ti­miza­tion. One pro­posed prob­lem is that of mesa op­ti­miza­tion, in which an op­ti­miza­tion al­gorithm used to train an AI cre­ates an agent that is it­self perform­ing op­ti­miza­tion. In such a sce­nario, we need to en­sure that the “in­ner” op­ti­miza­tion is also al­igned.

To bet­ter un­der­stand these and other situ­a­tions, it would be use­ful to have a for­mal­iza­tion of op­ti­miza­tion. This is hard: while we don’t want op­ti­miza­tion to be about our be­liefs about a sys­tem, if we try to define it mechanis­ti­cally, it be­comes hard to avoid defin­ing a bot­tle cap as an op­ti­mizer of “wa­ter kept in the bot­tle”.

Un­der­stand­ing agents is an­other hard task. While agents are rel­a­tively well un­der­stood un­der the Carte­sian as­sump­tion, where the agent is sep­a­rate from its en­vi­ron­ment, things be­come much more com­plex and poorly-un­der­stood when the agent is a part of its en­vi­ron­ment.

Value learn­ing. Build­ing an AI that learns all of hu­man value has his­tor­i­cally been thought to be very hard, be­cause it re­quires you to de­com­pose hu­man be­hav­ior into the “be­liefs and plan­ning” part and the “val­ues” part, and there’s no clear way to do this.

Another way of look­ing at it is to say that value learn­ing re­quires a model that sep­a­rates the given data into that which ac­tu­ally achieves the true “val­ues” and that which is just “a mis­take”, which seems hard to do. In ad­di­tion, value learn­ing seems quite frag­ile to mis-speci­fi­ca­tion of this hu­man model.

Nonethe­less, there are rea­sons for op­ti­mism. We could try to build an ad­e­quate util­ity func­tion, which works well enough for our pur­poses. We can also have un­cer­tainty over the util­ity func­tion, and up­date the be­lief over time based on hu­man be­hav­ior. If ev­ery­thing is speci­fied cor­rectly (a big if), as time goes on, the agent would be­come more and more al­igned with hu­man val­ues. One ma­jor benefit of this is that it is in­ter­ac­tive—it doesn’t re­quire us to spec­ify ev­ery­thing perfectly ahead of time.

Ro­bust­ness. We would like our agents to be ro­bust—that is, they shouldn’t fail catas­troph­i­cally in situ­a­tions slightly differ­ent from the ones they were de­signed for. Within re­in­force­ment learn­ing, safe re­in­force­ment learn­ing aims to avoid mis­takes, even dur­ing train­ing. This ei­ther re­quires an­a­lyt­i­cal (i.e. not trial-and-er­ror) rea­son­ing about what a “mis­take” is, which re­quires a for­mal speci­fi­ca­tion of what a mis­take is, or an over­seer who can cor­rect the agent be­fore it makes a mis­take.

The clas­sic ex­am­ple of a failure of ro­bust­ness is ad­ver­sar­ial ex­am­ples, in which a tiny change to an image can dras­ti­cally af­fect its clas­sifi­ca­tion. Re­cent re­search has shown that these ex­am­ples are caused (at least in part) by real statis­ti­cal cor­re­la­tions that gen­er­al­ize to the test set, that are nonethe­less frag­ile to small changes. In ad­di­tion, since ro­bust­ness to one kind of ad­ver­sary doesn’t make the clas­sifier ro­bust to other kinds of ad­ver­saries, there has been a lot of work done on im­prov­ing ad­ver­sar­ial eval­u­a­tion in image clas­sifi­ca­tion. We’re also see­ing some of this work in re­in­force­ment learn­ing.

How­ever, ask­ing our agents to be ro­bust to ar­bi­trary mis­takes seems to be too much—hu­mans cer­tainly don’t meet this bar. For AI safety, it seems like we need to en­sure that our agents are ro­bustly in­tent al­igned, that is, they are always “try­ing” to do what we want. One par­tic­u­lar way that our agents could be in­tent al­igned is if they are cor­rigible, that is, they are try­ing to keep us “in con­trol”. This seems like a par­tic­u­larly easy prop­erty to ver­ify, as con­cep­tu­ally it seems to be in­de­pen­dent of the do­main in which the agent is de­ployed.

So, we would like to en­sure that even in the worst case, our agent re­mains cor­rigible. One pro­posal would be to train an ad­ver­sary to search for “re­laxed” situ­a­tions in which the agent be­haves in­cor­rigibly, and then train the agent not to do that.

Scal­ing to su­per­hu­man abil­ities. If we’re build­ing cor­rigible agents us­ing ad­ver­sar­ial train­ing, our ad­ver­sary should be more ca­pa­ble than the agent that it is train­ing, so that it can find all the situ­a­tions in which the agent be­haves in­cor­rigibly. This re­quires tech­niques that scale to su­per­hu­man abil­ities. Some tech­niques for this in­clude iter­ated am­plifi­ca­tion and de­bate.

In iter­ated am­plifi­ca­tion, we start with an ini­tial policy, and al­ter­nate be­tween am­plifi­ca­tion and dis­til­la­tion, which in­crease ca­pa­bil­ities and effi­ciency re­spec­tively. This can en­code a range of al­gorithms, but of­ten am­plifi­ca­tion is done by de­com­pos­ing ques­tions and us­ing the agent to an­swer sub­ques­tions, and dis­til­la­tion can be done us­ing su­per­vised learn­ing or re­in­force­ment learn­ing.

In de­bate, we train an agent through self-play in a zero-sum game in which the agent’s goal is to “win” a ques­tion-an­swer­ing de­bate, as eval­u­ated by a hu­man judge. The hope is that since each “side” of the de­bate can point out flaws in the other side’s ar­gu­ments, such a setup can use a hu­man judge to train far more ca­pa­ble agents while still in­cen­tiviz­ing them to provide hon­est, true in­for­ma­tion.

Both iter­ated am­plifi­ca­tion and de­bate aim to train an agent that ap­prox­i­mates the an­swer that one would get from an ex­po­nen­tially large tree of hu­mans de­liber­at­ing. The fac­tored cog­ni­tion hy­poth­e­sis is that this sort of tree of hu­mans is able to do any task we care about. This hy­poth­e­sis is con­tro­ver­sial: many have the in­tu­ition that cog­ni­tion re­quires large con­texts and flashes of in­tu­ition that couldn’t be repli­cated by a tree of time-limited hu­mans.

Univer­sal­ity. One prop­erty we would hope to have is that if we use this tree of hu­mans as an over­seer for some sim­pler agent, then the tree would “know ev­ery­thing the agent knows”. If true, this prop­erty could al­low us to build a sig­nifi­cantly stronger con­cep­tual ar­gu­ment for safety. It is also very re­lated to…

In­ter­pretabil­ity. While in­ter­pretabil­ity can help us know what the agent knows, and what the agent would do in other situ­a­tions (which can help us ver­ify if it is cor­rigible), there are other uses for it as well: in gen­eral, it seems bet­ter if we can un­der­stand the things we’re build­ing.

Im­pact reg­u­lariza­tion. While rel­a­tive reach­a­bil­ity and at­tain­able util­ity preser­va­tion were de­vel­oped last year, this year saw them be unified into a sin­gle frame­work. In ad­di­tion, there was a new pro­posed defi­ni­tion of im­pact: change in our abil­ity to get what we want. This no­tion of im­pact de­pends on know­ing the util­ity func­tion U. How­ever, we might hope that we can pe­nal­ize some “ob­jec­tive” no­tion, per­haps “power”, that oc­curs re­gard­less of the choice of U, for the same rea­sons that we ex­pect in­stru­men­tal con­ver­gence.

Causal mod­el­ing. Causal mod­els have been used re­cently to model the in­cen­tives for an agent un­der differ­ent AI safety frame­works, and to ar­gue that by eval­u­at­ing plans with the cur­rent re­ward func­tion, you can re­move the in­cen­tive for an agent to tam­per with its re­ward func­tion.

Or­a­cles. Even if or­a­cles are try­ing to max­i­mize pre­dic­tive ac­cu­racy, they could “choose” be­tween differ­ent self-con­firm­ing pre­dic­tions. We could avoid this us­ing coun­ter­fac­tual or­a­cles, which make pre­dic­tions con­di­tion­ing that their pre­dic­tions do not in­fluence the fu­ture.

De­ci­sion the­ory. There was work on de­ci­sion the­ory, that I haven’t fol­lowed very much.

Fore­cast­ing. Sev­eral re­sources were de­vel­oped to en­able effec­tive group fore­cast­ing, in­clud­ing an AI fore­cast­ing dic­tio­nary that defines terms, an AI re­s­olu­tion coun­cil whose fu­ture opinions can be pre­dicted, and a dataset of well-con­structed ex­em­plar ques­tions about AI.

Separately, the de­bate over take­off speeds con­tinued, with two posts ar­gu­ing force­fully for con­tin­u­ous take­off, with­out much re­sponse (al­though many re­searchers do not agree with them). The con­ti­nu­ity of take­off is rele­vant for but doesn’t com­pletely de­ter­mine whether re­cur­sive self im­prove­ment will hap­pen, or whether some ac­tor ac­quires a de­ci­sive strate­gic ad­van­tage. The pri­mary im­pli­ca­tion of the de­bate is whether we should ex­pect that we will have enough time to re­act and fix prob­lems as they arise.

It has also be­come clearer that re­cent progress in AI has been driven to a sig­nifi­cant de­gree by in­creas­ing the amount of com­pute de­voted to AI, which sug­gests a more con­tin­u­ous take­off. You could take the po­si­tion that cur­rent meth­ods can’t do <prop­erty X> (say, causal rea­son­ing), and so it doesn’t mat­ter how much com­pute you use.

AI Progress. There was a lot of progress in AI.

Field build­ing. There were posts aiming to build the field, but they were all fairly dis­jointed.

The long ver­sion (~8.3k words) starts here.

Ba­sic anal­y­sis of AI risk

Agen­tic AI systems

Much of the foun­da­tional writ­ing about AI risk has fo­cused on agen­tic AI sys­tems. This ap­proach (re­cently dis­cussed in the post and com­ments here) ar­gues that since AI agents will be ex­ert­ing a lot of op­ti­miza­tion, there will be ex­treme out­comes in which our reg­u­lar ar­gu­ments may not work. This im­plies that we must adopt a se­cu­rity mind­set (2017) to en­sure al­ign­ment, and it sug­gests that proof-level guaran­tees may be more im­por­tant at var­i­ous stages of al­ign­ment re­search.


The foun­da­tional writ­ing then goes on to point out that since pow­er­ful AI sys­tems should not be able to be dutch booked (i.e. have their re­sources stolen from them), they will be well mod­eled (2017) as ex­pected util­ity max­imisers. An AI sys­tem that max­i­mizes ex­pected util­ity is very likely to be dan­ger­ous. One rea­son was re­cently for­mal­ized in MDPs in which the agent gets a ran­dom util­ity func­tion: us­ing for­mal­iza­tions of power and in­stru­men­tal con­ver­gence, we find some sug­ges­tive re­sults that agents seek con­trol over their fu­ture (from which we might in­fer that they will try to wrest that con­trol from us).

How­ever, it is not math­e­mat­i­cally nec­es­sary that AI sys­tems will have util­ity func­tions (ex­cept in a vac­u­ous sense), and while there are in­tu­itive and con­cep­tual rea­sons to think that we will build goal-di­rected agents by de­fault, there are al­ter­na­tive path­ways that might be taken in­stead, and that are valuable to ex­plore and build out to en­sure AI safety.

This challenge to the usual ar­gu­ment for util­ity max­i­miz­ers has prompted a se­ries of ar­ti­cles ex­plor­ing other var­i­ants of the ar­gu­ment, for ex­am­ple by re­strict­ing the class of util­ity func­tions to make it non-vac­u­ous, or by say­ing that op­ti­miza­tion pro­cesses in gen­eral will lead to goal-di­rected agents.

Com­pre­hen­sive AI Services

Com­pre­hen­sive AI Ser­vices (CAIS) also takes is­sue with the model of a sin­gle AGI agent hy­per-com­pe­tently pur­su­ing some goal, and in­stead pro­poses a model in which differ­ent tasks are solved by spe­cial­ized, com­pet­ing AI ser­vices. This is sug­gest­ing that mod­u­lar­ity across tasks is suffi­ciently use­ful that it will ap­ply to AI, in the same way that it ap­plies to hu­mans (e.g. I have spe­cial­ized in AI re­search, and not plumb­ing). The ag­gre­gate of all the ser­vices can ac­com­plish any task, in­clud­ing the de­vel­op­ment of new ser­vices, mak­ing it com­pre­hen­sive (analo­gous to the “gen­eral” in AGI). Since AI ser­vices can also do ba­sic AI R&D re­search, which leads to im­prove­ment in AI ser­vices gen­er­ally, we should ex­pect re­cur­sive tech­nolog­i­cal im­prove­ment (as op­posed to re­cur­sive self im­prove­ment). Note that CAIS does not nec­es­sar­ily sug­gest we will be safe, just that the tra­di­tional risks are not as likely as we may have thought, while other emer­gent risks are per­haps greater.

Crit­ics of­ten ar­gue that end-to-end train­ing and in­te­grated agent-like ar­chi­tec­tures are likely to (even­tu­ally) out­perform mod­u­lar ser­vices. How­ever, through co­or­di­na­tion ser­vices can also be in­te­grated. In ad­di­tion, this post ar­gues that this crit­i­cism mir­rors old con­cerns that un­der cap­i­tal­ism firms will be­come too large—a con­cern that the post ar­gues did not pan out.

CAIS does al­low for AI sys­tems that are ca­pa­ble of learn­ing across many do­mains: it sim­ply ar­gues that these AI sys­tems will spe­cial­ize for effi­ciency rea­sons, and so will only be com­pe­tent at a small sub­set of do­mains. This de­com­po­si­tion of in­tel­li­gence into learn­ing + com­pe­tence has been used to ex­plain the vari­a­tion in hu­man abil­ities.

(This con­ver­sa­tion is re­lated to much prior con­ver­sa­tion on Tool AI, which is listed here.)

Ar­gu­ments for AI risk

There are many ar­gu­ments for AI risk, with each of these posts pro­vid­ing a list of such ar­gu­ments. It is un­clear whether from an out­side per­spec­tive this should be taken as ev­i­dence against AI risk (since differ­ent re­searchers be­lieve differ­ent ar­gu­ments and are aiming for differ­ent “suc­cess sto­ries”) or as ev­i­dence for AI risk (be­cause there are so many differ­ent sources of AI risk).

One ar­gu­ment that saw a lot of dis­cus­sion was that we must figure out philos­o­phy since the cre­ation of AGI might “lock in” philo­soph­i­cal ideas. For ex­am­ple, we might not want to have AI sys­tems with util­ity func­tions be­cause of im­pos­si­bil­ity re­sults in pop­u­la­tion ethics that sug­gest that ev­ery util­ity func­tion would lead to some coun­ter­in­tu­itive con­clu­sion. Similarly, there are many pro­pos­als for how to define val­ues; it may be nec­es­sary to figure out the right defi­ni­tion ahead of time. Rather than solv­ing these prob­lems di­rectly, we could solve metaphilos­o­phy, or del­e­gate to hu­mans who de­liber­ate, whether ideal­ized or real.

We might also worry that AIs will eco­nom­i­cally out­com­pete hu­mans, give us tech­nolo­gies we aren’t ready for, or am­plify hu­man vuln­er­a­bil­ities.

Un­der con­tin­u­ous take­off, two sce­nar­ios have been pro­posed for what failure looks like. First, AI differ­en­tially im­proves so­ciety’s ca­pa­bil­ity to op­ti­mize met­rics that are easy to mea­sure, rather than ones that we ac­tu­ally care about. Se­cond, AI agents could ac­ci­den­tally be trained to seek in­fluence, and then fail catas­troph­i­cally at some point in the fu­ture once they are suffi­ciently ca­pa­ble. One cri­tique ar­gues that these prin­ci­pal-agent prob­lems only lead to bounded losses (i.e. they aren’t catas­trophic), but sev­eral oth­ers dis­agree.

This post ar­gues that there has been a shift in the ar­gu­ments that mo­ti­vate new AI risk re­searchers, and calls for more ex­pla­na­tion of these ar­gu­ments so that they can be prop­erly eval­u­ated.

Ar­gu­ments against AI risk

Many views that ex­pect the prob­lem to be solved by de­fault have also been writ­ten up this year.

A se­ries of four con­ver­sa­tions (sum­ma­rized here) sug­gested that some en­gaged peo­ple ex­pect AI to go well by de­fault, be­cause they are un­con­vinced by the tra­di­tional ar­gu­ments for AI risk, find dis­con­ti­nu­ities in AI ca­pa­bil­ities rel­a­tively un­likely, and are hope­ful that there will be “warn­ing shots” that demon­strate prob­lems, that the ex­ist­ing ML com­mu­nity will then suc­cess­fully fix.

One post lists sev­eral good out­side-view heuris­tics that ar­gue against AI x-risk, while an­other ques­tions why value be­ing com­plex and frag­ile must lead to high AI risk.

This talk ar­gues that while AGI will in­tu­itively be a big deal, it’s not ob­vi­ous that we can af­fect its im­pact, and so it’s not ob­vi­ous that longter­mists should fo­cus on it. It gives an anal­ogy to try­ing to in­fluence the im­pact of elec­tric­ity, be­fore elec­tric­ity was com­mon­place, and sug­gests there was lit­tle im­pact one could have had on its safe use. It ar­gues that ac­ci­dent risks in par­tic­u­lar draw on fuzzy, in­tu­itive con­cepts, haven’t been en­gaged with much by crit­ics, and don’t sway most AI re­searchers.

De­spite the seem­ing con­tro­versy in this and pre­vi­ous sec­tions, it is worth not­ing that there is gen­eral agree­ment within the AI safety com­mu­nity on the fol­low­ing broader ar­gu­ment for work on AI safety:

  1. Su­per­hu­man agents are not re­quired to treat hu­mans well, in the same way that hu­mans aren’t re­quired to treat go­rillas well.

  2. You should have a good tech­ni­cal rea­son to ex­pect that su­per­hu­man agents will treat hu­mans well.

  3. We do not cur­rently have such a rea­son.

Agency and optimization

Mesa optimization

The prob­lem of mesa op­ti­miza­tion was ex­plained in sig­nifi­cantly more de­tail (see also this less for­mal sum­mary). In mesa op­ti­miza­tion, we start with a base op­ti­mizer like gra­di­ent de­scent that searches for a policy that ac­com­plishes some com­plex task. For suffi­ciently com­plex tasks, it seems likely that the best policy will it­self be an op­ti­mizer. (Meta learn­ing is ex­plic­itly try­ing to learn poli­cies that are also op­ti­miz­ers.) How­ever, the policy could be op­ti­miz­ing a differ­ent goal, called the mesa ob­jec­tive, rather than the base ob­jec­tive.

Op­ti­miz­ing the mesa ob­jec­tive must lead to good base ob­jec­tive be­hav­ior on the train­ing dis­tri­bu­tion (else gra­di­ent de­scent would not se­lect it), but could be ar­bi­trar­ily bad when off dis­tri­bu­tion. For ex­am­ple, a plau­si­ble mesa ob­jec­tive would be to seek in­fluence: such an agent would ini­tially do what we want it to do (since oth­er­wise we would shut it down), but might turn against us once it has ac­cu­mu­lated enough power.

This de­com­poses the over­all al­ign­ment prob­lem into outer al­ign­ment (en­sur­ing that the base ob­jec­tive is al­igned with “what we want”) and in­ner al­ign­ment (en­sur­ing that the mesa ob­jec­tive is al­igned with the base ob­jec­tive). This is some­what analo­gous to differ­ent types (2017) of Good­hart’s law.

The pa­per and sub­se­quent anal­y­sis iden­tify and cat­e­go­rize re­la­tion­ships be­tween the base and mesa ob­jec­tives, and ex­plain how mesa op­ti­miz­ers could fail catas­troph­i­cally. Of par­tic­u­lar in­ter­est is that mesa op­ti­miz­ers should be fast, but could still be mis­al­igned, sug­gest­ing that pe­nal­iz­ing com­pute is not enough to solve in­ner al­ign­ment.

Effec­tively, the con­cern is that our AI sys­tems will have ca­pa­bil­ities that gen­er­al­ize, but ob­jec­tives that don’t. Since this is what drives risk, some sug­gest that we should talk about this phe­nomenon, with­out need­ing to bring in the bag­gage of “op­ti­miza­tion”, a term we have yet to un­der­stand well, while oth­ers ar­gue that even if we start with this defi­ni­tion, it would be use­ful to rein­tro­duce the no­tions of op­ti­miza­tion and agency.

One ad­van­tage of the origi­nal defi­ni­tion is that it speci­fies a par­tic­u­lar mechanism by which risk arises; this gives us a foothold into the prob­lem that al­lows us to pro­pose po­ten­tial solu­tions and em­piri­cal in­ves­ti­ga­tions. Of course, this is ac­tively coun­ter­pro­duc­tive if the risk arises by some other mechanism, but we might ex­pect op­ti­miza­tion to be es­pe­cially likely be­cause op­ti­miza­tion al­gorithms are sim­ple, and the phe­nomenon of dou­ble de­scent sug­gests that neu­ral nets have an in­duc­tive bias to­wards sim­plic­ity.

What are op­ti­miza­tion and agency, any­way?

Given the cen­tral im­por­tance of op­ti­miza­tion to in­ner al­ign­ment and AI safety more broadly, we’d like to be able to for­mal­ize it. How­ever, it’s not clear how to do so: while we want op­ti­miza­tion to be about the me­chan­i­cal pro­cess by which out­comes hap­pen (as op­posed to e.g. our be­liefs about that pro­cess), we can­not sim­ply say that X is an op­ti­mizer if it makes some quan­tity go up: by this defi­ni­tion, a bot­tle cap would be an op­ti­mizer for “keep­ing wa­ter in the bot­tle″.

It is also rele­vant how the sys­tem in­ter­acts with its en­vi­ron­ment, rather than just be­ing about whether some num­ber is go­ing up. The type of com­pu­ta­tion mat­ters: while older mod­els of op­ti­miza­tion in­volve an agent that can search over pos­si­ble ac­tions and simu­late their re­sults, other op­ti­miza­tion pro­cesses must con­trol their en­vi­ron­ment with­out be­ing able to simu­late the con­se­quences of their choice.

Our use of the word “agency” might be tied to our mod­els or spe­cific hu­man ar­chi­tec­tures, rather than be­ing a gen­eral con­cept that could de­scribe a me­chan­i­cal prop­erty of a com­pu­ta­tion. This would be par­tic­u­larly wor­ry­ing since it would mean that ar­gu­ments for AI risk are based on our flawed mod­els of re­al­ity, rather than an ob­jec­tive prop­erty about re­al­ity. How­ever, this is ex­tremely spec­u­la­tive.

Embed­ded agency

Dis­cus­sions about AI usu­ally as­sume that a no­tion of the “ac­tions” that an agent can take. How­ever, the em­bed­ded agency se­quence points out that this “Carte­sian bound­ary” does not ac­tu­ally ex­ist: since any real agent is em­bed­ded in the real world, you can­not make many as­sump­tions that are com­mon in re­in­force­ment learn­ing, such as ded­i­cated and perfectly trusted in­put-out­put chan­nels, a perfect model of the en­vi­ron­ment, an agent ar­chi­tec­ture that is un­in­fluenced by the en­vi­ron­ment, etc.

This means you can never con­sider all of the im­por­tant in­for­ma­tion, and op­ti­mize ev­ery­thing that could be op­ti­mized. This has led to a cou­ple of hy­pothe­ses:

  1. Real learn­ing al­gorithms re­quire mod­el­ing as­sump­tions to solve the credit as­sign­ment prob­lem, and so can only lead to par­tial agency or my­opia. (See also this parable and as­so­ci­ated thoughts.)

  2. Embed­ded agency works via ab­strac­tion, which is the key idea al­low­ing you to make maps that are smaller than the ter­ri­tory.

Value learning

De­scrip­tive em­bed­ded agency

While the em­bed­ded agency se­quence is writ­ten from the per­spec­tive of pre­scribing how ideal agents should op­er­ate, we could also aim for a the­ory that can de­scribe real agents like hu­mans. This in­volves mak­ing your the­ory of agency cor­re­spond­ingly broader: for ex­am­ple, mov­ing from util­ity func­tions to mar­kets or sub­agents, which are more gen­eral. The de­vel­op­ment of such a the­ory is more grounded in con­crete real sys­tems, and more likely to gen­er­ate the­o­ret­i­cal in­sight or coun­terex­am­ples, mak­ing it a good re­search meta-strat­egy.

Such a the­ory would be use­ful so that we can build AI sys­tems that can model hu­mans and hu­man val­ues while avoid­ing em­bed­ded agency prob­lems with hu­mans.

The difficulty of value learning

Even if we ig­nore prob­lems of em­bed­ded agency, there are ob­sta­cles to value learn­ing. For ex­am­ple, there need not be a re­ward func­tion over ob­ser­va­tions that leads to what we want in POMDP (though we could in­stead fo­cus on in­stru­men­tal re­ward func­tions defined on states).

Another key prob­lem is that all you ever get to ob­serve is be­hav­ior; this then needs to be de­com­posed into “be­liefs” and “val­ues”, but there is no clear crite­rion (2017) that sep­a­rates them (al­though it hasn’t been proven that sim­plic­ity doesn’t work, and hu­man pri­ors help). This sug­gests that am­bi­tious value learn­ing, in which you iden­tify the one true util­ity func­tion, is hard.

Hu­man models

For an agent to out­perform the pro­cess gen­er­at­ing its data, it must un­der­stand the ways in which that pro­cess makes mis­takes. So, to out­perform hu­mans at a task given only hu­man demon­stra­tions of that task, you need to de­tect hu­man mis­takes in the demon­stra­tions. Model­ing hu­mans to this fidelity is an un­solved prob­lem, though there is a lit­tle progress, and we might hope that we can make as­sump­tions about the struc­ture of the model.

Any such model is likely to be mis­speci­fied, and value learn­ing al­gorithms are not cur­rently ro­bust to mis­speci­fi­ca­tion: in one case, the sim­pler but less con­cep­tu­ally ac­cu­rate model is more ro­bust.

You might hope that if we give up on out­perform­ing hu­mans and just imi­tate them, this would be safe. Even this is con­tro­ver­sial, be­cause per­haps hu­mans them­selves are un­safe, maybe imi­tat­ing hu­mans leads to mesa op­ti­miza­tion, or pos­si­bly perfect imi­ta­tion is too hard to achieve.

You might also hope that AI sys­tems have good enough mod­els that you can sim­ply provide nat­u­ral lan­guage in­struc­tions and the AI does what you mean.

The pres­ence of hu­man mod­els in an AI sys­tem has a few un­for­tu­nate effects:

  1. We can’t test an AI sys­tem by see­ing if it agrees with hu­man judg­ment, be­cause the AI sys­tem may be us­ing its hu­man model to (in the short term) op­ti­mize for agree­ment with hu­man judgment

  2. A bug in the code is more likely to op­ti­mize for suffer­ing (since the hu­man model would in­clude the con­cept of suffer­ing)

  3. If hu­mans are mod­eled with suffi­cient fidelity, these mod­els may them­selves be con­scious and ca­pa­ble of suffer­ing.

Learn­ing an ad­e­quate util­ity function

De­spite the ob­jec­tions that learn­ing val­ues is hard, it seems like hu­mans are pretty good at learn­ing the val­ues of other hu­mans, even if not perfect. Per­haps we could repli­cate this, in or­der to learn an ad­e­quate util­ity func­tion that leads to okay out­comes?

The main is­sue is that we are only good at pre­dict­ing hu­man val­ues in nor­mal situ­a­tions, while pow­er­ful AI sys­tems will likely put us in ex­treme situ­a­tions where we will dis­agree much more about val­ues. As a re­sult, we need a the­ory of hu­man val­ues that defines what to do in these situ­a­tions. One the­ory, as­so­ci­ated value learn­ing agenda, and toy model pro­pose that we can ex­tract par­tial prefer­ences from hu­man men­tal mod­els, and syn­the­size them to­gether into a full util­ity func­tion, while re­spect­ing meta-prefer­ences about prefer­ences and the syn­the­sis pro­cess and tak­ing care to prop­erly nor­mal­ize util­ities.

In fact, the core pieces of such an ap­proach seem nec­es­sary for any solu­tion to the prob­lem. How­ever, this re­search agenda de­pends upon solv­ing many hard prob­lems ex­plic­itly in a hu­man-un­der­stand­able way, which doesn’t jive with the bit­ter les­son that ML progress pri­mar­ily hap­pens by us­ing more com­pute to solve harder prob­lems.

I don’t agree that the core pieces iden­ti­fied in this re­search agenda must be solved be­fore cre­at­ing pow­er­ful AI, nor that we must have ex­plicit solu­tions to the prob­lems.

Uncer­tainty over the util­ity function

We could also make the AI un­cer­tain about the util­ity func­tion, and en­sure that it has a way to learn about the util­ity func­tion that is grounded in hu­man be­hav­ior. Then, as an in­stru­men­tal goal for max­i­miz­ing ex­pected re­ward, the AI will choose ac­tions with high ex­pected in­for­ma­tion gain. While this was pro­posed ear­lier (2016), the book Hu­man Com­pat­i­ble (sum­mary, pod­cast 1, pod­cast 2, in­ter­view) ex­plores the idea in much more de­tail than pre­vi­ous writ­ing, and it has now made its way into deep re­in­force­ment learn­ing as well.

In­tu­itively, since the AI is un­cer­tain about the true re­ward, it will be­have con­ser­va­tively and try to learn about the true re­ward, thus avoid­ing Good­hart’s law (see also fuzzi­ness). Of course, once the AI has learned ev­ery­thing there is to learn, it will be­have (2015?) just like a reg­u­lar util­ity max­i­mizer. In this set­ting, you would hope that the AI has be­come al­igned with the true util­ity func­tion, as long as its ini­tial dis­tri­bu­tion over util­ity func­tions con­tains the truth, and the ob­ser­va­tion model by which its dis­tri­bu­tion is up­dated is “cor­rect”. How­ever, it might be quite difficult to en­sure that these ac­tu­ally hold. This also de­pends on the as­sump­tion that there is a true util­ity func­tion, and that the hu­man knows it, which is not the case, though this is be­ing ad­dressed.

One im­por­tant fea­ture of this agenda is that rather than re­quiring a perfect util­ity func­tion to be­gin with, the AI can learn the util­ity func­tion by in­ter­act­ing with the hu­man; such a feed­back mechanism can make a prob­lem much eas­ier. In­ter­ac­tion also opens up other pos­si­bil­ities, such as learn­ing hu­man norms in­stead of val­ues. How­ever, it is com­pu­ta­tion­ally difficult, and so more re­search would be needed to make it a vi­able solu­tion.

Cur­rent meth­ods for learn­ing hu­man preferences

There has been a lot of prac­ti­cal work on learn­ing hu­man prefer­ences, in­clud­ing:

There are many re­cent pa­pers that I haven’t cited here, as it is a very large area of work.


Safe re­in­force­ment learning

We would like to en­sure that our AI sys­tems do not make mis­takes dur­ing train­ing. With prefer­ence learn­ing, we can do this by learn­ing hu­man prefer­ences over hy­po­thet­i­cal be­hav­iors that are not ac­tu­ally ex­e­cuted. Another op­tion is to provide safety con­straints and en­sure that the AI never vi­o­lates them (even dur­ing train­ing), or at least to sig­nifi­cantly re­duce such vi­o­la­tions.

Avoid­ing all mis­takes would re­quire us to have a for­mal speci­fi­ca­tion of what a “mis­take” is, or to have some over­seer that can iden­tify “mis­takes” be­fore ex­e­cu­tion, so that our AI could avoid the mis­take even though it hasn’t seen this situ­a­tion be­fore. This seems pro­hibitively hard to me if we in­clude liter­ally all “mis­takes”.

Ad­ver­sar­ial examples

Ad­ver­sar­ial ex­am­ples are a clear demon­stra­tion of how the “cog­ni­tion” of neu­ral nets is differ­ent from our own: by mak­ing su­perfi­cial changes to the in­put that would not mat­ter to a hu­man, you can com­pletely change the out­put of the neu­ral net. While I am not an ex­pert here, and cer­tainly have not read the huge moun­tains of work done over the last year, I do want to high­light a few things.

First, while we might nom­i­nally think of ad­ver­sar­ial ex­am­ples as “bugs” in our neu­ral net, this pa­per shows that image clas­sifiers are pick­ing up real im­per­cep­ti­ble fea­tures that do gen­er­al­ize to the test set. The clas­sifiers re­ally are max­i­miz­ing pre­dic­tive ac­cu­racy; the prob­lem is that we want them to pre­dict la­bels based on the fea­tures that we use, in­stead of im­per­cep­ti­ble (but pre­dic­tive) fea­tures. Ad­ver­sar­ial train­ing re­moves these frag­ile fea­tures, leav­ing only the ro­bust fea­tures; this makes sub­se­quent ap­pli­ca­tions eas­ier.

While the pa­per was con­tro­ver­sial, I thought that its main the­sis seemed to be sup­ported even af­ter read­ing these six re­sponses.

Se­cond, there has been a dis­tinct shift away from the L-in­finity norm ball threat model of ad­ver­sar­ial ex­am­ples. So far, it seems that ro­bust­ness to one set of per­tur­ba­tions doesn’t grant ro­bust­ness to other per­tur­ba­tions, prompt­ing the de­vel­op­ment of mul­ti­ple per­tur­ba­tions, a bench­mark of nat­u­ral ad­ver­sar­ial ex­am­ples, and new eval­u­a­tion met­rics. While the L-in­finity norm ball is an in­ter­est­ing un­solved re­search prob­lem, it is in no way a re­al­is­tic threat model.

Third, ad­ver­sar­ial at­tacks are now be­ing pro­posed as a method for eval­u­at­ing how ro­bust an agent trained by re­in­force­ment learn­ing is. This seems es­pe­cially im­por­tant since in RL there is of­ten no train-test split, and so it is hard to tell whether an agent has “mem­o­rized” a sin­gle tra­jec­tory or ac­tu­ally learned a policy that works well across a va­ri­ety of cir­cum­stances.

In­tent alignment

Ul­ti­mately, ro­bust­ness seeks to iden­tify and elimi­nate all “bugs”, i.e. be­hav­iors that are in­con­sis­tent with the speci­fi­ca­tion (see also this pod­cast). In­stead of con­sid­er­ing all the mis­takes, we could seek to only pre­vent catas­trophic mis­takes, and en­sure that the AI is in­tent al­igned, that is, it is always try­ing to do what we want. This goal avoids many of the pit­falls around the goal of de­sign­ing an AI with the right util­ity func­tion.


One promis­ing way in which an AI could be in­tent al­igned is by be­ing cor­rigible: roughly, the AI is not try­ing to de­ceive us, it clar­ifies its un­cer­tainty by ask­ing us, it learns about our prefer­ences, it shuts down if we ask it to, etc. This is a nar­rower con­cept than in­tent al­ign­ment: an AI that in­fers our “true” util­ity func­tion and op­ti­mizes it may wrest con­trol away from us in or­der to ex­pand faster, or make us safer; such an AI would be al­igned but not cor­rigible. There are a few benefits of us­ing cor­rigi­bil­ity:

  1. It can be achieved with rel­a­tively low lev­els of in­tel­li­gence (we can imag­ine cor­rigible hu­mans)

  2. It seems to have a pos­i­tive feed­back loop (that is, an AI that reaches some “thresh­old” of cor­rigi­bil­ity would tend to be­come more cor­rigible)

  3. It doesn’t seem to re­quire any do­main ex­per­tise.

(A similar idea would be to build an AI sys­tem that only takes ac­tions that the over­seer has given in­formed con­sent for.)

Note that MIRI’s no­tion of cor­rigi­bil­ity (2015) is similar but much stric­ter. My guess is that MIRI wants the same in­tu­itive cor­rigi­bil­ity prop­er­ties, but wants them to be cre­ated by a sim­ple change to the util­ity func­tion. Sim­plic­ity helps en­sure that it can­not be gamed, and the util­ity func­tion means that you are chang­ing what the AI cares about, rather than try­ing to con­strain a pow­er­ful su­per­in­tel­li­gence. For ex­am­ple, I’d guess that MIRI-cor­rigi­bil­ity can de­pend on whether a shut­down but­ton is pressed, but can­not de­pend on the rea­sons for which the shut­down but­ton is pressed.

If you set aside the util­ity func­tion re­quire­ment, then this prop­erty can be achieved us­ing con­strained op­ti­miza­tion: the agent can op­ti­mize nor­mally when the but­ton is not pressed, while en­sur­ing that it is still able to shut down if nec­es­sary, and it can op­ti­mize for shut­ting down if the but­ton is pressed. If you set aside the sim­plic­ity re­quire­ment, then you can define the de­sired poli­cies and re­cover the cor­rect util­ity func­tion. But from now on I’m only go­ing to talk about the no­tion of cor­rigi­bil­ity I first in­tro­duced.

It has been ar­gued that while cor­rigi­bil­ity is sim­pler than “hu­man val­ues”, it is a “non-nat­u­ral” type of cog­ni­tion, such that you are un­likely to be able to find cor­rigible in­tel­li­gences with ma­chine learn­ing. (I do not feel the force of this in­tu­ition; I agree much more with the ear­lier in­tu­itions.)

You might be wor­ried that since a cor­rigible AI defers to us, if we were about to take a sub­op­ti­mal ac­tion that we couldn’t tell was sub­op­ti­mal, the AI wouldn’t stop us from do­ing so be­cause it can’t ex­plain to us what would be bad about the world. How­ever, at the very least, it can say “this is bad for rea­sons I can’t fully ex­plain”.

Worst case guarantees

We still want to guaran­tee that there will never be a failure of cor­rigi­bil­ity, which can’t be done with reg­u­lar ML tech­niques, which only give an av­er­age-case guaran­tee. In or­der to get a worst-case guaran­tee, we need other tech­niques. One pro­posal is to use ad­ver­sar­ial train­ing to find ab­stracted in­puts on which the agent is in­cor­rigible, where the ad­ver­sary is aided by in­ter­pretabil­ity tech­niques that al­low the ad­ver­sary to un­der­stand what the agent is think­ing. It would be par­tic­u­larly nice to find a mechanis­tic de­scrip­tion of cor­rigi­bil­ity, as that would make it eas­ier to ver­ify the ab­sence of in­cor­rigible be­hav­ior.

Crit­ics ar­gue that this could never work be­cause ma­chine learn­ing wouldn’t learn the “in­tended” in­ter­pre­ta­tion of cor­rigi­bil­ity, and could be ad­ver­sar­ial. I don’t think this ob­jec­tion is crit­i­cal. It seems like it is say­ing that ML will fail to gen­er­al­ize and there will be situ­a­tions in which the con­cept of cor­rigi­bil­ity breaks down, but the en­tire point of ad­ver­sar­ial train­ing is to find these situ­a­tions and train the agent away from it.

While this is usu­ally tied in to the broader iter­ated am­plifi­ca­tion agenda, it seems to me that solv­ing just this sub­prob­lem would achieve a lot of the value of the full agenda. If we had a way of ap­ply­ing ad­ver­sar­ial train­ing to an ar­bi­trary AI agent, such that we are very likely to find po­ten­tial in­puts on which the agent is in­cor­rigible, then pre­sum­ably AI sys­tems that could be in­cor­rigible would not be de­ployed. Iter­ated am­plifi­ca­tion adds ad­di­tional safety in that it (hope­fully) al­lows you to as­sume a smarter, already-al­igned ad­ver­sary, whereas a di­rect solu­tion to this sub­prob­lem would have an ap­prox­i­mately-as-ca­pa­ble, not-au­to­mat­i­cally-al­igned ad­ver­sary, which would prob­a­bly not have a worst-case guaran­tee but might still be good enough.

Scal­ing to su­per­hu­man abilities

Iter­ated amplification

Iter­ated am­plifi­ca­tion carves out a broad class of al­gorithms that can scale to su­per­hu­man abil­ities, with the hope that we can an­a­lyze the al­ign­ment prop­er­ties of the en­tire class of al­gorithms at once. Al­gorithms in this class have two com­po­nents:

  1. Am­plifi­ca­tion, which in­creases an agent’s ca­pa­bil­ities, at the cost of effi­ciency.

  2. Distil­la­tion, which in­creases an agent’s effi­ciency, at the cost of ca­pa­bil­ity.

Given this, start­ing from some base agent, the al­gorithm al­ter­nates am­plifi­ca­tion and dis­til­la­tion, to get suc­ces­sively more ca­pa­ble agents, as long as each com­po­nent is good enough.

Given this broad class of al­gorithms, we can in­stan­ti­ate many spe­cific al­gorithms by pick­ing a spe­cific am­plifi­ca­tion step and a spe­cific dis­til­la­tion step. For ex­am­ple, the am­plifi­ca­tion step can be done by al­low­ing an over­seer to de­com­pose the prob­lem into sub­prob­lems, which is es­pe­cially promis­ing for ques­tion an­swer­ing. Distil­la­tion could be done us­ing su­per­vised learn­ing, imi­ta­tion learn­ing, or re­in­force­ment learn­ing.

Re­cur­sive re­ward mod­el­ing (pod­cast) is an­other al­gorithm that could al­low us to scale to su­per­hu­man abil­ities. It can be cast as an al­gorithm in the iter­ated am­plifi­ca­tion class by con­sid­er­ing an am­plifi­ca­tion step that takes agents that can eval­u­ate some set of tasks, and builds new hu­man-agent teams that can eval­u­ate some more com­plex set of tasks. The dis­til­la­tion step would then be re­in­force­ment learn­ing, to get an agent that can di­rectly solve the more com­plex tasks. Iter­at­ing this even­tu­ally leads to an agent that can solve the origi­nal de­sired task.

Iter­ated am­plifi­ca­tion does im­pose a par­tic­u­lar struc­ture on al­gorithms, which can be ap­plied to ex­ist­ing ML prob­lems. How­ever, this may be un­com­pet­i­tive if the best ML al­gorithms re­quire differ­ent al­gorith­mic struc­tures or differ­ent en­vi­ron­ments, in or­der to reach high ca­pa­bil­ities (though we could then train a ques­tion-an­swer­ing sys­tem alongside the other al­gorithm /​ en­vi­ron­ment, which plau­si­bly doesn’t take too many more re­sources).

The iter­ated am­plifi­ca­tion se­quence, re­cur­sive re­ward mod­el­ing pa­per, and these posts help ex­plain the full agenda bet­ter.


Quan­tiliza­tion (2015) al­lows you to am­plify a base policy by ran­domly se­lect­ing among the top 1/​Q of ac­tions the base policy could take, at a cost of at most Q-fold in­crease in risk. How­ever, this can forgo benefits of the rest of the base policy. Since quan­tiliza­tion in­creases risk, it can­not be safely iter­ated: for ex­am­ple, if you start with a policy with a worst-case 1% chance of failure, and you 5-quan­tilize it, you now have a worst-case 5% chance of failure. After two more iter­a­tions of 5-quan­tiliza­tion, there is no longer a worst-case bound on failure prob­a­bil­ity.


Another mechanism for scal­ing be­yond hu­mans is de­bate (pod­cast), in which an AI agent is trained via self-play in a zero-sum game in which its goal is to “win” the de­bate, as eval­u­ated by a hu­man judge. The key hope is that de­tect­ing a lie is eas­ier than ly­ing: if one of the play­ers lies or de­ceives or ma­nipu­lates the hu­man, then the other player can re­veal that and thereby win the de­bate. If this were true, we would ex­pect that the equil­ibrium be­hav­ior is for the agent to provide hon­est, use­ful in­for­ma­tion.

Since its pro­posal, de­bate has been tested with MNIST and Fash­ion MNIST, as well as ques­tion an­swer­ing. There is also a pro­posal to use it to im­prove iter­ated am­plifi­ca­tion.

The­o­ret­i­cal work brings up the pos­si­bil­ity of ques­tions that are “too hard”: while suffi­ciently long “fea­ture de­bates” are prov­ably truth-seek­ing (be­cause the de­baters can re­veal all of their in­for­ma­tion), it is pos­si­ble to con­struct com­plex ques­tions in which the de­bate doesn’t find the right an­swer. How­ever, the re­sults don’t gen­er­al­ize well from fea­ture de­bates to real de­bates.

Re­lat­edly, even if it is easy to de­tect lies, it’s not clear what would hap­pen with am­bigu­ous ques­tions.

Since de­bate doesn’t in­volve al­ter­nat­ing be­tween in­creas­ing ca­pa­bil­ities and in­creas­ing effi­ciency, it isn’t an in­stance of iter­ated am­plifi­ca­tion. How­ever, both iter­ated am­plifi­ca­tion and de­bate are aiming to com­pute the an­swer that an ex­po­nen­tially large tree of bounded hu­mans would ar­rive at (see next sec­tion), and so it seems likely that ei­ther they would both work, or nei­ther would work.

Fac­tored cognition

Both iter­ated am­plifi­ca­tion and de­bate de­pend on the fac­tored cog­ni­tion hy­poth­e­sis: that ar­bi­trar­ily com­plex tasks can be performed ar­bi­trar­ily well by a gi­ant tree of bounded base agents, pos­si­bly ex­tended with fea­tures like shared ex­ter­nal mem­ory or long-lived as­sis­tants (2016).

Iter­ated am­plifi­ca­tion checks lo­cal nodes in a tree of con­sid­er­a­tions and bro­ken-down ques­tions, in which an as­sis­tant at level k de­com­poses its ques­tions, gets an­swers from as­sis­tants at level k-1, and com­bines them into an over­all an­swer. Mean­while, in de­bate, if the two agents dis­agree, they will play down the most difficult /​ con­tested path in an ex­po­nen­tial tree of ar­gu­ments and coun­ter­ar­gu­ments, so the de­bate train­ing pro­ce­dure is check­ing a sin­gle path from root to leaf in the ex­po­nen­tial tree.

It is an open ques­tion whether the fac­tored cog­ni­tion hy­poth­e­sis is true. Em­piri­cal work has been scal­ing up, and we should hope­fully have some in­for­ma­tive ev­i­dence in the up­com­ing year.

The main rea­sons peo­ple are skep­ti­cal of the hy­poth­e­sis are be­cause it seems that suffi­ciently com­plex tasks re­quire build­ing up big con­texts or us­ing globally-con­structed in­tu­itions or “in­ex­pli­ca­ble flashes of in­sight”. This could be done if the “small” agents simu­lated an ar­bi­trary Tur­ing Ma­chine, but this would lose any guaran­tees of al­ign­ment.

How­ever, we might ex­pect that these tasks could still be done by a tree of hu­mans: hu­mans are al­lowed to use a heuris­tic “just be­cause it works”; this should al­low the tree of hu­mans to use heuris­tics that other agents use, in­clud­ing “in­ex­pli­ca­ble flashes of in­sight”.


Align­ment of the tree of humans

In or­der for this tree of hu­mans to be al­igned (a nec­es­sary con­di­tion for iter­ated am­plifi­ca­tion or de­bate to be al­igned), the ini­tial agent must already be al­igned, and putting the agents to­gether must not de­stroy al­ign­ment. One in­tu­ition that this is hard is that al­ign­ment is not com­po­si­tional; a “big” agent made up of “small” al­igned agents need not be al­igned. How­ever, the hope doesn’t de­pend on com­po­si­tion­al­ity of al­ign­ment; it in­stead de­pends on en­sur­ing that your agents never do in­cor­rigible op­ti­miza­tion.

In ad­di­tion, it could be the case that “large” ini­tial agents like hu­mans (or hu­man imi­ta­tions) are not ro­bustly al­igned, be­cause there may be some clever ar­gu­ment that causes them to be­have in­cor­rigibly. One re­sponse would be to use low-band­width over­seers as the ini­tial agent, who only an­swer very “small” ques­tions on which we are rel­a­tively con­fi­dent that there are no such failures. We would also hope to train hu­mans to prop­erly de­com­pose ques­tions and be­have cor­rigibly, so that putting to­gether sev­eral hu­mans re­mains cor­rigible (a task for which we need so­cial sci­en­tists).

Note that it is only com­pet­i­tive to ap­prox­i­mate the tree of hu­mans with iter­ated am­plifi­ca­tion if we ex­pect that any pow­er­ful AI sys­tems will also be trained in a man­ner similar to iter­ated am­plifi­ca­tion. If we in­stead con­sider a model in which ML perfectly op­ti­mizes a func­tion (rather than perform­ing iter­ated lo­cal search), then iter­ated am­plifi­ca­tion would be far more ex­pen­sive than un­al­igned pow­er­ful AI sys­tems. It would be worth study­ing this sim­pler model to see if al­ign­ment is pos­si­ble there.

Ascrip­tion universality

Even if we know that the tree of hu­mans is al­igned, we also need to en­sure that the model trained from over­sight from the tree of hu­mans will also be al­igned. The key claim in fa­vor of this is that HCH (the tree of hu­mans) is uni­ver­sal, that is, it “knows” any facts that a suffi­ciently smaller com­pu­ta­tion “knows”. This was for­mal­ized here and ap­plied to mul­ti­ple prob­lems, in­clud­ing the prob­lem that ma­lign op­ti­miza­tion might emerge within HCH. While a good ex­pla­na­tion of this is out of scope here, I sum­ma­rized these posts here. Ascrip­tion uni­ver­sal­ity does have to be ap­plied to the en­tire train­ing pro­cess and not just the fi­nal model.


Since we want to be able to “know ev­ery­thing the model knows”, and also to be able to find situ­a­tions un­der with a model be­haves cor­rigibly (see worst case guaran­tees above), it would be very use­ful to be able to peer in­side our mod­els and un­der­stand what they are do­ing. It would be par­tic­u­larly use­ful to be able to iden­tify op­ti­miza­tion pro­cesses and un­der­stand how they come about.

Even though in­ter­pretabil­ity tools prob­a­bly could not deal with already de­cep­tive mod­els, since the de­cep­tive mod­els could figure out how to fool the tools, it seems likely that in­ter­pretabil­ity could help pre­vent de­cep­tion from ever aris­ing—hope­fully an eas­ier task.

How­ever, in­ter­pretabil­ity has other uses be­sides catch­ing prob­lems: it could also be used to get more un­der­stand­able mod­els dur­ing train­ing, provide feed­back on the pro­cess by which a model makes a de­ci­sion (rather than feed­back on just the de­ci­sion), or cre­ate ML tech­niques that help us un­der­stand the world with­out act­ing in it (thus avoid­ing prob­lems with agen­tial AI).

Un­for­tu­nately, I haven’t kept up with in­ter­pretabil­ity re­search, so I can’t say how it’s pro­gressed re­cently, but one pa­per you could start with is ac­ti­va­tion at­lases.

Im­pact regularization

Im­pact measures

In 2018, there was a lot of progress on propos­ing spe­cific im­pact mea­sures, in­clud­ing rel­a­tive reach­a­bil­ity and at­tain­able util­ity preser­va­tion (fol­lowup, pa­per). Th­ese were re­cently unified as us­ing similar un­der­ly­ing al­gorithms but with differ­ent “de­vi­a­tion mea­sures”: the former con­sid­ers the change in num­ber of reach­able states, whereas the lat­ter con­sid­ers the change in at­tain­able util­ity (for some set of util­ity func­tions).

Th­ese two posts sum­ma­rize the work on im­pact (go­ing back till 2012).

What is im­pact, any­way?

The Refram­ing Im­pact se­quence aims to build in­tu­itions about what we mean by “im­pact”, and con­cludes that an ac­tion is im­pact­ful if it changes our abil­ity to get what we want. Of course, this defi­ni­tion de­pends on “what we want”, whereas usu­ally with im­pact reg­u­lariza­tion we want some­thing that is easy to spec­ify. How­ever, we might hope that im­pact is rel­a­tively goal-ag­nos­tic, be­cause for most goals you need to pur­sue the same con­ver­gent in­stru­men­tal sub­goals. In par­tic­u­lar, we might hope for a for­mal­iz­able no­tion of power, that at­tain­able util­ity preser­va­tion could pe­nal­ize.

To bet­ter dis­t­in­guish be­tween differ­ent defi­ni­tions and tech­niques for mea­sur­ing im­pact, this post pro­poses sev­eral test cases for im­pact reg­u­lariza­tion.

Utility of im­pact measures

The main­line use case for im­pact reg­u­lariza­tion is to be an “ad­di­tional layer of defense”: if for some rea­son we fail to al­ign an AI sys­tem, then hope­fully there still won’t be catas­trophic con­se­quences, be­cause the AI sys­tem only takes low-im­pact ac­tions. How­ever, this may fail to work for a va­ri­ety of rea­sons. Still, work on im­pact mea­sures could be use­ful for de­con­fu­sion, test­ing pro­to­cols, tem­po­rary al­ign­ment mea­sures, or value-neu­tral­ity ver­ifi­ca­tion.

Causal modeling

Causal in­fluence di­a­grams help us un­der­stand what a train­ing pro­cess does. Given a causal in­fluence di­a­gram, we can de­ter­mine ob­ser­va­tion in­cen­tives (what an agent would like to know) and in­ter­ven­tion in­cen­tives (what an agent would like to change). We can pro­duce such di­a­grams for AGI safety frame­works, and an­a­lyze solu­tions to re­ward func­tion tam­per­ing, user feed­back tam­per­ing, and ob­ser­va­tion tam­per­ing. For ex­am­ple, it al­lows us to show that if the agent’s plans are eval­u­ated by the cur­rent re­ward, then there is no in­cen­tive for the agent to tam­per with its re­ward func­tion.

The vari­ables of the di­a­grams rep­re­sent im­por­tant com­po­nents of the agent and the en­vi­ron­ment (such as re­ward func­tions and dy­nam­ics mod­els in the agent, and the user’s prefer­ences and the state of the world in the en­vi­ron­ment). Differ­ent ways of com­bin­ing these into agent se­tups lead to differ­ent causal in­fluence di­a­grams. The in­cen­tive anal­y­sis en­ables the de­signer to choose agent se­tups with good in­cen­tive prop­er­ties.

How­ever, the causal mod­els them­selves are not uniquely de­ter­mined. For ex­am­ple, what counts as wire­head­ing is rel­a­tive to the stance taken to­wards the sys­tem and its de­sired goals. For ex­am­ple, if you define it as tak­ing con­trol of some “nar­row mea­sure­ment chan­nel”, then what is a mea­sure­ment chan­nel and what the goal is de­pends on mod­el­ing as­sump­tions.


Or­a­cles also benefit from rea­son­ing about causal­ity and in­fluences. A sys­tem that max­i­mizes pre­dic­tive ac­cu­racy ends up choos­ing self-con­firm­ing pre­dic­tions, which can be ar­bi­trar­ily bad. (This af­fects self-su­per­vised learn­ing in ad­di­tion to or­a­cles.) You might hope to avoid this by pre­vent­ing the AI sys­tem from be­ing aware of it­self, but this doesn’t work.

In­stead, we could en­sure that the or­a­cle makes pre­dic­tions con­di­tional on the pre­dic­tions not in­fluenc­ing any­thing (us­ing ran­dom­iza­tion to do so). There are still other prob­lems be­sides self-con­firm­ing pre­dic­tions, such as acausal trade.

De­ci­sion theory

There’s been a lot of work ex­plor­ing the in­tu­itions be­hind de­ci­sion the­ory. Since I don’t fol­low de­ci­sion the­ory closely, I’m not go­ing to try and sum­ma­rize the con­ver­sa­tion, and in­stead you get a list of posts: pro CDT, anti CDT, anti FDT, ac­tu­ally it all de­pends on coun­ter­fac­tu­als, anti UDT be­cause of com­mit­ment races, UDT doesn’t work with AIXI, strange rea­son­ing in Troll Bridge, a com­par­i­son across de­ci­sion the­o­ries, coun­ter­fac­tual in­duc­tion posts. There’s also been some dis­cus­sion of why peo­ple care about de­ci­sion the­ory: it is use­ful for im­prov­ing ra­tio­nal­ity, find­ing prob­lems, and de­con­fu­sion.

Re­lat­edly, this pa­per char­ac­ter­izes the de­ci­sion the­o­ries of ex­ist­ing agents, and this post ex­plains how “Pavlov” strate­gies (similar to re­in­force­ment learn­ing) can work well with game the­ory.

As we get to the end of the tech­ni­cal al­ign­ment sec­tion, I want to men­tion BoMAI, which didn’t fit in any of the sec­tions. BoMAI is an AIXI-like sys­tem that does not seek power, be­cause it only cares about re­ward un­til the end of the epi­sode (my­opia), and dur­ing the epi­sode it is con­fined to a box from which in­for­ma­tion can­not leave. Such an AI sys­tem can still be use­ful be­cause there is also a hu­man in the box, who can trans­mit in­for­ma­tion to the out­side world af­ter the epi­sode has ended.

Strat­egy and coordination

So far I’ve been talk­ing about the tech­ni­cal work on the al­ign­ment prob­lem. Let’s now switch to more “meta” work that tries to pre­dict the fu­ture in or­der to pri­ori­tize across re­search top­ics.

Con­tin­u­ous vs dis­con­tin­u­ous takeoff

A cen­tral dis­agree­ment among AI re­searchers is about how “quickly” AI im­proves once it reaches hu­man level. Re­cently, the ques­tion has been dis­til­led to whether there will be a dis­con­ti­nu­ity in AI ca­pa­bil­ities. As a re­sult, I will ask whether take­off will be con­tin­u­ous or dis­con­tin­u­ous (as op­posed to slow or fast).

One op­er­a­tional­iza­tion of this ques­tion is whether there will be a 4-year dou­bling of GDP that ends be­fore the first 1-year dou­bling of GDP starts. Note that con­tin­u­ous take­off need not be slow: to get to 4-year dou­bling, you need su­per­ex­po­nen­tial growth. Un­der ex­po­nen­tial growth, the dou­bling time stays fixed at its cur­rent value of a few decades. Ex­trap­o­lat­ing his­tor­i­cal growth trends (which “sup­ports the pos­si­bil­ity of rad­i­cal in­creases in growth rate”) would still (prob­a­bly) be com­pat­i­ble with this op­er­a­tional­iza­tion.

Two posts ar­gue for con­tin­u­ous take­off; the main ar­gu­ment is that con­ti­nu­ity is very likely for prop­er­ties that peo­ple care about, since lots of peo­ple are try­ing to make progress on the prop­erty, and it is less likely that we quickly in­vest much more effort into mak­ing progress on the prop­erty. So far, there has not been a com­pel­ling re­sponse, but this does not mean that re­searchers agree.

There has been some dis­cus­sion of par­tic­u­lar prop­er­ties that make dis­con­tin­u­ous take­off seem more likely (though I would guess that they are not the ar­gu­ments that MIRI re­searchers would make). For ex­am­ple, per­haps we just need to find the one cor­rect ar­chi­tec­ture, which will then cause a dis­con­ti­nu­ity, but note that birds and pri­mates have in­de­pen­dently evolved neu­ral ar­chi­tec­tures that both work well.

Alter­na­tively, AI sys­tems with differ­ent ex­plicit util­ity func­tions could co­op­er­ate by merg­ing to pur­sue a joint util­ity func­tion, mak­ing them much more effec­tive at co­or­di­na­tion than hu­mans, al­low­ing them to avoid prin­ci­pal-agent prob­lems that plague hu­man cor­po­ra­tions. This could lead to a dis­con­tin­u­ous jump. AI sys­tems could also build mo­nop­o­lies through such co­or­di­na­tion to ob­tain a de­ci­sive strate­gic ad­van­tage.

We could also ex­pect that just as the in­ven­tion of cul­ture and so­cial learn­ing by evolu­tion al­lowed hu­mans to be­come the dom­i­nant species very quickly (rel­a­tively speak­ing), similarly once AI sys­tems are ca­pa­ble of so­cial learn­ing they may also “take off” dis­con­tin­u­ously. How­ever, the same ar­gu­ment could be taken as ev­i­dence against a dis­con­ti­nu­ity, since cur­rent nat­u­ral lan­guage sys­tems like GPT-2 could already be thought of as pro­cess­ing cul­ture or do­ing so­cial learn­ing.

It is worth not­ing that ques­tions about re­cur­sive self im­prove­ment and de­ci­sive strate­gic ad­van­tage do not map cleanly onto the ques­tion of take­off speeds, though they are re­lated. The pri­mary rea­son take­off speed is im­por­tant is that it de­ter­mines whether or not we will be able to re­spond to prob­lems as they come up. For this pur­pose, it’s prob­a­bly bet­ter to define take­off speed with re­spect to the amount of work that can be done as AI takes off, which might differ sig­nifi­cantly from cal­en­dar time.

The im­por­tance of compute

There is a strong case that the most effec­tive meth­ods (so far) are the ones that can lev­er­age more com­pu­ta­tion, and the AI-GA ap­proach to gen­eral in­tel­li­gence is pred­i­cated on this view (for ex­am­ple, by learn­ing good learn­ing en­vi­ron­ments). In fact, since the rise of deep learn­ing in 2012, the amount of com­pute used in the largest AI train­ing runs has been in­creas­ing ex­po­nen­tially with a 3.4-month dou­bling time. It’s im­por­tant to note the caveat that we can­not sim­ply in­crease com­pute: we also need good data, which is sparse in rare, un­safe situ­a­tions (con­sider driv­ing when a pedes­trian sud­denly jumps on the road). This may re­quire hu­man knowl­edge and ex­plicit mod­els.

Since it seems more likely that com­pute grows con­tin­u­ously (rel­a­tive to a “deep in­sights” model), this would ar­gue for a more con­tin­u­ous take­off. How­ever, you may ex­pect that we still need deep in­sights, po­ten­tially be­cause you think that cur­rent tech­niques could never lead to AGI, due to their lack of some prop­erty cru­cial to gen­eral in­tel­li­gence (such as causal rea­son­ing). How­ever, for any such prop­erty, it seems that some neu­ral net could en­code that prop­erty, and the rele­vant ques­tion is how big the neu­ral net has to be and how long it takes for lo­cal search to find the right com­pu­ta­tion.

So­ciolog­i­cal evidence

It has re­cently be­come more com­mon to cri­tique the field of AI as a whole, which should (ar­guably) cause you to lengthen your timelines. For ex­am­ple, hy­poth­e­siz­ing af­ter the re­sults are known makes for bad sci­ence that doesn’t gen­er­al­ize, and re­search that is “re­pro­ducible” in the sense that the code can be re­run to get the same re­sults need not have ex­ter­nal val­idity. There is also a ten­dency for re­searchers to throw trial and er­ror at prob­lems, which means that with re­peated tri­als by chance we can get re­sults that look sig­nifi­cant. It also means that re­searchers don’t un­der­stand the sys­tems they build; re­ori­ent­ing the field to fo­cus on un­der­stand­ing could make our de­sign de­ci­sions more de­liber­ate and make it more likely that we build al­igned AIs.

We should also ex­pect that at least in­dus­try re­search is bi­ased to­wards short timelines, since any com­pa­nies that didn’t ar­gue for short timelines would be much less likely to get fund­ing.

Meta work on forecasting

While fore­cast­ing the fu­ture is no­to­ri­ously hard, col­lab­o­ra­tive and check­able fore­cast­ing is even harder. It would be nice to at least re­duce the difficulty back down to “reg­u­lar” fore­cast­ing. Three steps have been taken to­wards this:

  1. Peo­ple need to agree on the mean­ing of the terms used; an AI fore­cast­ing dic­tio­nary has been de­vel­oped for this pur­pose.

  2. In or­der to be check­able, ques­tions need to be op­er­a­tional­ized; but then it is of­ten the case that the pri­mary de­ter­mi­nant of the an­swer to a ques­tion de­pends on some “dis­trac­tor” fea­ture. For ex­am­ple, whether we have a su­per­hu­man AI at <game> by 2025 de­pends a lot on who tries to make such an AI, rather than whether we have the tech­ni­cal abil­ity to make such an AI. A par­tial solu­tion was to cre­ate a re­s­olu­tion coun­cil, and in­stead have ques­tions ask about the fu­ture opinion of the re­s­olu­tion coun­cil.

  3. This post pro­vides ad­vice on how to write good fore­cast­ing ques­tions, with a database of ex­am­ples.

Of course, there is still the hard prob­lem of ac­tu­ally figur­ing out what hap­pens in the fu­ture (and it’s even hard to tell whether long-run fore­cast­ing is fea­si­ble). The Good Judg­ment Pro­ject stud­ied prac­tices that help with this prob­lem, sum­ma­rized here.

Another is­sue arises when ask­ing mem­bers of a group (e.g. AI re­searchers) about out­comes that de­pend on ac­tions within that group: due to the by­stan­der effect, ev­ery­one may pre­dict that the group will solve a prob­lem, even though they them­selves are not try­ing to solve the prob­lem. So, we should in­stead ask peo­ple to make pre­dic­tions about the pro­por­tion of mem­bers that try to solve a prob­lem, and com­pare that to the pro­por­tion of mem­bers who say that they are try­ing to solve the prob­lem.

AI Progress

A full up­date on AI progress in 2019 would be far too long, so here I’ll just men­tion some re­sults I found in­ter­est­ing, which bi­ases to­wards 1. re­sults in­volv­ing “throw­ing com­pute at the prob­lem”, and 2. un­der­stand­ing deep learn­ing.

Re­in­force­ment learning

  1. AlphaS­tar (up­date, dis­cus­sion) be­come ex­tremely good at Star­craft.

  2. OpenAI Five beat the world cham­pi­ons at Dota, and could play co­op­er­a­tively alongside hu­mans.

  3. OpenAI trained a robot to ma­nipu­late a Ru­bik’s cube so that it could some­times solve a jum­bled cube when given the steps of the solu­tion. See also this dis­cus­sion.

  4. MuZero is an evolu­tion of AlphaZero where MCTS is ap­plied on a learned world model op­ti­mized for plan­ning, al­low­ing it to mas­ter Atari in ad­di­tion to AlphaZero’s Go, Chess, and Shogi. See also this pa­per on in­stru­men­tally learned world mod­els.

  5. Pluribus was shown to be su­per­hu­man at mul­ti­player poker. (Note that to my knowl­edge it did not use deep learn­ing, and it did not re­quire much com­pute.)

  6. With a com­plex enough hide-and-seek en­vi­ron­ment, self-play can learn qual­i­ta­tively in­ter­est­ing be­hav­iors.

Deep learning

  1. While GPT-2 is the most well-known, there have been sev­eral large lan­guage mod­els that are eerily good at cap­tur­ing lan­guage, such as Trans­former-XL and XLNet.

  2. SATNet pro­posed a differ­en­tiable layer for neu­ral net­works that pro­vides a strong in­duc­tive bias to­wards “log­i­cal rea­son­ing”, though even reg­u­lar ma­chine trans­la­tion tech­niques work well for func­tion in­te­gra­tion and differ­en­tial equa­tion solv­ing.

  3. The lot­tery ticket hy­poth­e­sis from 2018 was tested much more.

  4. The dou­ble de­scent phe­nomenon was em­piri­cally val­i­dated.

Field building

While there have been a lot of field build­ing efforts, they are rel­a­tively dis­joint and not part of a con­ver­sa­tion, and so I’ve sum­ma­rized them in lists.

Sum­maries and reviews

  1. This talk and mul­ti­part pod­cast pro­vides an overview of ap­proaches to tech­ni­cal AI al­ign­ment.

  2. This post de­com­poses the benefi­cial AI prob­lem into a tree of differ­ent sub­prob­lems (with a par­tic­u­lar fo­cus on the al­ign­ment prob­lem).

  3. There is of course the an­nual liter­a­ture re­view and char­ity com­par­i­son.

  4. This post iden­ti­fies im­por­tant hy­pothe­ses that re­searchers dis­agree about.

Agen­das and prioritization

  1. This doc pro­vides an overview of the tech­ni­cal prob­lems that need to be solved to al­ign AI sys­tems (as op­posed to e.g. MIRI’s de­con­fu­sion ap­proach).

  2. Th­ese posts list ques­tions that could be tack­led by philoso­phers and non-AI re­searchers re­spec­tively.

  3. It would be bet­ter to bridge near- and long-term con­cerns about AI, to pre­vent the fields from “fight­ing” each other.

  4. For s-risks, rather than look­ing at par­tic­u­lar sce­nar­ios, we could fo­cus on risk fac­tors: prop­er­ties we can in­ter­vene on to make risks less prob­a­ble or less se­vere.

Events and news updates

  1. Sev­eral con­fer­ences and work­shops in 2019, in­clud­ing Benefi­cial AGI, SafeML at ICLR, AI Safety at IJCAI, and Uncer­tainty and Ro­bust­ness at ICML.

  2. There was a hu­man-al­igned AI sum­mer school and an AI safety camp.

  3. OpenAI switched to a limited-profit struc­ture and re­ceived a $1B in­vest­ment from Microsoft, while still ex­press­ing sup­port for their char­ter.

The Cen­ter for Se­cu­rity and Emerg­ing Tech­nol­ogy (CSET) was founded.


See the Google Doc for a list of all the names and links in the text above.