2018 AI Alignment Literature Review and Charity Comparison

Cross-posted to the EA fo­rum.


Like last year and the year be­fore, I’ve at­tempted to re­view the re­search that has been pro­duced by var­i­ous or­gani­sa­tions work­ing on AI safety, to help po­ten­tial donors gain a bet­ter un­der­stand­ing of the land­scape. This is a similar role to that which GiveWell performs for global health char­i­ties, and some­what similar to an se­cu­ri­ties an­a­lyst with re­gards to pos­si­ble in­vest­ments. It ap­pears that once again no-one else has at­tempted to do this, to my knowl­edge, so I’ve once again un­der­taken the task.

This year I have in­cluded sev­eral groups not cov­ered in pre­vi­ous years, and read more widely in the liter­a­ture.

My aim is ba­si­cally to judge the out­put of each or­gani­sa­tion in 2018 and com­pare it to their bud­get. This should give a sense for the or­gani­sa­tions’ av­er­age cost-effec­tive­ness. We can also com­pare their fi­nan­cial re­serves to their 2019 bud­gets to get a sense of ur­gency.

Note that this doc­u­ment is quite long, so I en­courage you to just read the sec­tions that seem most rele­vant to your in­ter­ests, prob­a­bly the sec­tions about the in­di­vi­d­ual or­gani­sa­tions. I do not recom­mend you skip to the con­clu­sions!

I’d like to apol­o­gize in ad­vance to ev­ery­one do­ing use­ful AI Safety work whose con­tri­bu­tions I may have over­looked or mis­con­strued.

Method­olog­i­cal Considerations

Track Records

Judg­ing or­gani­sa­tions on their his­tor­i­cal out­put is nat­u­rally go­ing to favour more ma­ture or­gani­sa­tions. A new startup, whose value all lies in the fu­ture, will be dis­ad­van­taged. How­ever, I think that this is cor­rect. The newer the or­gani­sa­tion, the more fund­ing should come from peo­ple with close knowl­edge. As or­gani­sa­tions ma­ture, and have more eas­ily ver­ifi­able sig­nals of qual­ity, their fund­ing sources can tran­si­tion to larger pools of less ex­pert money. This is how it works for star­tups turn­ing into pub­lic com­pa­nies and I think the same model ap­plies here.

This judge­ment in­volves analysing a large num­ber pa­pers re­lat­ing to Xrisk that were pro­duced dur­ing 2018. Hope­fully the year-to-year volatility of out­put is suffi­ciently low that this is a rea­son­able met­ric. I also at­tempted to in­clude pa­pers dur­ing De­cem­ber 2017, to take into ac­count the fact that I’m miss­ing the last month’s worth of out­put from 2017, but I can’t be sure I did this suc­cess­fully.

This ar­ti­cle fo­cuses on AI risk work. If you think other causes are im­por­tant too, your pri­ori­ties might differ. This par­tic­u­larly af­fects GCRI, FHI and CSER, who both do a lot of work on other is­sues.

We fo­cus on pa­pers, rather than out­reach or other ac­tivi­ties. This is partly be­cause they are much eas­ier to mea­sure; while there has been a large in­crease in in­ter­est in AI safety over the last year, it’s hard to work out who to credit for this, and partly be­cause I think progress has to come by per­suad­ing AI re­searchers, which I think comes through tech­ni­cal out­reach and pub­lish­ing good work, not pop­u­lar/​poli­ti­cal work.


My im­pres­sion is that policy on tech­ni­cal sub­jects (as op­posed to is­sues that at­tract strong views from the gen­eral pop­u­la­tion) is gen­er­ally made by the gov­ern­ment and civil ser­vants in con­sul­ta­tion with, and be­ing lob­bied by, out­side ex­perts and in­ter­ests. Without ex­pert (e.g. top ML re­searchers at Google, CMU & Baidu) con­sen­sus, no use­ful policy will be en­acted. Push­ing di­rectly for policy seems if any­thing likely to hin­der ex­pert con­sen­sus. At­tempts to di­rectly in­fluence the gov­ern­ment to reg­u­late AI re­search seem very ad­ver­sar­ial, and risk be­ing pat­tern-matched to ig­no­rant op­po­si­tion to GM foods or nu­clear power. We don’t want the ‘us-vs-them’ situ­a­tion, that has oc­curred with cli­mate change, to hap­pen here. AI re­searchers who are dis­mis­sive of safety law, re­gard­ing it as an im­po­si­tion and en­cum­brance to be en­dured or evaded, will prob­a­bly be harder to con­vince of the need to vol­un­tar­ily be ex­tra-safe—es­pe­cially as the reg­u­la­tions may ac­tu­ally be to­tally in­effec­tive. The only case I can think of where sci­en­tists are rel­a­tively happy about puni­tive safety reg­u­la­tions, nu­clear power, is one where many of those ini­tially con­cerned were sci­en­tists them­selves. Given this, I ac­tu­ally think policy out­reach to the gen­eral pop­u­la­tion is prob­a­bly nega­tive in ex­pec­ta­tion.

If you’re in­ter­ested in this I’d recom­mend you read this blog post (also re­viewed be­low).


I think there is a strong case to be made that open­ness in AGI ca­pac­ity de­vel­op­ment is bad. As such I do not as­cribe any pos­i­tive value to pro­grams to ‘de­moc­ra­tize AI’ or similar.

One in­ter­est­ing ques­tion is how to eval­u­ate non-pub­lic re­search. For a lot of safety re­search, open­ness is clearly the best strat­egy. But what about safety re­search that has, or po­ten­tially has, ca­pa­bil­ities im­pli­ca­tions, or other in­fo­haz­ards? In this case it seems best if the re­searchers do not pub­lish it. How­ever, this leaves fun­ders in a tough po­si­tion – how can we judge re­searchers if we can­not read their work? Maybe in­stead of do­ing top se­cret valuable re­search they are just slack­ing off. If we donate to peo­ple who say “trust me, it’s very im­por­tant and has to be se­cret” we risk be­ing taken ad­van­tage of by char­latans; but if we re­fuse to fund, we in­cen­tivize peo­ple to re­veal pos­si­ble in­fo­haz­ards for the sake of money. (Is it even a good idea to pub­li­cise that some­one else is do­ing se­cret re­search?)

With re­gard pub­lished re­search, in gen­eral I think it is bet­ter for it to be open ac­cess, rather than be­hind jour­nal pay­walls, to max­imise im­pact. Re­duc­ing this im­pact by a sig­nifi­cant amount in or­der for the re­searcher to gain a small amount of pres­tige does not seem like an effi­cient way of com­pen­sat­ing re­searchers to me. Thank­fully this does not oc­cur much with CS pa­pers as they are all on arXiv, but it is an is­sue for some strat­egy pa­pers.

More pro­saically, or­gani­sa­tions should make sure to up­load the re­search they have pub­lished to their web­site! Hav­ing gone to all the trou­ble of do­ing use­ful re­search it is a shame how many or­gani­sa­tions don’t take this sim­ple step to sig­nifi­cantly in­crease the reach of their work.

Re­search Flywheel

My ba­sic model for AI safety suc­cess is this:

  1. Iden­tify in­ter­est­ing problems

    1. As a byproduct this draws new peo­ple into the field through nerd-sniping

  2. Solve in­ter­est­ing problems

    1. As a byproduct this draws new peo­ple into the field through cred­i­bil­ity and prestige

  3. Repeat

One ad­van­tage of this model is that it pro­duces both ob­ject-level work and field growth.

There is also some value in ar­gu­ing for the im­por­tance of the field (e.g. Bostrom’s Su­per­in­tel­li­gence) or ad­dress­ing crit­i­cisms of the field.

No­tice­ably ab­sent are strate­gic pieces. In pre­vi­ous years I have found these helpful; how­ever, lately fewer seem to yield in­cre­men­tal up­dates to my views, so I gen­er­ally as­cribe lower value to these. This does not ap­ply to tech­ni­cal strat­egy pieces, about e.g. whether CIRL or Am­plifi­ca­tion is a more promis­ing ap­proach.

Near vs Far Safety Research

One ap­proach is to re­search things that will make con­tem­po­rary ML sys­tems more safe, be­cause you think AGI will be a nat­u­ral out­growth from con­tem­po­rary ML, and this is the only way to get feed­back on your ideas. I think of this ap­proach as be­ing ex­em­plified by Con­crete Prob­lems. You might also hope that even if ML ends up lead­ing us into an­other AI Win­ter, the near-term solu­tions will gen­er­al­ize in a use­ful way, though this is of course hard to judge. To the ex­tent that you en­dorse this ap­proach, you would prob­a­bly be more likely to donate to CHAI.

Another ap­proach is to try to rea­son di­rectly about the sorts of is­sues that will arise with su­per­in­tel­li­gent AI, and won’t get solved any­way /​ ren­dered ir­rele­vant as a nat­u­ral side effect of or­di­nary ML re­search. To the ex­tent that you en­dorse this ap­proach, you would prob­a­bly be more likely to donate to MIRI, es­pe­cially for their Agent Foun­da­tions work.

I am not sure how to rel­a­tively value these two things.

There are a num­ber of other top­ics that of­ten get men­tioned as AI Safety is­sues. I gen­er­ally do not think it is im­por­tant to sup­port or­gani­sa­tions or in­di­vi­d­u­als work­ing on these is­sues un­less there is some di­rect read-through to AGI safety.

I have heard it ar­gued that we should be­come ex­perts in these ar­eas in or­der to gain cred­i­bil­ity and in­fluence for the real policy work. How­ever, I am some­what scep­ti­cal of this, as I sus­pect that as soon as a do­main is nar­row-AI-solved it will cease to be viewed as AI.

Au­tonomous Cars

My view is that the lo­cal­ised na­ture of any tragedies plus the strong in­cen­tive al­ign­ment mean that pri­vate com­pa­nies will solve this prob­lem by them­selves.


While tech­nolog­i­cal ad­vance con­tinu­ally mechanise and re­place labour in in­di­vi­d­ual cat­e­gories, it also opens up new ones. Con­tem­po­ra­ne­ous un­em­ploy­ment has more to do with poor macroe­co­nomic policy and in­flex­ible labour mar­kets than robots. AI strong enough to re­place hu­mans in ba­si­cally ev­ery job is ba­si­cally AGI-com­plete. At that point we should be wor­ried about sur­vival, and if we solve the al­ign­ment prob­lem well enough to pre­vent ex­tinc­tion we will have likely also solved it well enough to also pre­vent mass un­em­ploy­ment (or at least the nega­tive effects of such, if you be­lieve the two can be sep­a­rated).

There has been an in­crease in in­ter­est in a ‘Ba­sic In­come’ – an un­con­di­tional cash trans­fer given to all cit­i­zens – as a solu­tion to AI-driven un­em­ploy­ment. I think this is a big mis­take, and largely mo­ti­vated rea­son­ing by peo­ple who would have sup­ported it any­way. In a Han­so­nian sce­nario, all meat-based hu­man­ity has is our prop­erty rights. If prop­erty rights are strong, we will be­come very rich. If they are weak, and the policy is that ev­ery agent gets a fair share, all the wealth will be eaten up as Malthu­sian EMs mas­sively out­num­ber phys­i­cal hu­mans and driv­ing the ba­sic in­come down to the price of some cy­cles on AWS.


The vast ma­jor­ity of dis­cus­sion in this area seems to con­sist of peo­ple who are an­noyed at ML sys­tems are learn­ing based on the data, rather than based on the prej­u­dices/​moral views of the writer. While in the­ory this could be use­ful for teach­ing peo­ple about the difficulty of the al­ign­ment prob­lem, the com­plex­ity of hu­man value, etc., in prac­tice I doubt this is the case. This pre­sen­ta­tion is one of the bet­ter I have seen on the sub­ject.

Other Ex­is­ten­tial Risks

Some of the or­gani­sa­tions de­scribed be­low also do work on other ex­is­ten­tial risks, for ex­am­ple GCRI, FLI and CSER. I am not an ex­pert on other Xrisks so they are hard for me to eval­u­ate work in, but it seems likely that many peo­ple who care about AI Align­ment will also care about them, so I will men­tion pub­li­ca­tions in these ar­eas. The ex­cep­tion is cli­mate change, which is highly non-ne­glected.

Fi­nan­cial Reserves

Char­i­ties like hav­ing fi­nan­cial re­serves to provide run­way, and guaran­tee that they will be able to keep the lights on for the im­me­di­ate fu­ture. This could be jus­tified if you thought that char­i­ties were ex­pen­sive to cre­ate and de­stroy, and were wor­ried about this oc­cur­ring by ac­ci­dent due to the whims of donors.

Donors pre­fer char­i­ties to not have too much re­serves. Firstly, those re­serves are cash that could be be­ing spent on out­comes now, by ei­ther the spe­cific char­ity or oth­ers. Valuable fu­ture ac­tivi­ties by char­i­ties are sup­ported by fu­ture dona­tions; they do not need to be pre-funded. Ad­di­tion­ally, hav­ing re­serves in­creases the risk of or­gani­sa­tions ‘go­ing rogue’, be­cause they are in­su­lated from the need to con­vince donors of their value.

As such, in gen­eral I do not give full cre­dence to char­i­ties say­ing they need more fund­ing be­cause they want more than a year of run­way in the bank. A year’s worth of re­serves should provide plenty of time to raise more fund­ing.

It is worth spend­ing a mo­ment think­ing about the equil­ibrium here. If donors tar­get a lower run­way num­ber than char­i­ties, char­i­ties might cur­tail their ac­tivi­ties to al­low their re­serves to last for longer. At this lower level of ac­tivi­ties, donors would then de­cide a lower level of re­serves are nec­es­sary, and so on, un­til even­tu­ally the overly con­ser­va­tive char­ity ends up with a bud­get of zero, with all the re­sources in­stead given to other groups who turn dona­tions into work more promptly. This is al­lows donor funds to be turned into re­search more quickly.

I es­ti­mated re­serves = (cash and grants) /​ (2019 bud­get – com­mit­ted an­nual fund­ing). In gen­eral I think of this as some­thing of a mea­sure of ur­gency. This is a sim­pler calcu­la­tion than many or­gani­sa­tions (MIRI, CHAI etc.) shared with me, be­cause I want to be able to com­pare con­sis­tently across or­gani­sa­tions. I at­tempted to com­pare the amount of re­serves differ­ent or­gani­sa­tions had, but found this rather difficult. Some or­gani­sa­tions were ex­tremely open about their fi­nanc­ing (thank you CHAI!). Others were less so. As such these should be con­sid­ered sug­ges­tive only.

Dona­tion Matching

In gen­eral I be­lieve that char­ity-spe­cific dona­tion match­ing schemes are some­what dishon­est, de­spite my hav­ing pro­vided match­ing fund­ing for at least one in the past.

Iron­i­cally, de­spite this view be­ing es­poused by GiveWell (albeit in 2011), this is ba­si­cally of OpenPhil’s policy of, at least in some cases, ar­tifi­cially limit­ing their fund­ing to 50% of a char­ity’s need, which some char­i­ties ar­gue (though not by OpenPhil them­selves that I re­call) effec­tively pro­vides a 1:1 match for out­side donors. I think this is bad. In the best case this forces out­side donors to step in, im­pos­ing mar­ket­ing costs on the char­ity and re­search costs on the donors. In the worst case it leaves valuable pro­jects un­funded.

Ob­vi­ously cause-neu­tral dona­tion match­ing is differ­ent and should be ex­ploited. Every­one should max out their cor­po­rate match­ing pro­grams if pos­si­ble, and things like the an­nual Face­book Match and the quadratic-vot­ing match were great op­por­tu­ni­ties.

Poor Qual­ity Re­search

Partly thanks to the efforts of the com­mu­nity, the field of AI safety is con­sid­er­ably more well re­spected and funded than was pre­vi­ously the case, which has at­tracted a lot of new re­searchers. While gen­er­ally good, one side effect of this (per­haps com­bined with the fact that many low-hang­ing fruits of the in­sight tree have been plucked) is that a con­sid­er­able amount of low-qual­ity work has been pro­duced. For ex­am­ple, there are a lot of pa­pers which can be ac­cu­rately sum­ma­rized as as­sert­ing “just use ML to learn ethics”. Fur­ther­more, the con­ven­tional peer re­view sys­tem seems to be ex­tremely bad at deal­ing with this is­sue.

The stan­dard view here is just to ig­nore low qual­ity work. This has many ad­van­tages, for ex­am­ple 1) it re­quires lit­tle effort, 2) it doesn’t an­noy peo­ple. This con­spir­acy of silence seems to be the strat­egy adopted by most sci­en­tific fields, ex­cept in ex­treme cases like anti-vax­ers.

How­ever, I think there are some down­sides to this strat­egy. A suffi­ciently large miliu of low-qual­ity work might de­grade the rep­u­ta­tion of the field, de­ter­ring po­ten­tially high-qual­ity con­trib­u­tors. While low-qual­ity con­tri­bu­tions might help im­prove Con­crete Prob­lems’ cita­tion count, they may use up scarce fund­ing.

More­over, it is not clear to me that ‘just ig­nore it’ re­ally gen­er­al­izes as a com­mu­nity strat­egy. Per­haps you, en­light­ened reader, can judge that “How to solve AI Ethics: Just use RNNs” is not great. But is it re­ally effi­cient to re­quire ev­ery­one to in­de­pen­dently work this out? Fur­ther­more, I sus­pect that the idea that we can all just ig­nore the weak stuff is some­what an ex­am­ple of typ­i­cal mind fal­lacy. Sev­eral times I have come across peo­ple I re­spect ac­cord­ing re­spect to work I found blatantly rub­bish. And sev­eral times I have come across peo­ple I re­spect ar­gu­ing per­sua­sively that work I had pre­vi­ously re­spected was very bad – but I only learnt they be­lieved this by chance! So I think it is quite pos­si­ble that many peo­ple will waste a lot of time as a re­sult of this strat­egy, es­pe­cially if they don’t hap­pen to move in the right so­cial cir­cles.

Fi­nally, I will note that the two ex­am­ples which spring to mind of cases where the EA com­mu­nity has forthrightly crit­i­cized peo­ple for pro­duc­ing epistem­i­cally poor work – namely In­ten­tional In­sights and ACE – seem ex post to have been the right thing to do, al­though in both cases the tar­gets were in­side the EA com­mu­nity, rather than vaguely-al­igned aca­demics.

Hav­ing said all that, I am not a fan of unilat­eral ac­tion, so will largely con­tinue to abide by this non-ag­gres­sion con­ven­tion. My only de­vi­a­tion here is to make it ex­plicit – though see this by 80,000 Hours.

The Bay Area

Much of the AI and EA com­mu­ni­ties, and es­pe­cially the EA com­mu­nity con­cerned with AI, is lo­cated in the Bay Area, es­pe­cially Berkeley and San Fran­cisco. This is an ex­tremely ex­pen­sive place, and is dys­func­tional both poli­ti­cally and so­cially. A few months ago I read a se­ries of sto­ries about abuse in the bay and was struck by how many things I con­sid­ered ab­hor­rent were in the story merely as back­ground. In gen­eral I think the cen­tral­iza­tion is bad, but if there must be cen­tral­iza­tion I would pre­fer it be al­most any­where other than Berkeley. Ad­di­tion­ally, I think many fun­ders are ge­o­graph­i­cally my­opic, and bi­ased to­wards fund­ing things in the Bay Area. As such, I have a mild prefer­ence to­wards fund­ing non-Bay-Area pro­jects. If you’re in­ter­ested in this topic I recom­mend you read­ing this or this or this.

Or­gani­sa­tions and Research

MIRI: The Ma­chine In­tel­li­gence Re­search Institute

MIRI is the largest pure-play AI ex­is­ten­tial risk group. Based in Berkeley, it fo­cuses on math­e­mat­ics re­search that is un­likely to be pro­duced by aca­demics, try­ing to build the foun­da­tions for the de­vel­op­ment of safe AIs. They were founded by Eliezer Yud­kowsky and lead by Nate Soares.

His­tor­i­cally they have been re­spon­si­ble for much of the ger­mi­na­tion of the field, in­clud­ing ad­vo­cacy, but are now fo­cused on re­search. In gen­eral they do very ‘pure’ math­e­mat­i­cal work, in com­par­i­son to other or­gani­sa­tion with more ‘ap­plied’ ML or strat­egy fo­cuses. I have his­tor­i­cally been im­pressed with their re­search.

Their agent foun­da­tions work is ba­si­cally try­ing to de­velop the cor­rect way of think­ing about agents and learn­ing/​de­ci­sion mak­ing by spot­ting ar­eas where our cur­rent mod­els fail and seek­ing to im­prove them.


Garrabrant and Dem­ski’s Embed­ded Agency Se­quence is a short se­quence of blog posts out­lin­ing MIRI’s think­ing about Agent Foun­da­tions. It de­scribes the is­sues about how to rea­son about agents that are em­bed­ded in their en­vi­ron­ment. I found it to be a very in­tu­itive ex­pla­na­tion of many is­sues that MIRI is work­ing on. How­ever, lit­tle of it will be new to some­one who has worked through MIRI’s pre­vi­ous, less ac­cessible work on the sub­ject.

Yud­kowsky and Chris­ti­ano’s Challenges to Chris­ti­ano’s Ca­pa­bil­ity Am­plifi­ca­tion Pro­posal dis­cusses Eliezer’s ob­jec­tions to Paul’s Am­plifi­ca­tion agenda in back-and-forth blog for­mat. Eliezer has a cou­ple of ob­jec­tions. At a high level, Paul is at­tempt­ing a more di­rect solu­tion, work­ing largely within the ex­ist­ing ML frame­work, vs MIRI’s de­sire to work on things like agent foun­da­tions first. Eliezer is con­cerned that most ag­gre­ga­tion/​am­plifi­ca­tion meth­ods do not pre­serve al­ign­ment, and that find­ing one that does (and build­ing the low level agents) is es­sen­tially as hard as solv­ing the al­ign­ment prob­lem. Any loss of al­ign­ment would be mul­ti­plied with ev­ery level of am­plifi­ca­tion. Thirdly, there may be many prob­lems that need se­quen­tial work—ad­di­tional band­width does not suffice. Ad­di­tion­ally, he ob­jects that Paul’s ideas would likely be far too slow, due to the huge amount of hu­man in­put re­quired. This was an in­ter­est­ing post, but I think could have been more clear. Re­searchers from OpenAI were also named au­thors on the pa­per.

Yud­kowsky’s The Rocket Align­ment Prob­lem is a blog post pre­sent­ing a Gal­ileo-style di­alogue/​anal­ogy for why MIRI is tak­ing a seem­ingly in­di­rect ap­proach to AI Safety. It was en­joy­able, but I’m not sure how con­vinc­ing it would be to out­siders. I guess if you thought a deep un­der­stand­ing of the tar­get do­main was never nec­es­sary it could provide an ex­is­tence proof.

Dem­ski’s An Un­trol­lable Math­e­mat­i­cian Illus­trated pro­vides a very ac­cessible ex­pla­na­tion to some re­sults about log­i­cal in­duc­tion.

MIRI re­searchers also ap­peared as co-au­thors on:

Non-dis­clo­sure policy

Last month MIRI an­nounced their new policy of nondis­clo­sure-by-de­fault:

[G]oing for­ward, most re­sults dis­cov­ered within MIRI will re­main in­ter­nal-only un­less there is an ex­plicit de­ci­sion to re­lease those re­sults, based usu­ally on a spe­cific an­ti­ci­pated safety up­side from their re­lease.

This is a sig­nifi­cant change from their pre­vi­ous policy. As of circa a year ago my un­der­stand­ing was that MIRI would be do­ing se­cret re­search largely in ad­di­tion to their cur­rent re­search pro­grams, not that all their pro­grams would be­come es­sen­tially se­cret.

At the same time se­crecy at MIRI is not en­tirely new. I’m aware of at least one case from 2010 where they de­cided not to pub­lish some­thing for similar rea­sons; as far as I’m aware this thing has never been ‘de­clas­sified’ – in­deed per­haps it has been for­got­ten.

In any case, one con­se­quence of this is that for 2018 MIRI has pub­lished es­sen­tially noth­ing. (Ex­cep­tions to this are dis­cussed above).

I find this very awk­ward to deal with.

On the one hand, I do not want peo­ple to be pres­sured into pre­ma­ture dis­clo­sure for the sake of fund­ing. This space is suffi­ciently full of in­fo­haz­ards that se­crecy might be nec­es­sary, and in its ab­sence re­searchers might pru­dently shy away from work­ing on po­ten­tially risky things—in the same way that no-one in busi­ness sends sen­si­tive in­for­ma­tion over email any more. MIRI are in ex­actly the sort of situ­a­tion that you would ex­pect might give rise to the need for ex­treme se­crecy. If se­cret re­search is a nec­es­sary step en route to sav­ing the world, it will have to be done by some­one, and it is not clear there is any­one much bet­ter.

On the other hand, I don’t think we can give peo­ple money just be­cause they say they are do­ing good things, be­cause of the risk of abuse. There are many other rea­sons for not pub­lish­ing any­thing. A some sim­ple ones would be “we failed to pro­duce any­thing pub­lish­able” or “it is fun to fool our­selves into think­ing we have ex­cit­ing se­crets” or “we are do­ing bad things and don’t want to get caught.”

Ad­di­tion­ally, by hid­ing the high­est qual­ity work we risk im­pov­er­ish­ing the field, mak­ing it look un­pro­duc­tive and unattrac­tive to po­ten­tial new re­searchers.

One pos­si­ble solu­tion would be for the re­search to be done by im­pec­ca­bly de­on­tolog­i­cally moral peo­ple, whose moral code you un­der­stand and trust. Un­for­tu­nately I do not think this is the case with MIRI. (I also don’t think it is the case with many other or­gani­sa­tions, so this is not a spe­cific crit­i­cism of MIRI, ex­cept in­so­muchas you might have held them to a higher stan­dard than oth­ers).

Another pos­si­ble solu­tion would be for ma­jor donors to be in­sid­ers, who read the se­cret stuff and can ver­ify it is worth sup­port­ing. If the or­gani­sa­tion also wanted to keep small donors the large donors could give their seal of ap­proval; oth­er­wise the or­gani­sa­tion could sim­ply de­cide it did not need them any more. How­ever, if MIRI are adopt­ing this strat­egy they are keep­ing it a se­cret from me! Per­haps this is re­as­sur­ing about their abil­ity to keep se­crets.

Per­haps we hope that MIRI em­ploy­ees would leak in­for­ma­tion of any wrong­do­ing, but not leak po­ten­tial info-haz­ards?

Fi­nally, I will note that MIRI are have been very gen­er­ous with their time in at­tempt­ing to help me un­der­stand what they are do­ing.


Ac­cord­ing to MIRI they have around 1.5 years of ex­penses in re­serve, and their 2019 es­ti­mated bud­get is around $4.8m. This does not in­clude the po­ten­tial pur­chase of a new office they are con­sid­er­ing.

There is prima fa­cie coun­ter­fac­tu­ally valid match­ing fund­ing available from REG’s Dou­ble Up Drive.

If you wanted to donate to MIRI, here is the rele­vant web page.

FHI: The Fu­ture of Hu­man­ity Institute

FHI is a well-es­tab­lished re­search in­sti­tute, af­fili­ated with Oxford and led by Nick Bostrom. Com­pared to the other groups we are re­view­ing they have a large staff and large bud­get. As a rel­a­tively ma­ture in­sti­tu­tion they pro­duced a de­cent amount of re­search over the last year that we can eval­u­ate. They also do a sig­nifi­cant amount of out­reach work.

Their re­search is more varied than MIRI’s, in­clud­ing strate­gic work, work di­rectly ad­dress­ing the value-learn­ing prob­lem, and cor­rigi­bil­ity work.


Arm­strong and O’Rourke’s ‘In­differ­ence’ meth­ods for man­ag­ing agent re­wards pro­vides an overview of Stu­art’s work on In­differ­ence. Th­ese are meth­ods that try to pre­vent agents from ma­nipu­lat­ing a cer­tain event, or ig­nore it, or change util­ity func­tion with­out try­ing to fight it. In the pa­per they lay out ex­ten­sive for­mal­ism and prove some re­sults. Some but not all will be fa­mil­iar to peo­ple who have been fol­low­ing his other work in the area. The key to un­der­stand­ing the why the util­ity func­tion in the ex­am­ple is defined the way it is, and vuln­er­a­ble to the prob­lem de­scribed in the pa­per, is that we do not di­rectly ob­serve age—hence the need to base it on wrist­band sta­tus. I found the ex­am­ple a lit­tle con­fus­ing be­cause it could also be solved by just scal­ing up the pun­ish­ment for mis-iden­ti­fi­ca­tion that is caught, in line with Becker’s Crime and Pu­n­ish­ment: An Eco­nomic Ap­proach (1974), but this ap­proach wouldn’t work if you didn’t know the prob­a­bil­ities ahead of time. Over­all I thought this was an ex­cel­lent pa­per. Re­searchers from ANU were also named au­thors on the pa­per.

Arm­strong and Min­der­mann’s Im­pos­si­bil­ity of de­duc­ing prefer­ences and ra­tio­nal­ity from hu­man policy ar­gues that you can­not in­fer hu­man prefer­ences from the ac­tions of peo­ple who may be ir­ra­tional in un­known ways. The ba­sic point is quite triv­ial—that ar­bi­trary ir­ra­tional­ities can mean that any set of val­ues could have pro­duced the ob­served ac­tions—but at the same time I hadn’t in­ter­nal­ised why this would be a big prob­lem for the IRL frame­work, and in any case it is good to have im­por­tant things writ­ten down. More sig­nifi­cant is they also showed that ‘sim­plic­ity’ as­sump­tions will not save us—the ‘sim­plest’ solu­tion will (al­most definitely) be de­gen­er­ate. This sug­gests we do need to ‘hard code’ some pri­ors about hu­man val­ues into the AI—they sug­gest be­liefs about truth­ful hu­man ut­ter­ances (though of course as speech acts are acts all the same, it seems that some of the same prob­lems oc­cur again at this level of meta). Alter­na­tives (not men­tioned in the pa­per) could be to look to psy­chol­ogy or biol­ogy (e.g. Haidt or evolu­tion­ary biol­ogy). Over­all I thought this was an ex­cel­lent pa­per.

Arm­strong and O’Rourke’s Safe Uses of AI Or­a­cles sug­gests two pos­si­ble safe Or­a­cle de­signs. The first takes ad­van­tage of Stu­art’s trade­mark in­differ­ence re­sults to build an or­a­cle whose re­ward is only based on cases where the out­put af­ter be­ing au­to­mat­i­cally ver­ified is deleted, and hence can­not at­tempt to ma­nipu­late hu­man­ity. I thought this was clever, and it’s nice to see some pay­off from the in­differ­ence ma­chin­ery he’s been work­ing on, though this Or­a­cle only works for NP-style ques­tions, and as­sumes the ver­ifier can­not be ma­nipu­lated—which is a big as­sump­tion. The pa­per also in­cludes a simu­la­tion of such an Or­a­cle, show­ing how the re­stric­tion af­fects perfor­mance. The rest of the pa­per de­scribes the more clas­sic tech­nique of re­strict­ing an Or­a­cle to give an­swers sim­ple enough that we hope they’re not po­ten­tially ma­nipu­la­tive, and fre­quently re-start­ing the Or­a­cle. Re­searchers from ANU were also named au­thors on the pa­per.

Dafoe’s AI Gover­nance: A Re­search Agenda is an in­tro­duc­tion to the is­sues faced in AI gov­er­nance for policy fu­ture re­searchers. It seems to do a good job of this. As low­er­ing bar­ri­ers to en­try is im­por­tant for new fields, this is po­ten­tially a very valuable doc­u­ment if you are highly con­cerned about the gov­er­nance side of AI. In par­tic­u­lar, it cov­ers policy work to ad­dress threats from gen­eral ar­tifi­cial in­tel­li­gence as well as near-term nar­row AI is­sues, which is a ma­jor plus to me. In some ways it feels similar to Su­per­in­tel­li­gence.

Sand­berg’s Hu­man Ex­tinc­tion from Nat­u­ral Hazard Events pro­vides a de­tailed overview of ex­tinc­tion risks from nat­u­ral events. The pa­per is both de­tailed and broad, and is some­thing of an up­dated ver­sion of part of Bostrom and Cirkovic’s Global Catas­trophic Risks. His con­clu­sion is broadly than man-made risks are sig­nifi­cantly larger than nat­u­ral ones. As with any An­ders pa­per it con­tains a num­ber of in­ter­est­ing anec­dotes—for ex­am­ple I also hadn’t re­al­ised that peo­ple in 1910 were con­cerned that Halley’s Comet might poi­son the at­mo­sphere!

Schulze and Evans’s Ac­tive Re­in­force­ment Learn­ing with Monte-Carlo Tree Search provide an al­gorithm for effi­cient re­in­force­ment-learn­ing when learn­ing the re­ward is costly. In most RL de­signs the agent always sees the re­ward; how­ever, this would not be the case with CIRL, be­cause the re­wards re­quire hu­man in­put, which is ex­pen­sive, so we have to ra­tion it. Here Se­bas­tian and Owain pro­duce a new al­gorithm, BAMCP++ that tries to ad­dress this in an effi­cient way. The pa­per pro­vides simu­la­tions to show the near-op­ti­mal­ity of this al­gorithm in some sce­nar­ios vs failure of ri­vals, and some the­o­ret­i­cal con­sid­er­a­tions for why things like Thomp­son Sam­pling would strug­gle.

Brundage et al.‘s The Mal­i­cious Use of Ar­tifi­cial In­tel­li­gence: Fore­cast­ing, Preven­tion, and Miti­ga­tion is a mas­sively col­lab­o­ra­tive policy doc­u­ment on the threats posed by nar­row AI. Aimed pri­mar­ily at poli­cy­mak­ers, it does a good job of in­tro­duc­ing a wide va­ri­ety of po­ten­tial threats. How­ever, it does not re­ally cover ex­is­ten­tial risks at all, so I sus­pect the main benefit (from our point of view) is that of cred­i­bil­ity-build­ing for later. How­ever, I am in gen­eral scep­ti­cal of poli­ti­ci­ans’ abil­ity to help with AI safety, so I rel­a­tively down­weight this. But if you were con­cerned about bad ac­tors us­ing AI to at­tack, this is a good pa­per for you. Re­searchers from OpenAI, CSER were also named au­thors on the pa­per.

Bostrom’s The Vuln­er­a­ble World Hy­poth­e­sis in­tro­duces and dis­cusses the idea of wor­lds that will be de­stroyed ‘by de­fault’ when they reach a cer­tain level of tech­nolog­i­cal ad­vance­ment. He dis­t­in­guishes be­tween a va­ri­ety of differ­ent cases, like if it is easy for in­di­vi­d­u­als to de­velop weapons of mass de­struc­tion, with in­tu­itive names like ‘Type-2b vuln­er­a­bil­ity’, and es­sen­tially ar­gues for a global po­lice state (or similar) to re­duce the risk. It con­tained a bunch of in­ter­est­ing anec­dotes—for ex­am­ple I hadn’t re­al­ised what lit­tle in­fluence the sci­en­tists in the Man­hat­tan Pro­ject had on the even­tual poli­ti­cal uses of nukes. How­ever, given its ori­gin I ac­tu­ally found this pa­per didn’t add much new. The ar­eas where it could have added—for ex­am­ple, dis­cussing novel ways of us­ing cryp­tog­ra­phy to en­able surveillance with­out to­tal­i­tar­i­anism, dis­cussing Value Drift as a form of ex­is­ten­tial risk that might be im­pos­si­ble to solve with­out some­thing like this, or the risks of global surveillance it­self be­ing an ex­is­ten­tial risk (as iron­i­cally cov­ered in Ca­plan’s chap­ter of Global Catas­trophic Risks) - were left with only cur­sory dis­cus­sion. Ad­di­tion­ally, given the na­ture of gov­ern­ments, I do not think that sup­port­ing surveillance is a very ne­glected area.

Lewis et al.’s In­for­ma­tion Hazards in Biotech­nol­ogy dis­cusses is­sues around dan­ger­ous biol­ogy re­search. They provide an overview, in­clud­ing nu­mer­ous ex­am­ples of dan­ger­ous dis­cov­er­ies and the poli­cies that were used and their mer­its.

FHI re­searchers also ap­peared as co-au­thors on:


OpenPhil awarded FHI $13.4m ear­lier this year, spread out over 3 years, largely (but not ex­clu­sively) to fund AI safety re­search. Un­for­tu­nately the write-up I found on the web­site was even more min­i­mal than last year’s and so is un­likely to be of much as­sis­tance to po­ten­tial donors.

They are cur­rently in the pro­cess of mov­ing to a new larger office just west of Oxford.

FHI didn’t re­ply to my emails about dona­tions, and seem to be more limited by tal­ent (though there are prob­lems with this phrase) than by money, so the case for donat­ing here seems weaker. But it could be a good place to work!

If you wanted to donate to them, here is the rele­vant web page.

CHAI: The Cen­ter for Hu­man-Com­pat­i­ble AI

The Cen­ter for Hu­man-Com­pat­i­ble AI, founded by Stu­art Rus­sell in Berkeley, launched in Au­gust 2016. They have pro­duced a lot of in­ter­est­ing work, es­pe­cially fo­cused around in­verse re­in­force­ment learn­ing. They are sig­nifi­cantly more ap­plied and ML-fo­cused than MIRI or FHI (who are more ‘pure’) or CSER or CGRI (who are more strat­egy-fo­cused). They also do work on non-xrisk re­lated AI is­sues, which I gen­er­ally think are less im­por­tant, but which per­haps have solu­tions that can be re-used for AGI safety.


Shah’s AI Align­ment Newslet­ter is a weekly email of in­ter­est­ing new de­vel­op­ments rele­vant to AI Align­ment. It is amaz­ingly de­tailed. I strug­gle writ­ing this; I don’t know how he keeps on track of it all. Over­all I thought is an ex­cel­lent pro­ject.

Min­der­mann and Shah et al.‘s Ac­tive In­verse Re­ward De­sign turns the re­ward de­sign pro­cess into an in­ter­ac­tive one where the agent can ‘ask’ ques­tions. The idea, as I un­der­stand it, is that in­stead of the pro­gram­mers cre­at­ing a one-and-done train­ing re­ward func­tion which the agent learns about, in­stead the agent learns from the re­ward func­tion, is cog­nizant of its un­cer­tain­ties (In­verse Re­ward De­sign) and then queries the de­signer in such a way as to re­duce its un­cer­tainty. This seems like ex­plor­ing the de­sign­ers value space in the same way that an RL agent ex­plores its en­vi­ron­men­tal space. It seems like a very clever idea to me, though I would have liked to see more ex­am­ples in the pa­per.

Had­field-Menell and Had­field’s In­com­plete Con­tract­ing and AI al­ign­ment anal­o­gises the prob­lem of AI al­ign­ment with the eco­nomics liter­a­ture on in­cen­tive al­ign­ment (for hu­mans). The anal­y­sis is gen­er­ally good, and might lead to use­ful fol­lowups, though most of the readthroughs they drew from the prin­ci­pal-agent liter­a­ture seem like they are already ap­pre­ci­ated in the AI safety com­mu­nity. There was some some­what novel stuff about sig­nal­ling mod­els, and about Aghion & Ti­role’s 1997 pa­per on in­com­plete con­tract­ing that seemed in­ter­est­ing but I didn’t re­ally un­der­stand or have time to look into. It also did a nice job of point­ing out how much the hu­man prob­lem of in­com­plete con­tract­ing is solved by hu­mans be­ing em­bed­ded in a moral and so­cial or­der, and thus able and will­ing to do what ‘ob­vi­ously’ is ‘com­mon sense’ in un­clear situ­a­tions—a solu­tion which un­for­tu­nately seems no FAI-com­plete for our case. Re­searchers from OpenAI were also named au­thors on the pa­per.

Reddy et al.’s Where Do You Think You’re Go­ing?: In­fer­ring Beliefs about Dy­nam­ics from Be­havi­our at­tempt to in­fer val­ues from agents with in­cor­rect world-mod­els (pace Arm­strong and Min­der­mann’s Im­pos­si­bil­ity pa­per). They at­tempt to avoid the im­pos­si­bil­ity re­sult by first de­duc­ing agent be­liefs on a task with known goals, and then us­ing those be­liefs to in­fer goals on a new task. While there might not be any tasks with known hu­man goals, you might hope that there are differ­ent ar­eas where hu­man goals and be­liefs are more or less well un­der­stood, which could be util­ised by a re­lated ap­proach. As such I was quite pleased by this pa­per. They also have a n=12 user trial.

Tucker et al.’s In­verse Re­in­force­ment Learn­ing for Video Games ap­ply an IRL al­gorithm to an Atari game. Given that prov­ing that al­ign­ment-con­ge­nial­ity can be achieved with lit­tle loss of effi­cacy is im­por­tant for con­vinc­ing the field, and how much sta­tus is ap­plied to suc­cess at video games, I think this is a good area to pur­sue.

Filan’s Bot­tle Caps aren’t Op­ti­misers is a short blog post about how to iden­tify agents. It ar­gues this is im­por­tant be­cause we don’t want to ac­ci­den­tally cre­ate agents.

Milli et al.‘s Model Re­con­struc­tion from Model Ex­pla­na­tions show it is eas­ier to re­con­struct a model with queries about gra­di­ents than lev­els. Ask­ing “what are the par­tial deriva­tives at this point?” gives more in­for­ma­tion, and hence makes it eas­ier to re­verse-en­g­ineer the model, than ask­ing “what is the out­put at this point?“. The pa­per is framed as be­ing about the de­sire by some peo­ple to make AI mod­els ‘ac­countable’ by mak­ing them ‘ex­plain’ their de­ci­sions. I think this is not very im­por­tant, but it does seem to have some rele­vance to effi­ciently re­con­struct­ing la­tent *hu­man* value mod­els. Given that we can only query hu­mans so many times, it is im­por­tant to make effi­cient use of these queries. In­stead of ask­ing “Would you pull the lever?” many times, in­stead ask “Which fac­tors would make you more likely to pull the lever?“. In some sense ask­ing for par­tial deriva­tives seems like n queries (for an n-di­men­sional space), but given that many (most?) of these are likely to be lo­cally neg­ligible this might be an effi­cient way to help ex­tract hu­man prefer­ences.

Shah et al.’s Value Learn­ing Se­quence is a short se­quence of blog posts out­lin­ing the speci­fi­ca­tion prob­lem. This is ba­si­cally how to spec­ify even in the­ory what we might want to AI to do. It is a nice in­tro­duc­tion to many of the is­sues, like why imi­ta­tion learn­ing is not enough. Most of what has been pub­lished so far is not that new, though ap­par­ently it is still on­go­ing. Re­searchers from FHI were also con­tributed posts.

Reddy et al.’s Shared Au­ton­omy via Deep Re­in­force­ment Learn­ing de­sire an RL sys­tem that is in­tended to op­er­ate si­mul­ta­neously with a hu­man, pre­vent­ing the hu­man from tak­ing very bad ac­tions, de­spite not fully un­der­stand­ing the hu­mans goals.

Had­field-Menell et al.‘s Leg­ible Nor­ma­tivity for AI Align­ment: The Value of Silly Rules build a RL/​Game The­ory model for why we might want AI agents to obey and en­force even ‘silly’ rules. Ba­si­cally the idea is that fidelity to, and en­force­ment of, silly rules pro­vides cred­ible sig­nals that im­por­tant rules will also be en­forced—and their failure to be en­forced is also use­ful in­for­ma­tion that the group is not strong enough to defend it­self so agents can quit ear­lier. I was a lit­tle con­fused by the con­clu­sion, which sug­gested that agents would have to learn the differ­ence be­tween silly and non-silly rules. Wouldn’t this un­der­mine the sig­nal­ling value?

CHAI re­searchers also ap­peared as co-au­thors on:


Based on de­tailed fi­nan­cials they shared with me I es­ti­mate they have around 2 years worth of ex­penses in re­serve (in­clud­ing grants promised but not yet dis­bursed), with a 2019 bud­get of around $3m.

If you wanted to donate to them, here is the rele­vant web page.

CSER: The Cen­ter for the Study of Ex­is­ten­tial Risk

CSER is an ex­is­ten­tial risk fo­cused group lo­cated in Cam­bridge. Like GCRI they do work on a va­ri­ety of ex­is­ten­tial risks, with more of a fo­cus on strat­egy than FHI, MIRI or CHAI.

Strate­gic work is in­her­ently tied to out­reach, like lob­by­ing the UK gov­ern­ment, which is hard to eval­u­ate and as­sign re­spon­si­bil­ity for.

In the past I have crit­i­cised them for a lack of out­put. It is pos­si­ble they had timing is­sues whereby a sub­stan­tial amount of work was done in ear­lier years but only re­leased more re­cently. In any case they have pub­lished more in 2018 than in pre­vi­ous years.

CSER’s re­searchers seem to se­lect a some­what eclec­tic group of re­search top­ics, which I worry may re­duce their effec­tive­ness.


Liu and Price’s Ram­sey and Joyce on de­liber­a­tion and pre­dic­tion dis­cusses whether agents can have cre­dences on which de­ci­sion they’ll make while they’re in the pro­cess of de­cid­ing. This builds on their pre­vi­ous work in Heart of DARC­ness. The rele­vance to AI safety is pre­sum­ably via MIRI’s 5-10 prob­lem, and how to model agents who think about them­selves as part of the world, which I didn’t ap­pre­ci­ate when I read Heart of DARC­ness. In par­tic­u­lar, it dis­cusses agents with sub agents. Hav­ing said that, a lot of the pa­per seemed to rest on ter­minolog­i­cal dis­tinc­tions.

Cur­rie’s Ex­is­ten­tial Risk, Creativity & Well-Adapted Science ar­gues that the pro­fes­sion­al­i­sa­tion of sci­ence en­courages ‘cau­tious’ re­search, whereas Xrisk re­quires more cre­ativity. Essen­tially it ar­gues that many in­sti­tu­tional fac­tors push sci­en­tists to­wards ex­ploita­tion over ex­plo­ra­tion. In gen­eral I found this con­vinc­ing, though pace Cur­rie I think the small num­ber of Pro­fes­sor­ships com­pared to the num­ber of PhDs ac­tu­ally *en­courages* risk-tak­ing, as the value out-of-the-money call op­tions in­creases with volatility. I found his ar­gu­ment that Xrisk re­search need­ing un­usu­ally large amounts of cre­ativity not en­tirely con­vinc­ing—while I agree that novel threats like AI re­quire this, his ex­am­ple of so­lar flares seems like the sort of threat that could be ad­dressed in a dili­gent, rather than ge­nius, fash­ion. The pa­per has some per­tience for how we fund the Xrisk move­ment—in par­tic­u­lar I think it pulls in favour of many small grants to ‘cit­i­zen sci­en­tists’, rather than large grants to­wards or­gani­sa­tions.

Rees’s On The Fu­ture is a quick-read pop-sci book about the fu­ture of hu­man­ity. It in­cludes a brief dis­cus­sion of AI risk, and the sec­tion on the risks posed by high-en­ergy physics ex­per­i­ments was new to me. Many top­ics are dis­cussed only in a very cur­sory way how­ever, and I agree with Robin’s re­view—the book would have benefited from be­ing proofread by an economist, or sim­ply some­one who does not share the au­thor’s poli­ti­cal views.

Sha­har and Shapira’s Civ V AI Mod is a mod for Civ V (PC game) that adds su­per­in­tel­li­gence re­search into the game. This is the novel pub­lic­ity effort I al­luded to last year. It gen­er­ated some me­dia at­ten­tion, which seemed less bad than I ex­pected.

Cur­rie’s In­tro­duc­tion: Creativity, Con­ser­vatism & the So­cial Episte­mol­ogy of Science is a gen­eral in­tro­duc­tion to some is­sues about how risk-tak­ing (or not) in­sti­tu­tional sci­ence is.

Sha­har’s Mav­er­icks and Lot­ter­ies de­scribes var­i­ous ways in which al­lo­cat­ing re­search fund­ing by lot­tery, rather than through peer re­view, might be bet­ter. In par­tic­u­lar he ar­gues it would make in­sti­tu­tional sci­ence less con­ser­va­tive. I am scep­ti­cal of this, how­ever: the pro­pos­als still fea­ture fil­ter­ing pro­pos­als for be­ing “good enough”, and in equil­ibrium the stan­dard for be­ing “good enough” may just rise to where the peer re­view stan­dard was be­fore. Ad­di­tion­ally, I’m not sure I see a very strong link to ex­is­ten­tial risk—I guess OpenPhil could adopt ran­domi­sa­tion? Ex­pect­ing to re­form all of sci­ence fund­ing as a path to Xrisk re­duc­tion seems *very* in­di­rect.

Cur­rie’s Geo­eng­ineer­ing Ten­sions dis­cusses the pros and cons of geo­eng­ineer­ing, and the difficul­ties of do­ing ex­per­i­ments in the field. It dis­cusses two ten­sions: firstly the moral haz­ard risk, and sec­ondly the difficulty of do­ing the nec­es­sary ex­per­i­ments given the con­ser­vatism of in­sti­tu­tional sci­ence.

Adrian Cur­rie ed­ited a ‘spe­cial is­sue’, Fu­tures of Re­search in Catas­trophic and Ex­is­ten­tial Risk which I think is ba­si­cally a jour­nal of ar­ti­cles they in some sense com­mis­sioned or col­lected. Cur­rie and Ó hÉigeartaigh’s Work­ing to­gether to face hu­man­ity’s great­est threats: In­tro­duc­tion to The Fu­ture of Re­search on Catas­trophic and Ex­is­ten­tial Risk pro­vides an overview of the top­ics dis­cussed in the edi­tion. In gen­eral these are not so much con­cerned with ob­ject-level ex­is­ten­tial risks as with the meta-work of de­vel­op­ing the field. Un­for­tu­nately I have not had time to re­view all the ar­ti­cles it con­tains that were not au­thored by CSER re­searchers, though Jones et al.’s Rep­re­sen­ta­tion of fu­ture gen­er­a­tions in United King­dom policy-mak­ing which ad­vo­cated for a Par­li­a­men­tory com­mit­tee for fu­ture gen­er­a­tions, looks in­ter­est­ing, as one was in­deed sub­se­quently cre­ated. CSER claim, as seems plau­si­ble, that many of these pa­pers would not have coun­ter­fac­tu­ally ex­isted with­out CSER’s role as a cat­a­lyst. The top­ics dis­cussed in­clude a va­ri­ety of ex­is­ten­tial risks.

CSER re­searchers also ap­peared as co-au­thors on the fol­low­ing pa­pers:


Based on some very rough num­bers shared with me I es­ti­mate they have around 1.25 years worth of ex­penses in re­serve, with an an­nual bud­get of around $1m.

If you wanted to donate to them, here is the rele­vant web page.

GCRI: Global Catas­trophic Risks Institute

The Global Catas­trophic Risks In­sti­tute is a ge­o­graph­i­cally dis­persed group run by Seth Baum. They have pro­duced work on a va­ri­ety of ex­is­ten­tial risks, in­clud­ing AI and non-AI risks. Within AI they do a lot of work on the strate­gic land­scape, and are very pro­lific.

They are sig­nifi­cantly smaller or­gani­sa­tion than most of the oth­ers re­viewed here, and in 2018 only one of their re­searchers (Seth) was full time. In the past I have been im­pressed with their high re­search out­put to bud­get ra­tio, and that con­tinued this year. At the mo­ment they seem to be some­what sub­scale as an or­gani­sa­tion—Seth seems to have been re­spon­si­ble for a large ma­jor­ity of their 2018 work—and are try­ing to grow.

Here is their an­nual write-up.

Adam Gleave, win­ner of the 2017 donor lot­tery, chose to give some money to GCRI; here is his thought pro­cess. He was im­pressed with their nu­clear war work (which I’m not qual­ified to judge), and recom­mend GCRI fo­cus more on qual­ity and less on quan­tity, which seems plau­si­ble to me. GCRI tell me they are at­ten­tive to the is­sue and have made in­sti­tu­tional changes to try to af­fect change.

GCRI also shared some other con­sid­er­a­tions with me that I can­not dis­close, which may have af­fected my over­all con­clu­sion in ad­di­tion to the con­sid­er­a­tions listed above.


Baum et al.’s Long-Term Tra­jec­to­ries of Hu­man Civ­i­liza­tion pro­vides an anal­y­sis of pos­si­ble ways the fu­ture might go. They dis­cuss four broad tra­jec­to­ries: sta­tus quo, catas­tro­phe, tech­nolog­i­cal trans­for­ma­tion, and as­tro­nom­i­cal colon­i­sa­tion. The scope is very broad but the anal­y­sis is still quite de­tailed; it re­minds me of Su­per­in­tel­li­gence a bit. I think this pa­per has a strong claim to be­com­ing the de­fault refer­ence for the topic. Re­searchers from FHI, FRI were also named au­thors on the pa­per.

Baum’s Re­silience to Global Catas­tro­phe pro­vides a brief in­tro­duc­tion to ideas around re­silience to dis­asters. The points it made seem true, but are ob­vi­ously more ap­pli­ca­ble to non-AGI based threats that leave more scope for re­cov­ery.

Baum’s Uncer­tain Hu­man Con­se­quences in As­teroid Risk Anal­y­sis and the Global Catas­tro­phe Thresh­old dis­cusses the con­se­quences of As­teroid im­pact. He re­views some of the liter­a­ture, and dis­cusses the idea of im­por­tant thresh­olds for im­pact. One idea I hadn’t come across be­fore was the risk that an as­ter­oid im­pact might be mis­taken as a nu­clear at­tack and cause a war—an in­ter­est­ing risk be­cause all we need to do to avoid it is see the as­ter­oid com­ing. How­ever, I’m not an ex­pert in the field, so strug­gle to judge how novel or in­cre­men­tal the pa­per is.

Baum and Bar­rett’s A Model for the Im­pacts of Nu­clear War goes through the var­i­ous im­pacts of nu­clear war. It seems dili­gent and use­ful for fu­ture re­searchers or poli­cy­mak­ers as a refer­ence, though it is not my area of ex­per­tise.

Baum et al.‘s A Model for the Prob­a­bil­ity of Nu­clear War de­scribes and de­com­poses the many pos­si­ble routes to nu­clear war. It also con­tains an in­ter­est­ing and ex­ten­sive database of ‘near-miss’ sce­nar­ios.

Baum’s Su­per­in­tel­li­gence Skep­ti­cism as a Poli­ti­cal Tool dis­cusses the risk of mo­ti­vated scep­ti­cism about AI risks in or­der to pro­tect fund­ing for re­searchers and avoid reg­u­la­tion for cor­po­ra­tion. This seems like a plau­si­ble risk, though we should be care­ful at­tribut­ing dis­in­gen­u­ous mo­ti­va­tions to op­po­nents—though it is cer­tainly true that the AI safety com­mu­nity seems to be the tar­get of more mis­in­for­ma­tion than you might ex­pect. I think the pa­per could might have benefit­ted from con­trast­ing this with the risks of reg­u­la­tory cap­ture, which seem to op­er­ate in the other di­rec­tion. Without do­ing so the poli­ti­cal dis­cus­sion was some­what par­ti­san—in both mis­in­for­ma­tion pa­pers vir­tu­ally all the ex­am­ples bad ac­tor were right wing groups, though per­haps most read­ers might find this is agree­able!

Baum’s Coun­ter­ing Su­per­in­tel­li­gence Mis­in­for­ma­tion dis­cusses ways to im­prove de­bate around su­per­in­tel­li­gence through coun­ter­ing mis­in­for­ma­tion. Th­ese are mainly differ­ent forms of ed­u­ca­tion, plus crit­i­cism of peo­ple for say­ing false things. I thought that the sec­tions about ways of ad­dress­ing mis­in­for­ma­tion once it ex­ists were gen­er­ally quite so­phis­ti­cated, though I am scep­ti­cal of some of them as I don’t think AI safety is very amenable to pop­u­lar or state pres­sure.

Baum et al.‘s Model­ling and In­ter­pret­ing Ex­pert Disagree­ment about Ar­tifi­cial In­tel­li­gence at­tempts to put num­bers of Bostrom and Go­ertzel’s cre­dences for var­i­ous AI risk fac­tors and com­pare. They try to break down the dis­agree­ment into three state­ments, in­ter­pret the two thinkers’ state­ments as prob­a­bil­ities for those state­ments, and then as­sign their own prob­a­bil­ity for which thinker is cor­rect. I’m a bit con­fused by the last step—it seems that by do­ing so you’re ba­si­cally en­sur­ing the out­put will be equal to your own cre­dence (by the law of to­tal prob­a­bil­ity).

Um­brello and Baum’s Eval­u­at­ing Fu­ture nan­otech­nol­ogy: The Net So­cietal Im­pacts of Atom­i­cally Pre­cise Man­u­fac­tur­ing dis­cusses the pos­si­ble im­pacts of nan­otech­nol­ogy on so­ciety. Most of the dis­cus­sion is quite broad, and could ap­ply to eco­nomic growth in gen­eral. I was sur­prised how lit­tle value the au­thors as­signed to greatly in­creas­ing the wealth of hu­man­ity.


GCRI spent around $140k in 2018, and are aiming to raise $1.5m to cover the next three years, for a tar­get an­nual bud­get of ~$500k. This would al­low them to em­ploy their (3) key staff full time and have some money for ad­di­tional hiring.

This large jump makes it a lit­tle hard to calcu­late run­way in a com­pa­rable fash­ion to other or­gani­sa­tions. They cur­rently have around $280k, hav­ing re­cently re­ceived a $250k dona­tion. But is it un­fair to in­clude this dona­tion, given they re­ceived it sub­se­quently to some other or­gani­sa­tions tel­ling me about their fi­nance? All or­gani­sa­tions should look pro­gres­sively bet­ter funded as giv­ing sea­son goes on!

In any case it seems rel­a­tively clear that they have been and prob­a­bly con­tinue to be at the mo­ment more fund­ing con­strained than most other or­gani­sa­tions. The part-time na­ture of many of their staff makes their cost struc­ture more vari­able and less fixed, sug­gest­ing this limited run­way is less of an ex­is­ten­tial threat than it would be at some other or­gani­sa­tions – they’re not about to dis­band—though clearly this is still un­de­sir­able.

It seems cred­ible that more fund­ing would al­low them to hire their re­searchers full time, which seems like a rel­a­tively low-risk method of scal­ing. If they can pre­serve their cur­rent pro­duc­tivity this could be valuable, though my im­pres­sion is many small or­gani­sa­tions be­come less pro­duc­tive as they scale, as high ini­tial pro­duc­tivity may be due to founder effects that re­vert to the mean.

If you want to donate to GCRI, here is the rele­vant web page.

GPI: The Global Pri­ori­ties Institute

The Global Pri­ori­ties In­sti­tute is an aca­demic re­search in­sti­tute, lead by Hilary Greaves, work­ing on EA philos­o­phy within Oxford. I think of their mis­sion as at­tempt­ing to provide a home so that high qual­ity aca­demics can have a re­spectable aca­demic ca­reer while work­ing on the most im­por­tant is­sues. At the mo­ment they mainly em­ployee philoso­phers, but they tell me they are plan­ning to hire more economists in the fu­ture.

They are rel­a­tively new but many of their em­ploy­ees are ex­tremely im­pres­sive and their work­ing pa­pers (linked on the EA fo­rum, not on their main web­site) seem very good to me. At this stage I wouldn’t ex­pect them to have reached run-rate pro­duc­tivity, so would ex­pect this to in­crease in 2019.

They shared with me ab­stracts of a num­ber of pa­pers and so on they were work­ing on which seemed in­ter­est­ing and use­ful. As aca­demic philos­o­phy goes it is very tightly fo­cused on im­por­tant, de­ci­sion-rele­vant is­sues—how­ever it is not di­rectly AI Safety work.

They al­low their em­ploy­ees to spend 50% (!) of their time work­ing on non-GPI pro­jects, to help at­tract tal­ent. How­ever, the Tram­mell pa­per men­tioned be­low was one of these pro­jects, and I thought it was very good, so maybe in prac­tice this does not rep­re­sent a halv­ing of their cost-effec­tive­ness.

CEA are also spawn­ing a new in­de­pen­dant Forethought Foun­da­tion for Global Pri­ori­ties Re­search, which seems to be very similar to GPI ex­cept not part of Oxford.


Mo­gensen’s Long-ter­mism for risk averse al­tru­ists ar­gues that risk-averse should make al­tru­ists *more*, not *less*, in­ter­ested in pre­vent­ing ex­is­ten­tial risks. This is ba­si­cally for the same rea­son that risk aver­sion causes peo­ple to buy in­surance. You should be risk averse in out­comes, not in the di­rect im­pacts of your ac­tions. This ar­gu­ment is to­tally ob­vi­ous now but I’d never heard any­one men­tion it un­til two months ago, which sug­gests it is real progress. Over­all I thought this was an ex­cel­lent pa­per.

Tram­mell’s Fixed-Point Solu­tions to the Regress Prob­lem in Nor­ma­tive Uncer­tainty ar­gues that we can avoid in­finite metaeth­i­cal regress through fixed-point re­sults. This seems like an al­ter­na­tive to Will’s work on Mo­ral Uncer­tainty in some senses. Ba­si­cally the idea is that if the ‘choice­wor­thi­ness’ of differ­ent the­o­ries are car­di­nal at ev­ery level in their hi­er­ar­chy, we can prove a unique fixed point. This is sig­nifi­cant to the ex­tent we think that AIs are go­ing to have to learn how to do moral rea­son­ing, per­haps with­out the aid of hu­mans’ con­ve­nient “just don’t think about it” hack. It’s also in some ways a nice re­sponse to this SlateS­tarCodex ar­ti­cle.


They have a 2019 bud­get of around $1.5m dol­lars, and shared with me a num­ber of ex­am­ples of types of peo­ple they might like to hire in the fu­ture, with ad­di­tional fund­ing.

Ap­par­ently Oxford Univer­sity rules mean that all their hires have to be pre-funded for their en­tire du­ra­tion of their (4-5 year) con­tract.

If you wanted to donate to GPI, here is the link.

ANU: Aus­tralian Na­tional Univer­sity

Aus­tralian Na­tional Univer­sity has pro­duced a sur­pris­ingly large num­ber of rele­vant pa­pers and re­searchers over time.


Ever­itt et al.’s AGI Safety Liter­a­ture Re­view AGI Safety Liter­a­ture Re­view—I was glad to see some­one else at­tempt­ing to do the same thing I have! Read­ers of this ar­ti­cle might en­joy read­ing it, as it has much the same pur­pose. For aca­demics new to the field it could func­tion as a use­ful overview, in­tro­duc­ing but not re­ally ar­gu­ing for many im­por­tant points. It’s main value prob­a­bly comes from one-sen­tence de­scrip­tions of a large num­ber of pa­pers, which could be a use­ful launch­ing point for re­search. Liter­a­ture re­views can also help raise the sta­tus of the field. How­ever, it is less likely to add much new in­sight to those fa­mil­iar with the field, as it doesn’t re­ally en­gage with any of the ar­gu­ments in depth.

Ever­itt et al.’s Re­in­force­ment Learn­ing with a Cor­rupted Re­ward Chan­nel ex­am­ines how noisy re­ward in­puts can dras­ti­cally de­grade re­in­force­ment learner perfor­mance, and some pos­si­ble solu­tions. Un­sur­pris­ingly, CIRL fea­tures as a pos­si­ble solu­tion. It’s also nice to see ANU-Deep­mind col­lab­o­ra­tion. This pa­per was ac­tu­ally writ­ten last year, but I men­tion it here for com­plete­ness as I think I missed it pre­vi­ously; I haven’t re­viewed it in depth. Re­searchers from Deep­mind were also named au­thors on the pa­per.

EDIT: one pa­per redacted on au­thor re­quest, pend­ing im­proved sec­ond ver­sion.

ANU re­searchers were also named as co-au­thors on the fol­low­ing pa­pers:


Given their po­si­tion as part of ANU I sus­pect it would be difficult for in­di­vi­d­ual dona­tions to ap­pre­cia­bly sup­port their work. Ad­di­tion­ally, one of their top re­searchers, Tom Ever­itt, has now joined Deep­mind.

BERI: The Berkeley Ex­is­ten­tial Risk Initiative

EDIT: After pub­lish­ing, the Berkeley Ex­is­ten­tial Risk Ini­ti­a­tive re­quested I re­move this sec­tion. As a pro­fes­sional cour­tesy I am re­luc­tantly com­ply­ing, and re­scind any sug­ges­tion that BERI may be a good place to donate. I apol­o­gize for any in­con­ve­nience caused to read­ers.


Ought is a San Fran­cisco based non-profit are re­search­ing the vi­a­bil­ity of au­tomat­ing hu­man-like cog­ni­tion. The fo­cus is on ap­proaches that are “scal­able” in the sense that bet­ter ML or more com­pute makes them in­creas­ingly helpful for sup­port­ing and au­tomat­ing de­liber­a­tion with­out re­quiring ad­di­tional data gen­er­ated by hu­mans. The idea, as with am­plifi­ca­tion, is that we can achieve safety guaran­tees by mak­ing agents that rea­son in in­di­vi­d­ual ex­plicit and com­pre­hen­si­ble steps, iter­ated many times over, as op­posed to the dom­i­nant more black-box ap­proaches of main­stream ML. Ought does re­search on com­put­ing paradigms that sup­port this ap­proach and ex­per­i­ments with hu­man par­ti­ci­pants to de­ter­mine whether this class of ap­proaches is promis­ing. But I ad­mit I un­der­stand what they do less well than with other groups.

Their work doesn’t fit neatly into the model of the above groups—they’re not fo­cused on pub­lish­ing re­search pa­pers, at least at the mo­ment. Partly as a re­sult of this, and as a new group, I feel like I don’t have quite as good a grasp on ex­actly their sta­tus as with other groups—which is of course pri­mar­ily a fact about my epistemic state, rather than them.


Stuh­lmüller’s Fac­tored Cog­ni­tion out­lines the ideas be­hind their im­ple­men­ta­tion of Chris­ti­ano-style am­plifi­ca­tion. They built a web app where peo­ple take ques­tions and re­cur­sively break them down into sim­pler ques­tions that can be solved in iso­la­tion. At the mo­ment this is for hu­mans, to try to test whether this sort of am­plifi­ca­tion of dis­til­la­tion and an­swer­ing could work. It seems like they have put a fair bit of thought into the on­tol­ogy.

Evans et al.’s Pre­dict­ing Hu­man De­liber­a­tive Judg­ments with Ma­chine Learn­ing at­tempts to make progress on build­ing ML sys­tems re­main well-cal­ibrated (i.e. the sys­tem “knows what it knows”) in AI-com­plete set­tings (i.e. in set­tings where cur­rent ML al­gorithms can’t pos­si­bly do well on ev­ery pos­si­ble in­put). To do this they col­lect a dataset of hu­man judge­ments on com­plex is­sues (weird fermi es­ti­ma­tions and poli­ti­cal fact-check­ing) and then look at how peo­ple’s es­ti­mates for these ques­tions changed as they were al­lowed more time. This is im­por­tant be­cause some­one’s rapid judge­ment of an is­sue is ev­i­dence as to what their even­tual slow judge­ment will be. In some cases you might be able to pre­dict that there is no need to give the hu­man more time; their 30 sec­ond an­swer is prob­a­bly good enough. This could be use­ful if you are try­ing to pro­duce a large train­ing set of judge­ments about com­plex top­ics. I also ad­mire the au­thor’s hon­esty that the re­sults of their ML sys­tem was less good than they ex­pected. They also dis­cussed prob­lems with their dataset; this was definitely my ex­pe­rience when try­ing to use the site. Re­searchers from FHI were also named au­thors on the pa­per.


Based on num­bers they shared with me I es­ti­mate they have around half a year’s worth of ex­penses in re­serve, with an pro­jected 2019 bud­get of around $1m.

Ad­di­tional fund­ing sounds like it would go to­wards re­serves and ad­di­tional re­searchers and pro­gramers, in­clud­ing a web de­vel­oper, prob­a­bly mainly con­tin­u­ing work­ing on Fac­tored Cog­ni­tion.

Ought ask me to point out that they have ap­plied for an OpenPhil grant re­newal but ex­pect to still have room for more fund­ing af­ter­wards.

AI Impacts

AI Im­pacts is a small Berkeley-based group that does high-level strat­egy work, es­pe­cially on AI timelines, some­what as­so­ci­ated with MIRI.

Adam Gleave, win­ner of the 2017 donor lot­tery, chose to give some money to AI Im­pacts; here is his thought pro­cess. He was im­pressed with their work, al­though scep­ti­cal of their abil­ity to scale.


Carey wrote In­ter­pret­ing AI Com­pute Trends, which ar­gues that cut­ting-edge ML re­search pro­jects have been get­ting dra­mat­i­cally more ex­pen­sive. So much so that the trend will have to stop, sug­gest­ing that (one driver of) AI progress will slow down over the next 3.5-10 years. Ad­di­tion­ally, he points out that we are also near­ing the pro­cess­ing ca­pac­ity (though not scan­ning ca­pac­ity) re­quired to model hu­man brains. (Note that this was a guest post by Ryan, who works for FHI)

Grace’s Like­li­hood of dis­con­tin­u­ous progress around the de­vel­op­ment of AGI dis­cusses a 11 differ­ent ar­gu­ments for AGI to have a dis­con­tin­u­ous im­pact, and finds them gen­er­ally un­con­vinc­ing. This is im­por­tant from a strat­egy point of view be­cause it sug­gests we should have more time to see AGI com­ing, po­ten­tially also mak­ing it clear to scep­tics. Over­all I found the ar­ti­cle clear and gen­er­ally con­vinc­ing.

McCaslin’s Trans­mit­ting fibers in the brain: To­tal length and dis­tri­bu­tion of lengths analy­ses how much neu­ral fibre there is in the hu­man brain, and the dis­tri­bu­tion of long vs short. My un­der­stand­ing is this is re­lated to how many neu­rons in hu­man brains are ded­i­cated to mov­ing in­for­ma­tion around, rather than com­pu­ta­tion, which might be im­por­tant be­cause it is an ad­di­tional form of ca­pac­ity that is of­ten over­looked when peo­ple talk about FLOPS and MIPS, and so might af­fect your es­ti­mates for when we have enough hard­ware ca­pac­ity for neu­ro­mor­phic AI. How­ever, I might be mi­s­un­der­stand­ing, as I found the mo­ti­va­tion a lit­tle un­clear.

Grace’s Hu­man Level Hard­ware Timeline at­tempts to es­ti­mate how long un­til we have hu­man-level hard­ware at hu­man cost. Largely based on ear­lier work, they es­ti­mate “a 30% chance we are already past hu­man-level hard­ware (at hu­man cost), a 45% chance it oc­curs by 2040, and a 25% chance it oc­curs later.”

They have gath­ered a col­lec­tion of ex­am­ples of dis­con­tin­u­ous progress in his­tory, to at­tempt to pro­duce some­thing of a refer­ence class for how likely this is with AGI—see for ex­am­ple the Burj Khal­ifa, the Eiffel Tower, rock­ets. It would be nice to see how many pos­si­ble ex­am­ples they in­ves­ti­gated and found were not dis­con­tin­u­ous.


Ac­cord­ing to num­bers they shared with me, AI Im­pacts spent around $90k in 2018 on two part-time em­ploy­ees. In 2019 they plan to sig­nifi­cantly in­crease, to ~$360k and hire mul­ti­ple new work­ers. They have just over $400k in cur­rent fund­ing, sug­gest­ing a bit over a year of run­way at this ele­vated rate, or many years at their 2018 rate.

Similar to GCRI, there is some risks that small groups may have a high pro­duc­tivity due to founder effects, and this might re­vert to the mean as they scale.

MIRI seems to ad­minister their fi­nances on their be­half; dona­tions can be made here.

Open AI

OpenAI is a San Fran­cisco based AGI startup char­ity, with a large fo­cus on safety. It was founded in 2015 with money largely from Elon Musk.


Chris­ti­ano et al. ’s Su­per­vis­ing Strong Learn­ers by Am­plify­ing Weak Ex­perts lays out Paul’s am­plifi­ca­tion ideas in a pa­per—or at least one im­ple­men­ta­tion of them. Ba­si­cally the idea is that there are many prob­lems where it is too ex­pen­sive to pro­duce train­ing sig­nals di­rectly, so we will do so in­di­rectly. We do this by iter­a­tively break­ing up the task into sub-tasks, us­ing the agent to help with each sub-task, and then train­ing the agent on the hu­man’s over­all judge­ment, aided by the agent’s out­put on the sub­tasks. Hope­fully as the agent be­comes strong it also gets bet­ter at the sub­tasks, im­prov­ing the train­ing set fur­ther. We also train a sec­ond agent to be able to pre­dict good sub­tasks to go for, and to pre­dict how the hu­man will use the out­puts from the sub­tasks. I’m not sure I un­der­stand why we don’t train the agent on its perfor­mance of the sub­tasks (ex­cept that it is ex­pen­sive to eval­u­ate there?) I think the pa­per might have been a bit clearer if it had in­cluded an ex­am­ple of the al­gorithm be­ing used in prac­tice with a hu­man in the loop, rather than purely al­gorith­mic ex­am­ples. Hope­fully this will come in the fu­ture. Nonethe­less this was clearly a very im­por­tant pa­per. Over­all I thought this was an ex­cel­lent pa­per.

Irv­ing, Chris­ti­ano and Amodei’s AI Safety via De­bate ex­plore ad­ver­sar­ial ‘de­bate’ be­tween two or more ad­vanced agents, com­pet­ing to be judged the most helpful by a trusted but limited agent. This is very clever. It’s an ex­ten­sion of the grand Chris­ti­ano pro­ject of try­ing to de­vise ways of am­plify­ing sim­ple, trusted agents (like hu­mans) into more pow­er­ful ones—de­sign­ing a sys­tem that takes ad­van­tage of our trust in the weak agent to en­sure com­pli­ance in the stronger. Imag­ine we ba­si­cally have a court­room situ­a­tion, where two highly ad­vanced le­gal teams, with vast amounts of le­gal and foren­sic ex­per­tise, try to con­vince a sim­ple but trusted agent (the jury) that they’re in the right. Each side is try­ing to make its ‘ar­gu­ments’ as sim­ple as pos­si­ble, and point out the flaws in the other’s. As long as re­fut­ing lies is easy rel­a­tive to ly­ing, hon­esty should be the best strat­egy… so agents con­strained in this way will be hon­est, and not even try dishon­esty! Like a court­room where both le­gal teams de­cide to rep­re­sent the same side. The pa­per con­tains some nice ex­am­ples, in­clud­ing AlphaGo as an anal­ogy and a neat MNIST simu­la­tion, and an in­ter­ac­tive web­site. Over­all I thought this was an ex­cel­lent pa­per.

The OpenAI Char­ter is their state­ment of val­ues with re­gard AGI re­search. It seems to con­tain the things you would want it to: benefit of all, fi­du­ciary duty to hu­man­ity. Most in­ter­est­ingly, it also in­cludes ” if a value-al­igned, safety-con­scious pro­ject comes close to build­ing AGI be­fore we do, we com­mit to stop com­pet­ing with and start as­sist­ing this pro­ject. We will work out speci­fics in case-by-case agree­ments, but a typ­i­cal trig­ger­ing con­di­tion might be “a bet­ter-than-even chance of suc­cess in the next two years”“, a clause which seems very sen­si­ble. Fi­nally, it also notes that, like MIRI, they an­ti­ci­pate re­duc­ing their con­ven­tional pub­lish­ing.

Amodei and Her­nan­dez’s AI and Com­pute at­tempts to quan­tify the com­put­ing power used for re­cent ma­jor AI de­vel­op­ments like ResNets and AlphaGo. They find it has been dou­bling ap­prox­i­mately ev­ery 3-4 months, dra­mat­i­cally faster than you would ex­pect from Moore’s law – es­pe­cially if you had been read­ing ar­ti­cles about the end of Moore’s law! This is due to a com­bi­na­tion of the move to spe­cial­ist hard­ware (ini­tially GPUs, and now AI ASICs) and com­pa­nies sim­ply spend­ing a lot more dol­lars. This is not a the­ory pa­per, but has di­rect rele­vance for timeline pre­dic­tion and strat­egy that de­pends on whether or not there will be a hard­ware over­hang.

Chris­ti­ano’s Univer­sal­ity and Se­cu­rity Am­plifi­ca­tion de­scribes how Am­plifi­ca­tion hopes to en­hance se­cu­rity by pro­tect­ing against ad­ver­sar­ial in­puts (at­tacks). The hope is that the pro­cess of break­ing down queries into sub-queries that is at the heart of the Am­plifi­ca­tion idea can leave us with queries of suffi­ciently low com­plex­ity that they are hu­man-se­cure. I’m not sure I re­ally un­der­stood what this posts adds to oth­ers in Paul’s ar­se­nal, mainly be­cause I haven’t been fol­low­ing these as closely as per­haps I should have.

Re­searchers from OpenAI were also named as coau­thors on:


Given the strong fund­ing situ­a­tion at OpenAI, as well as their safety team’s po­si­tion within the larger or­gani­sa­tions, I think it would be difficult for in­di­vi­d­ual dona­tions to ap­pre­cia­bly sup­port their work. How­ever it could be an ex­cel­lent place to ap­ply to work.

Google Deepmind

As well as be­ing ar­guably the most ad­vanced AI re­search shop in the world, Google’s Lon­don-based Deep­mind has a very so­phis­ti­cated AI Safety team.


Leike et al.‘s AI Safety Grid­wor­lds in­tro­duces an open-source set of en­vi­ron­ments for test­ing ML al­gorithms for safe­ty­ness. Progress in ML has been con­sid­er­able aided by the availa­bil­ity of com­mon toolsets like MNIST or the Atari games. Here the Deep­mind safety team have pro­duced a set of en­vi­ron­ments de­signed to test al­gorithms abil­ity to avoid a num­ber of safety-re­lated failure modes, like In­ter­rupt­ibil­ity, Side Effects, Distri­bu­tional Shifts and Re­ward Hack­ing. This hope­fully not only makes such test­ing more ac­cessible, it also makes these is­sues more con­crete. Ideally it would shift the over­ton win­dow: maybe one day it will be weird to read an ML pa­per that does not con­tain a sec­tion de­scribing perfor­mance on the Deep­mind Grid­wor­lds. This is clearly not a panacea; it is eas­ily to ‘fake’ pass­ing the test by giv­ing the agent in­for­ma­tion it shouldn’t have, it is bet­ter to prove safety re­sults than tack them on, and there is always a risk of Good­heart­ing. But this seems to me to be clearly a sig­nifi­cant step for­ward. My en­thu­si­asm is only slightly tem­pered by the fact that only one pa­per pub­lished in the fol­low­ing year cit­ing the pa­per made use of the Grid­world suite, though Alex Turner’s ex­cel­lent post on Im­pact mea­sures did as well. Over­all I thought this was an ex­cel­lent pa­per. Re­searchers from ANU were also named au­thors on the pa­per.

Krakovna’s Speci­fi­ca­tion Gam­ing Ex­am­ples in AI pro­vides a col­lec­tion of differ­ent cases where agents have op­ti­mised their re­ward func­tion in sur­pris­ing/​un­de­sir­able fash­ion. The spread­sheet of 45 ex­am­ples might have some re­search value, but my guess is most of the value is as ev­i­dence of the prob­lem.

Krakovna et al.‘s Mea­sur­ing and avoid­ing side effects us­ing rel­a­tive reach­a­bil­ity in­vents a new way of defin­ing ‘im­pact’, which is im­por­tant if you want to min­imise it, based on how many states’ achiev­abil­ity are af­fected. Essen­tially it takes some the set of pos­si­ble states, and then pun­ishes the agent for re­duc­ing the at­tain­abil­ity of these states. The post also in­cludes a few simu­la­tions in the AI Grid­world.

Leike et al.’s Scal­able agent al­ign­ment via re­ward mod­el­ing: a re­search di­rec­tion out­lines the Deep­mind agenda for boot­strap­ping hu­man eval­u­a­tions to provide feed­back for RL agents. Similar in some ways to the Chris­ti­ano pro­ject, the idea is that your main RL agent si­mul­ta­neously learns its re­ward func­tion and about the world. The hu­man’s abil­ity to provide good re­ward feed­back is im­proved by train­ing smaller agents who help him judge which re­wards to provide. The pa­per goes into a num­ber of po­ten­tial fa­mil­iar prob­lems, and po­ten­tial av­enues of at­tack on those is­sues. I think the news here is more that the Deep­mind (Safety) team is fo­cus­ing on this, rather than the core ideas them­selves. The pa­per also re­views a lot of re­lated work.

Gas­parik et al.’s Safety-first AI for au­tonomous data cen­tre cool­ing and in­dus­trial con­trol de­scribes the mainly safety mea­sures Google put in place to en­sure their ML-driven dat­a­cen­ter cool­ing sys­tem didn’t go wrong.

Ibarz et al.’s Re­ward Learn­ing from Hu­man Prefer­ences and De­mon­stra­tions in Atari com­bines RL and IRL as two differ­ent sources of in­for­ma­tion for the agent. If you think both ideas have some value, it makes sense that com­bin­ing them fur­ther im­proves perfor­mance.

Leibo et al.’s Psy­ch­lab: A Psy­chol­ogy Lab­o­ra­tory for Deep Re­in­force­ment Learn­ing Agents cre­ates an en­vi­ron­ment for com­par­ing hu­mans and RL agents on the same tasks. Given the goal of get­ting AI agents to be­have in ways hu­mans ap­prove of is closely re­lated to the goal of mak­ing them be­have like hu­mans, this seems like a po­ten­tially use­ful tool.

Ortega et al.‘s Build­ing safe ar­tifi­cial in­tel­li­gence: speci­fi­ca­tion, ro­bust­ness and as­surance provide an in­tro­duc­tion to var­i­ous prob­lems in AI Safety. The con­tent is un­likely to be new to read­ers here; it is sig­nifi­cant in­so­muchas it rep­re­sents a sum­mary of the (worth­while) pri­ori­ties of Deep­mind(’s safety team). They de­com­pose the is­sue into speci­fi­ca­tion, ro­bust­ness and as­surance.

Re­searcher’s from Deep­mind were also named as coau­thors on the fol­low­ing pa­pers:


Be­ing part of Google, I think it would be difficult for in­di­vi­d­ual donors to di­rectly sup­port their work. How­ever it could be an ex­cel­lent place to ap­ply to work.

Google Brain

Google Brain is Google’s other highly suc­cess­ful AI re­search group.


Ku­rakin et al. wrote Ad­ver­sar­ial At­tacks and Defences Com­pe­ti­tion, which sum­marises the NIPS 2017 com­pe­ti­tion on Ad­ver­sar­ial At­tacks, in­clud­ing many of the strate­gies used. If you’re not fa­mil­iar with the area this could be a good in­tro­duc­tion.

Brown and Ols­son wrote In­tro­duc­ing the Un­re­stricted Ad­ver­sar­ial Ex­am­ples Challenge, which launches a new 2-sided challenge, for de­sign­ing sys­tems re­sis­tant to ad­ver­sar­ial ex­am­ples, and then find­ing ad­ver­sar­ial ex­am­ples. The differ­ence here is in al­low­ing a much broader class of ad­ver­sar­ial ex­am­ples, rather than just small per­tur­ba­tions. This seems like a sig­nifi­cantly more im­por­tant class, so it is good they are at­tempt­ing to move the field in this di­rec­tion.

Gilmer et al. wrote Mo­ti­vat­ing the Rules of the Game for Ad­ver­sar­ial Ex­am­ple Re­search, which ar­gue that the ad­ver­sar­ial ex­am­ple liter­a­ture has overly-fo­cused on a nar­row class of im­per­cep­ti­bly-changed images. In most re­al­is­tic cases the ad­ver­sary has a much wider scope of pos­si­ble at­tacks. Im­por­tantly for us, the gen­eral ques­tion is also more similar to the sorts of dis­tri­bu­tional shift is­sues that are likely to arise with AGI. To the ex­tent this pa­per helps push re­searchers to­wards more rele­vant re­search it seems quite good.


Be­ing part of Google, I think it would be difficult for in­di­vi­d­ual donors to di­rectly sup­port their work. How­ever it could be an ex­cel­lent place to ap­ply to work.

EAF /​ FRI: The Effec­tive Altru­ism Foun­da­tion /​ Foun­da­tional Re­search Institute

EAF is a Ger­man/​Swiss group effec­tive al­tru­ist group, lead by Jonas Vol­lmer and Ste­fan Torges, that un­der­takes a num­ber of ac­tivi­ties. They do re­search on a num­ber of fun­da­men­tal long-term is­sues, many re­lated how to re­duce the risks of very bad AGI out­comes, pub­lished through the Foun­da­tional Re­search In­sti­tute (FRI). Their web­site sug­gests that FRI and WAS (Wild An­i­mal Suffer­ing) are two equal sub-or­gani­sa­tions, but ap­par­ently this is not the case—es­sen­tially ev­ery­thing EAF does is FRI now, and they just let WAS use their le­gal en­tity and dona­tion in­ter­face. EAF also have Rais­ing for Effec­tive Giv­ing, which en­courages pro­fes­sional poker play­ers to donate to effec­tive char­i­ties, in­clud­ing MIRI.

In the past they have been rather nega­tive util­i­tar­ian, which I have always viewed as an ab­surd and po­ten­tially dan­ger­ous doc­trine. If you are in­ter­ested in the sub­ject I recom­mend Toby Ord’s piece on the sub­ject. How­ever, they have pro­duced re­search on why it is good to co­op­er­ate with other value sys­tems, mak­ing me some­what less wor­ried.


Oester­held’s Ap­proval-di­rected agency and the de­ci­sion the­ory of New­comb-like prob­lems analy­ses which de­ci­sion the­o­ries are in­stan­ti­ated by RL agents. The pa­per analy­ses the struc­ture of RL agents of var­i­ous kinds and maps them math­e­mat­i­cally to ei­ther Ev­i­den­tial or Causal De­ci­sion the­ory. Given how much we dis­cuss de­ci­sion the­ory it is sur­pris­ing in ret­ro­spect that no-one (to my knowl­edge) had pre­vi­ously looked to see which ones our RL agents were ac­tu­ally in­stan­ti­at­ing. As such I found this an in­ter­est­ing pa­per.

Bau­mann’s Us­ing Sur­ro­gate Goals to Deflect Threats dis­cusses us­ing a de­coy util­ity func­tion com­po­nent as to pro­tect against threats. The idea is that agents run the risk of counter-op­ti­mi­sa­tion at the hands of an ex­tor­tion­ist, but this could be pro­tected against by re­defin­ing their util­ity func­tion to add a pointless sec­ondary goal (like avoid­ing the cre­ation of a cer­tain di­men­sioned plat­inum sphere). An op­po­nent would find it eas­ier to ex­tort the agent by nega­tively op­ti­mis­ing the sur­ro­gate goal. This doesn’t pre­vent the agent from giv­ing in to the threats, but it does re­duce the dam­age if the at­tacker has to fol­low-through on their threat. The pa­per dis­cusses many ad­di­tional de­tails, in­clud­ing the multi-agent case, and the in­ter­ac­tion be­tween this and other defence mechanisms. My un­der­stand­ing is that they and Eliezer both (in­de­pen­dently?) came up with this idea. One thing I didn’t quite un­der­stand is the no­tional of at­tacker-hos­tile sur­ro­gates—surely they would just be ig­nored?

So­tala and Gloor’s Su­per­in­tel­li­gence as a Cause or Cure for Risks of Astro­nom­i­cal Suffer­ing is a re­view ar­ti­cle for the var­i­ous ways the fu­ture might con­tain a lot of suffer­ing. It does a good job of go­ing through pos­si­bil­ities, though I felt it was overly fo­cused on suffer­ing as a bad out­come—there are many other bad things too!

So­tala’s Shap­ing eco­nomic in­cen­tives for col­lab­o­ra­tive AGI ar­gues that en­courag­ing col­lab­o­ra­tive norms in AI with re­gard nar­row AI will en­courage those norms in the fu­ture for AGI due to cul­tural lock-in. Un­for­tu­nately it is not clear how to go about do­ing this. Re­searchers from FHI, were also named au­thors on the pa­per.


Based on their blog post, they cur­rently have around a year and a half’s worth of re­serves, with a 2019 bud­get of $925,000.

As EAF have in the past worked on a va­ri­ety of cause ar­eas, donors might worry about fun­gi­bil­ity. EAF tell me that they are now ba­si­cally en­tirely fo­cused on AI re­lated work, and that WAS re­search is funded by speci­fi­cally al­lo­cated dona­tions, which would im­ply this is not a con­cern, though I note that sev­eral WAS peo­ple are still listed on their team page.

Read­ers who want to donate to EAF/​FRI can do so here.

Fore­sight Institute

The Fore­sight In­sti­tute is a Palo-Alto based group fo­cus­ing on AI and nan­otech­nol­ogy. Origi­nally founded in 1986 (!), they seem to have been some­what re-in­vi­go­rated re­cently by Alli­son Duettmann. Un­for­tu­nately I haven’t had time to re­view them in de­tail.

A large part of their ac­tivity seems to be in or­ganis­ing ‘sa­lon’ dis­cus­sion /​ work­shop events.

Duettmann et al.’s Ar­tifi­cial Gen­eral In­tel­li­gence: Co­or­di­na­tion and Great Pow­ers sum­marises the dis­cus­sion at the 2018 Fore­sight In­sti­tute Strat­egy Meet­ing on AGI. Re­searchers from FHI and FLI were also named au­thors on the pa­per.

Read­ers who want to donate to Fore­sight can do so here.

FLI: The Fu­ture of Life Institute

The Fu­ture of Life In­sti­tute was founded to do out­reach, in­clud­ing run the Puerto Rico con­fer­ence. Elon Musk donated $10m for the or­gani­sa­tion to re-dis­tribute; given the size of the dona­tion it has right­fully come to some­what dom­i­nate their ac­tivity.

In 2018 they ran a sec­ond grant­mak­ing round, giv­ing $2m split be­tween 10 differ­ent peo­ple. Th­ese grants were more fo­cused on AGI than the pre­vi­ous round, which in­cluded a large num­ber of nar­row AI pro­jects. In gen­eral the grants went to uni­ver­sity pro­fes­sors. They have now awarded most of the $10m.

Un­for­tu­nately I haven’t had time to re­view them in de­tail.

Read­ers who want to donate to FLI can do so here.

Me­dian Group

The Me­dian Group is a new group for re­search on global catas­trophic risks, with re­searchers from MIRI, OpenPhil and Numerai. As a new group they lack the sort of track record that would make them eas­ily amenable to anal­y­sis. Cur­rent pro­jects they’re work­ing on in­clude AI timelines, for­est fires, and cli­mate change im­pacts on geopoli­tics.

I don’t know that much about them be­cause the con­tact email listed on the web­site does not work.


Tay­lor et al. wrote In­sight-based AI timeline model, which made an in­sight-based model for the time to AGI. They first pro­duced a list of im­por­tant in­sights that have (plau­si­bly) con­tributed to­wards AGI. Sur­pris­ingly, they find there has been a roughly con­stant rate of in­sight pro­duc­tion since 1945. They then model time-to-AGI us­ing a pareto dis­tri­bu­tion for the num­ber of in­sights re­quired. This is a novel (to me, at least) method that I liked.

Con­ver­gence Analysis

Con­ver­gence Anal­y­sis is a new group, lead by Justin Shov­e­lain, aiming to do strate­gic work. They are too new to have any track record.

Other Research

I would like to em­pha­sis that there is a lot of re­search I didn’t have time to re­view, es­pe­cially in this sec­tion, as I fo­cused on read­ing or­gani­sa­tion-dona­tion-rele­vant pieces. For ex­am­ple, Kosoy’s The Learn­ing-The­o­retic AI Align­ment Re­search Agenda seems like a wor­thy con­tri­bu­tion.


Lip­ton and Stein­hardt’s Trou­bling Trends in Ma­chine Learn­ing Schol­ar­ship cri­tiques a num­ber of de­vel­op­ments in the ML liter­a­ture that they think are bad. Ba­si­cally, they ar­gue that a lot of pa­pers obfus­cate ex­pla­na­tion vs spec­u­la­tion, ob­scure the true source of im­prove­ment in their pa­pers (of­ten just hy­per-pa­ram­e­ter tun­ing), use maths to im­press rather than clar­ify, and use com­mon en­glish words for com­plex terms, thereby smug­gling in un­nec­es­sary con­no­ta­tions. It’s un­clear to me, how­ever, to what ex­tent these is­sues re­tard progress on safety vs ca­pa­bil­ities. I guess to the ex­tent that safety re­quires clear un­der­stand­ing, whereas ca­pa­bil­ities can be achieved in a more messy fash­ion, these trends are bad and should be pushed back ok.

Jilk’s Con­cep­tual-Lin­guis­tic Su­per­in­tel­li­gence dis­cusses the need for AGI to have a con­cep­tual-lin­guis­tic fa­cil­ity. Con­tra re­cent AI de­vel­op­ments—e.g. AlphaZero does not have a lin­guis­tic abil­ity—he ar­gues that AIs will need lin­guis­tic abil­ity to un­der­stand much of the hu­man world. He also dis­cusses the difficul­ties that Rice’s the­o­rem im­poses on AI self-im­prove­ment, though this has been well dis­cussed be­fore.

Cave and Ó hÉigeartaigh’s An AI Race for Strate­gic Ad­van­tage: Rhetoric and Risks ar­gues that fram­ing AI de­vel­op­ment as a ‘race’, or an ‘arms race’, is bad. Much of their rea­son­ing is not new, and was pre­vi­ously pub­lished by e.g. Baum’s On the Pro­mo­tion of Safe and So­cially Benefi­cial Ar­tifi­cial In­tel­li­gence. In­stead I think of the tar­get au­di­ence here as be­ing poli­cy­mak­ers and other AI re­searchers: this is a pa­per aiming to in­fluence global strat­egy, not re­search EA strat­egy. Hav­ing said that, their dis­cus­sion of why we should ac­tively con­front AI race rhetoric, rather than try­ing to sim­ply avoid it, was novel, at least to me. It also ap­par­ently won best pa­per at the AAAI/​ACM con­fer­ence on Ar­tifi­cial In­tel­li­gence, Ethics, and So­ciety. Re­searchers from CSER were also named au­thors on the pa­per.

Liu et al.‘s A Sur­vey on Se­cu­rity Threats and Defen­sive Tech­niques of Ma­chine Learn­ing: A Data Driven View re­views se­cu­rity threats to con­tem­po­rary ML sys­tems. This is ba­si­cally ad­dresses the con­cerns raised in Amodei et al.’s Con­crete Prob­lems about Distri­bu­tional Shifts be­tween train­ing and test data, and how to en­sure ro­bust­ness.

Sarma and Hay’s Ro­bust Com­puter Alge­bra, The­o­rem Prov­ing, and Or­a­cle AI dis­cuss com­puter al­gorithm sys­tems as po­ten­tially im­por­tant classes of Or­a­cles, and try to provide con­crete safety-re­lated work that could be done. Their overview of Ques­tion-An­swer­ing-Sys­tems, Com­puter-Alge­bra-Sys­tems and In­ter­ac­tive-The­o­rem-Provers was in­ter­est­ing to me, as I didn’t have much fa­mil­iar­ity thereof. They ar­gue that CAS use heuris­tics that lead to in­valid in­fer­ences some­times, while ITPs are very in­effi­cient, and sug­gest pro­jects to help in­te­grate the two, to pro­duce more re­li­able math or­a­cles. I think of this pa­per as be­ing a bit like a spe­cial­ised ver­sion of Amodei et al’s Con­crete Prob­lems, but the con­nec­tion be­tween the pro­jects here and the end goal of FAI is a lit­tle harder for me to grasp. Ad­di­tion­ally, the pa­per seems to have been in de­vel­op­ment since 2013?

Man­heim and Garrabrant’s Cat­e­go­riz­ing Var­i­ants of Good­heart’s Law clas­sifies differ­ent types of situ­a­tions where a proxy mea­sures ceases to be a good proxy when you start rely­ing on it. This is clearly an im­por­tant topic for AI safety, in­so­much as we are hop­ing to de­sign AIs that will not fall vic­tim to it. The pa­per pro­vides a nice dis­am­bigua­tion of differ­ent kinds of situ­a­tion, bring­ing con­cep­tual clar­ity even if it’s not a deep math­e­mat­i­cal re­sult. Re­searchers from MIRI were also named au­thors on the pa­per.

Ngo and Pace’s Some cruxes on im­pact­ful al­ter­na­tives to AI policy work dis­cuss the ad­van­tages and dis­ad­van­tages of AI policy work. They try to find the ‘crux’ of their dis­agree­ment—the small num­ber of state­ments they dis­agree about which de­ter­mine which side of the is­sue they come down on. Re­searchers from Deep­mind were also named au­thors on the pa­per.

Awad et al.’s The Mo­ral Ma­chine Ex­per­i­ment did a mas­sive on­line in­ter­ac­tive sur­vey of 35 *mil­lion* peo­ple to de­ter­mine their moral prefer­ences with re­gard au­tonomous cars. They found that peo­ple pre­fer: sav­ing more peo­ple rather than fewer; sav­ing hu­mans over an­i­mals; sav­ing young (in­clud­ing un­born chil­dren) over old; lawful peo­ple over crim­i­nals; ex­ec­u­tives over home­less; fit over fat; fe­males over males; and pedes­tri­ans over pas­sen­gers. I thought this was very in­ter­est­ing, and ap­plaud them for ac­tu­ally look­ing for peo­ple’s moral in­tu­itions, rather than just sub­sti­tut­ing the val­ues of the pro­gram­mers/​poli­ti­ci­ans. They also analyse how these val­ues differ be­tween cul­tures. Over­all I thought this was an ex­cel­lent pa­per.

Green’s Eth­i­cal Reflec­tions on Ar­tifi­cial In­tel­li­gence re­views var­i­ous eth­i­cal is­sues about AI from a chris­tian per­spec­tive. Given the dom­i­nance of util­i­tar­ian think­ing on the sub­ject, it was nice to see an ex­plic­itly Chris­tian con­tri­bu­tion that dis­played fa­mil­iar­ity with the liter­a­ture, with safety as #1 and #3 on the list of is­sues. “there­fore it must be the paramount goal of ethics to main­tain hu­man sur­vival.′

Eth’s The Tech­nolog­i­cal Land­scape Affect­ing Ar­tifi­cial Gen­eral In­tel­li­gence and the Im­por­tance of Nanoscale Neu­ral Probes pre­sents ar­gu­ments for favour­ing whole-brain-em­u­la­tion as a path­way to hu­man-level AI over de novo AGI, and sug­gests that nanoscale neu­ral probe re­search could be a good way to differ­en­tially ad­vance WBE vs merely hu­man-in­spired Neu­ro­mor­phic AGI. The pa­per builds on a lot of ar­gu­ments in Bostrom’s Su­per­in­tel­li­gence. It seems clear that neu­ro­mor­phic AGI is un­de­sir­able—the ques­tion is be­tween de novo and WBE, which un­for­tu­nately seem to have neu­ro­mor­phic ‘in be­tween’ them from a tech­nolog­i­cal re­quire­ment point of view. Daniel pre­sents some good ar­gu­ments for the rel­a­tive safety of WBE (some of which were already in Bostrom), for ex­am­ple that WBEs would help provide train­ing data from de novo AGI, though I was scep­ti­cal of the idea that the iden­tity of the first WBEs would be de­ter­mined by pub­lic de­bate. An es­pe­cially good point was that even if nanoscale neu­ral probes ac­cel­er­ate neu­ro­mor­phic al­most as much as WBEs, be­cause the two hu­man-in­spired paths are closely linked and hence more likely to hit closer in time than de novo, neu­ral probe re­search is more likely to cause WBE to over­take neu­ro­mor­phic than neu­ro­mor­phic to over­take de novo.

Turchin’s Could slaugh­ter­bots wipe out hu­man­ity? Assess­ment of the global catas­trophic risk posed by au­tonomous weapons, pro­vides a se­ries of fermi-calcu­la­tion like es­ti­mates of the dan­ger posed by weapon­ised drones. He con­cludes that while they are very difficult to defend against, and their cost is com­ing down, it is un­likely they would be the driv­ing force be­hind hu­man ex­tinc­tion.

Bo­gosian’s Im­ple­men­ta­tion of Mo­ral Uncer­tainty in In­tel­li­gent Machines, ar­gues for us­ing Will’s metanor­ma­tivity ap­proach to moral un­cer­tainty as a way for ad­dress­ing moral dis­agree­ment in AI de­sign. I’m always glad to see more at­ten­tion given to Will’s the­sis, which I thought was very good, and the ap­pli­ca­tion to AI is an in­ter­est­ing one. I’m not quite sure how it would in­ter­act with a value-learn­ing sys­tem—is the idea that the agent is up­dat­ing all of its moral the­o­ries as new ev­i­dence comes in? Or that it has some value-learn­ing ap­proaches that are shar­ing cre­dence with pre-pro­grammed non-learn­ing sys­tems? I was a bit con­fused by his cit­ing Greene (2001) as com­par­ing the dis­per­sion of is­sue and the­ory level dis­agree­ment on moral is­sues, but I don’t think this ac­tu­ally af­fects the con­clu­sions of the pa­per at all, and am less con­cerned than Kyle is about the scal­ing prop­er­ties of the al­gorithm. I also liked his pru­den­tial ar­gu­ment for why moral par­ti­sans should agree to this com­pro­mise, though I note that virtue ethi­cists, for whom the char­ac­ter of the agent (not merely the re­sults) mat­ters, may not be con­vinced. Fi­nally, I think he ac­tu­ally un­der­stated the ex­tent to which de­bates about de­ci­sion pro­ce­dures are less vi­cious than those about ob­ject-level is­sues, as vir­tu­ally all the emo­tion about vot­ing sys­tems seems to be gen­er­ated by ob­ject-level par­ti­sans who be­lieve that chang­ing the vot­ing sys­tem will help them achieve their ob­ject-level poli­ti­cal goals.

rk and Sem­pere’s AI de­vel­op­ment in­cen­tive gra­di­ents are not uniformly ter­rible ar­gue that the ‘open­ness is bad’ con­clu­sion from Arm­strong et al’s Rac­ing to the Precipice is ba­si­cally be­cause of the dis­con­ti­nu­ity in suc­cess prob­a­bil­ity in their model. This seems true to me, and re­duced my cre­dence that open­ness was bad. Re­searchers from FHI were also named au­thors on the pa­per.

Liu et al.’s Govern­ing Bor­ing Apoca­lypses: A new ty­pol­ogy of ex­is­ten­tial vuln­er­a­bil­ities and ex­po­sures for ex­is­ten­tial risk re­search dis­cusses the broad risk land­scape. They provide a num­ber of break­downs of pos­si­ble risks, in­clud­ing many non-AI. I think the main use is the rel­a­tively poli­cy­maker-friendly fram­ing.

Bansal and Weld’s A Cover­age-Based Utility Model for Iden­ti­fy­ing Un­known Un­knowns de­sign a model for effi­ciently util­is­ing a scarce hu­man ex­pert to dis­cover false-pos­i­tive re­gions.

Dai’s A gen­eral model of safety-ori­ented AI de­vel­op­ment pro­vides a very brief gen­er­al­i­sa­tion of the sort of in­duc­tive strate­gies for AI safety I had been refer­ring to as ‘Chris­ti­ano-like’


Ro­man Yam­polskiy ed­ited a 500-page an­thol­ogy on AI Safety, available for pur­chase here. Un­for­tu­nately I haven’t had time to read ev­ery ar­ti­cle; here is a re­view by some­one who has.

The first half of the book, Con­cerns of Lu­mi­nar­ies, is ba­si­cally re-prints of older ar­ti­cles. As such read­ers will prob­a­bly mainly be in­ter­ested in the sec­ond half, which I think are all origi­nal to this vol­ume.

Misc other news

OpenPhil gave Carl Shul­man $5m to re-grant, of which some seems likely to end up fund­ing use­ful AI safety work. Given Carl’s in­tel­lect and ex­per­tise this seems like a good use of money to me.

OpenPhil are also fund­ing seven ML PhD stu­dents ($1.1m over five years) through their ‘AI Fel­lows’ pro­gram. I have read their pub­lished re­search and some of it seems quite in­ter­est­ing – I found Noam’s Safe and Nested Subgame Solv­ing for Im­perfect-In­for­ma­tion Games par­tic­u­larly in­ter­est­ing, partly as I didn’t have much prior fa­mil­iar­ity with the sub­ject. Most of their work thus far does not seem very AI Safety rele­vant, with some ex­cep­tions like this blog post by Jon Gau­thier. But given the timeline for aca­demic work and the mid-year an­nounce­ment of the fel­low­ships I think it’s prob­a­bly too early to see if they will pro­duce any AI Safety rele­vant work.

If you like pod­casts, you might en­joy these 80,000 Hours pod­casts. If not, they all have com­plete tran­scripts.

80,000 Hours also wrote a guide on how to tran­si­tion from pro­gram­ming or CS into ML.

Last year I men­tioned that EA Long Term Fu­ture Fund did not seem to be ac­tu­ally mak­ing grants. After a se­ries of crit­i­cism on the EA fo­rum by Henry Stan­ley and Evan Gaens­bauer, CEA has now changed the man­age­ment of the funds and com­mit­ted to a reg­u­lar se­ries of grant­mak­ing. How­ever, I’m skep­ti­cal this will solve the un­der­ly­ing prob­lem. Pre­sum­ably they or­gan­i­cally came across plenty of pos­si­ble grants – if this was truly a ‘lower bar­rier to giv­ing’ ve­hi­cle than OpenPhil they would have just made those grants. It is pos­si­ble, how­ever, that more man­agers will help them find more non-con­tro­ver­sial ideas to fund. Here is a link to their re­cent grants round.

If you’re read­ing this, you prob­a­bly already read SlateS­tarCodex. If not, you might en­joy this ar­ti­cle he wrote this year about AI Safety.

In an early proof of the vi­a­bil­ity of cry­on­ics, LessWrong has been brought back to life. If like me you find the new in­ter­face con­fus­ing you can view it through GreaterWrong. Re­lat­edly there is in­te­gra­tion with the Align­ment Fo­rum, to provide a place for dis­cus­sion of AI Align­ment is­sues that is linked to LessWrong. This seems rather clever to me.

Zvi Mow­show­itz and Vladimir Slep­nev have been or­ga­niz­ing a se­ries of AI Safety prizes, giv­ing out money for the ar­ti­cles they were most im­pressed with in a cer­tain time frame.

Deep­mind’s work on Protein Fold­ing proved quite suc­cess­ful, win­ning the big an­nual com­pe­ti­tion by a sig­nifi­cant mar­gin. This seemed sig­nifi­cant to me mainly be­cause ‘solv­ing the pro­tein fold­ing prob­lem’ has been one of the pro­to­typ­i­cal steps be­tween ‘re­cur­sively self-im­prov­ing AI’ and ‘sin­gle­ton’ since at least 2001.

Berkley offered a grad­u­ate-level course in AGI Safety.

Vast.ai are at­tempt­ing to cre­ate a two-sided mar­ket­place where you can buy or sell idle GPU ca­pac­ity. This seems like the sort of thing that prob­a­bly will not suc­ceed, but if some­thing like it did that’s an­other piece of ev­i­dence for hard­ware over­hang.

The US de­part­ment of com­merce sug­gested an ban on AI ex­ports, pre­sum­ably in­spired by pre­vi­ous bans on cryp­tog­ra­phy ex­ports.


The size of the field con­tinues to grow, both in terms of fund­ing and re­searchers. Both make it in­creas­ingly hard for in­di­vi­d­ual donors.

As I have once again failed to re­duce char­ity se­lec­tion to a sci­ence, I’ve in­stead at­tempted to sub­jec­tively weigh the pro­duc­tivity of the differ­ent or­gani­sa­tions against the re­sources they used to gen­er­ate that out­put, and donate ac­cord­ingly.

My con­stant wish is to pro­mote a lively in­tel­lect and in­de­pen­dent de­ci­sion-mak­ing among my read­ers; hope­fully my lay­ing out the facts as I see them above will prove helpful to some read­ers. Here is my even­tual de­ci­sion, rot13′d so you can do come to your own con­clu­sions first if you wish:

Qrfcvgr uni­vat qban­grq gb ZVEV pbafvf­gragyl sbe znal lrnef nf n erfhyg bs gurve uvtuyl aba-er­cyn­prnoyr naq teb­haqo­ernx­vat jbex va gur svryq, V pnaabg va tbbq snvgu qb fb guvf lrne tvira gurve ynpx bs qvfpybfher. Nqqvgvbanyyl, gurl ny­ernql unir n yne­tre ohqtrg guna nal bgure bet­navfngvba (rkprcg cre­uncf SUV) naq n ynetr nzb­hag bs erfreirf.

Qrfcvgr SUV ce­bqh­p­vat irel uvtu dhnyvgl erfrnepu, TCV uni­vat n ybg bs ce­bzvf­vat cn­cref va gur cvcry­var, naq obgu uni­vat uvtuyl dhnyvsvrq naq inyhr-nyv­tarq erfrnepuref, gur erd­hverzrag gb cer-shaq erfrnepuref’ ragver pbagenpg fv­tavsvp­nagyl vapern­frf gur rss­rpgvir pbfg bs shaq­vat erfrnepu gurer. Ba gur bgure unaq, uve­vat cr­b­cyr va gur onl nern vfa’g purnc rvgure.

Guvf vf gur svefg lrne V unir ng­grzc­grq gb erivrj PUNV va qrgnvy naq V unir orra vzcerffrq jvgu gur dhnyvgl naq iby­hzr bs gurve jbex. V nyfb gu­vax gurl unir zber ebbz sbe shaq­vat guna SUV. Nf fhpu V jvyy or qbang­vat fbzr zbarl gb PUNV guvf lrne.

V gu­vax bs PFRE naq TPEV nf or­vat eryn­gviryl pbzc­nenoyr bet­navfngvbaf, nf 1) gurl obgu jbex ba n in­evrgl bs rkvf­gragvny evfxf naq 2) obgu cevznevyl ce­bqhpr fgen­grtl cvr­prf. Va guvf pbzc­nevfba V gu­vax TPEV yb­bxf fv­tavsvp­nagyl orggre; vg vf abg pyrne gurve gbgny bhgchg, nyy gu­vatf pbafvqr­erq, vf yrff guna PFRE’f, ohg gurl unir qbar fb ba n qen­zngvp­nyyl fznyyre ohqtrg. Nf fhpu V jvyy or qbang­vat fbzr zbarl gb TPEV nt­nva guvf lrne.

NAH, Qr­rcz­vaq naq BcraNV unir nyy qbar tbbq jbex ohg V qba’g gu­vax vg vf ivnoyr sbe (eryn­gviryl) fznyy vaq­vivqhny qbabef gb zr­na­vat­shyyl fhc­cbeg gurve jbex.

Bh­tug fr­rzf yvxr n irel iny­h­noyr ceb­wrpg, naq V nz gbea ba qbang­vat, ohg V gu­vax gurve arrq sbe nqqvgvbany shaq­vat vf fyv­tu­gyl yrff guna fbzr bgure teb­hcf.

NV Vzc­npgf vf va znal jnlf va n fvzvyne cbfvgvba gb TPEV, jvgu gur rkprcgvba gung TPEV vf ng­grzcg­vat gb fp­nyr ol uve­vat vgf cneg-gvzr jbexref gb shyy-gvzr, ju­vyr NV Vzc­npgf vf fp­ny­vat ol uve­vat arj cr­b­cyr. Gur sbezre vf fv­tavsvp­nagyl yb­jre evfx, naq NV Vzc­npgf fr­rzf gb unir rab­htu zbarl gb gel bhg gur hcfvm­vat sbe 2019 naljnl. Nf fhpu V qb abg cyna gb qbangr gb NV Vzc­npgf guvf lrne, ohg vs gurl ner noyr gb fp­nyr rss­rpgviryl V zv­tug jryy qb fb va 2019.

Gur Sb­haqngvbany Erfrnepu Vafgvghgr unir qbar fbzr irel va­gr­erfg­vat jbex, ohg frrz gb or nqrdhn­gryl shaqrq, naq V nz fbzr­jung zber pbaprearq nobhg gur qna­tre bs evfxl havyn­greny npgvba urer guna jvgu bgure bet­navfngvbaf.

V unira’g unq gvzr gb riny­h­ngr gur Sberfv­tug Vafgvghgr, ju­vpu vf n funzr orpn­hfr ng gurve fznyy fvmr znet­vany shaq­vat pb­hyq or irel iny­h­noyr vs gurl ner va snpg qb­vat hfr­shy jbex. Fvzvyneyl, Zrqvna naq Pbaire­trapr frrz gbb arj gb ernyyl riny­h­ngr, gub­htu V jvfu gurz jryy.

Gur Shgher bs Yvsr vafgvghgr tenagf sbe guvf lrne frrz zber iny­h­noyr gb zr guna gur cerivbhf ongpu, ba nirentr. Ub­jrire, V cer­sre gb qverp­gyl riny­h­ngr ju­rer gb qbangr, en­gure guna bhgfb­hep­vat guvf qr­pvfvba.

V nyfb cyna gb fgneg znx­vat qbangvbaf gb vaq­vivqhny erfrnepuref, ba n ergebfcrpgvir on­fvf, sbe qb­vat hfr­shy jbex. Gur pheerag fvgh­ngvba, jvgu n ovanel rz­cy­blrq/​abg-rz­cy­blrq qvfg­vapgvba, naq hc­se­bag cn­lzrag sbe hapreg­nva bhgchg, fr­rzf fhobcgvzny. V nyfb ubcr gb fv­tavsvp­nagyl erqhpr bireurnq (sbe rirelbar ohg zr) ol abg uni­vat na nc­cyvp­ngvba cebprff be nal erd­hverzragf sbe tena­grrf or­lbaq uni­vat ce­bqh­prq tbbq jbex. Guvf jb­hyq or fbzr­jung fvzvyne gb Vzc­npg Pregvsvp­n­grf, ju­vyr ubcr­shyyl nib­vq­vat fbzr bs gurve vffhrf.

How­ever I wish to em­pha­sis that all the above or­gani­sa­tions seem to be do­ing good work on the most im­por­tant is­sue fac­ing mankind. It is the na­ture of mak­ing de­ci­sions un­der scarcity that we must pri­ori­tize some over oth­ers, and I hope that all or­gani­sa­tions will un­der­stand that this nec­es­sar­ily in­volves nega­tive com­par­i­sons at times.

Thanks for read­ing this far; hope­fully you found it use­ful. Apolo­gies to ev­ery­one who did valuable work that I ex­cluded; I have no ex­cuse other than pro­cras­ti­na­tion, Cru­sader Kings II, and a start­ing work at a new hedge fund.


I have not in gen­eral checked all the proofs in these pa­pers, and similarly trust that re­searchers have hon­estly re­ported the re­sults of their simu­la­tions.

I was a Sum­mer Fel­low at MIRI back when it was SIAI, vol­un­teered briefly at GWWC (part of CEA) and pre­vi­ously ap­plied for a job at FHI. I am per­sonal friends with peo­ple at MIRI, FHI, CSER, CHAI, GPI, BERI, OpenAI, Deep­mind, Ought and AI Im­pacts but not re­ally at ANU, EAF/​FRI, GCRI, Google Brain, Fore­sight, FLI, Me­dian, Con­ver­gence (so if you’re wor­ried about bias you should over­weight them… though it also means I have less di­rect knowl­edge) (also sorry if I’ve for­got­ten any friends who work for the lat­ter set!). How­ever I have no fi­nan­cial ties be­yond be­ing a donor and have never been ro­man­ti­cally in­volved with any­one who has ever been at any of the or­gani­sa­tions.

I shared drafts of the in­di­vi­d­ual or­gani­sa­tion sec­tions with rep­re­sen­ta­tives from MIRI, FHI, CHAI, CSER, GCRI, GPI, BERI, Ought, AI Im­pacts, and EAF/​FRI.

I’d like to thank Greg Lewis and my anony­mous re­view­ers for look­ing over this. Any re­main­ing mis­takes are of course my own. I would also like to thank my wife for tol­er­at­ing all the time I have in­vested/​wasted on this.

EDIT: Re­moved lan­guage about BERI, at their re­quest.


Amodei, Dario and Her­nan­dez, Danny—AI and Com­pute − 2018-05-16 - https://​​blog.ope­nai.com/​​ai-and-com­pute/​​

Arm­strong, Stu­art; O’Rourke, Xavier - ‘In­differ­ence’ meth­ods for man­ag­ing agent re­wards − 2018-01-05 - https://​​arxiv.org/​​pdf/​​1712.06365.pdf

Arm­strong, Stu­art; O’Rourke, Xavier—Safe Uses of AI Or­a­cles − 2018-06-05 - https://​​arxiv.org/​​pdf/​​1711.05541.pdf

Arm­strong, Stu­art; Soren, Min­der­mann—Im­pos­si­bil­ity of de­duc­ing prefer­ences and ra­tio­nal­ity from hu­man policy − 2017-12-05 - https://​​arxiv.org/​​abs/​​1712.05812

Avin, Sha­har; Win­tle, Bon­nie; Weitz­dorfer, Julius; Ó hÉigeartaigh, Seán; Suther­land, William; Rees, Martin—Clas­sify­ing Global Catas­trophic Risks − 2018-02-23 - https://​​www.sci­encedi­rect.com/​​sci­ence/​​ar­ti­cle/​​pii/​​S0016328717301957#tbl0010

Awad, Ed­mond; Dsouza, So­han; Kim, Richard; Schulz, Jonathan; Hen­rich, Joseph; Shar­iff, Azim; Bon­nefon, Jean-Fran­cois; Rah­wan, Iyad—The Mo­ral Ma­chine Ex­per­i­ment − 2018-10-24 - https://​​www.na­ture.com/​​ar­ti­cles/​​s41586-018-0637-6

Bansal, Ga­gan; Weld, Daniel—A Cover­age-Based Utility Model for Iden­ti­fy­ing Un­known Un­knowns − 2018-04-25 - https://​​www.aaai.org/​​ocs/​​in­dex.php/​​AAAI/​​AAAI18/​​pa­per/​​view/​​17110

Basu, Chan­drayee; Yang, Qian; Hunger­man, David; Mukesh, Sing­hal; Dra­gan, Anca—Do You Want Your Au­tonomous Car to Drive Like You? − 2018-02-05 -

Batin, Mikhail; Turchin, Alexey; Markov, Sergey; Zhila, Alisa; Denken­berger, David—Ar­tifi­cial In­tel­li­gence in Life Ex­ten­sion: from Deep Learn­ing to Su­per­in­tel­li­gence − 2017-08-31 - http://​​www.in­for­mat­ica.si/​​in­dex.php/​​in­for­mat­ica/​​ar­ti­cle/​​view/​​1797

Baum, Seth—Coun­ter­ing Su­per­in­tel­li­gence Mis­in­for­ma­tion − 2018-09-09 - https://​​www.mdpi.com/​​2078-2489/​​9/​​10/​​244

Baum, Seth—Re­silience to Global Catas­tro­phe − 2018-11-29 - https://​​irgc.epfl.ch/​​wp-con­tent/​​up­loads/​​2018/​​11/​​Baum-for-IRGC-Re­silience-Guide-Vol-2-2018.pdf

Baum, Seth—Su­per­in­tel­li­gence Skep­ti­cism as a Poli­ti­cal Tool − 2018-08-22 - https://​​www.mdpi.com/​​2078-2489/​​9/​​9/​​209

Baum, Seth—Uncer­tain Hu­man Con­se­quences in As­teroid Risk Anal­y­sis and the Global Catas­tro­phe Thresh­old − 2018-07-28 - https://​​pa­pers.ssrn.com/​​sol3/​​pa­pers.cfm?ab­stract_id=3218342

Baum, Seth; Arm­strong, Stu­art; Eken­st­edt, Ti­mo­teus; Hag­gstrom, Olle; Han­son, Robin; Kuh­le­mann, Karin; Maas, Matthijs; Miller, James; Salmela, Markus; Sand­berg, An­ders; So­tala, Kaj; Tor­res, Phil; Turchi, Alexey; Yam­polskiy, Ro­man—Long-Term Tra­jec­to­ries of Hu­man Civ­i­liza­tion − 2018-08-08 - http://​​gcrin­sti­tute.org/​​pa­pers/​​tra­jec­to­ries.pdf

Baum, Seth; Bar­rett, An­thony—A Model for the Im­pacts of Nu­clear War − 2018-04-03 - https://​​pa­pers.ssrn.com/​​sol3/​​pa­pers.cfm?ab­stract_id=3155983

Baum, Seth; Bar­rett, An­thony; Yam­polskiy, Ro­man—Model­ling and In­ter­pret­ing Ex­pert Disagree­ment about Ar­tifi­cial In­tel­li­gence − 2018-01-27 - https://​​pa­pers.ssrn.com/​​sol3/​​pa­pers.cfm?ab­stract_id=3104645

Baum, Seth; Neufville, Robert; Bar­rett, An­thony—A Model for the Prob­a­bil­ity of Nu­clear War − 2018-03-08 - https://​​pa­pers.ssrn.com/​​sol3/​​pa­pers.cfm?ab­stract_id=3137081

Bau­mann, To­bias—Us­ing Sur­ro­gate Goals to Deflect Threats − 2018-02-20 - https://​​foun­da­tional-re­search.org/​​us­ing-sur­ro­gate-goals-deflect-threats/​​

Becker, Gary—Crime and Pu­n­ish­ment: An Eco­nomic Ap­proach − 1974-01-01 - https://​​www.nber.org/​​chap­ters/​​c3625.pdf

Bek­dash, Gus—Us­ing Hu­man His­tory, Psy­col­ogy and Biol­ogy to Make AI Safe for Hu­mans − 2018-04-01 -

Ber­ber­ich, Ni­co­las; Die­pold, Klaus—The Vir­tu­ous Ma­chine—Old Ethics for New Tech­nol­ogy − 2018-06-27 - https://​​arxiv.org/​​abs/​​1806.10322

Blake, An­drew; Bordallo, Ale­jan­dro; Hawasly, Majd; Penkov, Svetlin; Ra­mamoor­thy, Subra­ma­nian; Silva, Alexan­dre - Effi­cient Com­pu­ta­tion of Col­li­sion Prob­a­bil­ities for Safe Mo­tion Plan­ning − 2018-04-15 - https://​​arxiv.org/​​abs/​​1804.05384

Bo­gosian, Kyle—Im­ple­men­ta­tion of Mo­ral Uncer­tainty in In­tel­li­gent Machines − 2017-12-01 - https://​​link.springer.com/​​ar­ti­cle/​​10.1007/​​s11023-017-9448-z

Bostrom, Nick—The Vuln­er­a­ble World Hy­poth­e­sis − 2018-11-09 - https://​​nick­bostrom.com/​​pa­pers/​​vuln­er­a­ble.pdf

Brown, Noam; Sand­holm, Tuo­mas—Safe and Nested Subgame Solv­ing for Im­perfect-In­for­ma­tion Games − 2017-05-08 - https://​​arxiv.org/​​abs/​​1705.02955

Brown, Noam; Sand­holm, Tuo­mas—Solv­ing Im­perfect-In­for­ma­tion Games via Dis­counted Re­gret Min­i­miza­tion − 2018-09-11 - https://​​arxiv.org/​​abs/​​1809.04040

Brown, Tom; Ols­son, Cather­ine; Google Brain Team, Re­search Eng­ineers—In­tro­duc­ing the Un­re­stircted Ad­ver­sar­ial Ex­am­ples Challenge − 2018-09-03 - https://​​ai.google­blog.com/​​2018/​​09/​​in­tro­duc­ing-un­re­stricted-ad­ver­sar­ial.html

Carey, Ryan—In­ter­pret­ing AI Com­pute Trends − 2018-07-10 - https://​​aiim­pacts.org/​​in­ter­pret­ing-ai-com­pute-trends/​​

Cave, Stephen; Ó hÉigeartaigh, Seán - An AI Race for Strate­gic Ad­van­tage: Rhetoric and Risks − 2018-01-16 - http://​​www.aies-con­fer­ence.com/​​wp-con­tent/​​pa­pers/​​main/​​AIES_2018_pa­per_163.pdf

Chris­ti­ano, Paul—Tech­niques for Op­ti­miz­ing Worst-Case Perfor­mance − 2018-02-01 - https://​​ai-al­ign­ment.com/​​tech­niques-for-op­ti­miz­ing-worst-case-perfor­mance-39eafec74b99

Chris­ti­ano, Paul—Univer­sal­ity and Se­cu­rity Am­plifi­ca­tion − 2018-03-10 - https://​​ai-al­ign­ment.com/​​uni­ver­sal­ity-and-se­cu­rity-am­plifi­ca­tion-551b314a3bab

Chris­ti­ano, Paul; Sh­legeris, Buck; Amodei, Dario—Su­per­vis­ing Strong Learn­ers by Am­plify­ing Weak Ex­perts − 2018-10-19 - https://​​arxiv.org/​​abs/​​1810.08575

Co­hen, Michael; Vel­lambi, Badri; Hut­ter, Mar­cus—Al­gorithm for Aligned Ar­tifi­cial Gen­eral In­tel­li­gence − 2018-05-25 - https://​​cs.anu.edu.au/​​courses/​​CSPROJECTS/​​18S1/​​re­ports/​​u6357432.pdf

Cundy, Chris; Filan, Daniel—Ex­plor­ing Hier­ar­chy-Aware In­verse Re­in­force­ment Learn­ing − 2018-07-13 - https://​​arxiv.org/​​abs/​​1807.05037

Cur­rie, Adrian—Ex­is­ten­tial Risk, Creativity & Well-Adapted Science − 2018-07-22 - http://​​philsci-archive.pitt.edu/​​14800/​​

Cur­rie, Adrian—Geo­eng­ineer­ing Ten­sions − 2018-04-30 - http://​​philsci-archive.pitt.edu/​​14607/​​

Cur­rie, Adrian—In­tro­duc­tion: Creativity, Con­ser­vatism & the So­cial Episte­mol­ogy of Science − 2018-09-27 - http://​​philsci-archive.pitt.edu/​​15066/​​

Cur­rie, Adrian; Ó hÉigeartaigh, Seán—Work­ing to­gether to face hu­man­ity’s great­est threats: In­tro­duc­tion to The Fu­ture of Re­search on Catas­trophic and Ex­is­ten­tial Risk − 2018-03-26 - https://​www.drop­box.com/​s/​bh6okdz8pvrxzc6/​Work­ing%20to­gether%20to%20face%20hu­man­ity%E2%80%99s%20great­est%20threats%20preprint.pdf?dl=0

Dafoe, Allen—AI Gover­nance: A Re­search Agenda − 2018-08-27 - https://​​www.fhi.ox.ac.uk/​​wp-con­tent/​​up­loads/​​GovAIA­genda.pdf

Dai, Wei—A gen­eral model of safety-ori­ented AI de­vel­op­ment − 2018-06-11 - https://​www.less­wrong.com/​posts/​idb5Ppp9zgh­ci­chJ5/​a-gen­eral-model-of-safety-ori­ented-ai-development

Dem­ski, Abram—An Un­trol­lable Math­e­mat­i­cian Illus­trated − 2018-03-19 - https://​​www.less­wrong.com/​​posts/​​CvKn­hXTu9BPcdKE4W/​​an-un­trol­lable-math­e­mat­i­cian-illustrated

DeVries, Ter­rance; Tay­lor, Gra­ham—Lev­er­ag­ing Uncer­tainty Es­ti­mates for Pre­dict­ing Seg­men­ta­tion Qual­ity − 2018-07-02 - https://​​arxiv.org/​​abs/​​1807.00502

Dobbe, Roel; Dean, Sarah; Gilbert, Thomas; Kohli, Nitin—A Broader View on Bias in Au­to­mated De­ci­sion-Mak­ing: Reflect­ing on Episte­mol­ogy and Dy­nam­ics − 2018-07-06 - https://​​arxiv.org/​​abs/​​1807.00553

Doshi-Velez, Fi­nale; Kim, Been—Con­sid­er­a­tions for Eval­u­a­tion and Gen­er­al­iza­tion in In­ter­pretable Ma­chine Learn­ing − 2018-08-24 - https://​​fi­nale.seas.har­vard.edu/​​pub­li­ca­tions/​​con­sid­er­a­tions-eval­u­a­tion-and-gen­er­al­iza­tion-in­ter­pretable-ma­chine-learning

Duettmann, Alli­son; Afanas­jeva, Olga; Arm­strong, Stu­art; Braley, Ryan; Cuss­ins, Jes­sica; Ding, Jeffrey; Eck­er­sley, Peter; Guan, Melody; Vance, Alyssa; Yam­polskiy, Ro­man—Ar­tifi­cial Gen­eral In­tel­li­gence: Co­or­di­na­tion and Great Pow­ers − 1900-01-00 - https://​​fs1-bb4c.kx­cdn.com/​​wp-con­tent/​​up­loads/​​2018/​​11/​​AGI-Co­or­di­na­tion-Geat-Pow­ers-Re­port.pdf

Erdelyi, Olivia ; Gold­smith, Judy—Reg­u­lat­ing Ar­tifi­cial In­tel­li­gence: Pro­posal for a Global Solu­tion − 2018-02-01 - http://​​www.aies-con­fer­ence.com/​​wp-con­tent/​​pa­pers/​​main/​​AIES_2018_pa­per_13.pdf

Eth, Daniel—The Tech­nolog­i­cal Land­scape Affect­ing Ar­tifi­cial Gen­eral In­tel­li­gence and the Im­por­tance of Nanoscale Neu­ral Probes − 2017-08-31 - http://​​www.in­for­mat­ica.si/​​in­dex.php/​​in­for­mat­ica/​​ar­ti­cle/​​view/​​1874

Evans, Owain; Stuh­lmul­ler, An­dreas; Cundy, Chris; Carey, Ryan; Ken­ton, Zachary; McGrath, Thomas; Schreiber, An­drew—Pre­dict­ing Hu­man De­liber­a­tive Judg­ments with Ma­chine Learn­ing − 2018-07-13 - https://​​ought.org/​​pa­pers/​​pre­dict­ing-judg­ments-tr2018.pdf

Ever­itt, Tom; Krakovna, Vic­to­ria; Orseau, Lau­rent; Hut­ter, Mar­cus; Legg, Shane—Re­in­force­ment Learn­ing with a Cor­rupted Re­ward Chan­nel − 2017-05-23 - https://​​arxiv.org/​​abs/​​1705.08417

Ever­itt, Tom; Lea, Gary; Hut­ter, Mar­cus—AGI Safety Liter­a­ture Re­view − 2018-05-22 - AGI Safety Liter­a­ture Review

Filan, Daniel—Bot­tle Caps aren’t Op­ti­misers − 2018-11-21 - https://​​www.greater­wrong.com/​​posts/​​26eupx3Byc8swRS7f/​​bot­tle-caps-aren-t-optimisers

Fisac, Jaime; Ba­jcsy, An­drea; Her­bert, Sylvia; Fri­dovich-Keil, David; Wang, Steven; Tom­lin, Claire; Dra­gan, Anca—Prob­a­bil­is­ti­cally Safe Robot Plan­ning with Con­fi­dence-Based Hu­man Pre­dic­tions − 2018-05-31 - https://​​arxiv.org/​​abs/​​1806.00109

Gar­nelo, Marta; Rosen­baum, Dan; Mad­di­son, Chris; Ra­malho, Ti­ago; Sax­ton, David; Shana­han, Mur­ray; The, Yee Whye; Rezende, Danilo; Es­lami, S M Ali—Con­di­tional Neu­ral Pro­cesses − 2018-07-04 -

Garrabrant, Scott; Dem­ski, Abram—Embed­ded Agency Se­quence − 2018-10-29 - https://​​www.less­wrong.com/​​s/​​Rm6oQRJJmhGCcLvxh

Gas­parik, Amanda; Gam­ble, Chris; Gao, Jim—Safety-first AI for au­tonomous data cen­tre cool­ing and in­dus­trial con­trol − 2018-08-17 - https://​​deep­mind.com/​​blog/​​safety-first-ai-au­tonomous-data-cen­tre-cool­ing-and-in­dus­trial-con­trol/​​

Gau­thier, Jon; Ivanova, Anna—Does the brain rep­re­sent words? An eval­u­a­tion of brain de­cod­ing stud­ies of lan­guage un­der­stand­ing − 2018-06-02 - https://​​arxiv.org/​​abs/​​1806.00591

Ghosh, Shromona; Berkenkamp, Felix; Ranade, Gireeja; Qadeer, Shaz; Kapoor, Ashish—Ver­ify­ing Con­trol­lers Against Ad­ver­sar­ial Ex­am­ples with Bayesian Op­ti­miza­tion − 2018-02-26 - https://​​arxiv.org/​​abs/​​1802.08678

Gilmer, Justin; Adams, Ryan; Good­fel­low, Ian; An­der­sen, David, Dahl, Ge­orge—Mo­ti­vat­ing the Rules of the Game for Ad­ver­sar­ial Ex­am­ple Re­search − 2018-07-20 - https://​​arxiv.org/​​abs/​​1807.06732

Grace, Katja—Hu­man Level Hard­ware Timeline − 2017-12-22 - https://​​aiim­pacts.org/​​hu­man-level-hard­ware-timeline/​​

Grace, Katja—Like­li­hood of dis­con­tin­u­ous progress around the de­vel­op­ment of AGI − 2018-02-23 - https://​​aiim­pacts.org/​​like­li­hood-of-dis­con­tin­u­ous-progress-around-the-de­vel­op­ment-of-agi/​​

Green, Brian Pa­trick—Eth­i­cal Reflec­tions on Ar­tifi­cial In­tel­li­gence − 2018-06-01 - http://​​apcz.umk.pl/​​cza­sopisma/​​in­dex.php/​​SetF/​​ar­ti­cle/​​view/​​SetF.2018.015

Had­field-Menell, Dy­lan; An­drus, McKane; Had­field, Gillian—Leg­ible Nor­ma­tivity for AI Align­ment: The Value of Silly Rules − 2018-11-03 - https://​arxiv.org/​abs/​1811.01267

Had­field-Menell, Dy­lan; Had­field, Gillian—In­com­plete Con­tract­ing and AI al­ign­ment − 2018-04-12 - https://​​arxiv.org/​​abs/​​1804.04268

Haqq-Misra, Ja­cob—Policy Op­tions for the ra­dio De­tectabil­ity of Earth − 2018-04-02 - https://​​arxiv.org/​​abs/​​1804.01885

Hoang, Lê Nguyên—A Roadmap for the Value-Load­ing Prob­lem − 2018-09-04 - https://​​arxiv.org/​​abs/​​1809.01036

Huang, Jessie; Wu, Fa; Pre­cup, Doina; Cai, Yang—Learn­ing Safe Poli­cies with Ex­pert Guidance − 2018-05-21 - https://​​arxiv.org/​​abs/​​1805.08313

Ibarz, Borja; Leike, Jan; Pohlen, To­bias; Irv­ing, Ge­offrey; Legg, Shane; Amodei, Dario—Re­ward Learn­ing from Hu­man Prefer­ences and De­mon­stra­tions in Atari − 2018-11-15 - https://​​arxiv.org/​​abs/​​1811.06521

IBM—Bias in AI: How we Build Fair AI Sys­tems and Less-Bi­ased Hu­mans − 2018-02-01 - https://​​www.ibm.com/​​blogs/​​policy/​​bias-in-ai/​​

Irv­ing, Ge­offrey; Chris­ti­ano, Paul; Amodei, Dario—AI Safety via De­bate − 2018-05-02 - https://​​arxiv.org/​​abs/​​1805.00899

Jan­ner, Michael; Wu, Ji­a­jun; Kulkarni, Te­jas; Yildirim, Ilker; Te­nen­baum, Joshua—Self-Su­per­vised In­trin­sic Image De­com­po­si­tion − 2018-02-05 - https://​​arxiv.org/​​abs/​​1711.03678

Jilk, David—Con­cep­tual-Lin­guis­tic Su­per­in­tel­li­gence − 2017-07-31 - http://​​www.in­for­mat­ica.si/​​in­dex.php/​​in­for­mat­ica/​​ar­ti­cle/​​view/​​1875

Jones, Natalie; O’Brien, Mark; Ryan, Thomas—Rep­re­sen­ta­tion of fu­ture gen­er­a­tions in United King­dom policy-mak­ing − 2018-03-26 - https://​www.sci­encedi­rect.com/​sci­ence/​ar­ti­cle/​pii/​S0016328717301179

Kol­ler, Torsten; Berkenkamp, Felix; Turchetta, Mat­teo; Krause, An­dreas—Learn­ing-based Model Pre­dic­tive Con­trol for Safe Ex­plo­ra­tion − 2018-09-22 - https://​​arxiv.org/​​abs/​​1803.08287

Krakovna, Vic­to­ria—Speci­fi­ca­tion Gam­ing Ex­am­ples in AI − 2018-04-02 - https://​​vkrakovna.word­press.com/​​2018/​​04/​​02/​​speci­fi­ca­tion-gam­ing-ex­am­ples-in-ai/​​

Krakovna, Vic­to­ria; Orseau, Lau­rent; Mar­tic, Mil­jan; Legg, Shane—Mea­sur­ing and avoid­ing side effects us­ing rel­a­tive reach­a­bil­ity − 2018-06-04 - https://​​arxiv.org/​​abs/​​1806.01186

Ku­rakin, Alexey; Good­fel­low, Ian; Ben­gio, Samy; Dong, Yin­peng; Liao, Fangzhou; Liang, Ming; Pang, Ti­anyu ; Zhu, Jun; Hu, Xiaolin; Xie, Cihang; Wang, Ji­anyu; Zhang, Zhishuai; Ren, Zhou; Yuille, Alan; Huang, Sangxia; Zhao, Yao; Zhao, Yuzhe; Han, Zhonglin; Long, Jun­ji­a­jia; Berdibekov, Yerke­bu­lan; Ak­iba, Takuya; Tokui, Seiya; Abe Mo­toki - Ad­ver­sar­ial At­tacks and Defences Com­pe­ti­tion − 2018-03-31 - https://​​arxiv.org/​​pdf/​​1804.00097.pdf

Lee, Kimin; Lee, Kibok; Lee, Honglak; Shin, Jin­woo—A Sim­ple Unified Frame­work for De­tect­ing Out-of-Distri­bu­tion Sam­ples and Ad­ver­sar­ial At­tacks − 2018-10-27 - https://​​arxiv.org/​​abs/​​1807.03888

Lehman, Joel; Clune, Jeff; Mi­se­vic, Du­san—The Sur­pris­ing Creativity of Digi­tal Evolu­tion: A Col­lec­tion of Anec­dotes from the Evolu­tion­ary Com­pu­ta­tion and Ar­tifi­cial Life Re­search Com­mu­ni­ties − 2018-08-14 - https://​​arxiv.org/​​abs/​​1803.03453

Leibo, Joel; de Mas­son d’Au­tume, Cy­prien; Zo­ran, Daniel; Amos, David; Beat­tie, Charles; An­der­son, Keith; Cas­tañeda, An­to­nio Gar­cía; Sanchez, Manuel; Green, Si­mon; Grus­lys, Au­drunas, Legg, Shane, Hass­abis, Demis, Botv­inick, Matthew—Psy­ch­lab: A Psy­chol­ogy Lab­o­ra­tory for Deep Re­in­force­ment Learn­ing Agents − 2018-02-04 - https://​​arxiv.org/​​abs/​​1801.08116

Leike, Jan; Krue­gar, David; Ever­itt, Tom; Mar­tic, Mil­jan; Maini, Vishal; Legg, Shane—Scal­able agent al­ign­ment via re­ward mod­el­ing: a re­search di­rec­tion − 2018-11-19 - https://​​arxiv.org/​​abs/​​1811.07871

Leike, Jan; Mar­tic, Mil­jan; Krakovna, Vic­to­ria; Ortega, Pe­dro; Ever­itt, Tom; Lefrancq, An­drew; Orseau, Lau­rent; Legg, Shane—AI Safety Grid­wor­lds − 2017-11-28 - https://​​arxiv.org/​​abs/​​1711.09883

Lewis, Gre­gory; Millett, Piers; Sand­berg, An­ders; Sny­der-Beat­tie; Gron­vall, Gigi—In­for­ma­tion Hazards in Biotech­nol­ogy − 2018-11-12 - https://​​on­linelibrary.wiley.com/​​doi/​​abs/​​10.1111/​​risa.13235

Lip­ton, Zachary; Stein­hardt, Ja­cob—Trou­bling Trends in Ma­chine Learn­ing Schol­ar­ship − 2018-07-26 - https://​​arxiv.org/​​abs/​​1807.03341

Liu, Chang; Ham­rick, Jes­sica; Fisac, Jaime; Dra­gan, Anca; Hedrick, J Karl; Sas­try, S Shankar; Griffiths, Thomas—Goal In­fer­ence Im­proves Ob­jec­tive and Per­ceived Perfor­mance in Hu­man-Robot Col­lab­o­ra­tion − 2018-02-06 - https://​​arxiv.org/​​abs/​​1802.01780

Liu, Hin-Yan; Lauta, Kris­tian Ced­er­vall; Mass, Matthijs Michiel—Govern­ing Bor­ing Apoca­lypses: A new ty­pol­ogy of ex­is­ten­tial vuln­er­a­bil­ities and ex­po­sures for ex­is­ten­tial risk re­search − 2018-03-26 - https://​​www.sci­encedi­rect.com/​​sci­ence/​​ar­ti­cle/​​pii/​​S0016328717301623

Liu, Qiang; Li, Pan; Zhao, Wen­tao; Cai, Wei; Yu, Shui; Le­ung, Vic­tor—A Sur­vey on Se­cu­rity Threats and Defen­sive Tech­niques of Ma­chine Learn­ing: A Data Driven View − 2018-02-13 - https://​​iee­ex­plore.ieee.org/​​doc­u­ment/​​8290925

Liu, Yang; Price, Huw—Ram­sey and Joyce on de­liber­a­tion and pre­dic­tion − 2018-08-30 - http://​​philsci-archive.pitt.edu/​​14972/​​

Lüt­jens, Björn; Everett, Michael; How, Jonathan - Safe Re­in­force­ment Learn­ing with Model Uncer­tainty Es­ti­mates − 2018-10-19 - https://​​arxiv.org/​​abs/​​1810.08700

Mal­inin, An­drey; Gales, Mark—Pre­dic­tive Uncer­tainty Es­ti­ma­tion via Prior Net­works − 2018-10-08 - https://​​arxiv.org/​​abs/​​1802.10501

Man­heim, David; Garrabrant, Scott—Cat­e­go­riz­ing Var­i­ants of Good­heart’s Law − 2018-04-10 - https://​​arxiv.org/​​abs/​​1803.04585

Martinez-Plumed, Fer­nando; Loe, Bao Sheng; Flach, Peter; Ó hÉigeartaigh, Seán; Vold, Ka­rina; Her­nan­dez-Orallo, Jose—The Facets of Ar­tifi­cial In­tel­li­gence: A Frame­work to Track the Evolu­tion of AI − 2018-08-21 - https://​​www.ij­cai.org/​​pro­ceed­ings/​​2018/​​0718.pdf

McCaslin, Te­gan—Trans­mit­ting fibers in the brain: To­tal length and dis­tri­bu­tion of lengths − 2018-03-29 - https://​​aiim­pacts.org/​​trans­mit­ting-fibers-in-the-brain-to­tal-length-and-dis­tri­bu­tion-of-lengths/​​

Menda, Ku­nal; Driggs-Camp­bell, Kather­ine; Kochen­derfer, Mykel—Ensem­bleDAg­ger: A Bayesian Ap­proach to Safe Imi­ta­tion Learn­ing − 2018-07-22 - https://​​arxiv.org/​​abs/​​1807.08364

Miles Brundage, Sha­har Avin, Jack Clark, He­len Toner, Peter Eck­er­sley, Ben Garfinkel, Allan Dafoe, Paul Scharre, Thomas Zeit­zoff, Bobby Filar, Hyrum An­der­son, Heather Roff, Gre­gory C. Allen, Ja­cob Stein­hardt, Car­rick Flynn, Seán Ó hÉigeartaigh, Si­mon Beard, Haydn Belfield, Se­bas­tian Far­quhar, Clare Lyle, Re­becca Crootof, Owain Evans, Michael Page, Joanna Bryson, Ro­man Yam­polskiy, Dario Amodei—The Mal­i­cious Use of Ar­tifi­cial In­tel­li­gence: Fore­cast­ing, Preven­tion, and Miti­ga­tion − 2018-02-20 - https://​​arxiv.org/​​abs/​​1802.07228

Milli, Smitha; Sch­midt, Lud­wig; Dra­gan, Anca; Hardt, Moritz—Model Re­con­struc­tion from Model Ex­pla­na­tions − 2018-07-13 - https://​arxiv.org/​abs/​1807.05185

Min­der­mann, Soren; Shah, Ro­hin; Gleave, Adam; Had­field-Menell, Dy­lan—Ac­tive In­verse Re­ward De­sign − 2018-11-16 - https://​​arxiv.org/​​abs/​​1809.03060

Mo­gensen, An­dreas—Long-ter­mism for risk averse al­tru­ists − 1900-01-00 - https://​​unioxford­nexus-my.share­point.com/​​per­sonal/​​exet1753_ox_ac_uk/​​_lay­outs/​​15/​​onedrive.aspx?id=%2Fper­sonal%2Fexet1753%5Fox%5Fac%5Fuk%2FDoc­u­ments%2FGlobal%20Pri­ori­ties%20In­sti­tute%2FOper­a­tions%2FWeb­site%2FWork­ing%20pa­pers%2FLongter­mism%20and%20risk%20aver­sion%20v3%2Epdf&par­ent=%2Fper­sonal%2Fexet1753%5Fox%5Fac%5Fuk%2FDoc­u­ments%2FGlobal%20Pri­ori­ties%20In­sti­tute%2FOper­a­tions%2FWeb­site%2FWork­ing%20pa­pers&slrid=10daaa9e-b098-7000-a41a-599fb32c6ff4

Ngo, Richard; Pace, Ben—Some cruxes on im­pact­ful al­ter­na­tives to AI policy work − 2018-10-10 - https://​​www.less­wrong.com/​​posts/​​DJB82jKwgJE5NsWgT/​​some-cruxes-on-im­pact­ful-al­ter­na­tives-to-ai-policy-work

Nooth­igattu, Ritesh; Bouneffouf, Djallel; Mat­tei, Ni­cholas; Chan­dra, Ra­chita; Madan, Piyush; Varsh­ney, Kush; Camp­bell, Mur­ray; Singh, Mon­in­der; Rossi, Francesca - In­ter­pretable Multi-Ob­jec­tive Re­in­force­ment Learn­ing through Policy Orches­tra­tion − 2018-09-21 - https://​​arxiv.org/​​abs/​​1809.08343

Nushi, Besmira; Ka­mar, Ece; Horvitz, Eric—Towards Ac­countable AI: Hy­brid Hu­man-Ma­chine Analy­ses for Char­ac­ter­iz­ing Sys­tem Failure − 2018-09-19 - https://​​arxiv.org/​​abs/​​1809.07424

Oester­held, Cas­par—Ap­proval-di­rected agency and the de­ci­sion the­ory of New­comb-like prob­lems − 2017-12-21 - https://​​cas­paroester­held.files.word­press.com/​​2017/​​12/​​rldt.pdf

OpenAI—OpenAI Char­ter − 2018-04-09 - https://​​blog.ope­nai.com/​​ope­nai-char­ter/​​

Ortega, Pe­dro; Maini, Vishal; Safety Team, Deep­mind—Build­ing safe ar­tifi­cial in­tel­li­gence: speci­fi­ca­tion, ro­bust­ness and as­surance − 2018-09-27 - https://​​medium.com/​​@deep­mind­safe­tyre­search/​​build­ing-safe-ar­tifi­cial-in­tel­li­gence-52f5f75058f1

Paper­not, Ni­co­las; McDaniel, Pa­trick—Deep k-Near­est Neigh­bors: Towards Con­fi­dent, In­ter­pretable and Ro­bust Deep Learn­ing − 2018-03-13 - https://​​arxiv.org/​​pdf/​​1803.04765.pdf

Raghu­nathan, Aditi; Stein­hardt, Ja­cob; Liang, Percy—Cer­tified Defenses Against Ad­ver­sar­ial Ex­am­ples − 2018-01-29 - https://​​arxiv.org/​​abs/​​1801.09344

Rain­forth, Tom; Ko­siorek, Adam; Anh Le, Tuan; Mad­di­son, Chris; Igl, Max­i­m­il­ian; Wood, Frank; Whe Teh, Yee—Tighter Vari­a­tional Bounds are Not Ne­c­es­sar­ily Bet­ter − 2018-06-25 - https://​​arxiv.org/​​abs/​​1802.04537

Rat­ner, Ellis; Had­field-Menell, Dy­lan; Dra­gan, Anca—Sim­plify­ing Re­ward De­sign through Divide-and-Con­quer − 2018-06-07 - https://​​arxiv.org/​​abs/​​1806.02501

Reddy, Sid­dharth; Dra­gan, Anca; Lev­ine, Sergey—Shared Au­ton­omy via Deep Re­in­force­ment Learn­ing − 2018-05-23 - https://​​arxiv.org/​​abs/​​1802.01744

Reddy, Sid­dharth; Dra­gan, Anca; Lev­ine, Sergey—Where Do You Think You’re Go­ing?: In­fer­ring Beliefs about Dy­nam­ics from Be­havi­our − 2018-10-20 - https://​​arxiv.org/​​abs/​​1805.08010

Rees, Martin—On The Fu­ture − 2018-10-16 - https://​​www.ama­zon.com/​​Fu­ture-Prospects-Hu­man­ity-Martin-Rees-ebook/​​dp/​​B07CSD5BG9

rk; Sem­pere, Nuno—AI de­vel­op­ment in­cen­tive gra­di­ents are not uniformly ter­rible − 2018-11-12 - https://​​www.less­wrong.com/​​posts/​​bkG4qj9BFEkNva3EX/​​ai-de­vel­op­ment-in­cen­tive-gra­di­ents-are-not-uniformly

Ruan, Wen­jie; Huang, Xiaowei; Kwiatkowska, Marta—Reach­a­bil­ity Anal­y­sis of Deep Neu­ral Net­works with Prov­able Guaran­tees − 2018-05-06 - https://​​arxiv.org/​​abs/​​1805.02242

Sadigh, Dorsa; Sas­try, Shankar; Seshia, San­jit; Dra­gan, Anca—Plan­ning for Au­tonomous Cars that Lev­er­age Effects on Hu­man Ac­tions − 2016-06-01 - https://​​peo­ple.eecs.berkeley.edu/​​~sas­try/​​pubs/​​Pdfs%20of%202016/​​SadighPlan­ning2016.pdf

Sand­berg, An­ders—Hu­man Ex­tinc­tion from Nat­u­ral Hazard Events − 2018-02-01 - http://​​oxfor­dre.com/​​nat­u­ral­haz­ard­science/​​view/​​10.1093/​​acre­fore/​​9780199389407.001.0001/​​acre­fore-9780199389407-e-293

Sarma, Gopal; Hay, Nick—Mam­malian Value Sys­tems − 2017-12-31 - https://​​arxiv.org/​​abs/​​1607.08289

Sarma, Gopal; Hay, Nick—Ro­bust Com­puter Alge­bra, The­o­rem Prov­ing, and Or­a­cle AI − 2017-12-31 - https://​​arxiv.org/​​abs/​​1708.02553

Sarma, Gopal; Hay, Nick; Safron, Adam—AI Safety and Re­pro­ducibil­ity: Estab­lish­ing Ro­bust Foun­da­tions for the Neu­ropsy­chol­ogy of Hu­man Values − 2018-09-08 - https://​​arxiv.org/​​abs/​​1712.04307

Schulze, Se­bas­tian; Evans, Owain—Ac­tive Re­in­force­ment Learn­ing with Monte-Carlo Tree Search − 2018-03-13 - https://​​arxiv.org/​​abs/​​1803.04926

Shah, Ro­hin—AI Align­ment Newslet­ter − 1905-07-10 - https://​​ro­hin­shah.com/​​al­ign­ment-newslet­ter/​​

Shah, Ro­hin; Chris­ti­ano, Paul; Arm­strong, Stu­art; Stein­hardt, Ja­cob; Evans, Owain—Value Learn­ing Se­quence − 2018-10-29 - https://​​www.less­wrong.com/​​s/​​Rm6oQRJJmhGCcLvxh

Sha­har, Avin—Mav­er­icks and Lot­ter­ies − 2018-09-25 - http://​​philsci-archive.pitt.edu/​​15058/​​

Sha­har, Avin; Shapira, Shai—Civ V AI Mod − 2018-01-05 - https://​​www.cser.ac.uk/​​news/​​civ­i­liza­tion-v-video-game-mod-su­per­in­tel­li­gent-ai/​​

Shaw, Nolan P.; Stockel, An­dreas; Orr, Ryan W.; Lid­bet­ter, Thomas F.; Co­hen, Robin—Towards Prov­ably Mo­ral AI Agents in Bot­tom-up Learn­ing Frame­works − 2018-03-15 - http://​​www.aies-con­fer­ence.com/​​wp-con­tent/​​pa­pers/​​main/​​AIES_2018_pa­per_8.pdf

So­tala, Kaj—Shap­ing eco­nomic in­cen­tives for col­lab­o­ra­tive AGI − 2018-06-29 - https://​​www.less­wrong.com/​​posts/​​FkZCM4DMprtEp568s/​​shap­ing-eco­nomic-in­cen­tives-for-col­lab­o­ra­tive-agi

So­tala, Kaj; Gloor, Lukas—Su­per­in­tel­li­gence as a Cause or Cure for Risks of Astro­nom­i­cal Suffer­ing − 2017-08-31 - http://​​www.in­for­mat­ica.si/​​in­dex.php/​​in­for­mat­ica/​​ar­ti­cle/​​view/​​1877

Stuh­lmul­ler, An­dreas—Fac­tored Cog­ni­tion − 2018-04-25 - https://​​ought.org/​​pre­sen­ta­tions/​​fac­tored-cog­ni­tion-2018-05

Tay­lor, Jes­sica; Gal­lagher, Jack; Malt­in­sky, Baeo - In­sight-based AI timeline model − 1905-07-10 - http://​​me­di­an­group.org/​​insights

The Fu­ture of Life In­sti­tute—Value Align­ment Re­search Land­scape − 1900-01-00 - https://​​fu­ture­oflife.org/​​val­ueal­ign­mentmap/​​

Tram­mell, Philip—Fixed-Point Solu­tions to the Regress Prob­lem in Nor­ma­tive Uncer­tainty − 2018-08-29 - https://​​unioxford­nexus-my.share­point.com/​​per­sonal/​​exet1753_ox_ac_uk/​​_lay­outs/​​15/​​onedrive.aspx?id=%2Fper­sonal%2Fexet1753%5Fox%5Fac%5Fuk%2FDoc­u­ments%2FGlobal%20Pri­ori­ties%20In­sti­tute%2FOper­a­tions%2FWeb­site%2FWork­ing%20pa­pers%2Fde­ci­sion%5Fthe­ory%5Fregress%2Epdf&par­ent=%2Fper­sonal%2Fexet1753%5Fox%5Fac%5Fuk%2FDoc­u­ments%2FGlobal%20Pri­ori­ties%20In­sti­tute%2FOper­a­tions%2FWeb­site%2FWork­ing%20pa­pers&slrid=14daaa9e-3069-7000-a41a-5aa6302f7c36

Tucker, Aaron; Gleave, Adam; Rus­sell, Stu­art—In­verse Re­in­force­ment Learn­ing for Video Games − 2018-10-24 - https://​​arxiv.org/​​abs/​​1810.10593

Turchin, Alexey—Could slaugh­ter­bots wipe out hu­man­ity? Assess­ment of the global catas­trophic risk posed by au­tonomous weapons − 2018-03-19 - https://​​philpa­pers.org/​​rec/​​TURCSW

Turchin, Alexey; Denken­berger, David—Clas­sifi­ca­tion of Global Catas­trophic Risks Con­nected with Ar­tifi­cial In­tel­li­gence − 2018-05-03 - https://​​link.springer.com/​​ar­ti­cle/​​10.1007/​​s00146-018-0845-5

Turner, Alex—Towards a New Im­pact Mea­sure − 2018-09-18 - https://​​www.al­ign­ment­fo­rum.org/​​posts/​​yEa7kwoMp­sB­gaBCgb/​​to­wards-a-new-im­pact-measure

Um­brello, Steven; Baum, Seth—Eval­u­at­ing Fu­ture nan­otech­nol­ogy: The Net So­cietal Im­pacts of Atom­i­cally Pre­cise Man­u­fac­tur­ing − 2018-04-30 - https://​​www.re­search­gate.net/​​pub­li­ca­tion/​​324715437_Eval­u­at­ing_Fu­ture_Nan­otech­nol­ogy_The_Net_So­cietal_Im­pacts_of_Atom­i­cally_Pre­cise_Manufacturing

Vonitzer, Vin­cent; Sin­nott-Arm­strong, Walter; Borg, Jana Schaich; Deng, Yuan; Kramer, Max—Mo­ral De­ci­sion Mak­ing Frame­works for Ar­tifi­cial In­tel­li­gence − 2017-02-12 - https://​​users.cs.duke.edu/​​~conitzer/​​moralAAAI17.pdf

Wang, Xin; Chen, Wenhu; Wang, Yuan-Fang ; Yang Wang, William - No Met­rics are Perfect: Ad­ver­sar­ial Re­ward Learn­ing for Vi­sual Sto­ry­tel­ling − 2018-07-09 - https://​​arxiv.org/​​abs/​​1804.09160

Wu, Yi; Sid­dharth, Sri­vas­tava; Hay, Ni­cholas; Du, Si­mon; Rus­sell, Stu­art—Discrete-Con­tin­u­ous Mix­tures in Prob­a­bil­is­tic Pro­gram­ming: Gen­er­al­ised Se­man­tics and In­fer­ence Al­gorithms − 2018-06-13 - https://​​arxiv.org/​​abs/​​1806.02027

Wu, Yueh-Hua; Lin, Shou-De—A Low-Cost Ethics Shap­ing Ap­proach for De­sign­ing Re­in­force­ment Learn­ing Agents − 2018-09-10 - https://​​arxiv.org/​​abs/​​1712.04172

Yu, Han; Shen, Zhiqi; Miao, Chun­yan; Le­ung, Cyril; Lesser, Vic­tor; Yang, Qiang—Build­ing Ethics into Ar­tifi­cial In­tel­li­gence − 2018-07-13 - http://​​www.ntulily.org/​​wp-con­tent/​​up­loads/​​con­fer­ence/​​Build­ing_Ethics_into_Ar­tifi­cial_In­tel­li­gence_ac­cepted.pdf

Yud­kowsky, Eliezer—The Rocket Align­ment Prob­lem − 2018-10-03 - https://​​in­tel­li­gence.org/​​2018/​​10/​​03/​​rocket-al­ign­ment/​​

Yud­kowsky, Eliezer; Chris­ti­ano, Paul—Challenges to Chris­ti­ano’s Ca­pa­bil­ity Am­plifi­ca­tion Pro­posal − 2018-05-19 - https://​​www.less­wrong.com/​​posts/​​S7csET9CgBtpi7sCh/​​challenges-to-chris­ti­ano-s-ca­pa­bil­ity-am­plifi­ca­tion-proposal

Zhou, Allen; Had­field-Menell, Dy­lan; Naga­bandi, Anusha; Dra­gan, Anca—Ex­pres­sive Robot Mo­tion Timing − 2018-02-05 -