2017 AI Safety Literature Review and Charity Comparison

Sum­mary: I re­view a sig­nifi­cant amount of 2017 re­search re­lated to AI Safety and offer some com­ments about where I am go­ing to donate this year. Cross-posted from here upon re­quest.

Contents

Contents

Introduction

The Ma­chine In­tel­li­gence Re­search In­sti­tute (MIRI)

The Fu­ture of Hu­man­ity In­sti­tute (FHI)

Global Catas­trophic Risks In­sti­tute (GCRI)

The Cen­ter for the Study of Ex­is­ten­tial Risk (CSER)

AI Impacts

Cen­ter for Hu­man-Com­pat­i­ble AI (CFHCA)

Other re­lated organisations

Re­lated Work by other parties

Other ma­jor de­vel­op­ments this year

Conclusion

Disclosures

Bibliography

Introduction

Like last year, I’ve at­tempted to re­view the re­search that has been pro­duced by var­i­ous or­gani­sa­tions work­ing on AI safety, to help po­ten­tial donors gain a bet­ter un­der­stand­ing of the land­scape. This is a similar role to that which GiveWell performs for global health char­i­ties, and some­what similar to an se­cu­ri­ties an­a­lyst with re­gards to pos­si­ble in­vest­ments. It ap­pears that once again no-one else has at­tempted to do this, to my knowl­edge, so I’ve once again un­der­taken the task. While I’ve been able to work sig­nifi­cantly more effi­ciently on this than last year, I have been un­for­tu­nately very busy with my day job, which has dra­mar­i­cally re­duced the amount of time I’ve been able to ded­i­cate.

My aim is ba­si­cally to judge the out­put of each or­gani­sa­tion in 2017 and com­pare it to their bud­get. This should give a sense for the or­gani­sa­tions’ av­er­age cost-effec­tive­ness. Then we can con­sider fac­tors that might in­crease or de­crease the marginal cost-effec­tive­ness go­ing for­ward. We fo­cus on or­gani­sa­tions, not re­searchers.

Judg­ing or­gani­sa­tions on their his­tor­i­cal out­put is nat­u­rally go­ing to favour more ma­ture or­gani­sa­tions. A new startup, whose value all lies in the fu­ture, will be dis­ad­van­taged. How­ever, I think that this is cor­rect. The newer the or­gani­sa­tion, the more fund­ing should come from peo­ple with close knowl­edge. As or­gani­sa­tions ma­ture, and have more eas­ily ver­ifi­able sig­nals of qual­ity, their fund­ing sources can tran­si­tion to larger pools of less ex­pert money. This is how it works for star­tups turn­ing into pub­lic com­pa­nies and I think the same model ap­plies here.

This judge­ment in­volves analysing a large num­ber pa­pers re­lat­ing to Xrisk that were pro­duced dur­ing 2017. Hope­fully the year-to-year volatility of out­put is suffi­ciently low that this is a rea­son­able met­ric. I also at­tempted to in­clude pa­pers dur­ing De­cem­ber 2016, to take into ac­count the fact that I’m miss­ing the last month’s worth of out­put from 2017, but I can’t be sure I did this suc­cess­fully.

This ar­ti­cle fo­cuses on AI risk work. If you think other causes are im­por­tant too, your pri­ori­ties might differ. This par­tic­u­larly af­fects GCRI and CSER, who both do a lot of work on other is­sues.

We fo­cus vir­tu­ally ex­clu­sively on pa­pers, rather than out­reach or other ac­tivi­ties. This is party be­cause they are much eas­ier to mea­sure; while there has been a large in­crease in in­ter­est in AI safety over the last year, it’s hard to work out who to credit for this, and partly be­cause I think progress has to come by per­suad­ing AI re­searchers, which I think comes through tech­ni­cal out­reach and pub­lish­ing good work, not pop­u­lar/​poli­ti­cal work.

My im­pres­sion is that policy on tech­ni­cal sub­jects (as op­posed to is­sues that at­tract strong views from the gen­eral pop­u­la­tion) is gen­er­ally made by the gov­ern­ment and civil ser­vants in con­sul­ta­tion with, and be­ing lob­bied by, out­side ex­perts and in­ter­ests. Without ex­pert (e.g. top ML re­searchers at Google, CMU & Baidu) con­sen­sus, no use­ful policy will be en­acted. Push­ing di­rectly for policy seems if any­thing likely to hin­der ex­pert con­sen­sus. At­tempts to di­rectly in­fluence the gov­ern­ment to reg­u­late AI re­search seem very ad­ver­sar­ial, and risk be­ing pat­tern-matched to ig­no­rant op­po­si­tion to GM foods or nu­clear power. We don’t want the ‘us-vs-them’ situ­a­tion, that has oc­curred with cli­mate change, to hap­pen here. AI re­searchers who are dis­mis­sive of safety law, re­gard­ing it as an im­po­si­tion and en­cum­brance to be en­dured or evaded, will prob­a­bly be harder to con­vince of the need to vol­un­tar­ily be ex­tra-safe—es­pe­cially as the reg­u­la­tions may ac­tu­ally be to­tally in­effec­tive. The only case I can think of where sci­en­tists are rel­a­tively happy about puni­tive safety reg­u­la­tions, nu­clear power, is one where many of those ini­tially con­cerned were sci­en­tists them­selves. Given this, I ac­tu­ally think policy out­reach to the gen­eral pop­u­la­tion is prob­a­bly nega­tive in ex­pec­ta­tion.

The good news on out­reach this year is we haven’t had any truly ter­rible pub­lic­ity that I can re­mem­ber, though I urge or­gani­sa­tions to re­mem­ber that the per­sonal ac­tivi­ties of their em­ploy­ees, es­pe­cially se­nior ones, re­flect on the or­gani­sa­tions them­selves, so they should take care not to act/​speak in ways that are offen­sive to those out­side their bub­ble, and to avoid hiring crazy peo­ple.

Part of my mo­ti­va­tion for writ­ing this is to help more peo­ple be­come in­formed about the AI safety land­scape so they can con­tribute bet­ter with both di­rect work and dona­tions. With re­gard dona­tions, at pre­sent Nick Beck­stead, in his role as both Fund Man­ager of the Long-Term Fu­ture Fund and officer with the Open Philan­thropy Pro­ject, is prob­a­bly the most im­por­tant fi­nancer of this work. He is also prob­a­bly sig­nifi­cantly more in­formed on the sub­ject than me, but I think it’s im­por­tant that the vi­tal­ity of the field doesn’t de­pend on a sin­gle per­son, even if that per­son is awe­some.

The Ma­chine In­tel­li­gence Re­search In­sti­tute (MIRI)

MIRI is the largest pure-play AI ex­is­ten­tial risk group. Based in Berkeley, it fo­cuses on math­e­mat­ics re­search that is un­likely to be pro­duced by aca­demics, try­ing to build the foun­da­tions for the de­vel­op­ment of safe AIs.

Their agent foun­da­tions work is ba­si­cally try­ing to de­velop the cor­rect way of think­ing about agents and learn­ing/​de­ci­sion mak­ing by spot­ting ar­eas where our cur­rent mod­els fail and seek­ing to im­prove them. Much of their work this year seems to in­volve try­ing to ad­dress self-refer­ence in some way—how can we de­sign, or even just model, agents that are smart enough to think about them­selves? This work is tech­ni­cal, ab­stract, and re­quires a con­sid­er­able be­lief in their long-term vi­sion, as it is rarely lo­cally ap­pli­ca­ble, so hard to in­de­pen­dently judge the qual­ity.

In 2016 they an­nounced they were some­what pivot­ing to­wards work that tied in closer to the ML liter­a­ture, a move I thought was a mis­take. How­ever, look­ing at their pub­lished re­search or their 2017 re­view page, in prac­tice this seems to have been less of a change of di­rec­tion than I had thought, as most of their work ap­pears to re­main on highly differ­en­ti­ated and un­re­place­able agent foun­da­tions type work—it seems un­likely that any­one not mo­ti­vated by AI safety would pro­duce this work. Even within those con­cerned about friendly AI, few not at MIRI would pro­duce this work.

Critch’s Toward Ne­go­tiable Re­in­force­ment Learn­ing: Shift­ing Pri­ori­ties in Pareto Op­ti­mal Se­quen­tial De­ci­sion-Mak­ing (el­se­where ti­tled ‘Ser­vant of Many Masters’) is a neat pa­per. Ba­si­cally it iden­ti­fies the pareto-effi­cient out­come if you have two agents with differ­ent be­liefs who want to agree on a util­ity func­tion for an AI, in a gen­er­al­i­sa­tion of Harsanyi’s Car­di­nal welfare, in­di­vi­d­u­al­is­tic ethics, and in­ter­per­sonal com­par­i­sons of util­ity. The key as­sump­tion is both want to use their cur­rent be­liefs when they calcu­late the ex­pected value of the deal to them­selves, and the (sur­pris­ing to me) con­clu­sion is that over time the AI will have to weigh more and more heav­ily the val­ues of the ne­go­tia­tor whose be­liefs were more ac­cu­rate. While I don’t think this is nec­es­sar­ily Critch’s in­ter­pre­ta­tion, I take this as some­thing of a re­duc­tio of the as­sump­tion. Surely if I was ne­go­ti­at­ing over a util­ity func­tion, I would want the agent to learn about the world and use that knowl­edge to bet­ter pro­mote my val­ues … not to learn about the world, de­cide I was a mo­ron with a bad world model, and ig­nore me there­after? If I think the AI is/​will be smarter than me, I should be happy for it to do things I’m un­aware will benefit me, and avoid do­ing things I falsely be­lieve will help me. On the other hand, if the par­ties are well-in­formed na­tion states rather than in­di­vi­d­u­als, the prospect of ‘get­ting one over’ the other might be helpful for avoid­ing arms races?

Kosoy’s Op­ti­mal polyno­mial-time es­ti­ma­tors ad­dresses a similar topic to the Log­i­cal In­duc­tion work—as­sign­ing ‘prob­a­bil­ities’ to log­i­cal/​math­e­mat­i­cal/​de­duc­tive state­ments un­der com­pu­ta­tional limi­ta­tions—but with a quite differ­ent ap­proach to solv­ing it. The work seems im­pres­sive but I didn’t re­ally un­der­stand it. In­side his frame­work he can prove that var­i­ous re­sults from prob­a­bil­ity the­ory also ap­ply to log­i­cal state­ments, which seems like what we’d want. (Note that tech­ni­cally this pa­per came out in De­cem­ber 2016, and so is in­cluded in this year rather than last year’s.)

Carey’s ar­ti­cle, In­cor­rigi­bil­ity in the CIRL Frame­work, is a re­sponse to Milli et al.’s Should Robots be Obe­di­ent and Had­field-Menel’s The Off-Switch Game. Carey ba­si­cally ar­gues it’s not nec­es­sar­ily the case that the CIRLs will be ‘au­to­mat­i­cally’ co­rigible if the AI’s be­liefs about value are very wrong, for ex­am­ple due to in­cor­rect pa­ram­e­ter­i­sa­tion or as­sign­ing a zero prior to some­thing that turns out to be the case. The dis­cus­sion sec­tion has some in­ter­est­ing ar­gu­ments, for ex­am­ple point­ing out that an al­gorithm de­signed to shut it­self off un­less it had a track record of perfectly pre­dict­ing what hu­mans would want might still fail if its on­tol­ogy was in­suffi­cient, so it couldn’t even tell that it was dis­agree­ing with the hu­mans dur­ing train­ing. I agree that value com­plex­ity and frag­ility might mean it’s very likely that any AI’s value model will be par­tially (and hence, for an AGI, catas­troph­i­cally) mis-pa­ram­e­ter­ised. How­ever, I’m not sure how much the ex­am­ples that take up much of the pa­per add to this ar­gu­ment. Milli’s ar­gu­ment only holds when the AI can learn the pa­ram­e­ters, and given that this pa­per as­sumes the hu­mans choose the wrong ac­tion by ac­ci­dent less than 1% of the time, it seems that the AI should as­sign a very large amount of ev­i­dence to a shut­down com­mand… in­stead the AI seems to sim­ply ig­nore it?

Some of MIRI’s pub­li­ca­tions this year seem to mainly be bet­ter ex­pla­na­tions of pre­vi­ous work. For ex­am­ple, Garrabrant et al’s A For­mal Ap­proach to the Prob­lem of Log­i­cal Non-Om­ni­science seems to be ba­si­cally an eas­ier to un­der­stand ver­sion of last year’s Log­i­cal In­duc­tion. Like­wise Yud­kowsky and Soares’s Func­tional De­ci­sion The­ory: A New The­ory of In­stru­men­tal Ra­tion­al­ity seems to be ba­si­cally new ex­po­si­tion of clas­sic MIRI/​LW de­ci­sion the­ory work—see for ex­am­ple Soares et al’s Toward Ideal­ized De­ci­sion The­ory. Similarly, I didn’t feel like there was much new in Soares et al’s Cheat­ing Death in Da­m­as­cus. Mak­ing things eas­ier to un­der­stand is use­ful—and last year’s Log­i­cal In­duc­tion pa­per was a lit­tle dense—but it’s clearly not as im­pres­sive as in­vent­ing new things.

When I asked for top achieve­ments for 2017, MIRI pointed me to­wards a lot of work they’d posted on agent­foun­da­tions.org as be­ing one of their ma­jor achieve­ments for the year, es­pe­cially this, this and this, which pose and then solve a prob­lem about how to find game-the­o­retic agents that can sta­bly model each other, for­mu­lated it as a topolog­i­cal fixed point prob­lem. There is also a lot of other work on agent­foun­da­tions that seems in­ter­est­ing, I’m not en­tirely sure how to think about giv­ing credit for these. Th­ese seem more like ‘work in progress’ than finished work—for most or­gani­sa­tions I am only giv­ing credit for the lat­ter. MIRI could with some jus­tifi­ca­tion re­spond that the stan­dard aca­demic pro­cess is very in­effi­cient, and part of their rea­son for ex­is­tence is to do things that uni­ver­si­ties can­not. How­ever, even if you de-pri­ori­tise peer re­view, I still think it is im­por­tant to write things up into pa­pers. Other­wise it is ex­tremely hard for out­siders to eval­u­ate—bad both for po­ten­tial fun­ders and for peo­ple wish­ing to en­ter the field. Un­for­tu­nately it is pos­si­ble that, if they con­tinue on this route, MIRI might pro­duce a lot of valuable work that is in­creas­ingly illeg­ible from the out­side. So over­all I think I con­sider these as ev­i­dence that MIRI is con­tin­u­ing to ac­tu­ally do re­search, but will wait un­til they’re ArXived to ac­tu­ally re­view them. If you dis­agree with this ap­proach, MIRI is go­ing to look much more pro­duc­tive, and their re­search pos­si­bil­ity ac­cel­er­at­ing in 2017 vs 2016. If you in­stead only look at pub­lished pa­pers, 2017 ap­pears to be some­thing of a ‘down year’ af­ter 2016.

Last year I was not keen to see that Eliezer was spend­ing a lot of time pro­duc­ing con­tent on Ar­bital as part of his job at MIRI, as there was a clear con­flict of in­ter­est—he was a sig­nifi­cant share­holder in Ar­bital, and ad­di­tion­ally I ex­pected Ar­bital to fail. Now that Ar­bital does seem to have in­deed failed, I’m pleased he seems to be spend­ing less time on it, but con­fused why he is spend­ing any time at all on it—though some of this seems to be cross-posted from el­se­where.

Eliezer’s book Inad­e­quate Equil­ibria, how­ever, does seem to be high qual­ity—ba­si­cally an­other se­quence—though only rele­vant inas­much as AI safety might be one of many ap­pli­ca­tions of the sub­ject of the book. I also en­courage read­ers to also read this ex­cel­lent ar­ti­cle by Greg Lewis (FHI) on the other side.

I also en­joyed There’s No Fire Alarm for Ar­tifi­cial Gen­eral In­tel­li­gence, which al­though ac­cessible to the lay­man I think pro­vided a con­vinc­ing case that, even when AGI is im­mi­nent, there would (/​might be) no sig­nal that this was the case, and his so­cratic se­cu­rity di­alogs on the mind­set re­quired to de­velop a se­cure AI.

I was sorry to hear Jes­sica Tay­lor left MIRI, as I thought she did good work.

MIRI spent roughly $1.9m in 2017, and aim to rapidly in­crease this to $3.5m in 2019, to fund new re­searchers and their new en­g­ineer­ing team.

The Open Philan­thropy Pro­ject awarded MIRI a $3.75m grant (over 3 years) ear­lier this year, largely be­cause one re­viewer was im­pressed with their work on Log­i­cal In­duc­tion. You may re­call this was a sig­nifi­cant part of why I en­dorsed MIRI last year. How­ever, as this re­view is fo­cused on work in the last twelve months, they don’t get credit for the same work two years run­ning! OPP have said they plan to fund roughly half of MIRI’s bud­get. On the pos­i­tive side, one might ar­gue this was es­sen­tially a 1:1 match on dona­tions to MIRI—but there are clearly game-the­o­retic prob­lems here. Ad­di­tion­ally, if you had faith in OpenPhil’s pro­cess, you might con­sider this a pos­i­tive sig­nal of MIRI qual­ity. On the other hand, if you think MIRI’s marginal cost-effec­tive­ness is diminish­ing over the multi-mil­lion dol­lar range, this might re­duce your es­ti­mate of the cost-effec­tive­ness of the marginal dol­lar.

There is also $1m of some­what plau­si­bly coun­ter­fac­tu­ally valid dona­tion match­ing available for MIRI (but not other AI Xrisk or­gani­sa­tions).

Fi­nally, I will note that MIRI are have been very gen­er­ous with their time in helping me un­der­stand what they are do­ing.

The Fu­ture of Hu­man­ity In­sti­tute (FHI)

Oxford’s FHI re­quested not to be in­cluded in this anal­y­sis, so I won’t be mak­ing any com­ment on whether or not they are a good place to fund. Had they not de­clined (and de­pend­ing on their fund­ing situ­a­tion) they would have been a strong can­di­date. This was dis­ap­point­ing to me, be­cause they seem to have pro­duced an im­pres­sive list of pub­li­ca­tions this year, in­clud­ing a lot of col­lab­o­ra­tions. I’ll briefly note two some pieces of re­search they pub­lished this year, but re­gret not be­ing able to give them bet­ter cov­er­age.

Saun­ders et al. pub­lished Trial with­out Er­ror: Towards Safe Re­in­force­ment Learn­ing via Hu­man In­ter­ven­tion, a nice pa­per where they at­tempt to make a Re­in­force­ment Learner that can ‘safely’ learn by train­ing a catas­tro­phe-recog­ni­tion al­gorithm to over­see the train­ing. It’s a cute idea, and a nice use of the OpenAI Atari suite, though I was most im­pressed with the fact that they con­cluded that their ap­proach would not scale (i.e. would not work). It’s not of­ten re­searchers pub­lish nega­tive re­sults!

Honourable men­tion also goes to the very cool (but aren’t all his pa­pers?) Sand­berg et al. That is not dead which can eter­nal lie: the aes­ti­va­tion hy­poth­e­sis for re­solv­ing Fermi’s para­dox, which is rele­vant inas­much as it sug­gests that the Fermi Para­dox is not ac­tu­ally ev­i­dence against AI as an ex­is­ten­tial risk.

FHI’s Brundage Bot ap­par­ently reads ev­ery ML pa­per ever writ­ten.

Global Catas­trophic Risks In­sti­tute (GCRI)

The Global Catas­trophic Risks In­sti­tute is run by Seth Baum and Tony Bar­rett. They have pro­duced work on a va­ri­ety of ex­is­ten­tial risks, in­clud­ing non-AI risks. Some of this work seems quite valuable, es­pe­cially Denken­berger’s Feed­ing Every­one No Mat­ter What on en­sur­ing food sup­ply in the event of dis­aster, and is prob­a­bly prob­a­bly of in­ter­est to the sort of per­son who would read this doc­u­ment. How­ever, they are off-topic for us here. Within AI they do a lot of work on the strate­gic land­scape, and are very pro­lific.

Baum’s Sur­vey of Ar­tifi­cial Gen­eral In­tel­li­gence Pro­jects for Ethics, Risk, and Policy at­tempts to analyse all ex­ist­ing AGI re­search pro­jects. This is a huge pro­ject and I laud him for it. I don’t know how much here is news to peo­ple who are very plugged in, but to me at least it was very in­for­ma­tive. The one crit­i­cism I would have is it could do more to try to differ­en­ti­ate on ca­pac­ity/​cred­i­bil­ity—e.g. my im­pres­sion is Deep­mind is dra­mat­i­cally more ca­pa­ble than many of the smaller or­gani­sa­tions listed—but that is clearly a very difficult ask. It’s hard for me to judge the ac­cu­racy, but I didn’t no­tice any mis­takes (be­yond be­ing sur­prised that AIXI has an ‘un­speci­fied’ for safety en­gage­ment, given the amount of AI safety pa­pers com­ing out of ANU.)

Baum’s So­cial Choice Ethics in Ar­tifi­cial In­tel­li­gence ar­gues that value-learn­ing type ap­proaches to AI ethics (like CEV ) con­tain many de­grees of free­dom for the pro­gram­mers to fi­nesse it to pick their val­ues, mak­ing them no bet­ter than the pro­gram­mers sim­ply choos­ing an eth­i­cal sys­tem di­rectly. The pro­gram­mers can choose whose val­ues are used for learn­ing, how they are mea­sured, and how they are ag­gre­gated. Over­all I’m not fully con­vinced—for ex­am­ple, pace the ar­gu­ment on page 3, a Law of Large Num­bers ar­gu­ment could sup­port av­er­ag­ing many views to get at the true ethics even if we had no way of in­de­pen­dently ver­ify­ing the true ethics. And there is some irony that, for all the pa­per’s con­cern with bias risk, the left-wing views of the au­thor come through strongly. But de­spite these I liked the pa­per, es­pe­cially for the dis­cus­sion of who has stand­ing—some­thing that seems like it will need a philo­soph­i­cal solu­tion, rather than a ML one.

Bar­rett’s Value of Global Catas­trophic Risk (GCR) In­for­ma­tion: Cost-Effec­tive­ness-Based Ap­proach for GCR Re­duc­tion cov­ers a lot of fa­mil­iar ground, and then at­tempts to do some monte carlo cost-benefit anal­y­sis on the a small num­ber of in­ter­ven­tions to help ad­dress nu­clear war and comet im­pact. After putting a lot of thought into set­ting up the ma­chin­ery, it would have been good to see anal­y­sis of a wider range of risks!

Baum & Bar­rett pub­lished Global Catas­tro­phes: The Most Ex­treme Risks, which seems to be es­sen­tially a rea­son­ably well ar­gued gen­eral in­tro­duc­tion to the sub­ject of ex­is­ten­tial risks. Hope­fully peo­ple who bought the book for other rea­sons will read it and be­come con­vinced.

Baum & Bar­rett’s Towards an In­te­grated Assess­ment of Global Catas­trophic Risk is a similar in­tro­duc­tory piece on catas­trophic risks, but the venue—a col­lo­quium on catas­trophic risks—seems less use­ful, as peo­ple read­ing it are more likely to already be con­cerned about the sub­ject, and I don’t think it spends enough time on AI risk per se to con­vince those who were already wor­ried about Xrisk but not AI Xrisk.

Last year I was (and still am) im­pressed by their pa­per On the Pro­mo­tion of Safe and So­cially Benefi­cial Ar­tifi­cial In­tel­li­gence, which made in­sight­ful, con­vinc­ing and ac­tion­able crit­i­cisms of ‘AI arms race’ lan­guage. I was less con­vinced by this year’s Rec­on­cili­a­tion Between Fac­tions Fo­cused on Near-Term and Long-Term Ar­tifi­cial In­tel­li­gence, which ar­gues for a re-al­ign­ment away from near-term AI wor­ries vs long-term AI wor­ries to­wards AI wor­ri­ers vs non-wor­ri­ers. How­ever, I’m not sure why any­one would agree to this—long-term wor­ri­ers don’t cur­rently spend much time ar­gu­ing against short-term wor­ries (even if you thought that AI dis­crim­i­na­tion ar­gu­ments were or­wellian, why bother ar­gu­ing about it?), and con­vinc­ing short-term wor­ri­ers to stop crit­i­cise long-term wor­ries seems ap­prox­i­mately as hard as sim­ply con­vinc­ing them to be­come long-term wor­ri­ers.

GCRI spent ap­prox­i­mately $117k in 2017, which is shock­ingly low con­sid­er­ing their pro­duc­tivity. This was lower than 2016; ap­par­ently their grants from the US Dept. of Home­land Se­cu­rity came to an end.

The Cen­ter for the Study of Ex­is­ten­tial Risk (CSER)

CSER is an ex­is­ten­tial risk fo­cused group lo­cated in Cam­bridge. Like GCRI they do work on a va­ri­ety of is­sues, no­tably in­clud­ing Rees’ work on in­fras­truc­ture re­silience.

Last year I crit­i­cised them for not hav­ing pro­duced any on­line re­search over sev­eral years; they now have a sep­a­rate page that does list some but maybe not all of their re­search.

Liu, a CSER re­searcher, wrote The Sure-Thing prin­ci­ple and P2 and was sec­ond au­thor on Gaif­man & Liu’s A sim­pler and more re­al­is­tic sub­jec­tive de­ci­sion the­ory, both on the math­e­mat­i­cal foun­da­tions of bayesian de­ci­sion the­ory, which is a valuable topic for AI safety in gen­eral. Strangely nei­ther pa­per men­tioned CSER as a fi­nan­cial sup­porter of the pa­per or af­fili­a­tion.

Liu and Price’s Heart of DARC­ness ar­gues that agents do not have cre­dences for what they will do while de­cid­ing whether to do it—their con­fi­dence is tem­porar­ily un­defined. I was not con­vinced—even some­one is de­cid­ing whether she’s 75% con­fi­dent or 50% con­fi­dent, pre­sum­ably there are some odds that de­ter­mine which side in a bet she’d take if forced to choose? I’m also not sure of the di­rect link to AI safety.

They’ve also con­vened and at­tended work­shops on AI and de­ci­sion the­ory, no­tably the AI & So­ciety Sym­po­sium in Ja­pan, but in gen­eral I am wary of giv­ing or­gani­sa­tions credit for these, as they are too hard for the out­side ob­server to judge, and ideally work­shops lead to pro­duce pa­pers—in which case we can judge those.

CSER also did a sig­nifi­cant amount of out­reach, in­clud­ing pre­sent­ing to the House of Lords, and ap­par­ently have ex­per­tise in Chi­nese out­reach (mul­ti­ple na­tive man­darin speak­ers), which could be im­por­tant, given China’s AI re­search but cul­tural sep­a­ra­tion from the west.

They are un­der­tak­ing a novel pub­lic­ity effort that I won’t name as I’m not sure it’s pub­lic yet. In gen­eral I think most paths to suc­cess in­volve con­sen­sus-build­ing among main­stream ML re­searchers, and ‘pop­u­lar’ efforts risk harm­ing our cred­i­bil­ity, so I am not op­ti­mistic here.

Their an­nual bud­get is around $750,000, with I es­ti­mate a bit less than half go­ing on AI risk . Ap­par­ently they need to raise funds to con­tinue ex­ist­ing once their cur­rent grants run out in 2019.

AI Impacts

AI Im­pacts is a small group that does high-level strat­egy work, es­pe­cially on AI timelines, some­what as­so­ci­ated with MIRI.

They seem to have pro­duced sig­nifi­cantly more this year than last year. The main achieve­ment is the When will AI ex­ceed Hu­man Perfor­mance? Ev­i­dence from AI Ex­perts, which asked gath­ered the opinions of hun­dreds of AI re­searchers on AI timelines ques­tions. There were some pretty rele­vant take­aways, like that most re­searchers find the AI Catas­trophic Risk ar­gu­ment some­what plau­si­ble, but doubt there is any­thing that can use­fully be done in the short term, or that asian re­searchers think hu­man-level AI is sig­nifi­cantly closer than amer­i­cans do. I think the value-prop here is twofold: firstly, pro­vid­ing a source of timeline es­ti­mates for when we make de­ci­sions that hinge on how long we have, and sec­ondly, to prove that con­cern about AI risk is a re­spectable, main­stream po­si­tion. It was ap­par­ently one of the most dis­cussed pa­pers of 2017.

On a similar note they also have data on im­prove­ments in a num­ber of AI-re­lated bench­marks, like com­put­ing costs or al­gorith­mic progress.

John Sal­vatier (mem­ber of AI Im­pacts at the time) was also sec­ond au­thor on Agent-Ag­nos­tic Hu­man-in-the-Loop Re­in­force­ment Learn­ing, along with Evans (FHI, 4th au­thor), which at­tempts to de­sign an in­ter­face for re­in­force­ment learn­ing that ab­stracts away from the agent, so you could eas­ily change the un­der­ly­ing agent.

AI Im­pacts’ bud­get is tiny com­pared to most of the other or­gani­sa­tions listed here; around $60k at pre­sent. In­cre­men­tal funds would ap­par­ently be spent on hiring more part-time re­searchers.

Cen­ter for Hu­man-Com­pat­i­ble AI (CFHCA)

The Cen­ter for Hu­man-Com­pat­i­ble AI, founded by Stu­art Rus­sell in Berkeley, launched in Au­gust 2016. As they are not look­ing for more fund­ing at the mo­ment I will only briefly sur­vey some of they work on co­op­er­a­tive in­verse re­in­force­ment learn­ing.

Had­field-Menel et al’s The Off-Switch Game is a nice pa­per that pro­duces and for­mal­ises the (at least now I’ve read it) very in­tu­itive re­sult that a value-learn­ing AI might be cor­rigible (at least in some in­stances) be­cause it takes the fact that a hu­man pressed the off-switch as ev­i­dence that this is the best thing to do.

Milli et al’s Should Robots be Obe­di­ent is in the same vein as Had­field-Menel et al’s Co­op­er­a­tive In­verse Re­in­force­ment Learn­ing (last year) on learn­ing val­ues from hu­mans, speci­fi­cally touch­ing on whether such agents would be will­ing to obey a com­mand to ‘turn off’, as per Soares’s pa­per on Cor­rigi­bil­ity. She does some in­ter­est­ing anal­y­sis about the trade-off be­tween obe­di­ence and re­sults in cases where hu­mans are fal­lible.

In both cases I thought the pa­pers were thought­ful and had good anal­y­sis. How­ever, I don’t think ei­ther is con­vinc­ing in show­ing that cor­rigi­bil­ity comes ‘nat­u­rally’ - at least not the strength of cor­rigi­bil­ity we need.

I en­courage them to keep their web­site more up-to-date.

Over­all I think their re­search is good and their team promis­ing. How­ever, ap­par­ently they have enough fund­ing for now, so I won’t be donat­ing this year. If this changed and they re­quested in­cre­men­tal cap­i­tal I could cer­tainly imag­ine fund­ing them in fu­ture years.

The Cen­ter for Ap­plied Ra­tion­al­ity (CFAR) works on try­ing to im­prove hu­man ra­tio­nal­ity, es­pe­cially with the aim of helping with AI Xrisk efforts.

The Fu­ture of Life In­sti­tute (FLI) ran a huge grant-mak­ing pro­gram to try to seed the field of AI safety re­search. There definitely seem to be a lot more aca­demics work­ing on the prob­lem now, but it’s hard to tell how much to at­tribute to FLI.

Eighty Thou­sand Hours (80K) provide ca­reer ad­vice, with AI safety be­ing one of their key cause ar­eas.

Deep Re­in­force­ment Learn­ing from Hu­man Prefer­ences, was pos­si­bly my favourite pa­per of the year, which pos­si­bly shouldn’t come as a sur­prise, given that two of the au­thors (Chris­ti­ano and Amodei from OpenAI ) were au­thors on last year’s Con­crete Prob­lems in AI Safety. It ap­plies ideas on boot­strap­ping that Chris­ti­ano has been dis­cussing for a while—get­ting hu­mans to train an AI which then trains an­other AI etc. The model performs sig­nifi­cantly bet­ter than I would have ex­pected, and as ever I’m pleased to see OpenAI—Deep­mind col­lab­o­ra­tion.

Chris­ti­ano con­tinues to pro­duce very in­ter­est­ing con­tent on his blog, like this on Cor­rigi­bil­ity. When I first read his ar­ti­cles about how to boot­strap safety through iter­a­tive train­ing pro­ce­dures, my re­ac­tions was that, while this seemed an in­ter­est­ing idea, it didn’t seem to have much in com­mon with main­stream ML. How­ever, there do seem to be a bunch of prac­ti­cal pa­pers about imi­ta­tion learn­ing now. I’m not sure if this was always the case, and I was just ig­no­rant, or if they have be­come more promi­nent in the last year. Either way, I have up­dated to­wards con­sid­er­ing this ap­proach to be a promis­ing one for in­te­grat­ing safety into main­stream ML work. He has also writ­ten a nice blog post ex­plain­ing how AlphaZero works, and ar­gu­ing that this sup­ports his en­hance­ment ideas.

It was also nice to see ~95 pa­pers that were ad­dress­ing Amodei et al’s call in last year’s Con­crete Prob­lems.

Menda et al’s DropoutDAg­ger pa­per on safe ex­plo­ra­tion seems to fit in this cat­e­gory. Ba­si­cally they come up with a form of imi­ta­tion learn­ing where the AI be­ing trained can ex­plore a bit, but isn’t al­lowed to stray too far from the ex­pert policy—though I’m not sure why they always have the learner ex­plore in the di­rec­tion it thinks is best, rather than as­sign­ing some weight to its un­cer­tainty of out­come, ex­plore-ex­ploit-style. I’m not sure how much credit Amodei et al can get for in­spiring this though, as it seems to be (to a sig­nifi­cant de­gree) an ex­ten­sion of Zhang and Cho’s Query-Effi­cient Imi­ta­tion Learn­ing for End-to-End Au­tonomous Driv­ing.

How­ever, I don’t want to give too much credit for work that im­proves ‘lo­cal’ safety that doesn’t also ad­dress the big prob­lems in AI safety, be­cause this work prob­a­bly ac­cel­er­ates un­safe hu­man-level AI. There are many pa­pers in this cat­e­gory, but for ob­vi­ous rea­sons I won’t call them out.

Gan’s Self-Reg­u­lat­ing Ar­tifi­cial Gen­eral In­tel­li­gence con­tains some nice eco­nomic for­mal­ism around AIs seiz­ing power from hu­mans, and raises the in­ter­est­ing ar­gu­ment that if you need spe­cial­ist AIs to achieve things, the first hu­man-level AIs might not ex­hibit take­off be­havi­our be­cause they would be un­able to suffi­ciently trust the power-seiz­ing agents they would need to cre­ate. I’m scep­ti­cal that this as­sump­tion about the need for spe­cial­ised AIs holds—surely even if you need to make sep­a­rate AI agents for differ­ent tasks, rather than in­te­grat­ing them, it would suffice to give them spe­cial­ised ca­pa­bil­ities and but the same goals. Re­gard­less, the pa­per does sug­gest the in­ter­est­ing pos­si­bil­ity that hu­man­ity might make an AI which is in­tel­li­gent enough to re­al­ise it can­not solve the al­ign­ment prob­lem to safely self-im­prove… and hence progress stops there—though of course this would not be some­thing to rely on.

MacFie’s Plau­si­bil­ity and Prob­a­bil­ity in De­duc­tive Rea­son­ing also ad­dresses the is­sue of how to as­sign prob­a­bil­ities to log­i­cal state­ments, in a similar vein to much MIRI re­search.

Vam­plew et al’s Hu­man-al­igned ar­tifi­cial in­tel­li­gence is a mul­ti­ob­jec­tive prob­lem ar­gues that we should con­sider a broader class of func­tions than lin­ear sums when com­bin­ing util­ity func­tions.

Google Deep­mind con­tinue to churn out im­pres­sive re­search, some of which seems rele­vant to the prob­lem, like Sune­hag et al’s Value-De­com­po­si­tion Net­works For Co­op­er­a­tive Multi-Agent Learn­ing and Danihelka, et al’s Com­par­i­son of Max­i­mum Like­li­hood and GAN-based train­ing of Real NVPs on avoid­ing overfit­ting.

In terms of pre­dict­ing AI timelines, an­other piece I found in­ter­est­ing was Gupta et al.’s Re­vis­it­ing the Un­rea­son­able Effec­tive­ness of Data, which ar­gued that, for vi­sion tasks at least, perfor­mance im­proved log­a­r­ith­mi­cally in sam­ple size.

The Fore­sight In­sti­tute pub­lished a white pa­per on the gen­eral sub­ject of AI policy and risk.

Stan­ford’s One Hun­dred Year Study on Ar­tifi­cial In­tel­li­gence pro­duced an AI In­dex re­port, which is ba­si­cally a re­port on progress in the field up to 2016. In­ter­est­ingly var­i­ous met­rics they tracked, sum­marised in their ‘Vibrancy’ met­ric, sug­gest that the field ac­tu­ally re­gressed in 2016, through my ex­pe­rience with similar data in the fi­nan­cial world leaves me rather scep­ti­cal of such method­ol­ogy. Un­for­tu­nately the re­port ded­i­cated only a sin­gle word to the sub­ject of AI safety.

On a lighter note, the es­teemed G.K. Ch­ester­ton re­turned from be­yond the grave to eviscer­ate an AI risk doubter, and a group of re­searchers (some FHI) proved that it is im­pos­si­ble to cre­ate a ma­chine larger than a hu­man, so that’s a re­lief.

Other ma­jor de­vel­op­ments this year

Google’s Deep­mind pro­duced AlphaZero, which learnt how to beat the best AIs (and hence also the best hu­mans) at Go, Chess and Shogi with just a few hours of self-play.

Creation of the EA funds, in­clud­ing the Long-Term Fu­ture Fund, run by Nick Beck­stead, which has made one small­ish grant re­lated to AI Safety, con­served the other 96%.

The Open Philan­thropy Pro­ject funded both MIRI and OpenAI (ac­quiring a board seat in the pro­cess with the lat­ter).

Nvidia (who make GPUs used for ML) saw their share price ap­prox­i­mately doubl, af­ter qua­dru­pling last year.

Hillary Clin­ton was pos­si­bly con­cerned about AI risk? But un­for­tu­nately Putin seems to have less helpful con­cerns about an AI Arms race… namely en­sur­ing that he wins it. And China an­nounced a na­tional plan for AI with chi­nese char­ac­ter­is­tics—but bear in mind they have failed at these be­fore, like their push into Semi­con­duc­tors, though com­pa­nies like Baidu do seem to be do­ing im­pres­sive re­search.

There were some pa­pers sug­gest­ing the repli­ca­tion crisis may be com­ing to ML?

Conclusion

In some ways this has been a great year. My im­pres­sion is that the cause of AI safety has be­come in­creas­ingly main­stream, with a lot of re­searchers un­af­fili­ated with the above or­gani­sa­tions work­ing at least tan­gen­tially on it.

How­ever, it’s tough from the point of view of an ex­ter­nal donor. Some of the or­gani­sa­tions do­ing the best work are well funded. Others (MIRI) seem to be do­ing a lot of good work but (per­haps nec­es­sar­ily) it is sig­nifi­cantly harder for out­siders to judge than last year, as there doesn’t seem to be a re­ally heavy-hit­ting pa­per like there was last year. I see MIRI’s work as be­ing a long-shot bet that their spe­cific view of the strate­gic land­scape is cor­rect, but given this they’re ba­si­cally ir­re­place­able. GCRI and CSER’s work is more main­stream in this re­gard, but GCRI’s pro­duc­tivity is es­pe­cially note­wor­thy, given the or­der of mag­ni­tude of differ­ence in bud­get size.

As I have once again failed to re­duce char­ity se­lec­tion to a sci­ence, I’ve in­stead at­tempted to sub­jec­tively weigh the pro­duc­tivity of the differ­ent or­gani­sa­tions against the re­sources they used to gen­er­ate that out­put, and donate ac­cord­ingly.

My con­stant wish is to pro­mote a lively in­tel­lect and in­de­pen­dent de­ci­sion-mak­ing among my read­ers; hope­fully my lay­ing out the facts as I see them above will prove helpful to some read­ers. Here is my even­tual de­ci­sion, rot13′d so you can do come to your own con­clu­sions first if you wish:

Sig­nifi­cant dona­tions to the Ma­chine In­tel­li­gence Re­search In­sti­tute and the Global Catas­trophic Risks In­sti­tute. A much smaller one to AI Im­pacts.

How­ever I wish to em­pha­sis that all the above or­gani­sa­tions seem to be do­ing good work on the most im­por­tant is­sue fac­ing mankind. It is the na­ture of mak­ing de­ci­sions un­der scarcity that we must pri­ori­tize some over oth­ers, and I hope that all or­gani­sa­tions will un­der­stand that this nec­es­sar­ily in­volves nega­tive com­par­i­sons at times.

Thanks for read­ing this far; hope­fully you found it use­ful. Some­one sug­gested that, in­stead of do­ing this an­nu­ally, I should in­stead make a blog where I provide some anal­y­sis of AI-risk re­lated events as they oc­cur. Pre­sum­ably there would still be an an­nual giv­ing-sea­son writeup like this one. If you’d find this use­ful, please let me know.

Disclosures

I was a Sum­mer Fel­low at MIRI back when it was SIAI, vol­un­teered very briefly at GWWC (part of CEA) and once ap­plied for a job at FHI. I am per­sonal friends with peo­ple at MIRI, FHI, CSER, CFHCA and AI Im­pacts but not GCRI (so if you’re wor­ried about bias you should over­weight them… though it also means I have less di­rect knowl­edge). How­ever I have no fi­nan­cial ties be­yond be­ing a donor and have never been ro­man­ti­cally in­volved with any­one who has ever been at any of the or­gani­sa­tions.

I shared a draft of the rele­vant sec­tions of this doc­u­ment with rep­re­sen­ta­tives of MIRI, CSER and GCRI and AI Im­pacts. I’m very grate­ful for Alex Flint and Jess Riedel for helping re­view a draft of this doc­u­ment. Any re­main­ing in­ad­e­qua­cies and mis­takes are my own.

Edited 2017-12-21: Spel­ling mis­takes, cor­rected Amodei’s af­fili­a­tion.

Edited 2017-12-24: Minor cor­rec­tion to CSER num­bers.

Bibliography

Adam D. Cobb, An­drew Markham, Stephen J. Roberts; Learn­ing from li­ons: in­fer­ring the util­ity of agents from their tra­jec­to­ries; https://​​arxiv.org/​​abs/​​1709.02357
Alexei An­dreev; What’s up with Ar­bital; http://​​less­wrong.com/​​r/​​dis­cus­sion/​​lw/​​otq/​​whats_up_with_ar­bital/​​
Alli­son Duettmann; Ar­tifi­cial Gen­eral In­tel­li­gence: Timeframes & Policy White Paper; https://​​fore­sight.org/​​pub­li­ca­tions/​​AGI-Timeframes&Poli­cyWhitePaper.pdf
An­ders Sand­berg, Stu­art Arm­strong, Milan Cirkovic; That is not dead which can eter­nal lie: the aes­ti­va­tion hy­poth­e­sis for re­solv­ing Fermi’s para­dox; https://​​arxiv.org/​​pdf/​​1705.03394.pdf
An­drew Critch, Stu­art Rus­sell; Ser­vant of Many Masters: Shift­ing pri­ori­ties in Pareto-op­ti­mal se­quen­tial de­ci­sion-mak­ing; https://​​arxiv.org/​​abs/​​1711.00363
An­drew Critch; Toward Ne­gotible Re­in­force­ment Learn­ing: Shift­ing Pri­ori­ties in Pareto Op­ti­mal Se­quen­tial De­ci­sion-Mak­ing; https://​​arxiv.org/​​abs/​​1701.01302
An­drew MacFie; Plau­si­bil­ity and Prob­a­bil­ity in De­duc­tive Rea­son­ing; https://​​arxiv.org/​​pdf/​​1708.09032.pdf
As­saf Ar­belle, Tammy Rik­lin Ra­viv; Microscopy Cell Seg­men­ta­tion via Ad­ver­sar­ial Neu­ral Net­works; https://​​arxiv.org/​​abs/​​1709.05860
Ben Garfinkel, Miles Brundage, Daniel Filan, Car­rick Flynn, Je­lena Luketina, Michael Page, An­ders Sand­berg, An­drew Sny­der-Beat­tie, and Max Teg­mark; On the Im­pos­si­bil­ity of Su­per­sized Machines; https://​​arxiv.org/​​pdf/​​1703.10987.pdf
Chel­sea Finn, Ti­anhe Yu, Ti­an­hao Zhang, Pieter Abbeel, Sergey Lev­ine; One-Shot Vi­sual Imi­ta­tion Learn­ing via Meta-Learn­ing; https://​​arxiv.org/​​abs/​​1709.04905
Chen Sun, Ab­hi­nav Shri­vas­tava Sau­rabh Singh, Ab­hi­nav Gupta; Re­vis­it­ing Un­rea­son­able Effec­tive­ness of Data in Deep Learn­ing Era; https://​​arxiv.org/​​pdf/​​1707.02968.pdf
Chih-Hong Cheng, Fred­erik Diehl, Yas­sine Hamza, Gereon Hinz, Ge­org Nuhren­berg, Markus Rick­ert, Har­ald Ruess, Michael Troung-Le; Neu­ral Net­works for Safety-Crit­i­cal Ap­pli­ca­tions—Challenges, Ex­per­i­ments and Per­spec­tives; https://​​arxiv.org/​​pdf/​​1709.00911.pdf
Dario Amodei, Chris Olah, Ja­cob Stein­hardt, Paul Chris­ti­ano, John Schul­man, Dan Mané; Con­crete Prob­lems in AI Safety; https://​​arxiv.org/​​abs/​​1606.06565
David Abel, John Sal­vatier, An­dreas Stuh­lmüller, Owain Evans; Agent-Ag­nos­tic Hu­man-in-the-Loop Re­in­force­ment Learn­ing; https://​​arxiv.org/​​abs/​​1701.04079
Dy­lan Had­field-Menell, Anca Dra­gan, Pieter Abbeel, Stu­art Rus­sell; The Off-Switch Game; https://​​arxiv.org/​​pdf/​​1611.08219.pdf
Dy­lan Had­field-Menell, Anca Dra­gan, Pieter Abbeel, Stu­art Rus­sell; Co­op­er­a­tive In­verse Re­in­force­ment Learn­ing; https://​​arxiv.org/​​abs/​​1606.03137
Eliezer Yud­kowsky and Nate Soares; Func­tional De­ci­sion The­ory: A New The­ory of In­stru­men­tal Ra­tion­al­ity; https://​​arxiv.org/​​abs/​​1710.05060
Eliezer Yud­kowsky; A re­ply to Fran­cois Chol­let on in­tel­li­gence ex­po­sion; https://​​in­tel­li­gence.org/​​2017/​​12/​​06/​​chol­let/​​
Eliezer Yud­kowsky; Co­her­ant Ex­trap­o­lated Vo­li­tion; https://​​in­tel­li­gence.org/​​files/​​CEV.pdf
Eliezer Yud­kowsky; Inad­e­quate Equil­ibria; https://​​www.ama­zon.com/​​dp/​​B076Z64CPG
Eliezer Yud­kowsky; There’s No Fire Alarm for Ar­tifi­cial Gen­eral In­tel­li­gence; https://​​in­tel­li­gence.org/​​2017/​​10/​​13/​​fire-alarm/​​
Filipe Ro­drigues, Fran­cisco Pereira; Deep learn­ing from crowds; https://​​arxiv.org/​​abs/​​1709.01779
Greg Lewis; In Defense of Epistemic Modesty; http://​​effec­tive-al­tru­ism.com/​​ea/​​1g7/​​in_defence_of_epistemic_mod­esty/​​
Haim Gaif­man and Yang Liu; A sim­pler and more re­al­is­tic sub­jec­tive de­ci­sion the­ory; https://​​link.springer.com/​​ar­ti­cle/​​10.1007%2Fs11229-017-1594-6
Harsanyi; Car­di­nal welfare, in­di­vi­d­u­al­is­tic ethics, and in­ter­per­sonal com­par­i­sons of util­ity; http://​​www.springer.com/​​us/​​book/​​9789027711861
Ivo Danihelka, Balaji Lak­sh­mi­narayanan, Benigno Uria, Daan Wier­stra, Peter Dayan; Com­par­i­son of Max­i­mum Like­li­hood and GAN-based train­ing of Real NVPs; https://​​arxiv.org/​​pdf/​​1705.05263.pdf
Ji­akai Zhang, Kyunghyun Cho; Query-Effi­cient Imi­ta­tion Learn­ing for End-to-End Au­tonomous Driv­ing; https://​​arxiv.org/​​abs/​​1605.06450
Joshua Gans; Self-Reg­u­lat­ing Ar­tifi­cial Gen­eral In­tel­li­gence; https://​​arxiv.org/​​pdf/​​1711.04309.pdf
Katja Grace, John Sal­vatier, Allan Dafoe, Baobao Zhang, Owain Evans; When will AI ex­ceed Hu­man Perfor­mance? Ev­i­dence from AI Ex­perts; https://​​arxiv.org/​​abs/​​1705.08807
Kavosh Asadi, Cameron Allen, Melrose Rod­er­ick, Ab­del-rah­man Mo­hamed, Ge­orge Konidaris, Michael Littman; Mean Ac­tor Critic; https://​​arxiv.org/​​abs/​​1709.00503
Ku­nal Menda, Kather­ine Driggs-Camp­bell, Mykel J. Kochen­derfer; DropoutDAg­ger: A Bayesian Ap­proach to Safe Imi­ta­tion Learn­ing; https://​​arxiv.org/​​abs/​​1709.06166
Mario Lu­cic, Karol Ku­rach, Marcin Michalski, Syl­vain Gelly, Olivier Bous­quet; Are GANs Created Equal? A Large-Scale Study; https://​​arxiv.org/​​abs/​​1711.10337
Martin Rees; “Black Sky” In­fras­truc­ture and So­cietal Re­silience Work­shop; https://​​www.cser.ac.uk/​​me­dia/​​up­loads/​​files/​​Black-Sky-Work­shop-at-the-Royal-So­ciety-Jan.-20171.pdf
Mile Brundage; Brundage Bot; https://​​twit­ter.com/​​BrundageBot
Ming­hai Qin, Chao Sun, De­jan Vucinic; Ro­bust­ness of Neu­ral Net­works against Stor­age Me­dia Er­rors; https://​​arxiv.org/​​abs/​​1709.06173
My­self; 2017 AI Risk Liter­a­ture Re­view and Char­ity Eval­u­a­tion; http://​​effec­tive-al­tru­ism.com/​​ea/​​14w/​​2017_ai_risk_liter­a­ture_re­view_and_char­ity/​​
Nate Soares and Benja Fallen­stein; Towards Ideal­ized De­ci­sion The­ory; https://​​arxiv.org/​​pdf/​​1507.01986.pdf
Nate Soares and Ben­jamin Lev­in­stein; Cheat­ing Death in Da­m­as­cus; https://​​in­tel­li­gence.org/​​files/​​DeathInDa­m­as­cus.pdf
Nates Soares, Benja Fallen­stein, Eliezer Yud­kowsky, Stu­art Arm­strong; Cor­rigi­bil­ity; https://​​in­tel­li­gence.org/​​files/​​Cor­rigi­bil­ity.pdf
Paul Chris­ti­ano, Jan Leike, Tom B. Brown, Mil­jan Mar­tic, Shane Legg, Dario Amodei; Deep Re­in­force­ment Learn­ing from Hu­man Prefer­ences; https://​​arxiv.org/​​abs/​​1706.03741
Paul Chris­ti­ano; AlphaGo Zero and ca­pa­bil­ity am­plifi­ca­tion; https://​​ai-al­ign­ment.com/​​alphago-zero-and-ca­pa­bil­ity-am­plifi­ca­tion-ede767bb8446
Peter Hen­der­son, Ri­ashat Is­lam, Philip Bach­man, Joelle Pineau, Doina Pre­cup, David Meger; Deep Re­in­force­ment Learn­ing that Mat­ters; https://​​arxiv.org/​​abs/​​1709.06560
Peter Stone, Rod­ney Brooks, Erik Bryn­jolfs­son, Ryan Calo, Oren Etz­ioni, Greg Hager, Ju­lia Hirschberg, Shivaram Ka­lyanakr­ish­nan, Ece Ka­mar, Sarit Kraus, Kevin Ley­ton-Brown, David Parkes, William Press, An­naLee Sax­e­nian, Julie Shah, Milind Tambe, Astro Tel­ler.; One Hun­dred Year Study on Ar­tifi­cial In­tel­li­gence; https://​​ai100.stan­ford.edu/​​
Peter Sune­hag, Guy Lever, Au­drunas Grus­lys, Wo­j­ciech Czar­necki, Vini­cius Zam­baldi, Max Jader­berg, Marc Lanc­tot, Ni­co­las Son­nerat, Joel Z. Leibo, Karl Tuyls, Thore Grae­pel; Value-De­com­po­si­tion Net­works For Co­op­er­a­tive Multi-Agent Learn­ing; https://​​arxiv.org/​​pdf/​​1706.05296.pdf
Peter Vam­plew, Richard Dazeley, Cameron Foale, Sally Fir­min, Jane Mum­mery; Hu­man-al­igned ar­tifi­cial in­tel­li­gence is a mul­ti­ob­jec­tive prob­lem; https://​​link.springer.com/​​ar­ti­cle/​​10.1007/​​s10676-017-9440-6
Ryan Carey; In­cor­rigi­bil­ity in the CIRL Frame­work; https://​​arxiv.org/​​abs/​​1709.06275
Sa­muel Yeom, Matt Fredrik­son, Somesh Jha; The Un­in­tended Con­se­quences of Overfit­ting: Train­ing Data In­fer­ence At­tacks; https://​​arxiv.org/​​abs/​​1709.01604
Scott Alexan­der; G.K. Ch­ester­ton on AI Risk; http://​​slat­estar­codex.com/​​2017/​​04/​​01/​​g-k-chester­ton-on-ai-risk/​​
Scott Garrabrant, Tsvi Ben­son-Tilsen, An­drew Critch, Nate Soares, Jes­sica Tay­lor; A For­mal Ap­proach to the Prob­lem of Log­i­cal Non-Om­ni­science; https://​​arxiv.org/​​abs/​​1707.08747
Scott Garrabrant, Tsvi Ben­son-Tilsen, An­drew Critch, Nate Soares, Jes­sica Tay­lor; Log­i­cal In­duc­tion; http://​​arxiv.org/​​abs/​​1609.03543
Seth Baum and Tony Bar­rett; Global Catas­tro­phes: The Most Ex­treme Risks; http://​​seth­baum.com/​​ac/​​2018_Ex­treme.pdf
Seth Baum and Tony Bar­rett; Towards an In­te­grated Assess­ment of Global Catas­trophic Risk ; https://​​pa­pers.ssrn.com/​​sol3/​​pa­pers.cfm?ab­stract_id=3046816
Seth Baum; On the Pro­mo­tion of Safe and So­cially Benefi­cial Ar­tifi­cial In­tel­li­gence; https://​​pa­pers.ssrn.com/​​sol3/​​pa­pers.cfm?ab­stract_id=2816323
Seth Baum; Rec­on­cili­a­tion Between Fac­tions Fo­cused on Near-Term and Long-Term Ar­tifi­cial In­tel­li­gence; https://​​pa­pers.ssrn.com/​​sol3/​​pa­pers.cfm?ab­stract_id=2976444
Seth Baum; So­cial Choice Ethics in Ar­tifi­cial In­tel­li­gence; https://​​pa­pers.ssrn.com/​​sol3/​​pa­pers.cfm?ab­stract_id=3046725
Seth Baum; Sur­vey of Ar­tifi­cial Gen­eral In­tel­li­gence Pro­jects for Ethics, Risk, and Policy; https://​​pa­pers.ssrn.com/​​sol3/​​pa­pers.cfm?ab­stract_id=3070741
Smitha Milli, Dy­lan Had­field-Menell, Anca Dra­gan, Stu­art Rus­sell; Should Robots be Obe­di­ent; https://​​arxiv.org/​​pdf/​​1705.09990.pdf
Tony Bar­rett; Value of Global Catas­trophic Risk (GCR) In­for­ma­tion: Cost-Effec­tive­ness-Based Ap­proach for GCR Re­duc­tion; https://​​www.drop­box.com/​​s/​​7a7eh2law7tbvk0/​​2017-bar­rett.pdf?dl=0
Vadim Kosoy; Op­ti­mal Polyno­mial-Time Es­ti­ma­tors: A Bayesian No­tion of Ap­prox­i­ma­tion Al­gorithm; https://​​arxiv.org/​​abs/​​1608.04112
Vic­tor Shih, David C Jan­graw, Paul Sa­jda, Sameer Saproo; Towards per­son­al­ized hu­man AI in­ter­ac­tion—adapt­ing the be­hav­ior of AI agents us­ing neu­ral sig­na­tures of sub­jec­tive in­ter­est; https://​​arxiv.org/​​abs/​​1709.04574
William Saun­ders, Gir­ish Sas­try, An­dreas Stuh­lmuel­ler, Owain Evans; Trial with­out Er­ror: Towards Safe Re­in­force­ment Learn­ing via Hu­man In­ter­ven­tion; https://​​arxiv.org/​​abs/​​1707.05173
Xiongzhao Wang, Varuna De Silva, Ah­met Kon­doz; Agent-based Learn­ing for Driv­ing Policy Learn­ing in Con­nected and Au­tonomous Ve­hi­cles; https://​​arxiv.org/​​abs/​​1709.04622
Yang Liu and Huw Price; Heart of DARC­ness; http://​​yliu.net/​​wp-con­tent/​​up­loads/​​dar­c­ness.pdf
Yang Liu; The Sure-Thing prin­ci­ple and P2; http://​​www.academia.edu/​​33992500/​​The_Sure-thing_Prin­ci­ple_and_P2
Yun­peng Pan, Ching-An Cheng, Kamil Saigol, Ke­un­taek Lee, Xinyan Yan, Evan­gelos Theodorou, By­ron Boots; Agile Off-Road Au­tonomous Driv­ing Us­ing End-to-End Deep Imi­ta­tion Learn­ing; https://​​arxiv.org/​​abs/​​1709.07174