A model I use when making plans to reduce AI x-risk

I’ve been think­ing about what im­plicit model of the world I use to make plans that re­duce x-risk from AI. I list four main gears be­low (with quotes to illus­trate), and then dis­cuss con­crete heuris­tics I take from it.

A model of AI x-risk in four parts

1. Align­ment is hard.

Quot­ing “Se­cu­rity Mind­set and the Lo­gis­tic Suc­cess Curve” (link)

Co­ral: YES. Given that this is a novel pro­ject en­ter­ing new ter­ri­tory, ex­pect it to take at least two years more time, or 50% more de­vel­op­ment time—whichever is less—com­pared to a se­cu­rity-in­cau­tious pro­ject that oth­er­wise has iden­ti­cal tools, in­sights, peo­ple, and re­sources. And that is a very, very op­ti­mistic lower bound.
Am­ber: This story seems to be head­ing in a wor­ry­ing di­rec­tion.
Co­ral: Well, I’m sorry, but cre­at­ing ro­bust sys­tems takes longer than cre­at­ing non-ro­bust sys­tems even in cases where it would be re­ally, ex­traor­di­nar­ily bad if cre­at­ing ro­bust sys­tems took longer than cre­at­ing non-ro­bust sys­tems.

2. Get­ting al­ign­ment right ac­counts for most of the var­i­ance in whether an AGI sys­tem will be pos­i­tive for hu­man­ity.

Quot­ing “The Hid­den Com­plex­ity of Wishes” (link)

There are three kinds of ge­nies: Ge­nies to whom you can safely say “I wish for you to do what I should wish for”; ge­nies for which no wish is safe; and ge­nies that aren’t very pow­er­ful or in­tel­li­gent.
[...]
There is no safe wish smaller than an en­tire hu­man moral­ity. There are too many pos­si­ble paths through Time. You can’t vi­su­al­ize all the roads that lead to the des­ti­na­tion you give the ge­nie… any more than you can pro­gram a chess-play­ing ma­chine by hard­cod­ing a move for ev­ery pos­si­ble board po­si­tion.
And real life is far more com­pli­cated than chess. You can­not pre­dict, in ad­vance, which of your val­ues will be needed to judge the path through time that the ge­nie takes. Espe­cially if you wish for some­thing longer-term or wider-range than res­cu­ing your mother from a burn­ing build­ing.

3. Our cur­rent epistemic state re­gard­ing AGI timelines will con­tinue un­til we’re close (<2 years from) to hav­ing AGI.

Quot­ing “There is No Fire Alarm for AGI” (link)

It’s not that when­ever some­body says “fifty years” the thing always hap­pens in two years. It’s that this con­fi­dent pre­dic­tion of things be­ing far away cor­re­sponds to an epistemic state about the tech­nol­ogy that feels the same way in­ter­nally un­til you are very very close to the big de­vel­op­ment. It’s the epistemic state of “Well, I don’t see how to do the thing” and some­times you say that fifty years off from the big de­vel­op­ment, and some­times you say it two years away, and some­times you say it while the Wright Flyer is fly­ing some­where out of your sight.
[...]
So far as I can presently es­ti­mate, now that we’ve had AlphaGo and a cou­ple of other maybe/​maybe-not shots across the bow, and seen a huge ex­plo­sion of effort in­vested into ma­chine learn­ing and an enor­mous flood of pa­pers, we are prob­a­bly go­ing to oc­cupy our pre­sent epistemic state un­til very near the end.
By say­ing we’re prob­a­bly go­ing to be in roughly this epistemic state un­til al­most the end, I don’t mean to say we know that AGI is im­mi­nent, or that there won’t be im­por­tant new break­throughs in AI in the in­ter­ven­ing time. I mean that it’s hard to guess how many fur­ther in­sights are needed for AGI, or how long it will take to reach those in­sights. After the next break­through, we still won’t know how many more break­throughs are needed, leav­ing us in pretty much the same epistemic state as be­fore. What­ever dis­cov­er­ies and mile­stones come next, it will prob­a­bly con­tinue to be hard to guess how many fur­ther in­sights are needed, and timelines will con­tinue to be similarly murky.

4. Given timeline un­cer­tainty, it’s best to spend marginal effort on plans that as­sume /​ work in shorter timelines.

Stated sim­ply: If you don’t know when AGI is com­ing, you should make sure al­ign­ment gets solved in wor­lds where AGI comes soon.

Quot­ing “Allo­cat­ing Risk-Miti­ga­tion Across Time” (link)

Sup­pose we are also un­sure about when we may need the prob­lem solved by. In sce­nar­ios where the solu­tion is needed ear­lier, there is less time for us to col­lec­tively work on a solu­tion, so there is less work on the prob­lem than in sce­nar­ios where the solu­tion is needed later. Given the diminish­ing re­turns on work, that means that a marginal unit of work has a big­ger ex­pected value in the case where the solu­tion is needed ear­lier. This should up­date us to­wards work­ing to ad­dress the early sce­nar­ios more than would be jus­tified by look­ing purely at their im­pact and like­li­hood.
[...]
There are two ma­jor fac­tors which seem to push to­wards prefer­ring more work which fo­cuses on sce­nar­ios where AI comes soon. The first is near­sight­ed­ness: we sim­ply have a bet­ter idea of what will be use­ful in these sce­nar­ios. The sec­ond is diminish­ing marginal re­turns: the ex­pected effect of an ex­tra year of work on a prob­lem tends to de­cline when it is be­ing added to a larger to­tal. And be­cause there is a much larger time hori­zon in which to solve it (and in a wealthier world), the prob­lem of AI safety when AI comes later may re­ceive many times as much work as the prob­lem of AI safety for AI that comes soon. On the other hand one more fac­tor prefer­ring work on sce­nar­ios where AI comes later is the abil­ity to pur­sue more lev­er­aged strate­gies which es­chew ob­ject-level work to­day in favour of gen­er­at­ing (hope­fully) more ob­ject-level work later.

The above is a slightly mis­rep­re­sen­ta­tive quote; the pa­per is largely un­de­cided as to whether shorter term strate­gies or longer term strate­gies are more valuable (given un­cer­tainty over timelines), and recom­mends a port­fo­lio ap­proach (run­ning mul­ti­ple strate­gies, that each ap­ply to differ­ent timelines). Nonethe­less when read­ing it I did up­date to­ward short-term strate­gies as be­ing es­pe­cially ne­glected, both by my­self and the x-risk com­mu­nity at large.

Con­crete implications

In­formed by the model above, here are heuris­tics I use for mak­ing plans.

  • Solve al­ign­ment! Aaargh! Solve it! Solve it now!

    • I nearly for­got to say it ex­plic­itly, but it’s the most im­por­tant: if you have a clear av­enue to do good work on al­ign­ment, or field-build­ing in al­ign­ment, do it.

  • Find ways to con­tribute to in­tel­lec­tual progress on alignment

    • I think that in­tel­lec­tual progress is very tractable.

      • A cen­tral ex­am­ple of a small pro­ject I’d love to see more peo­ple at­tempt, is peo­ple writ­ing up (in their own words) analy­ses and sum­maries of core dis­agree­ments in al­ign­ment re­search.

      • A broader cat­e­gory of things that can be done to push dis­course for­ward can be found in this talk Oliver and I have given in the past, about how to write good com­ments on LessWrong.

    • It seems to me that peo­ple I talk to think earn­ing-to-give is easy and doable, but push­ing for­ward in­tel­lec­tual progress (es­pe­cially on al­ign­ment) is im­pos­si­ble, or at least only ‘ge­niuses’ can do it. I dis­agree; there is a lot of low hang­ing fruit.

  • Build in­fras­truc­ture for the al­ign­ment re­search community

    • The Berkeley Ex­is­ten­tial Risk Ini­ti­a­tive (BERI) is a great ex­am­ple of this—many orgs (FHI, CHAI, etc) have ridicu­lous uni­ver­sity con­straints upon their ac­tions, and so one of BERI’s goals is to help them out­source this (to BERI) and re­move the bu­reau­cratic mess. This is ridicu­lously helpful. (FYI they’re hiring.)

    • I per­son­ally have been chat­ting re­cently with var­i­ous al­ign­ment re­searchers about what on­line in­fras­truc­ture could be helpful, and have found sur­pris­ingly good op­por­tu­ni­ties to im­prove things (will write up more on this in a fu­ture post).

    • What other in­fras­truc­ture could you build for bet­ter com­mu­ni­ca­tion be­tween key re­searchers?

  • Avoid/​re­duce di­rect gov­ern­ment in­volve­ment (in the long run)

    • It’s im­por­tant that those run­ning AGI pro­jects are ca­pa­ble of un­der­stand­ing the al­ign­ment prob­lem and why it’s nec­es­sary to solve al­ign­ment be­fore im­ple­ment­ing an AGI. There’s a bet­ter chance of this when the per­son run­ning the pro­ject has a strong tech­ni­cal un­der­stand­ing of how AI works.

      • A gov­ern­ment-run AI pro­ject is analo­gous to a tech com­pany with non-tech­ni­cal founders. Sure, the founders can em­ploy a CTO, but then you have Paul Gra­ham’s de­sign prob­lem—how are they sup­posed to figure out who a good CTO is? They don’t know what to test for. They will likely just pick who­ever comes with the strongest recom­men­da­tion, and given their info chan­nels that will prob­a­bly just be who­ever has the most sta­tus.

  • Fo­cus on tech­ni­cal solu­tions to x-risk rather than poli­ti­cal or societal

    • I have an im­pres­sion that hu­man­ity has a bet­ter track record of find­ing tech­ni­cal than poli­ti­cal/​so­cial solu­tions to prob­lems, and this means we should fo­cus even more on things like al­ign­ment.

      • As one dat­a­point, fields like com­puter sci­ence, en­g­ineer­ing and math­e­mat­ics seem to make a lot more progress than ones like macroe­co­nomics, poli­ti­cal the­ory, and in­ter­na­tional re­la­tions. If you can frame some­thing as ei­ther a math prob­lem or a poli­ti­cal prob­lem, do the former.

    • I don’t have some­thing strong to back this up with, so will do some re­search/​read­ing.

  • Avoid things that (be­cause they’re so­cial) are fun to ar­gue about

    • For ex­am­ple, ethics is a very sexy sub­ject that can eas­ily at­tract pub­lic out­rage and at­ten­tion while not in fact be­ing use­ful (cf. bioethics). If we ex­pect al­ign­ment to not be solved, the ques­tion of “whose val­ues do we get to put into the AI?” is an en­tic­ing dis­trac­tion.

    • Another can­di­date for a sexy sub­ject that is ba­si­cally a dis­trac­tion, is dis­cus­sion of the high sta­tus peo­ple in AI e.g. “Did you hear what Elon Musk said to Demis Hass­abis?” Too many of my late-night con­ver­sa­tions fall into pat­terns like this, and I ac­tively push back against it (both in my­self and oth­ers).

    • This recom­men­da­tion is a nega­tive one (“Don’t do this”). If you have any ideas for pos­i­tive things to do in­stead, please write them down. What norms/​TAPs push away from so­cial dis­trac­tions?


I wrote this post to make ex­plicit some of the think­ing that goes into my plans. While the heuris­tics are in­formed by the model, they likely hide other as­sump­tions that I didn’t no­tice.

To folks who have tended to agree with my ob­ject level sug­ges­tions, I ex­pect you to have a sense of hav­ing read ob­vi­ous things, stated ex­plic­itly. To ev­ery­one else, I’d love to read about the core mod­els that in­form your views on AI, and I’d en­courage you to read more on those of mine that are new to you.


My thanks and ap­pre­ci­a­tion to Ja­cob Lager­ros for help edit­ing.

[Edit: On 01/​26/​18, I made slight ed­its to this post body and ti­tle. It used to say there were four mod­els in part I, and in­stead now says that part I lists four parts of a sin­gle model. Some of the com­ments were a re­sponse to the origi­nal, and thus may read a lit­tle funny.]