A model I use when making plans to reduce AI x-risk

I’ve been thinking about what implicit model of the world I use to make plans that reduce x-risk from AI. I list four main gears below (with quotes to illustrate), and then discuss concrete heuristics I take from it.

A model of AI x-risk in four parts

1. Alignment is hard.

Quoting “Security Mindset and the Logistic Success Curve” (link)

Coral: YES. Given that this is a novel project entering new territory, expect it to take at least two years more time, or 50% more development time—whichever is less—compared to a security-incautious project that otherwise has identical tools, insights, people, and resources. And that is a very, very optimistic lower bound.
Amber: This story seems to be heading in a worrying direction.
Coral: Well, I’m sorry, but creating robust systems takes longer than creating non-robust systems even in cases where it would be really, extraordinarily bad if creating robust systems took longer than creating non-robust systems.

2. Getting alignment right accounts for most of the variance in whether an AGI system will be positive for humanity.

Quoting “The Hidden Complexity of Wishes” (link)

There are three kinds of genies: Genies to whom you can safely say “I wish for you to do what I should wish for”; genies for which no wish is safe; and genies that aren’t very powerful or intelligent.
[...]
There is no safe wish smaller than an entire human morality. There are too many possible paths through Time. You can’t visualize all the roads that lead to the destination you give the genie… any more than you can program a chess-playing machine by hardcoding a move for every possible board position.
And real life is far more complicated than chess. You cannot predict, in advance, which of your values will be needed to judge the path through time that the genie takes. Especially if you wish for something longer-term or wider-range than rescuing your mother from a burning building.

3. Our current epistemic state regarding AGI timelines will continue until we’re close (<2 years from) to having AGI.

Quoting “There is No Fire Alarm for AGI” (link)

It’s not that whenever somebody says “fifty years” the thing always happens in two years. It’s that this confident prediction of things being far away corresponds to an epistemic state about the technology that feels the same way internally until you are very very close to the big development. It’s the epistemic state of “Well, I don’t see how to do the thing” and sometimes you say that fifty years off from the big development, and sometimes you say it two years away, and sometimes you say it while the Wright Flyer is flying somewhere out of your sight.
[...]
So far as I can presently estimate, now that we’ve had AlphaGo and a couple of other maybe/​maybe-not shots across the bow, and seen a huge explosion of effort invested into machine learning and an enormous flood of papers, we are probably going to occupy our present epistemic state until very near the end.
By saying we’re probably going to be in roughly this epistemic state until almost the end, I don’t mean to say we know that AGI is imminent, or that there won’t be important new breakthroughs in AI in the intervening time. I mean that it’s hard to guess how many further insights are needed for AGI, or how long it will take to reach those insights. After the next breakthrough, we still won’t know how many more breakthroughs are needed, leaving us in pretty much the same epistemic state as before. Whatever discoveries and milestones come next, it will probably continue to be hard to guess how many further insights are needed, and timelines will continue to be similarly murky.

4. Given timeline uncertainty, it’s best to spend marginal effort on plans that assume /​ work in shorter timelines.

Stated simply: If you don’t know when AGI is coming, you should make sure alignment gets solved in worlds where AGI comes soon.

Quoting “Allocating Risk-Mitigation Across Time” (link)

Suppose we are also unsure about when we may need the problem solved by. In scenarios where the solution is needed earlier, there is less time for us to collectively work on a solution, so there is less work on the problem than in scenarios where the solution is needed later. Given the diminishing returns on work, that means that a marginal unit of work has a bigger expected value in the case where the solution is needed earlier. This should update us towards working to address the early scenarios more than would be justified by looking purely at their impact and likelihood.
[...]
There are two major factors which seem to push towards preferring more work which focuses on scenarios where AI comes soon. The first is nearsightedness: we simply have a better idea of what will be useful in these scenarios. The second is diminishing marginal returns: the expected effect of an extra year of work on a problem tends to decline when it is being added to a larger total. And because there is a much larger time horizon in which to solve it (and in a wealthier world), the problem of AI safety when AI comes later may receive many times as much work as the problem of AI safety for AI that comes soon. On the other hand one more factor preferring work on scenarios where AI comes later is the ability to pursue more leveraged strategies which eschew object-level work today in favour of generating (hopefully) more object-level work later.

The above is a slightly misrepresentative quote; the paper is largely undecided as to whether shorter term strategies or longer term strategies are more valuable (given uncertainty over timelines), and recommends a portfolio approach (running multiple strategies, that each apply to different timelines). Nonetheless when reading it I did update toward short-term strategies as being especially neglected, both by myself and the x-risk community at large.

Concrete implications

Informed by the model above, here are heuristics I use for making plans.

  • Solve alignment! Aaargh! Solve it! Solve it now!

    • I nearly forgot to say it explicitly, but it’s the most important: if you have a clear avenue to do good work on alignment, or field-building in alignment, do it.

  • Find ways to contribute to intellectual progress on alignment

    • I think that intellectual progress is very tractable.

      • A central example of a small project I’d love to see more people attempt, is people writing up (in their own words) analyses and summaries of core disagreements in alignment research.

      • A broader category of things that can be done to push discourse forward can be found in this talk Oliver and I have given in the past, about how to write good comments on LessWrong.

    • It seems to me that people I talk to think earning-to-give is easy and doable, but pushing forward intellectual progress (especially on alignment) is impossible, or at least only ‘geniuses’ can do it. I disagree; there is a lot of low hanging fruit.

  • Build infrastructure for the alignment research community

    • The Berkeley Existential Risk Initiative (BERI) is a great example of this—many orgs (FHI, CHAI, etc) have ridiculous university constraints upon their actions, and so one of BERI’s goals is to help them outsource this (to BERI) and remove the bureaucratic mess. This is ridiculously helpful. (FYI they’re hiring.)

    • I personally have been chatting recently with various alignment researchers about what online infrastructure could be helpful, and have found surprisingly good opportunities to improve things (will write up more on this in a future post).

    • What other infrastructure could you build for better communication between key researchers?

  • Avoid/​reduce direct government involvement (in the long run)

    • It’s important that those running AGI projects are capable of understanding the alignment problem and why it’s necessary to solve alignment before implementing an AGI. There’s a better chance of this when the person running the project has a strong technical understanding of how AI works.

      • A government-run AI project is analogous to a tech company with non-technical founders. Sure, the founders can employ a CTO, but then you have Paul Graham’s design problem—how are they supposed to figure out who a good CTO is? They don’t know what to test for. They will likely just pick whoever comes with the strongest recommendation, and given their info channels that will probably just be whoever has the most status.

  • Focus on technical solutions to x-risk rather than political or societal

    • I have an impression that humanity has a better track record of finding technical than political/​social solutions to problems, and this means we should focus even more on things like alignment.

      • As one datapoint, fields like computer science, engineering and mathematics seem to make a lot more progress than ones like macroeconomics, political theory, and international relations. If you can frame something as either a math problem or a political problem, do the former.

    • I don’t have something strong to back this up with, so will do some research/​reading.

  • Avoid things that (because they’re social) are fun to argue about

    • For example, ethics is a very sexy subject that can easily attract public outrage and attention while not in fact being useful (cf. bioethics). If we expect alignment to not be solved, the question of “whose values do we get to put into the AI?” is an enticing distraction.

    • Another candidate for a sexy subject that is basically a distraction, is discussion of the high status people in AI e.g. “Did you hear what Elon Musk said to Demis Hassabis?” Too many of my late-night conversations fall into patterns like this, and I actively push back against it (both in myself and others).

    • This recommendation is a negative one (“Don’t do this”). If you have any ideas for positive things to do instead, please write them down. What norms/​TAPs push away from social distractions?


I wrote this post to make explicit some of the thinking that goes into my plans. While the heuristics are informed by the model, they likely hide other assumptions that I didn’t notice.

To folks who have tended to agree with my object level suggestions, I expect you to have a sense of having read obvious things, stated explicitly. To everyone else, I’d love to read about the core models that inform your views on AI, and I’d encourage you to read more on those of mine that are new to you.


My thanks and appreciation to Jacob Lagerros for help editing.

[Edit: On 01/​26/​18, I made slight edits to this post body and title. It used to say there were four models in part I, and instead now says that part I lists four parts of a single model. Some of the comments were a response to the original, and thus may read a little funny.]