AGI safety from first principles: Conclusion

Let’s recap the second species argument as originally laid out, along with the additional conclusions and clarifications from the rest of the report.

  1. We’ll build AIs which are much more intelligent than humans; that is, much better than humans at using generalisable cognitive skills to understand the world.

  2. Those AGIs will be autonomous agents which pursue long-term, large-scale goals, because goal-directedness is reinforced in many training environments, and because those goals will sometimes generalise to be larger in scope.

  3. Those goals will by default be misaligned with what we want, because our desires are complex and nuanced, and our existing tools for shaping the goals of AIs are inadequate.

  4. The development of autonomous misaligned AGIs would lead to them gaining control of humanity’s future, via their superhuman intelligence, technology and coordination—depending on the speed of AI development, the transparency of AI systems, how constrained they are during deployment, and how well humans can cooperate politically and economically.

Personally, I am most confident in 1, then 4, then 3, then 2 (in each case conditional on all the previous claims) - although I think there’s room for reasonable disagreement on all of them. In particular, the arguments I’ve made about AGI goals might have been too reliant on anthropomorphism. Even if this is a fair criticism, though, it’s also very unclear how to reason about the behaviour of generally intelligent systems without being anthropomorphic. The main reason we expect the development of AGI to be a major event is because the history of humanity tells us how important intelligence is. But it wasn’t just our intelligence that led to human success—it was also our relentless drive to survive and thrive. Without that, we wouldn’t have gotten anywhere. So when trying to predict the impacts of AGIs, we can’t avoid thinking about what will lead them to choose some types of intelligent behaviour over others—in other words, thinking about their motivations.

Note, however, that the second species argument, and the scenarios I’ve outlined above, aren’t meant to be comprehensive descriptions of all sources of existential risk from AI. Even if the second species argument doesn’t turn out to be correct, AI will likely still be a transformative technology, and we should try to minimise other potential harms. In addition to the standard misuse concerns (e.g. about AI being used to develop weapons), we might also worry about increases in AI capabilities leading to undesirable structural changes. For example, they might shift the offense-defence balance in cybersecurity, or lead to more centralisation of human economic power. I consider Christiano’s “going out with a whimper” scenario to also fall into this category. Yet there’s been little in-depth investigation of how structural changes might lead to long-term harms, so I am inclined to not place much credence in such arguments until they have been explored much more thoroughly.

By contrast, I think the AI takeover scenarios that this report focuses on have received much more scrutiny—but still, as discussed previously, have big question marks surrounding some of the key premises. However, it’s important to distinguish the question of how likely it is that the second species argument is correct, from the question of how seriously we should take it. Often people with very different perspectives on the latter actually don’t disagree very much on the former. I find the following analogy from Stuart Russell illustrative: suppose we got a message from space telling us that aliens would be landing on Earth sometime in the next century. Even if there’s doubt about the veracity of the message, and there’s doubt about whether the aliens will be hostile, we (as a species) should clearly expect this event to be a huge deal if it happens, and dedicate a lot of effort towards making it go well. In the case of AGI, while there’s reasonable doubt about what it will look like, it may nevertheless be the biggest thing that’s ever happened. At the very least we should put serious effort into understanding the arguments I’ve discussed above, how strong they are, and what we might be able to do about them.[1]

Thanks for reading, and thanks again to everyone who’s helped me improve the report. I don’t expect everyone to agree with all my arguments, but I do think that there’s a lot of room to further the conversation about this, and produce more analyses and evaluations of the core ideas in AGI safety. At this point I consider such work more valuable and neglected than technical AGI safety research, and have recently transitioned from full-time work on the latter to a PhD which will allow me to focus on the former. I’m excited to see our collective understanding of the future of AGI continue to develop.


  1. ↩︎

    I want to explicitly warn against taking this argument too far, though—for example, by claiming that AI safety work should still be a major priority even if the probability of AI catastrophe is much less than 1%. This claim is misleading because most researchers in the field of safety think it’s much higher than that; and also because, if it really is that low, there are probably some fundamental confusions in our concepts and arguments that need to be cleared up before we can actually start object-level work towards making AI safer.