AGI safety from first principles: Introduction

This is the first part of a six-part report called AGI safety from first principles, in which I’ve attempted to put together the most complete and compelling case I can for why the development of AGI might pose an existential threat. The report stems from my dissatisfaction with existing arguments about the potential risks from AGI. Early work tends to be less relevant in the context of modern machine learning; more recent work is scattered and brief. I originally intended to just summarise other people’s arguments, but as this report has grown, it’s become more representative of my own views and less representative of anyone else’s. So while it covers the standard ideas, I also think that it provides a new perspective on how to think about AGI—one which doesn’t take any previous claims for granted, but attempts to work them out from first principles.

Having said that, the breadth of the topic I’m attempting to cover means that I’ve included many arguments which are only hastily sketched out, and undoubtedly a number of mistakes. I hope to continue polishing this report, and I welcome feedback and help in doing so. I’m also grateful to many people who have given feedback and encouragement so far. I plan to cross-post some of the most useful comments I’ve received to the Alignment Forum once I’ve had a chance to ask permission. I’ve posted the report itself in six sections; the first and last are shorter framing sections, while the middle four correspond to the four premises of the argument laid out below.

AGI safety from first principles

The key concern motivating technical AGI safety research is that we might build autonomous artificially intelligent agents which are much more intelligent than humans, and which pursue goals that conflict with our own. Human intelligence allows us to coordinate complex societies and deploy advanced technology, and thereby control the world to a greater extent than any other species. But AIs will eventually become more capable than us at the types of tasks by which we maintain and exert that control. If they don’t want to obey us, then humanity might become only Earth’s second most powerful “species”, and lose the ability to create a valuable and worthwhile future.

I’ll call this the “second species” argument; I think it’s a plausible argument which we should take very seriously.[1] However, the version stated above relies on several vague concepts and intuitions. In this report I’ll give the most detailed presentation of the second species argument that I can, highlighting the aspects that I’m still confused about. In particular, I’ll defend a version of the second species argument which claims that, without a concerted effort to prevent it, there’s a significant chance that:

  1. We’ll build AIs which are much more intelligent than humans (i.e. superintelligent).

  2. Those AIs will be autonomous agents which pursue large-scale goals.

  3. Those goals will be misaligned with ours; that is, they will aim towards outcomes that aren’t desirable by our standards, and trade off against our goals.

  4. The development of such AIs would lead to them gaining control of humanity’s future.

While I use many examples from modern deep learning, this report is also intended to apply to AIs developed using very different models, training algorithms, optimisers, or training regimes than the ones we use today. However, many of my arguments would no longer be relevant if the field of AI moves away from focusing on machine learning. I also frequently compare AI development to the evolution of human intelligence; while the two aren’t fully analogous, humans are the best example we currently have to ground our thinking about generally intelligent AIs.


  1. ↩︎

    Stuart Russell also refers to this as the “gorilla problem” in his recent book, Human Compatible.