Embedded Agents
(A longer text-based version of this post is also available on MIRI’s blog here, and the bibliography for the whole sequence can be found here)
- Introduction to Cartesian Frames by (22 Oct 2020 13:00 UTC; 155 points)
- The Plan − 2023 Version by (29 Dec 2023 23:34 UTC; 152 points)
- 2018 Review: Voting Results! by (24 Jan 2020 2:00 UTC; 135 points)
- 2018 Review: Voting Results! by (24 Jan 2020 2:00 UTC; 135 points)
- Forecasting Thread: AI Timelines by (22 Aug 2020 2:33 UTC; 135 points)
- Philosophy in the Darkest Timeline: Basics of the Evolution of Meaning by (7 Jun 2020 7:52 UTC; 133 points)
- Selection Theorems: A Program For Understanding Agents by (28 Sep 2021 5:03 UTC; 133 points)
- A Shutdown Problem Proposal by (21 Jan 2024 18:12 UTC; 125 points)
- Welcome & FAQ! by (24 Aug 2021 20:14 UTC; 115 points)
- The Sweet Lesson: AI Safety Should Scale With Compute by (5 May 2025 19:03 UTC; 97 points)
- Humans Are Embedded Agents Too by (23 Dec 2019 19:21 UTC; 82 points)
- Prizes for Last Year’s 2018 Review by (2 Dec 2020 11:21 UTC; 72 points)
- Agents Over Cartesian World Models by (27 Apr 2021 2:06 UTC; 67 points)
- Against Time in Agent Models by (13 May 2022 19:55 UTC; 63 points)
- What Decision Theory is Implied By Predictive Processing? by (28 Sep 2020 17:20 UTC; 59 points)
- «Boundaries/Membranes» and AI safety compilation by (3 May 2023 21:41 UTC; 56 points)
- [AN #127]: Rethinking agency: Cartesian frames as a formalization of ways to carve up the world into an agent and its environment by (2 Dec 2020 18:20 UTC; 53 points)
- 's comment on johnswentworth’s Shortform by (7 Oct 2025 20:18 UTC; 51 points)
- 's comment on Alignment: “Do what I would have wanted you to do” by (13 Jul 2024 1:07 UTC; 49 points)
- Sunday October 25, 12:00PM (PT) — Scott Garrabrant on “Cartesian Frames” by (21 Oct 2020 3:27 UTC; 48 points)
- Gödel’s Legacy: A game without end by (28 Jun 2020 18:50 UTC; 45 points)
- Embedded Agency via Abstraction by (26 Aug 2019 23:03 UTC; 42 points)
- [AN #163]: Using finite factored sets for causal and temporal inference by (8 Sep 2021 17:20 UTC; 41 points)
- Understanding Selection Theorems by (28 May 2022 1:49 UTC; 41 points)
- Simulators, constraints, and goal agnosticism: porbynotes vol. 1 by (23 Nov 2022 4:22 UTC; 40 points)
- If brains are computers, what kind of computers are they? (Dennett transcript) by (30 Jan 2020 5:07 UTC; 37 points)
- What are the most plausible “AI Safety warning shot” scenarios? by (26 Mar 2020 20:59 UTC; 36 points)
- What’s your big idea? by (18 Oct 2019 15:47 UTC; 30 points)
- International Relations; States, Rational Actors, and Other Approaches (Policy and International Relations Primer Part 4) by (EA Forum; 22 Jan 2020 8:29 UTC; 27 points)
- how has this forum changed your life? by (30 Jan 2020 21:54 UTC; 26 points)
- 's comment on The Standard Analogy by (5 Jun 2024 12:17 UTC; 26 points)
- What is abstraction? by (15 Dec 2018 8:36 UTC; 25 points)
- [AN #148]: Analyzing generalization across more axes than just accuracy or loss by (28 Apr 2021 18:30 UTC; 24 points)
- [AN #105]: The economic trajectory of humanity, and what we might mean by optimization by (24 Jun 2020 17:30 UTC; 24 points)
- 's comment on The Solomonoff Prior is Malign by (28 Dec 2021 2:45 UTC; 23 points)
- Clarifying Factored Cognition by (13 Dec 2020 20:02 UTC; 23 points)
- My decomposition of the alignment problem by (2 Sep 2024 0:21 UTC; 22 points)
- Theory of Ideal Agents, or of Existing Agents? by (13 Sep 2019 17:38 UTC; 20 points)
- 's comment on 2018 Review: Voting Results! by (28 Jan 2020 2:33 UTC; 19 points)
- Sunday July 12 — talks by Scott Garrabrant, Alexflint, alexei, Stuart_Armstrong by (8 Jul 2020 0:27 UTC; 19 points)
- 's comment on Against functionalism: a self dialogue by (10 Aug 2025 8:47 UTC; 17 points)
- [AN #55] Regulatory markets and international standards as a means of ensuring beneficial AI by (5 May 2019 2:20 UTC; 17 points)
- Alignment Newsletter #31 by (5 Nov 2018 23:50 UTC; 17 points)
- [AN #83]: Sample-efficient deep learning with ReMixMatch by (22 Jan 2020 18:10 UTC; 15 points)
- [AN #79]: Recursive reward modeling as an alignment technique integrated with deep RL by (1 Jan 2020 18:00 UTC; 13 points)
- 's comment on Book Review: Design Principles of Biological Circuits by (6 Nov 2019 18:03 UTC; 13 points)
- 's comment on Maybe Lying Doesn’t Exist by (28 Oct 2019 4:27 UTC; 13 points)
- [AN #143]: How to make embedded agents that reason probabilistically about their environments by (24 Mar 2021 17:20 UTC; 13 points)
- Comments on Allan Dafoe on AI Governance by (29 Nov 2021 16:16 UTC; 13 points)
- 's comment on My research methodology by (25 Mar 2021 0:47 UTC; 12 points)
- [AN #66]: Decomposing robustness into capability robustness and alignment robustness by (30 Sep 2019 18:00 UTC; 12 points)
- 's comment on New safety research agenda: scalable agent alignment via reward modeling by (31 Dec 2018 23:54 UTC; 12 points)
- Towards deconfusing values by (29 Jan 2020 19:28 UTC; 12 points)
- 's comment on The Pointers Problem: Human Values Are A Function Of Humans’ Latent Variables by (16 Dec 2021 22:08 UTC; 12 points)
- Beyond Rewards and Values: A Non-dualistic Approach to Universal Intelligence by (30 Dec 2022 19:05 UTC; 10 points)
- What are brains? by (10 Jun 2023 14:46 UTC; 10 points)
- 's comment on Jonathan Claybrough’s Shortform by (26 Jul 2023 22:18 UTC; 9 points)
- 's comment on What is it to solve the alignment problem? (Notes) by (25 Aug 2024 10:00 UTC; 8 points)
- 's comment on Reframing Impact by (20 Sep 2019 22:57 UTC; 8 points)
- 's comment on Attainable Utility Preservation: Empirical Results by (21 Dec 2021 14:54 UTC; 7 points)
- 's comment on johnswentworth’s Shortform by (8 Oct 2025 14:30 UTC; 7 points)
- 's comment on Instruction-following AGI is easier and more likely than value aligned AGI by (12 Jul 2024 15:34 UTC; 6 points)
- ACI#8: Value as a Function of Possible Worlds by (3 Jun 2024 21:49 UTC; 6 points)
- Choice := Anthropics uncertainty? And potential implications for agency by (21 Apr 2022 16:38 UTC; 6 points)
- 's comment on shminux’s Shortform by (15 Aug 2024 9:08 UTC; 6 points)
- 's comment on An Undergraduate Reading Of: Semantic information, autonomous agency and non-equilibrium statistical physics by (30 Oct 2018 18:35 UTC; 5 points)
- 's comment on The LessWrong 2018 Book is Available for Pre-order by (28 Sep 2022 19:05 UTC; 5 points)
- 's comment on Dialogue on Appeals to Consequences by (1 Dec 2019 18:57 UTC; 4 points)
- 's comment on What are the actual arguments in favor of computationalism as a theory of identity? by (18 Jul 2024 22:50 UTC; 4 points)
- 's comment on When is a mind me? by (9 Jul 2024 23:32 UTC; 4 points)
- SlateStarCodex Fika by (2 Jan 2021 2:03 UTC; 3 points)
- Decisions with Non-Logical Counterfactuals: request for input by (24 Oct 2019 17:23 UTC; 3 points)
- 's comment on Stupidity and Dishonesty Explain Each Other Away by (30 Dec 2019 5:41 UTC; 3 points)
- 's comment on (A → B) → A in Causal DAGs by (23 Jan 2020 17:21 UTC; 2 points)
- 's comment on What are the open problems in Human Rationality? by (12 Jan 2021 5:58 UTC; 2 points)
- 's comment on Stephen Fowler’s Shortform by (4 Jun 2023 2:13 UTC; 1 point)
- 's comment on awg’s Shortform by (7 May 2023 15:41 UTC; 1 point)
- 's comment on All AGI safety questions welcome (especially basic ones) [July 2022] by (19 Jul 2022 2:09 UTC; 1 point)
- 's comment on Siebe’s Shortform by (13 Feb 2025 7:41 UTC; 1 point)
I actually have some understanding of what MIRI’s Agent Foundations work is about
This post (and the rest of the sequence) was the first time I had ever read something about AI alignment and thought that it was actually asking the right questions. It is not about a sub-problem, it is not about marginal improvements. Its goal is a gears-level understanding of agents, and it directly explains why that’s hard. It’s a list of everything which needs to be figured out in order to remove all the black boxes and Cartesian boundaries, and understand agents as well as we understand refrigerators.
I nominate this post for two reasons.
One, it is an excellent example of providing supplemental writing about basic intuitions and thought processes, which is extremely helpful to me because I do not have a good enough command of the formal work to intuit them.
Two, it is one of the few examples of experimenting with different kinds of presentation. I feel like this is underappreciated and under-utilized; better ways of communicating seems like a strong baseline requirement of the rationality project, and this post pushes in that direction.
This post has significant changed my mental model of how to understand key challenges in AI safety, and also given me a clearer understanding of and language for describing why complex game-theoretic challenges are poorly specified or understood. The terms and concepts in this series of posts have become a key part of my basic intellectual toolkit.
This sequence was the first time I felt I understood MIRI’s research.
(Though I might prefer to nominate the text-version that has the whole sequence in one post.)
Read sequence as research for my EA/rationality novel, this was really good and also pretty easy to follow despite not having any technical background