Agent Foundations

Tag

Why Agent Foundations? An Overly Abstract Explanation

johnswentworth25 Mar 2022 23:17 UTC

312 points

60 comments8 min readLW link 1 review

Embedded Agency (full-text version)

Scott Garrabrant and abramdemski

15 Nov 2018 19:49 UTC

210 points

17 comments54 min readLW link

The Rocket Alignment Problem

Eliezer Yudkowsky4 Oct 2018 0:38 UTC

232 points

44 comments15 min readLW link 2 reviews

Some Summaries of Agent Foundations Work

mattmacdermott15 May 2023 16:09 UTC

62 points

1 comment13 min readLW link

Understanding Infra-Bayesianism: A Beginner-Friendly Video Series

Jack Parker and Connall Garrod

22 Sep 2022 13:25 UTC

140 points

6 comments2 min readLW link

Orthogonal: A new agent foundations alignment organization

Tamsin Leake19 Apr 2023 20:17 UTC

217 points

4 comments1 min readLW link

(orxl.org)

Striking Implications for Learning Theory, Interpretability — and Safety?

RogerDearnaley5 Jan 2024 8:46 UTC

37 points

4 comments2 min readLW link

Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis

RogerDearnaley1 Feb 2024 21:15 UTC

16 points

15 comments13 min readLW link

Working through a small tiling result

James Payor13 May 2025 20:28 UTC

66 points

9 comments5 min readLW link

You won’t solve alignment without agent foundations

Mikhail Samin6 Nov 2022 8:07 UTC

29 points

3 comments8 min readLW link

Why Simulator AIs want to be Active Inference AIs

Jan_Kulveit and rosehadshar

10 Apr 2023 18:23 UTC

96 points

9 comments8 min readLW link 1 review

Clarifying the Agent-Like Structure Problem

johnswentworth29 Sep 2022 21:28 UTC

63 points

19 comments6 min readLW link

0th Person and 1st Person Logic

Adele Lopez10 Mar 2024 0:56 UTC

60 points

28 comments6 min readLW link

My take on agent foundations: formalizing metaphilosophical competence

zhukeepa1 Apr 2018 6:33 UTC

21 points

6 comments1 min readLW link

Short Timelines Don’t Devalue Long Horizon Research

Vladimir_Nesov9 Apr 2025 0:42 UTC

172 points

24 comments1 min readLW link

formalizing the QACI alignment formal-goal

Tamsin Leake and JuliaHP

10 Jun 2023 3:28 UTC

54 points

6 comments13 min readLW link

(carado.moe)

The Learning-Theoretic Agenda: Status 2023

Vanessa Kosoy19 Apr 2023 5:21 UTC

144 points

22 comments56 min readLW link 3 reviews

Non-Monotonic Infra-Bayesian Physicalism

Marcus Ogren2 Apr 2025 12:14 UTC

35 points

0 comments18 min readLW link

Time complexity for deterministic string machines

alcatal21 Apr 2024 22:35 UTC

21 points

2 comments21 min readLW link

[Question] Critiques of the Agent Foundations agenda?

Jsevillamol24 Nov 2020 16:11 UTC

16 points

3 comments1 min readLW link

Fixed points in mortal population games

ViktoriaMalyasova14 Mar 2023 7:10 UTC

31 points

0 comments12 min readLW link

(www.lesswrong.com)

Lectures on statistical learning theory for alignment researchers

Vanessa Kosoy1 Oct 2025 8:36 UTC

41 points

1 comment1 min readLW link

(www.youtube.com)

Empirical vs. Mathematical Joints of Nature

Elizabeth and Alex_Altair

26 Jun 2024 1:55 UTC

35 points

1 comment5 min readLW link

Wildfire of strategicness

TsviBT5 Jun 2023 13:59 UTC

38 points

19 comments1 min readLW link

Announcement: Learning Theory Online Course

Yegreg and Alex Flint

20 Jan 2025 19:55 UTC

63 points

33 comments4 min readLW link

Live Theory Part 0: Taking Intelligence Seriously

Sahil26 Jun 2024 21:37 UTC

103 points

3 comments8 min readLW link

Towards a formalization of the agent structure problem

Alex_Altair29 Apr 2024 20:28 UTC

55 points

6 comments14 min readLW link

Proceedings of ILIAD: Lessons and Progress

Alexander Gietelink Oldenziel and JessRiedel

28 Apr 2025 19:04 UTC

78 points

5 comments8 min readLW link

Come join Dovetail’s agent foundations fellowship talks & discussion

Alex_Altair15 Feb 2025 22:10 UTC

24 points

0 comments1 min readLW link

A very non-technical explanation of the basics of infra-Bayesianism

David Matolcsi26 Apr 2023 22:57 UTC

62 points

9 comments9 min readLW link

[Question] Does agent foundations cover all future ML systems?

Jonas Hallgren25 Jul 2022 1:17 UTC

4 points

0 comments1 min readLW link

Uncertainty in all its flavours

Cleo Nardo9 Jan 2024 16:21 UTC

34 points

6 comments35 min readLW link

Is alignment reducible to becoming more coherent?

Cole Wyeth22 Apr 2025 23:47 UTC

19 points

0 comments3 min readLW link

Meaning & Agency

abramdemski19 Dec 2023 22:27 UTC

93 points

17 comments14 min readLW link

[Question] Take over my project: do computable agents plan against the universal distribution pessimistically?

Cole Wyeth19 Feb 2025 20:17 UTC

25 points

3 comments3 min readLW link

Video lectures on the learning-theoretic agenda

Vanessa Kosoy27 Oct 2024 12:01 UTC

75 points

0 comments1 min readLW link

(www.youtube.com)

Abstract Mathematical Concepts vs. Abstractions Over Real-World Systems

Thane Ruthenis18 Feb 2025 18:04 UTC

32 points

10 comments4 min readLW link

Game Theory without Argmax [Part 2]

Cleo Nardo11 Nov 2023 16:02 UTC

31 points

14 comments13 min readLW link

Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations)

Thane Ruthenis22 Dec 2023 20:19 UTC

75 points

14 comments6 min readLW link

New Paper: Infra-Bayesian Decision-Estimation Theory

Vanessa Kosoy and Diffractor

10 Apr 2025 9:17 UTC

78 points

4 comments1 min readLW link

(arxiv.org)

Infra-Bayesian physicalism: a formal theory of naturalized induction

Vanessa Kosoy30 Nov 2021 22:25 UTC

114 points

23 comments42 min readLW link 1 review

Consequentialism is in the Stars not Ourselves

DragonGod24 Apr 2023 0:02 UTC

7 points

19 comments5 min readLW link

Report & retrospective on the Dovetail fellowship

Alex_Altair14 Mar 2025 23:20 UTC

26 points

3 comments9 min readLW link

Linear infra-Bayesian Bandits

Vanessa Kosoy10 May 2024 6:41 UTC

40 points

5 comments1 min readLW link

(arxiv.org)

Interpreting Quantum Mechanics in Infra-Bayesian Physicalism

Yegreg12 Feb 2024 18:56 UTC

30 points

6 comments43 min readLW link

[Closed] Gauging Interest for a Learning-Theoretic Agenda Mentorship Programme

Vanessa Kosoy16 Feb 2025 16:24 UTC

54 points

5 comments2 min readLW link

Formalizing the Informal (event invite)

abramdemski10 Sep 2024 19:22 UTC

42 points

0 comments1 min readLW link

Talk: “AI Would Be A Lot Less Alarming If We Understood Agents”

johnswentworth17 Dec 2023 23:46 UTC

58 points

3 comments1 min readLW link

(www.youtube.com)

⿻ Symbiogenesis vs. Convergent Consequentialism

Audrey Tang and plex

21 Oct 2025 10:10 UTC

57 points

5 comments20 min readLW link

[Closed] Prize and fast track to alignment research at ALTER

Vanessa Kosoy17 Sep 2022 16:58 UTC

63 points

8 comments3 min readLW link

New Paper: Ambiguous Online Learning

Vanessa Kosoy25 Jun 2025 9:14 UTC

30 points

2 comments1 min readLW link

(arxiv.org)

Glass box learners want to be black box

Cole Wyeth10 May 2025 11:05 UTC

49 points

10 comments4 min readLW link

Challenges with Breaking into MIRI-Style Research

Chris_Leong17 Jan 2022 9:23 UTC

75 points

16 comments2 min readLW link

Coherence of Caches and Agents

johnswentworth1 Apr 2024 23:04 UTC

79 points

13 comments11 min readLW link

Game Theory without Argmax [Part 1]

Cleo Nardo11 Nov 2023 15:59 UTC

70 points

18 comments19 min readLW link

Some AI research areas and their relevance to existential safety

Andrew_Critch19 Nov 2020 3:18 UTC

206 points

37 comments50 min readLW link 2 reviews

Learning-theoretic agenda reading list

Vanessa Kosoy9 Nov 2023 17:25 UTC

104 points

1 comment2 min readLW link 1 review

The Plan − 2023 Version

johnswentworth29 Dec 2023 23:34 UTC

152 points

40 comments31 min readLW link 1 review

(A → B) → A

Scott Garrabrant11 Sep 2018 22:38 UTC

80 points

11 comments2 min readLW link

An Introduction to Credal Sets and Infra-Bayes Learnability

Brittany Gelb22 Aug 2025 13:03 UTC

33 points

2 comments13 min readLW link

Hierarchical Agency: A Missing Piece in AI Alignment

Jan_Kulveit27 Nov 2024 5:49 UTC

115 points

22 comments11 min readLW link

Leaving MIRI, Seeking Funding

abramdemski8 Aug 2024 18:32 UTC

264 points

19 comments2 min readLW link

Work with me on agent foundations: independent fellowship

Alex_Altair21 Sep 2024 13:59 UTC

59 points

5 comments4 min readLW link

Unbounded Embedded Agency: AEDT w.r.t. rOSI

Cole Wyeth20 Jul 2025 23:46 UTC

29 points

0 comments18 min readLW link

What’s next for the field of Agent Foundations?

Nora_Ammann, Alexander Gietelink Oldenziel and mattmacdermott

30 Nov 2023 17:55 UTC

59 points

23 comments10 min readLW link

Public Call for Interest in Mathematical Alignment

Davidmanheim22 Nov 2023 13:22 UTC

90 points

9 comments1 min readLW link

Towards the Operationalization of Philosophy & Wisdom

Thane Ruthenis28 Oct 2024 19:45 UTC

20 points

2 comments33 min readLW link

(aiimpacts.org)

Apply for the 2025 Dovetail fellowship

Alex_Altair and Alfred Harwood

17 Aug 2025 19:09 UTC

42 points

2 comments4 min readLW link

Contra “Strong Coherence”

DragonGod4 Mar 2023 20:05 UTC

39 points

24 comments1 min readLW link

Refinement of Active Inference agency ontology

Roman Leventov15 Dec 2023 9:31 UTC

16 points

0 comments5 min readLW link

(arxiv.org)

AXRP Episode 15 - Natural Abstractions with John Wentworth

DanielFilan23 May 2022 5:40 UTC

34 points

1 comment58 min readLW link

Box inversion revisited

Jan_Kulveit7 Nov 2023 11:09 UTC

40 points

3 comments8 min readLW link

No, Futarchy Doesn’t Have This EDT Flaw

Mikhail Samin27 Jun 2025 9:27 UTC

33 points

28 comments2 min readLW link

Agent Foundations 2025 at CMU

Alexander Gietelink Oldenziel and windows

19 Jan 2025 23:48 UTC

90 points

10 comments1 min readLW link

My research agenda in agent foundations

Alex_Altair28 Jun 2023 18:00 UTC

76 points

9 comments11 min readLW link

Compositional language for hypotheses about computations

Vanessa Kosoy11 Mar 2023 19:43 UTC

38 points

6 comments12 min readLW link

AXRP Episode 25 - Cooperative AI with Caspar Oesterheld

DanielFilan3 Oct 2023 21:50 UTC

43 points

0 comments92 min readLW link

Agent foundations: not really math, not really science

Alex_Altair17 Aug 2025 5:48 UTC

119 points

28 comments5 min readLW link

Proof Section to an Introduction to Credal Sets and Infra-Bayes Learnability

Brittany Gelb21 Aug 2025 23:11 UTC

13 points

0 comments10 min readLW link

[Closed] Agent Foundations track in MATS

Vanessa Kosoy31 Oct 2023 8:12 UTC

54 points

1 comment1 min readLW link

(www.matsprogram.org)

Most Minds are Irrational

Davidmanheim10 Dec 2024 9:36 UTC

17 points

4 comments10 min readLW link

Deep Learning is cheap Solomonoff induction?

Lucius Bushnaq, Kaarel and Dmitry Vaintrob

7 Dec 2024 11:00 UTC

45 points

1 comment17 min readLW link

Synthesizing Standalone World-Models, Part 4: Metaphysical Justifications

Thane Ruthenis26 Sep 2025 18:00 UTC

23 points

9 comments4 min readLW link

UDT1.01: Logical Inductors and Implicit Beliefs (5/10)

Diffractor18 Apr 2024 8:39 UTC

34 points

2 comments19 min readLW link

What is Inadequate about Bayesianism for AI Alignment: Motivating Infra-Bayesianism

Brittany Gelb1 May 2025 19:06 UTC

48 points

1 comment7 min readLW link

S-Expressions as a Design Language: A Tool for Deconfusion in Alignment

Johannes C. Mayer19 Jun 2025 19:03 UTC

5 points

0 comments6 min readLW link

Ruling Out Lookup Tables

Alfred Harwood4 Feb 2025 10:39 UTC

22 points

11 comments7 min readLW link

Arguments about Highly Reliable Agent Designs as a Useful Path to Artificial Intelligence Safety

riceissa and Davidmanheim

27 Jan 2022 13:13 UTC

27 points

0 comments1 min readLW link

(arxiv.org)

Proof Section to an Introduction to Reinforcement Learning for Understanding Infra-Bayesianism

Brittany Gelb17 May 2025 2:36 UTC

3 points

0 comments9 min readLW link

Distilling the Internal Model Principle part II

JoseFaustino30 Apr 2025 17:56 UTC

15 points

0 comments19 min readLW link

[Question] Popular materials about environmental goals/agent foundations? People wanting to discuss such topics?

Q Home22 Jan 2025 3:30 UTC

5 points

0 comments1 min readLW link

Optimisation Measures: Desiderata, Impossibility, Proposals

mattmacdermott and Alexander Gietelink Oldenziel

7 Aug 2023 15:52 UTC

36 points

9 comments1 min readLW link

A mostly critical review of infra-Bayesianism

David Matolcsi28 Feb 2023 18:37 UTC

108 points

9 comments29 min readLW link

Towards Measures of Optimisation

mattmacdermott and Alexander Gietelink Oldenziel

12 May 2023 15:29 UTC

53 points

37 comments4 min readLW link

Detect Goodhart and shut down

Jeremy Gillen22 Jan 2025 18:45 UTC

70 points

21 comments7 min readLW link

An Introduction to Evidential Decision Theory

Babić2 Feb 2025 21:27 UTC

5 points

2 comments10 min readLW link

Towards building blocks of ontologies

Daniel C, Alex_Altair, Dalcy, Alfred Harwood and JoseFaustino

8 Feb 2025 16:03 UTC

29 points

0 comments26 min readLW link

Reward is not Necessary: How to Create a Compositional Self-Preserving Agent for Life-Long Learning

Roman Leventov12 Jan 2023 16:43 UTC

17 points

2 comments2 min readLW link

(arxiv.org)

Re-imagining AI Interfaces

Harsha G.8 Sep 2025 19:38 UTC

8 points

0 comments5 min readLW link

(somestrangeloops.substack.com)

A New Framework for AI Alignment: A Philosophical Approach

niscalajyoti25 Jun 2025 2:41 UTC

1 point

0 comments1 min readLW link

(archive.org)

Natural Latents: Latent Variables Stable Across Ontologies

johnswentworth and David Lorell

4 Sep 2025 0:33 UTC

117 points

20 comments20 min readLW link

Three Types of Constraints in the Space of Agents

Nora_Ammann and Mateusz Bagiński

15 Jan 2024 17:27 UTC

26 points

3 comments17 min readLW link

Rational Effective Utopia & Narrow Way There: Math-Proven Safe Static Multiversal mAX-Intelligence (AXI), Multiversal Alignment, New Ethicophysics… (Aug 11)

ank11 Feb 2025 3:21 UTC

13 points

8 comments38 min readLW link

Gearing Up for Long Timelines in a Hard World

Dalcy14 Jul 2023 6:11 UTC

18 points

0 comments4 min readLW link

Abstractions are not Natural

Alfred Harwood4 Nov 2024 11:10 UTC

25 points

21 comments11 min readLW link

Intelligence–Agency Equivalence ≈ Mass–Energy Equivalence: On Static Nature of Intelligence & Physicalization of Ethics

ank22 Feb 2025 0:12 UTC

1 point

0 comments6 min readLW link

Rebuttals for ~all criticisms of AIXI

Cole Wyeth7 Jan 2025 17:41 UTC

26 points

17 comments14 min readLW link

Understanding Selection Theorems

adamk28 May 2022 1:49 UTC

41 points

3 comments7 min readLW link

Performance guarantees in classical learning theory and infra-Bayesianism

David Matolcsi28 Feb 2023 18:37 UTC

9 points

4 comments31 min readLW link

Distilling the Internal Model Principle

JoseFaustino8 Feb 2025 14:59 UTC

21 points

0 comments16 min readLW link

Crisp Supra-Decision Processes

Brittany Gelb17 Sep 2025 15:56 UTC

34 points

0 comments17 min readLW link

Can AI agents learn to be good?

Ram Rachum29 Aug 2024 14:20 UTC

8 points

0 comments1 min readLW link

(futureoflife.org)

Infra-Bayesianism naturally leads to the monotonicity principle, and I think this is a problem

David Matolcsi26 Apr 2023 21:39 UTC

22 points

6 comments4 min readLW link

Goal alignment without alignment on epistemology, ethics, and science is futile

Roman Leventov7 Apr 2023 8:22 UTC

20 points

2 comments2 min readLW link

Bridging Expected Utility Maximization and Optimization

Daniel Herrmann5 Aug 2022 8:18 UTC

25 points

5 comments14 min readLW link

Requirements for a Basin of Attraction to Alignment

RogerDearnaley14 Feb 2024 7:10 UTC

41 points

12 comments31 min readLW link

Open Problems in AIXI Agent Foundations

Cole Wyeth12 Sep 2024 15:38 UTC

42 points

2 comments10 min readLW link

A Generalization of the Good Regulator Theorem

Alfred Harwood4 Jan 2025 9:55 UTC

20 points

6 comments10 min readLW link

Infra-Bayesian haggling

hannagabor20 May 2024 12:23 UTC

28 points

0 comments20 min readLW link

Discovering Agents

zac_kenton18 Aug 2022 17:33 UTC

73 points

11 comments6 min readLW link

Live Governance: AI tools for coordination without centralisation

mbuch13 Oct 2025 8:24 UTC

10 points

0 comments12 min readLW link

Unaligned AGI & Brief History of Inequality

ank22 Feb 2025 16:26 UTC

−20 points

4 comments7 min readLW link

100 Dinners And A Workshop: Information Preservation And Goals

Stephen Fowler28 Mar 2023 3:13 UTC

8 points

0 comments7 min readLW link

Half-baked idea: a straightforward method for learning environmental goals?

Q Home4 Feb 2025 6:56 UTC

16 points

7 comments5 min readLW link

[Question] Choice := Anthropics uncertainty? And potential implications for agency

Antoine de Scorraille21 Apr 2022 16:38 UTC

6 points

1 comment1 min readLW link

7. Evolution and Ethics

RogerDearnaley15 Feb 2024 23:38 UTC

6 points

7 comments6 min readLW link

Proof Section to Crisp Supra-Decision Processes

Brittany Gelb17 Sep 2025 15:57 UTC

4 points

0 comments3 min readLW link

Repeated Play of Imperfect Newcomb’s Paradox in Infra-Bayesian Physicalism

Sven Nilsen3 Apr 2023 10:06 UTC

2 points

0 comments2 min readLW link

Interview with Vanessa Kosoy on the Value of Theoretical Research for AI

WillPetillo4 Dec 2023 22:58 UTC

37 points

0 comments35 min readLW link

Clarifying “wisdom”: Foundational topics for aligned AIs to prioritize before irreversible decisions

Anthony DiGiovanni20 Jun 2025 21:55 UTC

37 points

2 comments12 min readLW link

Live Conversational Threads: Not an AI Notetaker

adiga19 Oct 2025 8:56 UTC

1 point

0 comments7 min readLW link

Directly Try Solving Alignment for 5 weeks

Kabir Kumar21 Jul 2025 21:51 UTC

80 points

4 comments6 min readLW link

(beta.ai-plans.com)

Another take on agent foundations: formalizing zero-shot reasoning

zhukeepa1 Jul 2018 6:12 UTC

64 points

20 comments12 min readLW link

An Impossibility Proof Relevant to the Shutdown Problem and Corrigibility

Audere2 May 2023 6:52 UTC

66 points

13 comments9 min readLW link

An Introduction to Reinforcement Learning for Understanding Infra-Bayesianism

Brittany Gelb17 May 2025 2:34 UTC

21 points

0 comments20 min readLW link

A Straightforward Explanation of the Good Regulator Theorem

Alfred Harwood18 Nov 2024 12:45 UTC

83 points

29 comments14 min readLW link

Normative vs Descriptive Models of Agency

mattmacdermott2 Feb 2023 20:28 UTC

26 points

5 comments4 min readLW link

What program structures enable efficient induction?

Daniel C5 Sep 2024 10:12 UTC

23 points

5 comments3 min readLW link

Intent-aligned AI systems deplete human agency: the need for agency foundations research in AI safety

catubc31 May 2023 21:18 UTC

26 points

4 comments11 min readLW link

Thou shalt not command an alighned AI

Martin Vlach11 May 2025 20:02 UTC

0 points

4 comments1 min readLW link

12 Angry Agents, or: A Plan for AI Empathy

Ram Rachum and Davidmanheim

14 Oct 2025 15:24 UTC

21 points

4 comments12 min readLW link

No comments.