RSS

Agent Foundations

TagLast edit: 10 Mar 2026 23:24 UTC by XelaP

There are fundamental confusions about intelligent agents, that is, about minds that try to make stuff that they want happen. Some believe that working out these fundamental confusions is necessary for AI alignment. Others prefer more prosaic approaches; or something else not mentioned.

Here’s some fundamental confusions that agent foundations tries to answer:

Why Agent Foun­da­tions? An Overly Ab­stract Explanation

johnswentworth25 Mar 2022 23:17 UTC
317 points
60 comments8 min readLW link1 review

Embed­ded Agency (full-text ver­sion)

15 Nov 2018 19:49 UTC
220 points
17 comments54 min readLW link

The Rocket Align­ment Problem

Eliezer Yudkowsky4 Oct 2018 0:38 UTC
238 points
44 comments15 min readLW link2 reviews

Some Sum­maries of Agent Foun­da­tions Work

mattmacdermott15 May 2023 16:09 UTC
63 points
1 comment13 min readLW link

Un­der­stand­ing In­fra-Bayesi­anism: A Begin­ner-Friendly Video Series

22 Sep 2022 13:25 UTC
140 points
6 comments2 min readLW link

Orthog­o­nal: A new agent foun­da­tions al­ign­ment organization

Tamsin Leake19 Apr 2023 20:17 UTC
217 points
4 comments1 min readLW link
(orxl.org)

Strik­ing Im­pli­ca­tions for Learn­ing The­ory, In­ter­pretabil­ity — and Safety?

RogerDearnaley5 Jan 2024 8:46 UTC
37 points
4 comments2 min readLW link

Align­ment has a Basin of At­trac­tion: Beyond the Orthog­o­nal­ity Thesis

RogerDearnaley1 Feb 2024 21:15 UTC
15 points
15 comments13 min readLW link

Work­ing through a small tiling result

James Payor13 May 2025 20:28 UTC
72 points
9 comments5 min readLW link

You won’t solve al­ign­ment with­out agent foundations

Mikhail Samin6 Nov 2022 8:07 UTC
29 points
3 comments8 min readLW link

Why Si­mu­la­tor AIs want to be Ac­tive In­fer­ence AIs

10 Apr 2023 18:23 UTC
108 points
9 comments8 min readLW link1 review

Clar­ify­ing the Agent-Like Struc­ture Problem

johnswentworth29 Sep 2022 21:28 UTC
64 points
19 comments6 min readLW link

0th Per­son and 1st Per­son Logic

Adele Lopez10 Mar 2024 0:56 UTC
63 points
29 comments6 min readLW link

My take on agent foun­da­tions: for­mal­iz­ing metaphilo­soph­i­cal competence

zhukeepa1 Apr 2018 6:33 UTC
21 points
6 comments1 min readLW link

Short Timelines Don’t De­value Long Hori­zon Research

Vladimir_Nesov9 Apr 2025 0:42 UTC
178 points
24 comments1 min readLW link

for­mal­iz­ing the QACI al­ign­ment for­mal-goal

10 Jun 2023 3:28 UTC
54 points
6 comments13 min readLW link
(carado.moe)

The Learn­ing-The­o­retic Agenda: Sta­tus 2023

Vanessa Kosoy19 Apr 2023 5:21 UTC
144 points
22 comments56 min readLW link3 reviews

Non-Mono­tonic In­fra-Bayesian Physicalism

Marcus Ogren2 Apr 2025 12:14 UTC
43 points
0 comments18 min readLW link

Time com­plex­ity for de­ter­minis­tic string machines

alcatal21 Apr 2024 22:35 UTC
21 points
2 comments21 min readLW link

[Question] Cri­tiques of the Agent Foun­da­tions agenda?

Jsevillamol24 Nov 2020 16:11 UTC
15 points
3 comments1 min readLW link

Fixed points in mor­tal pop­u­la­tion games

ViktoriaMalyasova14 Mar 2023 7:10 UTC
31 points
0 comments12 min readLW link
(www.lesswrong.com)

Lec­tures on statis­ti­cal learn­ing the­ory for al­ign­ment researchers

Vanessa Kosoy1 Oct 2025 8:36 UTC
42 points
1 comment1 min readLW link
(www.youtube.com)

Em­piri­cal vs. Math­e­mat­i­cal Joints of Nature

26 Jun 2024 1:55 UTC
35 points
1 comment5 min readLW link

Com­ment on Nat­u­ral Emer­gent Misal­ign­ment Paper by Anthropic

Simon Lermen23 Nov 2025 4:21 UTC
21 points
0 comments4 min readLW link

Wild­fire of strategicness

TsviBT5 Jun 2023 13:59 UTC
40 points
19 comments1 min readLW link

An­nounce­ment: Learn­ing The­ory On­line Course

20 Jan 2025 19:55 UTC
63 points
33 comments4 min readLW link

Live The­ory Part 0: Tak­ing In­tel­li­gence Seriously

Sahil26 Jun 2024 21:37 UTC
105 points
3 comments8 min readLW link

Towards a for­mal­iza­tion of the agent struc­ture problem

Alex_Altair29 Apr 2024 20:28 UTC
55 points
6 comments14 min readLW link

Pro­ceed­ings of ILIAD: Les­sons and Progress

28 Apr 2025 19:04 UTC
78 points
5 comments8 min readLW link

Come join Dove­tail’s agent foun­da­tions fel­low­ship talks & discussion

Alex_Altair15 Feb 2025 22:10 UTC
24 points
0 comments1 min readLW link

A very non-tech­ni­cal ex­pla­na­tion of the ba­sics of in­fra-Bayesianism

David Matolcsi26 Apr 2023 22:57 UTC
66 points
14 comments9 min readLW link

Com­par­ing Payor & Löb

abramdemski8 Nov 2025 5:40 UTC
49 points
1 comment3 min readLW link

Un­su­per­vised Agent Discovery

Gunnar_Zarncke22 Dec 2025 22:01 UTC
26 points
0 comments6 min readLW link

[Question] Does agent foun­da­tions cover all fu­ture ML sys­tems?

Jonas Hallgren25 Jul 2022 1:17 UTC
4 points
0 comments1 min readLW link

Uncer­tainty in all its flavours

Cleo Nardo9 Jan 2024 16:21 UTC
34 points
6 comments35 min readLW link

Is al­ign­ment re­ducible to be­com­ing more co­her­ent?

Cole Wyeth22 Apr 2025 23:47 UTC
19 points
0 comments3 min readLW link

Mean­ing & Agency

abramdemski19 Dec 2023 22:27 UTC
93 points
17 comments14 min readLW link

“We are con­fused about agency”

Cole Wyeth17 Feb 2026 19:51 UTC
56 points
37 comments3 min readLW link

In (highly con­tin­gent!) defense of in­ter­pretabil­ity-in-the-loop ML training

Steven Byrnes6 Feb 2026 16:32 UTC
82 points
11 comments3 min readLW link

[Question] Take over my pro­ject: do com­putable agents plan against the uni­ver­sal dis­tri­bu­tion pes­simisti­cally?

Cole Wyeth19 Feb 2025 20:17 UTC
25 points
3 comments3 min readLW link

Video lec­tures on the learn­ing-the­o­retic agenda

Vanessa Kosoy27 Oct 2024 12:01 UTC
75 points
0 comments1 min readLW link
(www.youtube.com)

Ab­stract Math­e­mat­i­cal Con­cepts vs. Ab­strac­tions Over Real-World Systems

Thane Ruthenis18 Feb 2025 18:04 UTC
35 points
10 comments4 min readLW link

Is the In­visi­ble Hand an Agent?

Gunnar_Zarncke18 Feb 2026 16:26 UTC
13 points
4 comments4 min readLW link
(substack.com)

Game The­ory with­out Argmax [Part 2]

Cleo Nardo11 Nov 2023 16:02 UTC
31 points
14 comments13 min readLW link

Ideal­ized Agents Are Ap­prox­i­mate Causal Mir­rors (+ Rad­i­cal Op­ti­mism on Agent Foun­da­tions)

Thane Ruthenis22 Dec 2023 20:19 UTC
77 points
14 comments6 min readLW link

New Paper: In­fra-Bayesian De­ci­sion-Es­ti­ma­tion Theory

10 Apr 2025 9:17 UTC
80 points
4 comments1 min readLW link
(arxiv.org)

In­fra-Bayesian phys­i­cal­ism: a for­mal the­ory of nat­u­ral­ized induction

Vanessa Kosoy30 Nov 2021 22:25 UTC
115 points
24 comments42 min readLW link1 review

Con­se­quen­tial­ism is in the Stars not Ourselves

DragonGod24 Apr 2023 0:02 UTC
7 points
19 comments5 min readLW link

Re­port & ret­ro­spec­tive on the Dove­tail fellowship

Alex_Altair14 Mar 2025 23:20 UTC
26 points
3 comments9 min readLW link

Lin­ear in­fra-Bayesian Bandits

Vanessa Kosoy10 May 2024 6:41 UTC
40 points
6 comments1 min readLW link1 review
(arxiv.org)

In­ter­pret­ing Quan­tum Me­chan­ics in In­fra-Bayesian Physicalism

Yegreg12 Feb 2024 18:56 UTC
34 points
10 comments43 min readLW link1 review

[Closed] Gaug­ing In­ter­est for a Learn­ing-The­o­retic Agenda Men­tor­ship Programme

Vanessa Kosoy16 Feb 2025 16:24 UTC
54 points
5 comments2 min readLW link

[Question] AI for Agent Foun­da­tions etc.?

Valentine12 Mar 2026 7:20 UTC
16 points
2 comments1 min readLW link

For­mal­iz­ing the In­for­mal (event in­vite)

abramdemski10 Sep 2024 19:22 UTC
42 points
0 comments1 min readLW link

Talk: “AI Would Be A Lot Less Alarm­ing If We Un­der­stood Agents”

johnswentworth17 Dec 2023 23:46 UTC
58 points
3 comments1 min readLW link
(www.youtube.com)

⿻ Sym­bio­ge­n­e­sis vs. Con­ver­gent Consequentialism

21 Oct 2025 10:10 UTC
63 points
7 comments20 min readLW link

[Closed] Prize and fast track to al­ign­ment re­search at ALTER

Vanessa Kosoy17 Sep 2022 16:58 UTC
64 points
8 comments3 min readLW link

New Paper: Am­bigu­ous On­line Learning

Vanessa Kosoy25 Jun 2025 9:14 UTC
30 points
2 comments1 min readLW link
(arxiv.org)

Glass box learn­ers want to be black box

Cole Wyeth10 May 2025 11:05 UTC
49 points
14 comments4 min readLW link

Types of sys­tems that could be use­ful for agent foundations

Alex_Altair14 Nov 2025 3:54 UTC
46 points
3 comments5 min readLW link

Challenges with Break­ing into MIRI-Style Research

Chris_Leong17 Jan 2022 9:23 UTC
75 points
16 comments2 min readLW link

Co­her­ence of Caches and Agents

johnswentworth1 Apr 2024 23:04 UTC
80 points
13 comments11 min readLW link

The Ar­tifi­cial Self

15 Mar 2026 1:37 UTC
107 points
12 comments29 min readLW link

Game The­ory with­out Argmax [Part 1]

Cleo Nardo11 Nov 2023 15:59 UTC
78 points
18 comments19 min readLW link

Some AI re­search ar­eas and their rele­vance to ex­is­ten­tial safety

Andrew_Critch19 Nov 2020 3:18 UTC
206 points
37 comments50 min readLW link2 reviews

Learn­ing-the­o­retic agenda read­ing list

Vanessa Kosoy9 Nov 2023 17:25 UTC
106 points
1 comment2 min readLW link1 review

The Plan − 2023 Version

johnswentworth29 Dec 2023 23:34 UTC
153 points
40 comments31 min readLW link1 review

(A → B) → A

Scott Garrabrant11 Sep 2018 22:38 UTC
90 points
15 comments2 min readLW link

An In­tro­duc­tion to Credal Sets and In­fra-Bayes Learnability

Brittany Gelb22 Aug 2025 13:03 UTC
40 points
6 comments13 min readLW link

Hier­ar­chi­cal Agency: A Miss­ing Piece in AI Alignment

Jan_Kulveit27 Nov 2024 5:49 UTC
121 points
23 comments11 min readLW link1 review

Leav­ing MIRI, Seek­ing Funding

abramdemski8 Aug 2024 18:32 UTC
265 points
19 comments2 min readLW link

Work with me on agent foun­da­tions: in­de­pen­dent fellowship

Alex_Altair21 Sep 2024 13:59 UTC
59 points
5 comments4 min readLW link

What is com­pu­ta­tional me­chan­ics? An ex­plainer

Leo Cymbalista24 Feb 2026 6:09 UTC
15 points
0 comments15 min readLW link

Un­bounded Embed­ded Agency: AEDT w.r.t. rOSI

Cole Wyeth20 Jul 2025 23:46 UTC
36 points
0 comments16 min readLW link

What’s next for the field of Agent Foun­da­tions?

30 Nov 2023 17:55 UTC
59 points
23 comments10 min readLW link

Public Call for In­ter­est in Math­e­mat­i­cal Alignment

Davidmanheim22 Nov 2023 13:22 UTC
90 points
9 comments1 min readLW link

What would my 12-year-old self think of agent foun­da­tions?

Alex_Altair17 Nov 2025 1:46 UTC
27 points
1 comment2 min readLW link
(namelessvirtue.com)

Re­search Reflections

abramdemski4 Nov 2025 4:33 UTC
95 points
3 comments3 min readLW link

Towards the Oper­a­tional­iza­tion of Philos­o­phy & Wisdom

Thane Ruthenis28 Oct 2024 19:45 UTC
20 points
2 comments33 min readLW link
(aiimpacts.org)

Ap­ply for the 2025 Dove­tail fellowship

17 Aug 2025 19:09 UTC
42 points
2 comments4 min readLW link

Con­tra “Strong Co­her­ence”

DragonGod4 Mar 2023 20:05 UTC
39 points
24 comments1 min readLW link

AIXI with gen­eral util­ity func­tions: “Value un­der ig­no­rance in UAI”

Cole Wyeth22 Dec 2025 5:46 UTC
25 points
0 comments1 min readLW link
(arxiv.org)

Refine­ment of Ac­tive In­fer­ence agency ontology

Roman Leventov15 Dec 2023 9:31 UTC
17 points
0 comments5 min readLW link
(arxiv.org)

AXRP Epi­sode 15 - Nat­u­ral Ab­strac­tions with John Wentworth

DanielFilan23 May 2022 5:40 UTC
34 points
1 comment58 min readLW link

Up­com­ing Dove­tail fel­low talks & discussion

Alex_Altair26 Jan 2026 2:39 UTC
29 points
0 comments2 min readLW link

Box in­ver­sion revisited

Jan_Kulveit7 Nov 2023 11:09 UTC
43 points
3 comments8 min readLW link

An­nounc­ing: Agent Foun­da­tions 2026 at CMU

5 Dec 2025 18:37 UTC
60 points
2 comments1 min readLW link

No, Futarchy Doesn’t Have This EDT Flaw

Mikhail Samin27 Jun 2025 9:27 UTC
35 points
29 comments2 min readLW link

Man­aged vs Un­man­aged Agency

plex18 Feb 2026 13:23 UTC
48 points
22 comments3 min readLW link

When bits of op­ti­miza­tion im­ply bits of mod­el­ing: the Touchette-Lloyd theorem

15 Dec 2025 4:21 UTC
27 points
0 comments11 min readLW link

Agent Foun­da­tions 2025 at CMU

19 Jan 2025 23:48 UTC
90 points
10 comments1 min readLW link

My re­search agenda in agent foundations

Alex_Altair28 Jun 2023 18:00 UTC
76 points
9 comments11 min readLW link

Com­po­si­tional lan­guage for hy­pothe­ses about computations

Vanessa Kosoy11 Mar 2023 19:43 UTC
38 points
6 comments12 min readLW link

AXRP Epi­sode 25 - Co­op­er­a­tive AI with Cas­par Oesterheld

DanielFilan3 Oct 2023 21:50 UTC
43 points
0 comments92 min readLW link

Agent foun­da­tions: not re­ally math, not re­ally science

Alex_Altair17 Aug 2025 5:48 UTC
119 points
29 comments5 min readLW link

[Closed] Ap­ply to Vanessa’s men­tor­ship at PIBBSS

Vanessa Kosoy14 Jan 2026 9:15 UTC
39 points
0 comments2 min readLW link

Proof Sec­tion to an In­tro­duc­tion to Credal Sets and In­fra-Bayes Learnability

Brittany Gelb21 Aug 2025 23:11 UTC
13 points
0 comments10 min readLW link

On­tol­ogy for AI Cults and Cy­borg Egregores

Jan_Kulveit10 Nov 2025 13:19 UTC
65 points
14 comments2 min readLW link

[Closed] Agent Foun­da­tions track in MATS

Vanessa Kosoy31 Oct 2023 8:12 UTC
54 points
1 comment1 min readLW link
(www.matsprogram.org)

Most Minds are Irrational

Davidmanheim10 Dec 2024 9:36 UTC
17 points
4 comments10 min readLW link

Deep Learn­ing is cheap Solomonoff in­duc­tion?

7 Dec 2024 11:00 UTC
46 points
1 comment17 min readLW link

Syn­the­siz­ing Stan­dalone World-Models, Part 4: Me­ta­phys­i­cal Justifications

Thane Ruthenis26 Sep 2025 18:00 UTC
23 points
9 comments4 min readLW link

UDT1.01: Log­i­cal In­duc­tors and Im­plicit Beliefs (5/​10)

Diffractor18 Apr 2024 8:39 UTC
34 points
2 comments19 min readLW link

Pythia

plex7 Nov 2025 23:31 UTC
88 points
31 comments4 min readLW link

What is Inad­e­quate about Bayesi­anism for AI Align­ment: Mo­ti­vat­ing In­fra-Bayesianism

Brittany Gelb1 May 2025 19:06 UTC
54 points
1 comment7 min readLW link

S-Ex­pres­sions as a De­sign Lan­guage: A Tool for De­con­fu­sion in Align­ment

Johannes C. Mayer19 Jun 2025 19:03 UTC
5 points
0 comments6 min readLW link

The Nihilis­tic Real­ism Ar­chi­tec­ture (NRA-OS): Re­solv­ing the Speci­fi­ca­tion Trap and Epistemic Failures in Fron­tier AI Alignment

Nihle13 Mar 2026 1:13 UTC
1 point
0 comments15 min readLW link

Why You Can’t Teach AI To Be Safe — A Blueprint-First Approach

Prakhar Dwivedi11 Mar 2026 17:28 UTC
1 point
0 comments7 min readLW link

Ther­mo­dy­namic Align­ment: An At­tempt to Derive Align­ment from Physics, Not Ethics

thinkingstick13 Mar 2026 5:48 UTC
1 point
0 comments1 min readLW link
(github.com)

Rul­ing Out Lookup Tables

Alfred Harwood4 Feb 2025 10:39 UTC
22 points
11 comments7 min readLW link

Ar­gu­ments about Highly Reli­able Agent De­signs as a Use­ful Path to Ar­tifi­cial In­tel­li­gence Safety

27 Jan 2022 13:13 UTC
27 points
0 comments1 min readLW link
(arxiv.org)

Am­plified Align­ment: A struc­tural ap­proach where al­ign­ment scales pos­i­tively with capability

Shadow Rose12 Mar 2026 15:56 UTC
1 point
0 comments2 min readLW link

Proof Sec­tion to an In­tro­duc­tion to Re­in­force­ment Learn­ing for Un­der­stand­ing In­fra-Bayesianism

Brittany Gelb17 May 2025 2:36 UTC
3 points
0 comments9 min readLW link

Distill­ing the In­ter­nal Model Prin­ci­ple part II

JoseFaustino30 Apr 2025 17:56 UTC
15 points
0 comments19 min readLW link

[Question] Pop­u­lar ma­te­ri­als about en­vi­ron­men­tal goals/​agent foun­da­tions? Peo­ple want­ing to dis­cuss such top­ics?

Q Home22 Jan 2025 3:30 UTC
5 points
0 comments1 min readLW link

Op­ti­mi­sa­tion Mea­sures: Desider­ata, Im­pos­si­bil­ity, Proposals

7 Aug 2023 15:52 UTC
36 points
9 comments1 min readLW link

Words Are A Leaky Abstraction

sonicrocketman16 Feb 2026 22:20 UTC
1 point
0 comments5 min readLW link
(brianschrader.com)

Three-Path Con­silience for Dureon: Dis­si­pa­tive Struc­tures Re­veal the Hetero­gene­ity of Per­sis­tence Conditions

Hiroshi Yamakawa18 Feb 2026 11:59 UTC
10 points
0 comments12 min readLW link

A mostly crit­i­cal re­view of in­fra-Bayesianism

David Matolcsi28 Feb 2023 18:37 UTC
109 points
9 comments29 min readLW link

Towards Mea­sures of Optimisation

12 May 2023 15:29 UTC
53 points
37 comments4 min readLW link

We need to make our­selves peo­ple the mod­els can come to with problems

Lydia Nottingham14 Jan 2026 0:43 UTC
21 points
2 comments2 min readLW link
(lydianottingham.substack.com)

Man­age­ment of Sub­strate-Sen­si­tive AI Ca­pa­bil­ities (MoSSAIC) Part 0: Overture

mfatt26 Nov 2025 17:02 UTC
25 points
0 comments4 min readLW link

De­tect Good­hart and shut down

Jeremy Gillen22 Jan 2025 18:45 UTC
71 points
21 comments7 min readLW link

An In­tro­duc­tion to Ev­i­den­tial De­ci­sion Theory

Babić2 Feb 2025 21:27 UTC
5 points
2 comments10 min readLW link

Ro­bust Finite Poli­cies are Non­triv­ially Structured

Winter Cross6 Feb 2026 17:47 UTC
25 points
1 comment11 min readLW link

Towards build­ing blocks of ontologies

8 Feb 2025 16:03 UTC
29 points
0 comments26 min readLW link

Geo­met­ric Mne­mic Man­i­folds: A The­o­ret­i­cal Ar­chi­tec­ture for Struc­tured AI Me­mory (Re­quest for Feed­back)

garciaalan18610 Dec 2025 6:25 UTC
1 point
0 comments1 min readLW link

What If AI Agents Weren’t Black Boxes?

jonnymac6 Jan 2026 15:15 UTC
1 point
0 comments6 min readLW link

Re­ward is not Ne­c­es­sary: How to Create a Com­po­si­tional Self-Pre­serv­ing Agent for Life-Long Learning

Roman Leventov12 Jan 2023 16:43 UTC
17 points
2 comments2 min readLW link
(arxiv.org)

Re-imag­in­ing AI Interfaces

Harsha G.8 Sep 2025 19:38 UTC
8 points
0 comments5 min readLW link
(somestrangeloops.substack.com)

The Next Break­point in LLMs

timgee27 Dec 2025 17:06 UTC
1 point
0 comments7 min readLW link

The End of the “Chatty Bot”: Why Agen­tic AI is the New Stan­dard for Cus­tomer Service

negisantosh623@gmail.com5 Jan 2026 10:10 UTC
1 point
0 comments4 min readLW link

Work­ing The­ory: Con­straint as the Gen­er­a­tive Con­di­tion of Per­cep­tion, Cog­ni­tion, and Truth (The Pin­hole Per­spec­tive)

drhighfill31 Oct 2025 15:49 UTC
1 point
0 comments1 min readLW link

Man­age­ment of Sub­strate-Sen­si­tive AI Ca­pa­bil­ities (MoSSAIC) Part 1: Exposition

3 Dec 2025 18:29 UTC
15 points
0 comments5 min readLW link

A New Frame­work for AI Align­ment: A Philo­soph­i­cal Approach

niscalajyoti25 Jun 2025 2:41 UTC
1 point
0 comments1 min readLW link
(archive.org)

Co­her­ence, Not Con­scious­ness: A New Foun­da­tion for Trust­wor­thy AI

Michal Harcej (NanoMagic)17 Dec 2025 22:45 UTC
1 point
0 comments5 min readLW link

Nat­u­ral­ized Orthog­o­nal­ity Collapse

Cat Bunni20 Nov 2025 7:59 UTC
1 point
0 comments9 min readLW link

The 99th Per­centile Illu­sion in AI

Lihao Sun4 Mar 2026 14:26 UTC
1 point
0 comments6 min readLW link

Le­gi­t­i­macy Be­fore Ca­pa­bil­ity: A Frame­work for Hu­man–AI Coex­is­tence Grounded in Mo­ral Principles

[Error communicating with LW2 server]17 Mar 2026 13:14 UTC
1 point
0 comments3 min readLW link

Nat­u­ral La­tents: La­tent Vari­ables Stable Across Ontologies

4 Sep 2025 0:33 UTC
124 points
25 comments20 min readLW link

Three Types of Con­straints in the Space of Agents

15 Jan 2024 17:27 UTC
26 points
3 comments17 min readLW link

Effec­tive Utopia & Startup Way There: Static Math-Proven Safe mAX-In­tel­li­gence, Mul­tiver­sal Align­ment, Phys­i­cal­ized Com­pu­ta­tional Ethics...

ank11 Feb 2025 3:21 UTC
13 points
8 comments40 min readLW link

Man­age­ment of Sub­strate-Sen­si­tive AI Ca­pa­bil­ities (MoSSAIC) Part 2: Conflict

mfatt4 Dec 2025 18:27 UTC
8 points
0 comments9 min readLW link

Gear­ing Up for Long Timelines in a Hard World

Dalcy14 Jul 2023 6:11 UTC
18 points
0 comments4 min readLW link

Ab­strac­tions are not Natural

Alfred Harwood4 Nov 2024 11:10 UTC
25 points
21 comments11 min readLW link

In­tel­li­gence–Agency Equiv­alence ≈ Mass–En­ergy Equiv­alence: On Static Na­ture of In­tel­li­gence & Phys­i­cal­iza­tion of Ethics

ank22 Feb 2025 0:12 UTC
1 point
0 comments6 min readLW link

Why Align­ment Needs a Struc­tural Model of Minds

Ning Coeva24 Dec 2025 17:01 UTC
1 point
0 comments3 min readLW link

LAP: A Gover­nance Frame­work for AI Agent Tool Use (18k lines, work­ing code)

Dave11 Jan 2026 2:16 UTC
1 point
0 comments1 min readLW link

Re­sponses to ~all crit­i­cisms of AIXI

Cole Wyeth7 Jan 2025 17:41 UTC
27 points
17 comments14 min readLW link

Ab­strac­tion as a gen­er­al­iza­tion of al­gorith­mic Markov condition

Daniel C14 Dec 2025 18:55 UTC
8 points
0 comments7 min readLW link

I couldn’t make my in­ten­tions con­sis­tent. What came next proved why.

dlgeekay13 Mar 2026 5:39 UTC
1 point
0 comments2 min readLW link

Ad­dress­ing De­ci­sion The­ory’s Si­mu­la­tion Problem

Ashe Vazquez Nuñez3 Feb 2026 7:02 UTC
11 points
0 comments3 min readLW link

PETDC: A Me­chan­i­cal Au­dit Frame­work for Trust-Depen­dent Claims

here-comes-everybody15 Feb 2026 19:03 UTC
1 point
0 comments6 min readLW link

Un­der­stand­ing Selec­tion Theorems

adamk28 May 2022 1:49 UTC
41 points
3 comments7 min readLW link

Perfor­mance guaran­tees in clas­si­cal learn­ing the­ory and in­fra-Bayesianism

David Matolcsi28 Feb 2023 18:37 UTC
9 points
4 comments31 min readLW link

Distill­ing the In­ter­nal Model Principle

JoseFaustino8 Feb 2025 14:59 UTC
21 points
0 comments16 min readLW link

Quan­tum-Eth­i­cal De­ci­sion Alge­bra: For­mal­iz­ing De­ci­sion-Mak­ing Un­der On­tolog­i­cal Uncertainty

Srødingr11 Dec 2025 16:15 UTC
1 point
0 comments1 min readLW link

On Goal-Models

Richard_Ngo2 Feb 2026 18:44 UTC
136 points
15 comments4 min readLW link

The Devel­op­men­tal Axis: Why AI Needs a Physics of “Be­com­ing”

Sai Praneeth Reddy Dhadi15 Feb 2026 7:46 UTC
1 point
0 comments2 min readLW link

Wat­son’s Wager as a pos­i­tive Logic Bomb for AI Safe­tyUn­ti­tled Draft

nick@thewatsons.net.au4 Feb 2026 5:40 UTC
1 point
0 comments18 min readLW link

Crisp Supra-De­ci­sion Processes

Brittany Gelb17 Sep 2025 15:56 UTC
41 points
0 comments17 min readLW link

Model­ling, Mea­sur­ing, and In­ter­ven­ing on Goal-di­rected Be­havi­our in AI Systems

31 Oct 2025 1:28 UTC
14 points
0 comments8 min readLW link

Proof Sec­tion to For­mal­iz­ing New­com­bian Prob­lems with Fuzzy In­fra-Bayesianism

Brittany Gelb3 Dec 2025 14:34 UTC
12 points
0 comments2 min readLW link

Can AI agents learn to be good?

Ram Rachum29 Aug 2024 14:20 UTC
8 points
0 comments1 min readLW link
(futureoflife.org)

In­fra-Bayesi­anism nat­u­rally leads to the mono­ton­ic­ity prin­ci­ple, and I think this is a problem

David Matolcsi26 Apr 2023 21:39 UTC
22 points
6 comments4 min readLW link

Goal al­ign­ment with­out al­ign­ment on episte­mol­ogy, ethics, and sci­ence is futile

Roman Leventov7 Apr 2023 8:22 UTC
20 points
2 comments2 min readLW link

Hy­poth­e­sis: Grokking is a Reach­a­bil­ity Phase Tran­si­tion driven by Mechanis­tic De­scrip­tion Length (RMDL)

Rio Tsukatsuki26 Jan 2026 10:43 UTC
1 point
0 comments1 min readLW link

Am­plified Align­ment: A struc­tural ap­proach where al­ign­ment scales pos­i­tively with capability

Shadow Rose12 Mar 2026 17:58 UTC
1 point
0 comments3 min readLW link

Bridg­ing Ex­pected Utility Max­i­miza­tion and Optimization

Daniel Herrmann5 Aug 2022 8:18 UTC
25 points
5 comments14 min readLW link

Re­quire­ments for a Basin of At­trac­tion to Alignment

RogerDearnaley14 Feb 2024 7:10 UTC
47 points
12 comments31 min readLW link

Open Prob­lems in AIXI Agent Foundations

Cole Wyeth12 Sep 2024 15:38 UTC
42 points
2 comments10 min readLW link

A Gen­er­al­iza­tion of the Good Reg­u­la­tor Theorem

Alfred Harwood4 Jan 2025 9:55 UTC
21 points
6 comments10 min readLW link

ASP — A Pro­to­col Pro­posal for Skill Own­er­ship in the Agent Economy

AgentSov5 Mar 2026 16:03 UTC
1 point
0 comments2 min readLW link

Man­age­ment of Sub­strate-Sen­si­tive AI Ca­pa­bil­ities (MoSSAIC) Part 3: Resolution

5 Dec 2025 18:58 UTC
10 points
0 comments9 min readLW link

Payo­rian co­op­er­a­tion is easy with Kripke frames

transhumanist_atom_understander9 Mar 2026 0:29 UTC
69 points
7 comments8 min readLW link

Cor­rect­ing Re­cur­sive Ego-Loops: An In­for­ma­tion-The­o­retic Ap­proach to Sub­strate-In­de­pen­dent Alignment

EricCha10 Feb 2026 14:47 UTC
1 point
0 comments1 min readLW link

Agent Foun­da­tions: Paradig­ma­tiz­ing in Math and Science

TristanTrim8 Nov 2025 0:37 UTC
3 points
0 comments9 min readLW link

What if Align­ment Is Not About Con­trol, But About Who Re­boots the Sys­tem?

threczuch17 Dec 2025 11:24 UTC
1 point
0 comments4 min readLW link

In­fra-Bayesian haggling

hannagabor20 May 2024 12:23 UTC
30 points
1 comment20 min readLW link1 review

Towards Sub-agent Dy­nam­ics and Con­flict

Ashe Vazquez Nuñez25 Jan 2026 5:27 UTC
13 points
1 comment3 min readLW link

Dis­cov­er­ing Agents

zac_kenton18 Aug 2022 17:33 UTC
77 points
11 comments6 min readLW link

Pro­posal: A Vi­able Sys­tem Model (VSM) Ar­chi­tec­ture for In­ner Alignment

Han Kay15 Dec 2025 20:12 UTC
1 point
0 comments1 min readLW link

Live Gover­nance: AI tools for co­or­di­na­tion with­out centralisation

mbuch13 Oct 2025 8:24 UTC
15 points
0 comments12 min readLW link

Con­trol by Committee

Alexander Bistagne6 Nov 2025 21:02 UTC
2 points
1 comment1 min readLW link
(github.com)

Unal­igned AGI & Brief His­tory of Inequality

ank22 Feb 2025 16:26 UTC
−20 points
4 comments7 min readLW link

The Two-Board Prob­lem: Train­ing En­vi­ron­ment for Re­search Agents

Valerii K.8 Feb 2026 23:13 UTC
4 points
0 comments9 min readLW link

100 Din­ners And A Work­shop: In­for­ma­tion Preser­va­tion And Goals

Stephen Fowler28 Mar 2023 3:13 UTC
8 points
0 comments7 min readLW link

Aware­ness of Ma­nipu­la­tion In­creases Jailbreak Vuln­er­a­bil­ity: When LLMs De­clare Guideline Vio­la­tion While Com­mit­ting It

Politic Lee1 Dec 2025 15:40 UTC
1 point
0 comments8 min readLW link

Half-baked idea: a straight­for­ward method for learn­ing en­vi­ron­men­tal goals?

Q Home4 Feb 2025 6:56 UTC
16 points
7 comments5 min readLW link

A First At­tempt at a Poli­ti­cal Gaslight­ing Meter

Scott Cotton27 Feb 2026 12:59 UTC
1 point
0 comments9 min readLW link

[Question] Choice := An­throp­ics un­cer­tainty? And po­ten­tial im­pli­ca­tions for agency

Antoine de Scorraille21 Apr 2022 16:38 UTC
6 points
1 comment1 min readLW link

Ir­ra­tional­ity as a Defense Mechanism for Re­ward-hacking

Ashe Vazquez Nuñez18 Jan 2026 3:57 UTC
47 points
7 comments4 min readLW link

2.5. Evolu­tion and Ethics

RogerDearnaley15 Feb 2024 23:38 UTC
8 points
12 comments7 min readLW link1 review

Beyond Forced Align­ment: The Log­i­cal Case for AI Eco­nomic Citizenship

sekatska21 Dec 2025 21:20 UTC
1 point
0 comments2 min readLW link

Proof Sec­tion to Crisp Supra-De­ci­sion Processes

Brittany Gelb17 Sep 2025 15:57 UTC
4 points
0 comments3 min readLW link

Beyond RLHF: Im­ple­ment­ing On­tolog­i­cal Guardrails via Re­la­tional Coherence

Kinsey Kappler26 Jan 2026 5:06 UTC
1 point
0 comments4 min readLW link

Re­peated Play of Im­perfect New­comb’s Para­dox in In­fra-Bayesian Physicalism

Sven Nilsen3 Apr 2023 10:06 UTC
2 points
0 comments2 min readLW link

In­ter­view with Vanessa Kosoy on the Value of The­o­ret­i­cal Re­search for AI

WillPetillo4 Dec 2023 22:58 UTC
38 points
0 comments35 min readLW link

From Bar­ri­ers to Align­ment to the First For­mal Cor­rigi­bil­ity Guarantees

Aran Nayebi8 Dec 2025 12:31 UTC
64 points
11 comments11 min readLW link

Clar­ify­ing “wis­dom”: Foun­da­tional top­ics for al­igned AIs to pri­ori­tize be­fore ir­re­versible decisions

Anthony DiGiovanni20 Jun 2025 21:55 UTC
40 points
2 comments12 min readLW link

ECLAIR: A con­cep­tual frame­work for de­vel­op­men­tal AI ar­chi­tec­ture and emer­gent alignment

Barry Gardner7 Mar 2026 2:16 UTC
1 point
0 comments1 min readLW link

Live Con­ver­sa­tional Threads: Not an AI Notetaker

adiga3 Nov 2025 4:24 UTC
19 points
0 comments7 min readLW link

Directly Try Solv­ing Align­ment for 5 weeks

Kabir Kumar21 Jul 2025 21:51 UTC
86 points
4 comments6 min readLW link
(beta.ai-plans.com)

Another take on agent foun­da­tions: for­mal­iz­ing zero-shot reasoning

zhukeepa1 Jul 2018 6:12 UTC
64 points
20 comments12 min readLW link

Em­piri­cal Proof of Sys­temic In­co­her­ence in Large Lan­guage Models (ARAYUN_173)

arayun6 Nov 2025 14:28 UTC
1 point
0 comments1 min readLW link

An Im­pos­si­bil­ity Proof Rele­vant to the Shut­down Prob­lem and Corrigibility

Audere2 May 2023 6:52 UTC
66 points
13 comments9 min readLW link

An In­tro­duc­tion to Re­in­force­ment Learn­ing for Un­der­stand­ing In­fra-Bayesianism

Brittany Gelb17 May 2025 2:34 UTC
29 points
0 comments20 min readLW link

A Straight­for­ward Ex­pla­na­tion of the Good Reg­u­la­tor Theorem

Alfred Harwood18 Nov 2024 12:45 UTC
91 points
30 comments14 min readLW link

Nor­ma­tive vs De­scrip­tive Models of Agency

mattmacdermott2 Feb 2023 20:28 UTC
26 points
5 comments4 min readLW link

A Ther­mo­dy­nam­i­cally Bounded Ar­chi­tec­ture for Self-Manag­ing AI Agents

melhoward202518 Dec 2025 0:49 UTC
1 point
0 comments3 min readLW link

For­mal­iz­ing New­com­bian Prob­lems with Fuzzy In­fra-Bayesianism

Brittany Gelb3 Dec 2025 14:35 UTC
16 points
2 comments22 min readLW link

What pro­gram struc­tures en­able effi­cient in­duc­tion?

Daniel C5 Sep 2024 10:12 UTC
23 points
5 comments3 min readLW link

In­tent-al­igned AI sys­tems de­plete hu­man agency: the need for agency foun­da­tions re­search in AI safety

catubc31 May 2023 21:18 UTC
26 points
4 comments11 min readLW link

[Pre-print] Build­ing safe AGI as an er­gonomics problem

16 Jan 2026 13:18 UTC
1 point
0 comments1 min readLW link
(doi.org)

Thou shalt not com­mand an al­ighned AI

Martin Vlach11 May 2025 20:02 UTC
0 points
4 comments1 min readLW link

Un­der­stand­ing Agency through Markov Blankets

Ashe Vazquez Nuñez12 Jan 2026 19:32 UTC
25 points
2 comments3 min readLW link

Sta­bil­ity of nat­u­ral la­tents in in­for­ma­tion the­o­retic terms

Aram Ebtekar26 Oct 2025 20:33 UTC
35 points
0 comments2 min readLW link

12 An­gry Agents, or: A Plan for AI Empathy

14 Oct 2025 15:24 UTC
22 points
4 comments12 min readLW link
No comments.