Mo Putera comments on AI for Agent Foundations etc.?

Mo Putera 12 Mar 2026 9:23 UTC
2 points
0
This section of last year’s shallow review of TAIS is 3 months out of date and maybe too coarse-grained but is a decent starting point?
Agent foundations
Develop philosophical clarity and mathematical formalizations of building blocks that might be useful for plans to align strong superintelligence, such as agency, optimization strength, decision theory, abstractions, concepts, etc.
- Theory of change: Rigorously understand optimization processed and agents, and what it means for them to be aligned in a substrate independent way → identify impossibility results and necessary conditions for aligned optimizer systems → use this theoretical understanding to eventually design safe architectures that remain stable and safe under self-reflection
- General approach: cognitive · Target case: worst-case
- Orthodox alignment problems: Value is fragile and hard to specify, Corrigibility is anti-natural, Goals misgeneralize out of distribution
- See also: Aligning what? · Tiling agents · Dovetail
- Some names: Abram Demski, Alex Altair, Sam Eisenstat, Thane Ruthenis, Alfred Harwood, Daniel C, Dalcy K, José Pedro Faustino
Some outputs (10)
Limit-Computable Grains of Truth for Arbitrary Computable Extensive-Form (Un)Known Games. Cole Wyeth et al.
UAIASI. Cole Wyeth
Clarifying “wisdom”: Foundational topics for aligned AIs to prioritize before irreversible decisions
Agent foundations: not really math, not really science. Alex_Altair
Off-switching not guaranteed. Sven Neth
Formalizing Embeddedness Failures in Universal Artificial Intelligence. Cole Wyeth, Marcus Hutter
Is alignment reducible to becoming more coherent?. Cole Wyeth
What Is The Alignment Problem?. johnswentworth
Good old fashioned decision theory
Report & retrospective on the Dovetail fellowship. Alex Altair

Mo Putera comments on AI for Agent Foundations etc.?

Agent foundations