My take on agent foundations: formalizing metaphilosophical competence

I have some rough intuitions about the purpose of MIRI’s agent foundations agenda, and I’d like to share them here. (Note: I have not discussed these with MIRI, and these should not be taken to be representative of MIRI’s views.)

I think there’s a common misconception that the goal of agent foundations is to try building an AGI architected with a decision theory module, a logical induction module, etc. In my mind, this is completely not the point, and my intuitions say that approach is doomed to fail.

I interpret agent foundations as being more about providing formal specifications of metaphilosophical competence, to:

  • directly extend our understanding of metaphilosophy, by adding conceptual clarity to important notions we only have fuzzy understandings of. (Will this agent fall into epistemic pits? Are its actions low-impact? Will it avoid catastrophes?) As an analogy, formally defining mathematical proofs constituted significant progress in our understanding of mathematical logic and mathematical philosophy.

  • allow us to formally verify whether a computational process will satisfy desirable metaphilosophical properties, like those mentioned in the above parenthetical. (It seems perfectly fine for these processes to be built out of illegible components, like deep neural nets—while that makes them harder to inspect, it doesn’t preclude us from making useful formal statements about them. For example, in ALBA, it would help us make formal guarantees that distilled agents remain aligned.)

I want to explore logical induction as a case study. I think the important part about logical induction is the logical induction criterion, not the algorithm implementing it. I’ve heard the implementation criticized for being computationally intractable, but I see its primary purpose as showing the logical induction criterion to be satisfiable at all. This elevates the logical induction criterion over all the other loose collections of desiderata that may or may not be satisfiable, and may or may not capture what we mean by logical uncertainty. If we were to build an actual aligned AGI, I would expect its reasoning process to satisfy the logical induction criterion, but not look very much like the algorithm presented in the logical induction paper.

I also think the logical induction criterion provides an exact formalization—a necessary AND sufficient condition—of what it means to not get stuck in any epistemic pits in the limit. (The gist of this intuition: epistemic pits you’re stuck in forever correspond exactly to patterns in the market that a trader could exploit forever, and make unbounded profits from.) This lets us formalize the question “Does X reasoning process avoid permanently falling into epistemic pits?” into “Does X reasoning process satisfy the logical induction criterion?”