Vanessa Kosoy comments on Formal Inner Alignment, Prospectus

Vanessa Kosoy 24 May 2021 18:16 UTC
LW: 23 AF: 14
AF
Since you’re trying to compile a comprehensive overview of directions of research, I will try to summarize my own approach to this problem:
- I want to have algorithms that admit thorough theoretical analysis. There’s already plenty of bottom-up work on this (proving initially weak but increasingly stronger theoretical guarantees for deep learning). I want to complement it by top-down work (proving strong theoretical guarantees for algorithms that are initially infeasible but increasingly made more feasible). Hopefully eventually the two will meet in the middle.
- Given feasible algorithmic building blocks with strong theoretical guarantees, some version of the consensus algorithm can tame Cartesian daemons (including manipulation of search) as long as the prior (inductive bias) of our algorithm is sufficiently good.
- Coming up with a good prior is a problem in embedded agency. I believe I achieved significant progress on this using a certain infra-Bayesian approach, and hopefully will have a post soonish.
- The consensus-like algorithm will involve a trade-off between safety and capability. We will have to manage this trade-off based on expectations regarding external dangers that we need to deal with (e.g. potential competing unaligned AIs). I believe this to be inevitable, although ofc I would be happy to be proven wrong.
- The resulting AI is only a first stage that we will use to design the second stage AI, it’s not something we will deploy in self-driving cars or such
- Non-Cartesian daemons need to be addressed separately. Turing RL seems like a good way to study this if we assume the core is too weak to produce non-Cartesian daemons, so the latter can be modeled as potential catastrophic side effects of using the envelope. However, I don’t have a satisfactory solution yet (aside perhaps homomorphic encryption, but the overhead might be prohibitive).