Gurkenglas comments on My AGI Threat Model: Misaligned Model-Based RL Agent

Gurkenglas 27 Mar 2021 11:48 UTC
2 points
0
example
You point your math research generator at AI safety. It starts analyzing the graph of what programs will self-modify into what programs; which subgraphs are closed under successors; how you might define structure-preserving morphisms on graphs like this one; what existing category theory applies to the resulting kind of morphism. It finds the more general question of which agents achieve their goals in an environment containing what programs. (Self-modification is equivalent to instantiating such a program.) It finds a bunch of properties that such agents and programs can have—for example, tool AIs that help large classes of agents so long as they don’t ask questions that turn the tool AIs into agent AIs. And then you find that a theorem specializes to “You win if you run a program from this set.” and do so.
ask a mathematician
Sure, I don’t see the solution to every question immediately; I have to turn the question over in my head, decompose it, find other questions that are the same until I can solve one of them immediately. And the described generator could do the same, because it would generate the decompositions and rephrasings as lemmata, or as extra conjectures. We would of course need to keep relevant theorems it has already proved in scope so it can apply them, by (very cheap) fine-tuning or tetris-like context window packing. And yes, this gives whatever mesa-optimizer an opportunity to entrench itself in the model.