Solving interpretability with an AGI (even with humans-in-the-loop) might not lead to particularly great insights on a general alignment theory or even on how to specifically align a particular AGI
Wouldn’t it at least solve corrigibility by making it possible to detect formation of undesirable end-goals? I think even GPT-4 can classify textual interpretation of an end-goal on a basis of its general desirability for humans.
It seem to need another assumption, namely that the AGI has sufficient control of its internal state and knowledge of the detection network to be able to bring itself into a state that produces interpretation that trips detection network, while also allowing the AGI to advance its agenda.
I’m sure that I don’t understand you. GPT most likely doesn’t have “I want to predict next token” written somewhere, because it doesn’t want to predict next token. There’s nothing in there that will actively try to predict next token no matter what. It’s just the thing it does when it runs.
Is it possible to have a system that just “actively try to make paperclips no matter what” when it runs, but it doesn’t reflect it in its reasoning and planning? I have a feeling that it requires God-level sophistication and knowledge of the universe to create a device that can act like that, when the device just happens to act in a way that robustly maximizes paperclips while not containing anything that can be interpreted as that goal.
I found that I can’t precisely formulate why I feel that. Maybe I’ll be able to express that in a few weeks (or I’ll find that the feeling is misguided).