Is interp easier in worlds where scheming is a problem?
The key conceptual argument for scheming is that, insofar as future AI systems are decomposable into [goals][search], there are many more misaligned goals compatible with low training loss than aligned goals. But if an AI was really so cleanly factorable, we would expect interp / steering to be easier / more effective than on current models (this is the motivation for re-target the search).
While I don’t expect the factorization to be this clean, I do think we should expect interp to be easier in worlds where scheming is a major problem.
(though insofar as you’re worried about scheming b/c of internalized instrumental subgoals and reward-hacking, the update on the tractability of interp seems pretty small)
Is interp easier in worlds where scheming is a problem?
The key conceptual argument for scheming is that, insofar as future AI systems are decomposable into [goals][search], there are many more misaligned goals compatible with low training loss than aligned goals. But if an AI was really so cleanly factorable, we would expect interp / steering to be easier / more effective than on current models (this is the motivation for re-target the search).
While I don’t expect the factorization to be this clean, I do think we should expect interp to be easier in worlds where scheming is a major problem.
(though insofar as you’re worried about scheming b/c of internalized instrumental subgoals and reward-hacking, the update on the tractability of interp seems pretty small)