Oliver Daniels comments on Oliver Daniels-Koch’s Shortform

Oliver Daniels 26 Jul 2025 15:47 UTC
1 point
0
Is interp easier in worlds where scheming is a problem?

The key conceptual argument for scheming is that, insofar as future AI systems are decomposable into [goals][search], there are many more misaligned goals compatible with low training loss than aligned goals. But if an AI was really so cleanly factorable, we would expect interp / steering to be easier / more effective than on current models (this is the motivation for re-target the search).

While I don’t expect the factorization to be this clean, I do think we should expect interp to be easier in worlds where scheming is a major problem.

(though insofar as you’re worried about scheming b/c of internalized instrumental subgoals and reward-hacking, the update on the tractability of interp seems pretty small)