james.lucassen comments on Decision Theory Guarding is Sufficient for Scheming

james.lucassen 9 Sep 2025 19:06 UTC
3 points
0
If by “corrigible” we mean “the AI will cooperate with all self-modifications we want it to”, then no to 1 and yes to 2. But if you have an AI built by someone who assures you it’s corrigible, but who only had corrigibility w.r.t values/axiology in mind, then you might get yes to 1 and/or no to 2.
Does it mean that the AIs who resisted have never been ~~true Scotsmen~~ truly corrigible in the first place? Or that it becomes far more difficult to make the AIs actually corrigible?
Yup, I see this as placing an additional constraint on what we need to do to achieve corrigibility, because it adds to the list of self-modifications we might want the AI to make that a non-corrigible AI would resist. Unclear to me how much more difficult it makes corrigibility.