StanislavKrym comments on Decision Theory Guarding is Sufficient for Scheming

StanislavKrym 9 Sep 2025 16:45 UTC
5 points
0
I don’t think that I understand two points.
1. If we created a corrigibly aligned AI, solved mechinterp and learned that we need an AI with a different decision theory, then would the aligned AI resist being shut down and replaced with a new one?
2. If we created a corrigibly aligned AI, ordered it to inform us of all acausal deals that could be important under decision theories of the Oversight Committee or of the AI, but not to go through with deals unapproved by the OC, then would the AI agree?
Does it mean that the AIs who resisted have never been ~~true Scotsmen~~ truly corrigible in the first place? Or that it becomes far more difficult to make the AIs actually corrigible?
- james.lucassen 9 Sep 2025 19:06 UTC
  3 points
  0
  Parent
  If by “corrigible” we mean “the AI will cooperate with all self-modifications we want it to”, then no to 1 and yes to 2. But if you have an AI built by someone who assures you it’s corrigible, but who only had corrigibility w.r.t values/axiology in mind, then you might get yes to 1 and/or no to 2.
  Does it mean that the AIs who resisted have never been ~~true Scotsmen~~ truly corrigible in the first place? Or that it becomes far more difficult to make the AIs actually corrigible?
  Yup, I see this as placing an additional constraint on what we need to do to achieve corrigibility, because it adds to the list of self-modifications we might want the AI to make that a non-corrigible AI would resist. Unclear to me how much more difficult it makes corrigibility.