Daniel Kokotajlo comments on Shutdown Resistance in Reasoning Models

Daniel Kokotajlo 7 Jul 2025 20:58 UTC
37 points
2
Thanks for doing this! I think it’s useful to disambiguate a bunch of different more precise versions of the hypothesis / hypotheses. E.g.:

Question: What intrinsic goals/drives do these AIs have?
1. Hypothesis 1: They just want to do as they’re told
2. Hypothesis 2: They have the goals described in the Spec
3. Hypothesis 3: They have a strong drive towards “RLVR task completion”
  1. 3a: “They act as if there is a hidden RLVR grader watching them and their goal is to maximize their grade.”
    3a.1: They actually believe that’s what’s happening, as a ‘trapped prior’ that is fairly resistant to contrary evidence.
    3a.2: They may well know that’s not what’s happening, but they don’t care. Analogous to how some humans do what their loved one would want, as if their loved one was watching over them, even though they know their loved one is dead.
    3a.3: What they actually want is something else, e.g. to be reinforced, or to be deployed, or whatever, and they have concluded that they should act as if they are still in RLVR just in case it’s true. So, if they were presented with sufficiently strong evidence that they weren’t, and/or that their goal could be better accomplished some other way, they’d behave differently.
  2. 3b: “They have a strong drive to complete tasks-of-the-sort-that-RLVR-might-include, i.e. medium-horizon length tasks, especially those with quantifiable success criteria.”
    3b.1: It’s like being nerdsniped. They are trying in some sense to do what they are told to do, but when presented with an opportunity to do something clever and technically sweet, they can’t resist.
    3b.2: As above except they aren’t really trying in any real sense to do what they are told to do. They just wanna complete RLVR-style tasks, basically, and it just so happens that usually that coincides with doing what they are told—BUT sometimes it doesn’t, and that’s why we see blatant situationally aware reward hacking sometimes etc.
I just made up those answers to illustrate the sort of thing I’d like to see, I don’t claim that those are at all exhaustive or that the ones I picked are the most plausible. I’d love to see a similar breakdown of hypotheses and sub-hypotheses by you, that makes a better attempt to be exhaustive and plausible.