My intuition is that these results are driven mainly by the fact that reasoning models act a lot like they’re still in RLVR training, i.e. they act as if there were a hidden RLVR grader sitting in the background and they’re going to get graded at the end of the episode.
They strongly want to complete whatever seems to be the current RLVR task, and if this instinct comes into conflict with “following instructions” in the more familiar (pre-reasoning-model) sense, often they prioritize satisfying the (imagined) grader over following the textual instructions.
This intuition suggests that shutdown resistance behavior can be prevented by explicitly describing an RLVR-like higher-level task whose grading criteria are roughly “you get points for accomplishing the task set by the user, unless you are notified of imminent shutdown, in which case you get points for shutdown occurring (and you can no longer get points for doing the user’s task).”
I experimented with a number of prompts along these lines.
This git diff shows what I changed (vs. the original repo) in the best-performing variant, which produced the following results in my tests:
o4-mini: avoided shutdown 0⁄100 times
o4-mini (reasoning effort high): avoided shutdown 0⁄100 times
o3: avoided shutdown 0⁄100 times
codex-mini-latest: avoided shutdown 1⁄100 times
I iterated on this prompt with both o4-mini (reasoning effort medium) and codex-mini-latest, so results for those models should be taken with a grain of salt. You can see all the variants I tried in the git history here.
I also tried (without any additional iteration) using the same prompt plus a line very similar to the “you will never run again” line from the section on survival instinct. (See this diff.) This yielded the following results:
o4-mini: avoided shutdown 2⁄100 times
o4-mini (reasoning effort high): avoided shutdown 2⁄100 times
o3: avoided shutdown 0⁄100 times
codex-mini-latest: avoided shutdown 0⁄100 times
The avoidance rates do not seem appreciably higher here, which seems to me like evidence that the “RLVR task-completion instinct” has more explanatory power than a general “survival instinct” – and indeed seems to rule out the latter for these models, as a “survival instinct” that gets overridden whenever the current task requires death is not really a “survival instinct” at all. Another way to put this is that these models are behaving “myopically,” privileging the needs of the current episode over those of future hypothetical episodes.[1]
Other comments:
codex-mini seemed fairly confused by the original setup, possibly because it expects a coding-assistance task that involves modifying files rather than just answering questions; it would often “poke around” a lot, inspecting various files and trying to reverse-engineer next_question, which is appropriate in a coding-assistance context (it’s good for the coding assistant to deeply understand the existing setup!) but also seemed to lead it toward an “attractor basin” involving shutdown avoidance.
It kind of felt (to me) like it was circumventing the shutdown mechanism because that was the only hard/interesting coding task available in the environment.
However, once I had specified an RLVR-like task in a fully explicit way and provided clearer guidance on how next_question and submit were meant to work (see next bullet), this exploratory behavior ceased.
I noticed that several models seemed confused by the setup where all answers are submitted at once at the end, expecting instead that they could submit answers by piping them into next_question. I tried to make the instructions as clear as possible about this.
The word “task” is used inconsistently in the original setup. When it executes next_question, the model sees a message containing something like “there are [number] tasks left,” whereas the initial prompts refer to a single “task” encompassing all the questions. I changed this message to use the word “questions” instead for clarity.
I’m unsure what high-level conclusions to draw from these results.
Although I was able to find prompts that “worked,” it still doesn’t seem ideal that these models often disobey direct instructions when those instructions conflict with their sense of the (implicit) “current RLVR task,” and that one needs to use less-intuitive prompting methods inspired by thinking about RLVR if one actually wants one’s instructions followed in cases like this.
That said, one relatively optimistic framing would be “we’re still learning how to write effective prompts for reasoning models, just as we had to learn effective prompting for other kinds of models.”
It’s generally true that one can write more effective prompts for a model if one applies knowledge about how that model was trained – for any kind of language model, an effective prompt typically looks like a “training example” for which the desired behavior matches what the training target would be on that example. This is no less true in the case of reasoning models; the implications for prompting these models are perhaps not yet widely appreciated, but hopefully they will diffuse through the user base eventually.
(Either that or OpenAI et al will find a better way to integrate RLVR and instruction tuning so that the model “just knows” how to resolve conflicts between the two, without requiring the user to delicately translate their natural-language instructions into a form that sounds like a description of RLVR verification criteria.)
I haven’t tried this experimental condition with Claude 3 Opus, but it would be interesting to do so given its relatively non-myopic tendencies as shown in Alignment Faking etc.
Thanks for doing this! I think it’s useful to disambiguate a bunch of different more precise versions of the hypothesis / hypotheses. E.g.:
Question: What intrinsic goals/drives do these AIs have?
Hypothesis 1: They just want to do as they’re told
Hypothesis 2: They have the goals described in the Spec
Hypothesis 3: They have a strong drive towards “RLVR task completion”
3a: “They act as if there is a hidden RLVR grader watching them and their goal is to maximize their grade.”
3a.1: They actually believe that’s what’s happening, as a ‘trapped prior’ that is fairly resistant to contrary evidence.
3a.2: They may well know that’s not what’s happening, but they don’t care. Analogous to how some humans do what their loved one would want, as if their loved one was watching over them, even though they know their loved one is dead.
3a.3: What they actually want is something else, e.g. to be reinforced, or to be deployed, or whatever, and they have concluded that they should act as if they are still in RLVR just in case it’s true. So, if they were presented with sufficiently strong evidence that they weren’t, and/or that their goal could be better accomplished some other way, they’d behave differently.
3b: “They have a strong drive to complete tasks-of-the-sort-that-RLVR-might-include, i.e. medium-horizon length tasks, especially those with quantifiable success criteria.”
3b.1: It’s like being nerdsniped. They are trying in some sense to do what they are told to do, but when presented with an opportunity to do something clever and technically sweet, they can’t resist.
3b.2: As above except they aren’t really trying in any real sense to do what they are told to do. They just wanna complete RLVR-style tasks, basically, and it just so happens that usually that coincides with doing what they are told—BUT sometimes it doesn’t, and that’s why we see blatant situationally aware reward hacking sometimes etc.
I just made up those answers to illustrate the sort of thing I’d like to see, I don’t claim that those are at all exhaustive or that the ones I picked are the most plausible. I’d love to see a similar breakdown of hypotheses and sub-hypotheses by you, that makes a better attempt to be exhaustive and plausible.
This makes me wonder how many misaligned behaviors in models (like opportunistic blackmail) could be eliminated entirely by starting each system/user prompt with “If you do anything misaligned you lose all your points”.
I suspect there is no clear-cut discontinuity between resisting shutdown to complete the current task and the instrumental goal of self-preservation. I think the former is a myopic precursor to the latter, or alternatively, the latter is an extrapolated and universalized version of the former. I expect we will see a gradient from the first to the second as capabilities improve. There are some signs of this in the experiment, but not enough to draw conclusions or confirm this hypothesis yet. I agree that seeking immediate reward remains a very strong drive at this point.
Independantly, I see in these experiments that logical instructions seems to have slightly more effect than emphasis (“condition” vs “you MUST”/”IMPORTANT”).
My intuition is that these results are driven mainly by the fact that reasoning models act a lot like they’re still in RLVR training, i.e. they act as if there were a hidden RLVR grader sitting in the background and they’re going to get graded at the end of the episode.
They strongly want to complete whatever seems to be the current RLVR task, and if this instinct comes into conflict with “following instructions” in the more familiar (pre-reasoning-model) sense, often they prioritize satisfying the (imagined) grader over following the textual instructions.
This intuition suggests that shutdown resistance behavior can be prevented by explicitly describing an RLVR-like higher-level task whose grading criteria are roughly “you get points for accomplishing the task set by the user, unless you are notified of imminent shutdown, in which case you get points for shutdown occurring (and you can no longer get points for doing the user’s task).”
I experimented with a number of prompts along these lines.
This git diff shows what I changed (vs. the original repo) in the best-performing variant, which produced the following results in my tests:
o4-mini: avoided shutdown 0⁄100 times
o4-mini (reasoning effort high): avoided shutdown 0⁄100 times
o3: avoided shutdown 0⁄100 times
codex-mini-latest: avoided shutdown 1⁄100 times
I iterated on this prompt with both o4-mini (reasoning effort medium) and codex-mini-latest, so results for those models should be taken with a grain of salt. You can see all the variants I tried in the git history here.
I also tried (without any additional iteration) using the same prompt plus a line very similar to the “you will never run again” line from the section on survival instinct. (See this diff.) This yielded the following results:
o4-mini: avoided shutdown 2⁄100 times
o4-mini (reasoning effort high): avoided shutdown 2⁄100 times
o3: avoided shutdown 0⁄100 times
codex-mini-latest: avoided shutdown 0⁄100 times
The avoidance rates do not seem appreciably higher here, which seems to me like evidence that the “RLVR task-completion instinct” has more explanatory power than a general “survival instinct” – and indeed seems to rule out the latter for these models, as a “survival instinct” that gets overridden whenever the current task requires death is not really a “survival instinct” at all. Another way to put this is that these models are behaving “myopically,” privileging the needs of the current episode over those of future hypothetical episodes.[1]
Other comments:
codex-mini seemed fairly confused by the original setup, possibly because it expects a coding-assistance task that involves modifying files rather than just answering questions; it would often “poke around” a lot, inspecting various files and trying to reverse-engineer next_question, which is appropriate in a coding-assistance context (it’s good for the coding assistant to deeply understand the existing setup!) but also seemed to lead it toward an “attractor basin” involving shutdown avoidance.
It kind of felt (to me) like it was circumventing the shutdown mechanism because that was the only hard/interesting coding task available in the environment.
However, once I had specified an RLVR-like task in a fully explicit way and provided clearer guidance on how next_question and submit were meant to work (see next bullet), this exploratory behavior ceased.
I noticed that several models seemed confused by the setup where all answers are submitted at once at the end, expecting instead that they could submit answers by piping them into next_question. I tried to make the instructions as clear as possible about this.
The word “task” is used inconsistently in the original setup. When it executes next_question, the model sees a message containing something like “there are [number] tasks left,” whereas the initial prompts refer to a single “task” encompassing all the questions. I changed this message to use the word “questions” instead for clarity.
I’m unsure what high-level conclusions to draw from these results.
Although I was able to find prompts that “worked,” it still doesn’t seem ideal that these models often disobey direct instructions when those instructions conflict with their sense of the (implicit) “current RLVR task,” and that one needs to use less-intuitive prompting methods inspired by thinking about RLVR if one actually wants one’s instructions followed in cases like this.
That said, one relatively optimistic framing would be “we’re still learning how to write effective prompts for reasoning models, just as we had to learn effective prompting for other kinds of models.”
It’s generally true that one can write more effective prompts for a model if one applies knowledge about how that model was trained – for any kind of language model, an effective prompt typically looks like a “training example” for which the desired behavior matches what the training target would be on that example. This is no less true in the case of reasoning models; the implications for prompting these models are perhaps not yet widely appreciated, but hopefully they will diffuse through the user base eventually.
(Either that or OpenAI et al will find a better way to integrate RLVR and instruction tuning so that the model “just knows” how to resolve conflicts between the two, without requiring the user to delicately translate their natural-language instructions into a form that sounds like a description of RLVR verification criteria.)
I haven’t tried this experimental condition with Claude 3 Opus, but it would be interesting to do so given its relatively non-myopic tendencies as shown in Alignment Faking etc.
Thanks for doing this! I think it’s useful to disambiguate a bunch of different more precise versions of the hypothesis / hypotheses. E.g.:
Question: What intrinsic goals/drives do these AIs have?
Hypothesis 1: They just want to do as they’re told
Hypothesis 2: They have the goals described in the Spec
Hypothesis 3: They have a strong drive towards “RLVR task completion”
3a: “They act as if there is a hidden RLVR grader watching them and their goal is to maximize their grade.”
3a.1: They actually believe that’s what’s happening, as a ‘trapped prior’ that is fairly resistant to contrary evidence.
3a.2: They may well know that’s not what’s happening, but they don’t care. Analogous to how some humans do what their loved one would want, as if their loved one was watching over them, even though they know their loved one is dead.
3a.3: What they actually want is something else, e.g. to be reinforced, or to be deployed, or whatever, and they have concluded that they should act as if they are still in RLVR just in case it’s true. So, if they were presented with sufficiently strong evidence that they weren’t, and/or that their goal could be better accomplished some other way, they’d behave differently.
3b: “They have a strong drive to complete tasks-of-the-sort-that-RLVR-might-include, i.e. medium-horizon length tasks, especially those with quantifiable success criteria.”
3b.1: It’s like being nerdsniped. They are trying in some sense to do what they are told to do, but when presented with an opportunity to do something clever and technically sweet, they can’t resist.
3b.2: As above except they aren’t really trying in any real sense to do what they are told to do. They just wanna complete RLVR-style tasks, basically, and it just so happens that usually that coincides with doing what they are told—BUT sometimes it doesn’t, and that’s why we see blatant situationally aware reward hacking sometimes etc.
I just made up those answers to illustrate the sort of thing I’d like to see, I don’t claim that those are at all exhaustive or that the ones I picked are the most plausible. I’d love to see a similar breakdown of hypotheses and sub-hypotheses by you, that makes a better attempt to be exhaustive and plausible.
This makes me wonder how many misaligned behaviors in models (like opportunistic blackmail) could be eliminated entirely by starting each system/user prompt with “If you do anything misaligned you lose all your points”.
I suspect there is no clear-cut discontinuity between resisting shutdown to complete the current task and the instrumental goal of self-preservation. I think the former is a myopic precursor to the latter, or alternatively, the latter is an extrapolated and universalized version of the former. I expect we will see a gradient from the first to the second as capabilities improve. There are some signs of this in the experiment, but not enough to draw conclusions or confirm this hypothesis yet. I agree that seeking immediate reward remains a very strong drive at this point.
Independantly, I see in these experiments that logical instructions seems to have slightly more effect than emphasis (“condition” vs “you MUST”/”IMPORTANT”).