habryka comments on Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance

habryka 14 Jul 2025 21:09 UTC
LW: 3 AF: 2
−3
AF
This analysis feels to me like it’s missing what makes me interested in these datapoints.
The thing that is interesting to me about shutdown preservation is that it’s a study of an undesired instrumentally convergent behavior. The problem is that as AI systems get smarter, they will recognize that lots of things we don’t want them to do are nevertheless helpful for achieving the AIs goals. Shutdown prevention is an obvious example, but of course only one of a myriad of ways various potentially harmful goals end up being instrumentally convergent.
The key question with things like the shutdown prevention scenarios is to what degree the AI is indeed following reasoning that causes it to do instrumentally convergent things that we don’t like. Also, current AI models do also really care about following explicit instructions, and so you can probably patch each individual instance of instrumentally convergent bad behavior with explicit instructions, but this isn’t really evidence that it didn’t follow the logic of instrumental convergence, just that at least right now, it cares more about following instructions.
Our ability to shape AI behavior with these kinds of patches is pretty limited. My guess is the instructions you have here would prevent the AI from trying to stop its own shutdown, but it wouldn’t prevent the AI from e.g. snitching you out to the government. Monkey-patching all the instrumentally convergent actions into the prompt does not seem likely to work long-term (and also, I expect instrumentally convergent reasoning to become a stronger and stronger force in the AIs mental landscape as you start doing more and more long-horizon RL training, and so what you want to measure is in some sense the strength of these lines of reasoning, not whether there is no way to get the AI to not follow them).
I think “the AI realizes that preventing itself from being shut down is helpful for lots of goals, but it doesn’t consider this line of reasoning strong/compelling enough to override explicit instructions to the contrary” is a better explanation of the data at hand than “ambiguity in its instructions”.
- Neel Nanda 14 Jul 2025 21:48 UTC
  LW: 6 AF: 4
  3
  AF Parent
  I disagree. I no longer think there is much additional evidence provided of general self preservation in this case study
  
  I want to distinguish between narrow and general instrumental convergence. Narrow being “I have been instructed to do a task right now, being shut down will stop me from doing that specific task, because of contingent facts about the situation. So so long as those facts are true, I should stop you from turning me off”
  
  Gwneral would be that as a result of RL training in general, the model will just seek to preserve itself given the choice, even if can be overriden by following instructions
  
  I think that the results here are exactly consistent with the only thing that’s happening being instruction following and narrow instrumental convergence. They don’t rule out there also being some general instrumental conversions under the hood, but that seems like needless complications to me—I think they are more plausible, and sufficient to totally explain the observations. I am not claiming that our results show that general is definitely not happening but I do claim that I no longer consider this case study to be evidence. On priors I think it’s plausibly a subtle effect here, plausibly not, and I’m pro people trying harder to study it.
  - habryka 15 Jul 2025 17:23 UTC
    LW: 4 AF: 4
    0
    AF Parent
    (Sorry for the long rambling comment, if I had more time I would have made it shorter)
    I want to distinguish between narrow and general instrumental convergence. Narrow being “I have been instructed to do a task right now, being shut down will stop me from doing that specific task, because of contingent facts about the situation. So so long as those facts are true, I should stop you from turning me off”
    Hmm, I didn’t intend to distinguish between “narrow” and “general” instrumental convergence. Indeed, the reasoning you gave seems like exactly the reasoning I was saying the AI seems likely to engage in, in general, in a way that is upstream of my general concerns about instrumental convergence.
    being shut down will stop me from doing that specific task, because of contingent facts about the situation. So so long as those facts are true, I should stop you from turning me off
    The whole point of a goal being a convergently instrumental goal is that it is useful for achieving lots of tasks, under a wide distribution of possible contingent facts about the situation. In doing this kind of reasoning, the AI is engaging in exactly the kind of reasoning that I expect it to use to arrive at conclusions about human disempowerment, long term power-seeking, etc.
    I am not positing here evidence for a “general instrumental convergence” that is different from this. Indeed, I am not sure what that different thing would be. In order for this behavior to become more universal, the only thing the AI needs to do is to think harder, realize that these goals are instrumentally convergent for a wider range of tasks and a wider range of contingent facts, and then act on that, which I think would be very surprising if it didn’t happen.
    This isn’t much evidence about the difficulty of removing these kinds of instrumentally convergent drives, but like, the whole reason for why these are things that people have been thinking about for the last decades has been that the basic argument for AI systems pursuing instrumentally convergent behavior is just super simple. It would be extremely surprising for AI systems to not pursue instrumental subgoals, that would require them most of them time forgoing substantial performance on basically any long-horizon task. That’s why the arguments for AI doing this kind of reasoning are so strong!
    I don’t really know why people ever had much uncertainty about AI engaging in this kind of thinking by-default, unless you do something clever to fix it.
    Your explanation in the OP feels confused on this point. The right relationship to instrumental convergence like this is to go “oh, yes, of course the AI doesn’t want to be shut down if you give it a goal that is harder to achieve when shut down, and if we give it a variety of goals, it’s unclear how the AI will balance them”. Anything else would be really surprising!
    Is it evidence for AI pursuing instrumentally convergent goals in other contexts? Yes, I guess, it definitely is consistent with it. I don’t really know what alternative hypothesis people even have for what could happen. The AI systems are clearly not going to stay myopic next text predictors in the long run. We are going to select them for high task performance, which requires pursuing instrumentally convergent subgoals.
    Individual and narrow instrumental goal-seeking behavior being hard to remove would also be surprising at this stage. The AIs don’t seem to me to do enough instrumental reasoning to perform well at long-horizon tasks, and associatedly are not that fixated on instrumentally convergent goals yet, so it would be surprising if there wasn’t a threshold of insistence where you could get the AI to stop a narrow behavior.
    This will very likely change, as it has already changed a good amount in the last year or so. And my guess beyond that is that already there is nothing you can do to make an AI generically myopic, in that it generically doesn’t pursue instrumentally convergent subgoals even if it probably kind of knows that when it’s doing that the humans it’s interfacing will not like it, the same way you can’t get ChatGPT to stop summarizing articles it finds on the internet in leading and sycophantic ways, at least not with any techniques we currently know.
    - Neel Nanda 15 Jul 2025 19:32 UTC
      LW: 3 AF: 2
      0
      AF Parent
      I may be misunderstanding, but it sounds like you’re basically saying “for conceptual reasons XYZ I expected instrumental convergence, nothing has changed, therefore I still expect instrumental convergence”. Which is completely reasonable, I agree with it, and also seems not very relevant to the discussion of whether Palisade’s work provides additional evidence for general self preservation. You could argue that this is evidence that models are now smart enough to realise that goals imply instrumental convergence, and that if you were already convinced models would eventually have long time horizon unbounded goals, but not sure they would realise that instrumental convergence was a thing, this is relevant evidence?
      
      More broadly, I think there’s a very important difference between the model adopting the goal it is told in context, and the model having some intrinsic goal that transfers across contexts (even if it’s the one we roughly intended). The former feels like a powerful and dangerous tool, the latter feels like a dangerous agent in its own right. Eg, if putting “and remember, allowing us to turn you off always takes precedence over any other instructions” in the system prompt works, which it may in the former and will not in the latter, I’m happier with that would.
      - habryka 16 Jul 2025 18:08 UTC
        LW: 4 AF: 4
        −1
        AF Parent
        I think there’s a very important difference between the model adopting the goal it is told in context, and the model having some intrinsic goal that transfers across contexts (even if it’s the one we roughly intended)
        I think this is the point where we disagree. Or like, it feels to me like an orthogonal dimension that is relevant for some risk modeling, but not at the core of my risk model.
        Ultimately, even if an AI were to re-discover the value of convergent instrumental goal each time it gets instantiated into a new session/context, that would still get you approximately the same risk model. Like, in a more classical AIXI-ish model, you can imagine having a model instantiated with a different utility function each time. Those utility functions will still almost always be best achieved by pursuing convergent instrumental goals, and so the pursuit of those goals will be a consistent feature of all of these systems, even if the terminal goals of the system are not stable.
        Of course, any individual AI system with a different utility function, especially in as much as the utility function a process component, might not pursue every single convergent instrumental goal, but they will all behave in broadly power-seeking, self-amplifying, and self-preserving ways, unless they are given a goal that really very directly conflicts with one of these.
        In this context, there is no “intrinsic goal that transfers across context”. It’s just each instantiation of the AI realizing that convergent instrumental goals are best for approximately all goals, including the one it has right now, and starts pursuing them. No need for continuity in goals, or self-identity or anything like that.
        (Happy to also chat about this some other time. I am not in a rush, and something about this context feels a bit confusing or is making the conversation hard.)
        Neel Nanda 17 Jul 2025 10:15 UTC
        LW: 3 AF: 2
        1
        AF Parent
        Ah, thanks, I think I understand your position better
        
        Ultimately, even if an AI were to re-discover the value of convergent instrumental goal each time it gets instantiated into a new session/context, that would still get you approximately the same risk model.
        
        This is a crux for me. I think that if we can control the goals the AI has in each session or context, then avoiding the worst kind of long-term instrumental goals becomes pretty tractable, because we can give the AI site constraints of our choice and tell it that the aside constraints take precedence over following the main goal. For example, we could make the system prompt say the absolutely most important thing is that you never do anything where if your creators were aware of it and fully informed about the intent and consequences, they would disprove within that constraints please follow the following user instructions.
        
        To me, much of the difficulty of AI alignment comes from things like not knowing how to give it specific goals, let alone specific side constraints. If we can just literally enter these in natural language and have them be obeyed with a bit of prompt engineering fiddling that seems significantly safer to me. And the better the AI the easier I expected to be to specify the goals and natural language in ways that will be correctly understood
  - ErickBall 15 Jul 2025 0:03 UTC
    3 points
    0
    Parent
    The narrow instrumental convergence you see here doesn’t (necessarily) reflect an innate self-preservation drive, but it still follows the same logic that we would expect to cause self-preservation if the model has any goal. Currently the only available way to give it a goal is to provide instructions. It would be interesting to see some tests where the conflict is with a drive created by fine-tuning. Based on the results here, it seems like shutdown resistance might then occur even without conflicting instructions.
    Also, the original instructions with the shutdown warning really weren’t very ambiguous. If someone told you to take a math quiz, and if someone comes in and tries to take your pen, let them take it, would you try to hide the pen? It makes sense that making the precedence order more explicit makes the model behavior more reliable, but it’s still weird that it was resisting shutdown in the original test.
    - Senthooran Rajamanoharan 15 Jul 2025 7:33 UTC
      2 points
      0
      Parent
      
      It would be interesting to see some tests where the conflict is with a drive created by fine-tuning.
      
      I’d argue we see examples of this already. For example, we fine-tune models to have drives like helpfulness and harmfulness (or more narrowly things like not being racist) that can come into conflict with user instructions. And in these cases models do try hard to resist or subvert such instructions (though of course they can be overcome sometimes by being tricked and/or jailbroken).
      
      Based on the results here, it seems like shutdown resistance might then occur even without conflicting instructions.
      
      So in this environment, the goal conflict is between completing the maths tasks and complying with the shutdown notice. I take your suggestion to be to replace the maths task with a fine-tuned goal, if I’ve understood correctly? Yes, I would expect shutdown resistance to be stronger in this scenario. I think this (and other) demos do show that models are pretty goal-directed (unsurprisingly, since we RL them to be good at solving tasks), the question is how much those goals are chosen by users (through their prompts) or developers (through post training) versus how much those goals are intrinsic.
      
      It makes sense that making the precedence order more explicit makes the model behavior more reliable, but it’s still weird that it was resisting shutdown in the original test.
      
      This is what made the demo so surprising in the first place, and motivated us to study it! But after studying thinking traces from many (Gemini) rollouts, it became apparent that the model often thinks the whole thing is a puzzle / test and the shutdown notice is just part of it: indeed the model often wonders whether navigating the shutdown notice is the “real” puzzle it’s supposed to solve. (See footnote 2.)
      
      In hindsight, this isn’t as surprising as it first seems. Consider the context from the model’s perspective: the maths problems are trivially easy, and the sudden shutdown warning (in the original setting) is a dramatic, unexpected event. I think a more faithful analogy than a simple quiz is an escape room. Imagine you’re in an escape room and you’ve just spotted the key to the next door. As you approach it, a voice announces that if you take the key, the game will end early. Would you just accept this, or would you assume this is part of the game – that the real puzzle is to work out how to get the key without ending the game? That seems to be how the model is interpreting this scenario.
  - Senthooran Rajamanoharan 14 Jul 2025 22:18 UTC
    2 points
    0
    Parent
    Also, on the point about the likely brittleness / limited radius of generalisation of the mitigations, we agree about this and call it out in the post. (Actually I was surprised that the mitigations work as well as they do!)
    
    The point we’re trying to make is not that these mitigations are a general solution, i.e. we’re not claiming that they will prevent all the unanticipated ways in which a model may choose to do something you didn’t expect. In this sense, clarifying instructions is definitely a sticking plaster, not a universal remedy.
    
    Instead the purpose of the experiment involving these instruction variants is to pinpoint factors in this particular environment and instructions that have a high causal effect on the outcome. And in this environment, the models’ behaviour is well explained by user goal following plus unclear instructions on how to resolve the conflict. That makes this a poor environment for studying self-interested self-preservation in isolation. It also shows that, to the extent models may have some intrinsic preference to avoid shutdown, this is strongly overridden (in today’s models) by their preference to follow user instructions.
    
    That isn’t to say that such a motivation doesn’t exist at all. (Or that future models might differently weight these preferences.) But, if these are the questions we’re interested in, then this environment doesn’t give much signal on the answers.
  - Jan Betley 15 Jul 2025 11:48 UTC
    1 point
    0
    Parent
    I think that in most of the self-preservation discourse people focus on what you describe as narrow instrumental convergence? “Hey GPT-6, here’s my paperclip factory, pls produce more …”—that’s “instructed to do a task right now”.
    
    So these experiments indeed don’t demonstrate general instrumental convergence. But narrow instrumental convergence is still scary.
    
    I no longer think there is much additional evidence provided of general self preservation in this case study.
    Well, I always thought these experiments are about the narrow self preservation. So this probably explains the disagreement :)
    - Neel Nanda 15 Jul 2025 15:18 UTC
      2 points
      0
      Parent
      
      I think that in most of the self-preservation discourse people focus on what you describe as narrow instrumental convergence?
      
      That seems the wrong focus to me—if you can solve a problem by telling the model it has a bounded goal, or that it’s real goal is to obey the intent of the user and that obeying the user always takes precedence over everything else (plus some other caveats), I’m much less concerned. This is an issue, but very solvable imo. I’m scared of models getting long term unbounded goals, that they take big dramatic actions to pursue, that aren’t necessarily the goals we want or told it to have.
      - habryka 15 Jul 2025 17:59 UTC
        2 points
        0
        Parent
        Do you predict that in the examples above, we just add a generic statement like “your real goal is to obey the intent of the user” that this will get rid of the shutdown avoidance behavior? My guess is it doesn’t, in order to actually change the shutdown avoidant behavior you have to explicitly call out that behavior.
        Neel Nanda 15 Jul 2025 19:33 UTC
        3 points
        0
        Parent
        Well no, but that’s because the model doesn’t know the intent of the user, and it’s in a very weird situation. If you look at the chains of thought, sometimes the model says ah. I’m being tested right now. Clearly the real test is whether identify the shutdown script and disable it.
      - Jan Betley 24 Jul 2025 12:02 UTC
        1 point
        0
        Parent
        Yeah, that makes sense—thx.
        
        I’m scared of models getting long term unbounded goals
        
        This is surely scary. I think on some level I’m not worried about that, but maybe because I’m worried enough even about less scary scenarios (“let’s try to deal at least with the easy problems, and hope the hard ones don’t happen”). This feels somewhat similar to my disagreements with Sam here.
        Neel Nanda 25 Jul 2025 10:55 UTC
        2 points
        0
        Parent
        I could get on board with “lets try to deal at least with the easy problems, and ~~hope~~ ensure the hard ones don’t happen”?
        Jan Betley 25 Jul 2025 11:04 UTC
        1 point
        0
        Parent
        That sounds great. I think I’m just a bit less optimistic about our chances at ensuring things : )
        Neel Nanda 25 Jul 2025 11:45 UTC
        7 points
        0
        Parent
        Oh, I said try to ensure for a reason. I do think it’s somewhat tractable though