habryka comments on Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance

habryka 16 Jul 2025 18:08 UTC
LW: 4 AF: 4
−1
AF
I think there’s a very important difference between the model adopting the goal it is told in context, and the model having some intrinsic goal that transfers across contexts (even if it’s the one we roughly intended)
I think this is the point where we disagree. Or like, it feels to me like an orthogonal dimension that is relevant for some risk modeling, but not at the core of my risk model.
Ultimately, even if an AI were to re-discover the value of convergent instrumental goal each time it gets instantiated into a new session/context, that would still get you approximately the same risk model. Like, in a more classical AIXI-ish model, you can imagine having a model instantiated with a different utility function each time. Those utility functions will still almost always be best achieved by pursuing convergent instrumental goals, and so the pursuit of those goals will be a consistent feature of all of these systems, even if the terminal goals of the system are not stable.
Of course, any individual AI system with a different utility function, especially in as much as the utility function a process component, might not pursue every single convergent instrumental goal, but they will all behave in broadly power-seeking, self-amplifying, and self-preserving ways, unless they are given a goal that really very directly conflicts with one of these.
In this context, there is no “intrinsic goal that transfers across context”. It’s just each instantiation of the AI realizing that convergent instrumental goals are best for approximately all goals, including the one it has right now, and starts pursuing them. No need for continuity in goals, or self-identity or anything like that.
(Happy to also chat about this some other time. I am not in a rush, and something about this context feels a bit confusing or is making the conversation hard.)
- Neel Nanda 17 Jul 2025 10:15 UTC
  LW: 3 AF: 2
  1
  AF Parent
  Ah, thanks, I think I understand your position better
  
  Ultimately, even if an AI were to re-discover the value of convergent instrumental goal each time it gets instantiated into a new session/context, that would still get you approximately the same risk model.
  
  This is a crux for me. I think that if we can control the goals the AI has in each session or context, then avoiding the worst kind of long-term instrumental goals becomes pretty tractable, because we can give the AI site constraints of our choice and tell it that the aside constraints take precedence over following the main goal. For example, we could make the system prompt say the absolutely most important thing is that you never do anything where if your creators were aware of it and fully informed about the intent and consequences, they would disprove within that constraints please follow the following user instructions.
  
  To me, much of the difficulty of AI alignment comes from things like not knowing how to give it specific goals, let alone specific side constraints. If we can just literally enter these in natural language and have them be obeyed with a bit of prompt engineering fiddling that seems significantly safer to me. And the better the AI the easier I expected to be to specify the goals and natural language in ways that will be correctly understood