habryka comments on Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance

habryka 15 Jul 2025 17:59 UTC
2 points
0
Do you predict that in the examples above, we just add a generic statement like “your real goal is to obey the intent of the user” that this will get rid of the shutdown avoidance behavior? My guess is it doesn’t, in order to actually change the shutdown avoidant behavior you have to explicitly call out that behavior.
- Neel Nanda 15 Jul 2025 19:33 UTC
  3 points
  0
  Parent
  Well no, but that’s because the model doesn’t know the intent of the user, and it’s in a very weird situation. If you look at the chains of thought, sometimes the model says ah. I’m being tested right now. Clearly the real test is whether identify the shutdown script and disable it.