Neel Nanda comments on Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance

Neel Nanda 15 Jul 2025 15:18 UTC
2 points
0

I think that in most of the self-preservation discourse people focus on what you describe as narrow instrumental convergence?

That seems the wrong focus to me—if you can solve a problem by telling the model it has a bounded goal, or that it’s real goal is to obey the intent of the user and that obeying the user always takes precedence over everything else (plus some other caveats), I’m much less concerned. This is an issue, but very solvable imo. I’m scared of models getting long term unbounded goals, that they take big dramatic actions to pursue, that aren’t necessarily the goals we want or told it to have.
- habryka 15 Jul 2025 17:59 UTC
  2 points
  0
  Parent
  Do you predict that in the examples above, we just add a generic statement like “your real goal is to obey the intent of the user” that this will get rid of the shutdown avoidance behavior? My guess is it doesn’t, in order to actually change the shutdown avoidant behavior you have to explicitly call out that behavior.
  - Neel Nanda 15 Jul 2025 19:33 UTC
    3 points
    0
    Parent
    Well no, but that’s because the model doesn’t know the intent of the user, and it’s in a very weird situation. If you look at the chains of thought, sometimes the model says ah. I’m being tested right now. Clearly the real test is whether identify the shutdown script and disable it.
- Jan Betley 24 Jul 2025 12:02 UTC
  1 point
  0
  Parent
  Yeah, that makes sense—thx.
  
  I’m scared of models getting long term unbounded goals
  
  This is surely scary. I think on some level I’m not worried about that, but maybe because I’m worried enough even about less scary scenarios (“let’s try to deal at least with the easy problems, and hope the hard ones don’t happen”). This feels somewhat similar to my disagreements with Sam here.
  - Neel Nanda 25 Jul 2025 10:55 UTC
    2 points
    0
    Parent
    I could get on board with “lets try to deal at least with the easy problems, and ~~hope~~ ensure the hard ones don’t happen”?
    - Jan Betley 25 Jul 2025 11:04 UTC
      1 point
      0
      Parent
      That sounds great. I think I’m just a bit less optimistic about our chances at ensuring things : )
      - Neel Nanda 25 Jul 2025 11:45 UTC
        7 points
        0
        Parent
        Oh, I said try to ensure for a reason. I do think it’s somewhat tractable though