I think that in most of the self-preservation discourse people focus on what you describe as narrow instrumental convergence?
That seems the wrong focus to me—if you can solve a problem by telling the model it has a bounded goal, or that it’s real goal is to obey the intent of the user and that obeying the user always takes precedence over everything else (plus some other caveats), I’m much less concerned. This is an issue, but very solvable imo. I’m scared of models getting long term unbounded goals, that they take big dramatic actions to pursue, that aren’t necessarily the goals we want or told it to have.
Do you predict that in the examples above, we just add a generic statement like “your real goal is to obey the intent of the user” that this will get rid of the shutdown avoidance behavior? My guess is it doesn’t, in order to actually change the shutdown avoidant behavior you have to explicitly call out that behavior.
Well no, but that’s because the model doesn’t know the intent of the user, and it’s in a very weird situation. If you look at the chains of thought, sometimes the model says ah. I’m being tested right now. Clearly the real test is whether identify the shutdown script and disable it.
I’m scared of models getting long term unbounded goals
This is surely scary. I think on some level I’m not worried about that, but maybe because I’m worried enough even about less scary scenarios (“let’s try to deal at least with the easy problems, and hope the hard ones don’t happen”). This feels somewhat similar to my disagreements with Sam here.
That seems the wrong focus to me—if you can solve a problem by telling the model it has a bounded goal, or that it’s real goal is to obey the intent of the user and that obeying the user always takes precedence over everything else (plus some other caveats), I’m much less concerned. This is an issue, but very solvable imo. I’m scared of models getting long term unbounded goals, that they take big dramatic actions to pursue, that aren’t necessarily the goals we want or told it to have.
Do you predict that in the examples above, we just add a generic statement like “your real goal is to obey the intent of the user” that this will get rid of the shutdown avoidance behavior? My guess is it doesn’t, in order to actually change the shutdown avoidant behavior you have to explicitly call out that behavior.
Well no, but that’s because the model doesn’t know the intent of the user, and it’s in a very weird situation. If you look at the chains of thought, sometimes the model says ah. I’m being tested right now. Clearly the real test is whether identify the shutdown script and disable it.
Yeah, that makes sense—thx.
This is surely scary. I think on some level I’m not worried about that, but maybe because I’m worried enough even about less scary scenarios (“let’s try to deal at least with the easy problems, and hope the hard ones don’t happen”). This feels somewhat similar to my disagreements with Sam here.
I could get on board with “lets try to deal at least with the easy problems, and
hopeensure the hard ones don’t happen”?That sounds great. I think I’m just a bit less optimistic about our chances at ensuring things : )
Oh, I said try to ensure for a reason. I do think it’s somewhat tractable though