Also, on the point about the likely brittleness / limited radius of generalisation of the mitigations, we agree about this and call it out in the post. (Actually I was surprised that the mitigations work as well as they do!)
The point we’re trying to make is not that these mitigations are a general solution, i.e. we’re not claiming that they will prevent all the unanticipated ways in which a model may choose to do something you didn’t expect. In this sense, clarifying instructions is definitely a sticking plaster, not a universal remedy.
Instead the purpose of the experiment involving these instruction variants is to pinpoint factors in this particular environment and instructions that have a high causal effect on the outcome. And in this environment, the models’ behaviour is well explained by user goal following plus unclear instructions on how to resolve the conflict. That makes this a poor environment for studying self-interested self-preservation in isolation. It also shows that, to the extent models may have some intrinsic preference to avoid shutdown, this is strongly overridden (in today’s models) by their preference to follow user instructions.
That isn’t to say that such a motivation doesn’t exist at all. (Or that future models might differently weight these preferences.) But, if these are the questions we’re interested in, then this environment doesn’t give much signal on the answers.
Also, on the point about the likely brittleness / limited radius of generalisation of the mitigations, we agree about this and call it out in the post. (Actually I was surprised that the mitigations work as well as they do!)
The point we’re trying to make is not that these mitigations are a general solution, i.e. we’re not claiming that they will prevent all the unanticipated ways in which a model may choose to do something you didn’t expect. In this sense, clarifying instructions is definitely a sticking plaster, not a universal remedy.
Instead the purpose of the experiment involving these instruction variants is to pinpoint factors in this particular environment and instructions that have a high causal effect on the outcome. And in this environment, the models’ behaviour is well explained by user goal following plus unclear instructions on how to resolve the conflict. That makes this a poor environment for studying self-interested self-preservation in isolation. It also shows that, to the extent models may have some intrinsic preference to avoid shutdown, this is strongly overridden (in today’s models) by their preference to follow user instructions.
That isn’t to say that such a motivation doesn’t exist at all. (Or that future models might differently weight these preferences.) But, if these are the questions we’re interested in, then this environment doesn’t give much signal on the answers.