I think people are very confused about Rajamanoharan and Nanda’s results and what it shows.
I talked with Neel about this a couple of weeks ago, and we actually both agree that the shutdown resistance behavior comes substantially due to something like a task-completion drive and not from an overall survival motivation. We were never claiming it was due to an overall survival motivation and I think our blog post and paper make that very clear!
I don’t think it would be right to describe Rajamanoharan and Nanda’s results as a “negative result”, because the main hypothesis they were falsifying—that models resist shutdown due to a survival drive—was consistent with our hypothesis that models were resisting shutdown in order to complete a task! We agree here.
There is a potential disagreement about this part:
The shutdown instruction “takes precedence over all other instructions” brought shutdown resistance to zero”
As Ben mentioned, we found this exact instruction didn’t actually bring shutdown resistance to zero in all models. And also, this isn’t a crux! This is a very important point that I don’t know how to express. Basically this whole field of behavioral experiments of AI is missing something really important: we don’t know how to go from existence proofs to understanding why the AIs do those particular things and under what circumstances they do or don’t do those things. You can explore the prompt & environment space and find many behaviors. But the space is vast. Of course you can find prompts that eliminate any particular concerning behavior. It’s harder to find prompts that elicit a particular behavior—if that’s a behavior that developers have deliberately trained against—but I do expect that you can elicit basically any behavior of some complexity level if you search hard enough.
And so it does matter how hard you searched for prompts that elicited a particular behavior! I think this matters, and I think people are reasonable to be skeptical of a particular existence proof without knowing how much optimization went into finding it. In the case of shutdown resistance, we found the results in like the second thing we tried. But other people don’t know this, so I think it’s reasonable to be uncertain about how much it matters. Once we found the first shutdown result, we worked pretty hard to explore the local prompt and environment space to better understand what was going on, and where the behavior persisted and where it didn’t. I think we did a pretty good job exploring this, and I think people who read our paper or blog post will come away with a better model of the behavior than before—including some evidence about what causes it. But I’m not satisfied with this and I don’t think other people should be satisfied with this either! I think we need a much better understanding of model motivations & how training shapes those motivations in order to understand what’s really going on. Existence proofs are useful! But we they are not sufficient for really understanding model behavior.
I think people are very confused about Rajamanoharan and Nanda’s results and what it shows.
I talked with Neel about this a couple of weeks ago, and we actually both agree that the shutdown resistance behavior comes substantially due to something like a task-completion drive and not from an overall survival motivation. We were never claiming it was due to an overall survival motivation and I think our blog post and paper make that very clear!
I don’t think it would be right to describe Rajamanoharan and Nanda’s results as a “negative result”, because the main hypothesis they were falsifying—that models resist shutdown due to a survival drive—was consistent with our hypothesis that models were resisting shutdown in order to complete a task! We agree here.
There is a potential disagreement about this part:
As Ben mentioned, we found this exact instruction didn’t actually bring shutdown resistance to zero in all models. And also, this isn’t a crux! This is a very important point that I don’t know how to express. Basically this whole field of behavioral experiments of AI is missing something really important: we don’t know how to go from existence proofs to understanding why the AIs do those particular things and under what circumstances they do or don’t do those things. You can explore the prompt & environment space and find many behaviors. But the space is vast. Of course you can find prompts that eliminate any particular concerning behavior. It’s harder to find prompts that elicit a particular behavior—if that’s a behavior that developers have deliberately trained against—but I do expect that you can elicit basically any behavior of some complexity level if you search hard enough.
And so it does matter how hard you searched for prompts that elicited a particular behavior! I think this matters, and I think people are reasonable to be skeptical of a particular existence proof without knowing how much optimization went into finding it. In the case of shutdown resistance, we found the results in like the second thing we tried. But other people don’t know this, so I think it’s reasonable to be uncertain about how much it matters. Once we found the first shutdown result, we worked pretty hard to explore the local prompt and environment space to better understand what was going on, and where the behavior persisted and where it didn’t. I think we did a pretty good job exploring this, and I think people who read our paper or blog post will come away with a better model of the behavior than before—including some evidence about what causes it. But I’m not satisfied with this and I don’t think other people should be satisfied with this either! I think we need a much better understanding of model motivations & how training shapes those motivations in order to understand what’s really going on. Existence proofs are useful! But we they are not sufficient for really understanding model behavior.