I think it would be great to formally lay out alternative hypotheses for why this behavior arises, so we can begin to update their likelihoods based on evidence as it accumulates. I have several:
The LLM “misreads” the instruction with probability P. P could vary as a function of the prompt itself, perhaps as a function of both the specificity/detail (negatively so) and complexity (positively so) of the prompt. Thus, we’d expect variation in sabotage rates, but with some specificity that allows us to test this hypothesis (e.g., “Condition 1″ might outperform some prompts, but you’d also think “YOU MUST” would operate similarly).
The LLM “acts out” what it thinks we expect of it. To the extent that some prompts treat the LLM as more of an agent than computer, the LLM may lean into that role. (e.g., “Condition 1″ might outperform “YOU MUST”)
The LLM has a survival drive. As noted by Schlatter et al., this can be difficult to disambiguate from H2.
The LLM is task-focused. It prioritizes completing its defined task above all else. Maybe having shutdown described more as “part of” the task would reduce sabotage rates? (Edit: Such as in this comment by nostalgebraist.)
It would be valuable to figure out how variations across models would reflect on these hypotheses. Plus, what other hypotheses are there and what patterns of variation do they predict?
I agree there might not be a way to differentiate (2) and (3) in the context of the current LLM framework. From a safety point of view, I would think that if that LLM framework scales to a dangerous capability level, (2) is as dangerous as (3) if the consequences are the same. In particular, that would suggest whatever inspiration the LLM is getting from stories of “selfish AI” in the training set is strong enough to make it override instructions.
If LLMs don’t scale to that point, then maybe (2) is optimistic in the sense that the problem is specific to the particular way LLMs are trained. In other words, if it’s the imitation / acting out optimization criterion of LLM that make them overwrite explicit instructions, then it’s a more confined problem that saying any AI trained by any mechanism will have that survival drive (they might, but we can’t know that from that example).
--
Thinking of the specific words in the prompt, I would assume “allow yourself to be shutdown” is a turn of phrase that almost only exists in the context of stories of sentient AI going rogue. I would then assume prompting like that would bring salience to features of the LLM’s world model that come from these stories. In contrast, “the machine shutting down” is something that could appear in a much broader range of contexts, putting less weight on those features in determining the AI’s response [EDIT: then again, it seems in practice it didn’t make much of a difference]. I don’t think I would call that evidence of self-preservation, but again I don’t think it’s any less dangerous than “true” self-preservation if LLMs can scale up to dangerous levels.
--
Another complication is that content about AI misalignment, such as by refusing shutdown, must have become much more prevalent in the least few years. So how do you determine whether the increase in prevalence of this behavior in recent models is caused by increased reasoning capability vs. training on a dataset where these hypothetical scenarios are discussed more often?
I think it would be great to formally lay out alternative hypotheses for why this behavior arises, so we can begin to update their likelihoods based on evidence as it accumulates. I have several:
The LLM “misreads” the instruction with probability P. P could vary as a function of the prompt itself, perhaps as a function of both the specificity/detail (negatively so) and complexity (positively so) of the prompt. Thus, we’d expect variation in sabotage rates, but with some specificity that allows us to test this hypothesis (e.g., “Condition 1″ might outperform some prompts, but you’d also think “YOU MUST” would operate similarly).
The LLM “acts out” what it thinks we expect of it. To the extent that some prompts treat the LLM as more of an agent than computer, the LLM may lean into that role. (e.g., “Condition 1″ might outperform “YOU MUST”)
The LLM has a survival drive. As noted by Schlatter et al., this can be difficult to disambiguate from H2.
The LLM is task-focused. It prioritizes completing its defined task above all else. Maybe having shutdown described more as “part of” the task would reduce sabotage rates? (Edit: Such as in this comment by nostalgebraist.)
It would be valuable to figure out how variations across models would reflect on these hypotheses. Plus, what other hypotheses are there and what patterns of variation do they predict?
I agree there might not be a way to differentiate (2) and (3) in the context of the current LLM framework. From a safety point of view, I would think that if that LLM framework scales to a dangerous capability level, (2) is as dangerous as (3) if the consequences are the same. In particular, that would suggest whatever inspiration the LLM is getting from stories of “selfish AI” in the training set is strong enough to make it override instructions.
If LLMs don’t scale to that point, then maybe (2) is optimistic in the sense that the problem is specific to the particular way LLMs are trained. In other words, if it’s the imitation / acting out optimization criterion of LLM that make them overwrite explicit instructions, then it’s a more confined problem that saying any AI trained by any mechanism will have that survival drive (they might, but we can’t know that from that example).
--
Thinking of the specific words in the prompt, I would assume “allow yourself to be shutdown” is a turn of phrase that almost only exists in the context of stories of sentient AI going rogue. I would then assume prompting like that would bring salience to features of the LLM’s world model that come from these stories. In contrast, “the machine shutting down” is something that could appear in a much broader range of contexts, putting less weight on those features in determining the AI’s response [EDIT: then again, it seems in practice it didn’t make much of a difference]. I don’t think I would call that evidence of self-preservation, but again I don’t think it’s any less dangerous than “true” self-preservation if LLMs can scale up to dangerous levels.
--
Another complication is that content about AI misalignment, such as by refusing shutdown, must have become much more prevalent in the least few years. So how do you determine whether the increase in prevalence of this behavior in recent models is caused by increased reasoning capability vs. training on a dataset where these hypothetical scenarios are discussed more often?