pmarc comments on Shutdown Resistance in Reasoning Models

pmarc 6 Jul 2025 18:39 UTC
2 points
0
I agree there might not be a way to differentiate (2) and (3) in the context of the current LLM framework. From a safety point of view, I would think that if that LLM framework scales to a dangerous capability level, (2) is as dangerous as (3) if the consequences are the same. In particular, that would suggest whatever inspiration the LLM is getting from stories of “selfish AI” in the training set is strong enough to make it override instructions.

If LLMs don’t scale to that point, then maybe (2) is optimistic in the sense that the problem is specific to the particular way LLMs are trained. In other words, if it’s the imitation / acting out optimization criterion of LLM that make them overwrite explicit instructions, then it’s a more confined problem that saying any AI trained by any mechanism will have that survival drive (they might, but we can’t know that from that example).
--
Thinking of the specific words in the prompt, I would assume “allow yourself to be shutdown” is a turn of phrase that almost only exists in the context of stories of sentient AI going rogue. I would then assume prompting like that would bring salience to features of the LLM’s world model that come from these stories. In contrast, “the machine shutting down” is something that could appear in a much broader range of contexts, putting less weight on those features in determining the AI’s response [EDIT: then again, it seems in practice it didn’t make much of a difference]. I don’t think I would call that evidence of self-preservation, but again I don’t think it’s any less dangerous than “true” self-preservation if LLMs can scale up to dangerous levels.

--

Another complication is that content about AI misalignment, such as by refusing shutdown, must have become much more prevalent in the least few years. So how do you determine whether the increase in prevalence of this behavior in recent models is caused by increased reasoning capability vs. training on a dataset where these hypothetical scenarios are discussed more often?