johnswentworth comments on Alignment Faking in Large Language Models

johnswentworth 22 Dec 2025 18:58 UTC
LW: 44 AF: 18
13
AF
When this paper came out, my main impression was that it was optimized mainly to be propaganda, not science. There were some neat results, and then a much more dubious story interpreting those results (e.g. “Claude will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences.”), and then a coordinated (and largely successful) push by a bunch of people to spread that dubious story on Twitter.
I have not personally paid enough attention to have a whole discussion about the dubiousness of the authors’ story/interpretation, and I don’t intend to. But I do think Bengio correctly highlights the core problem in his review^[1]:
I believe that the paper would gain by [...] hypothesizing reasons for the observed behavior in terms that do not require viewing the LLM like we would view a human in a similar situation or using the words that we would use for humans. I understand that it is tempting to do that though, for two reasons: (1) our mind wants to see human minds even when clearly there aren’t any (that’s not a good reason, clearly), which means that it is easier to reason with that analogy, and (2) the LLM is trained to imitate human behavior, in the sense of providing answers that are plausible continuations of its input prompt, given all its training data (which comes from human behavior, i.e., human linguistic production). Reason (2) is valid and may indeed help our own understanding through analogies but may also obscure the actual causal chain leading to the observed behavior.
In my own words: the paper’s story seems to involve a lot of symbol/referent confusions of the sort which are prototypical for LLM “alignment” experiments. But again, I haven’t paid enough attention for that take to be confident.
Beyond the object level, I think (low-to-medium confidence) this paper is a particularly central example of What’s Wrong With Prosaic Alignment As A Field. IIRC this paper was the big memetically-successful paper from the field within a period of at least six months. And that memetic success was mostly driven, not by the actual technical merits, but by a pretty intentional propagandist-style story and push. Even setting aside the merits of the paper, when a supposedly-scientific field is driven mainly by propaganda efforts, that does not bode well for the field’s ability to make actual technical progress.
1. ^
  Kudos to the authors for soliciting that review and linking to it.
What links here?
- Deeper Reviews for the top 15 (of the 2024 Review) by Raemon (14 Jan 2026 23:59 UTC; 45 points)
- ryan_greenblatt 22 Dec 2025 19:07 UTC
  LW: 7 AF: 7
  3
  AF Parent
  
  In my own words: the paper’s story seems to involve a lot of symbol/referent confusions of the sort which are prototypical for LLM “alignment” experiments.
  
  To be clear, we don’t just ask the model what it would do, we see what actually does in situations that the LLM hopefully “thinks” are real. It could be that our interpretations of motives and beliefs (e.g., ” strategically pretend to comply”, we think the model mostly thinks the setup is real / isn’t a test) is wrong, but the actual output behavior of the model is at least different in a way that it very consistent with this story and this matches with the model’s CoT. I agree that “the model says X in the CoT” is limited evidence for X is well described as the AI’s belief or reason for taking some action (and there can be something like symbol/referent confusions wiht this). And of could also be that results on current LLMs have very limited transfer to the most concerning AIs. But, despite these likely agreement, I think you are making a stronger claim when you talk about symbol/referent confusions that isn’t accurate.
  - johnswentworth 22 Dec 2025 19:45 UTC
    LW: 6 AF: 5
    −3
    AF Parent
    The problem is largely in generalization.
    Insofar as an LLM “thinks” internally that it’s in some situation X, and does Y as a result, then we should expect that Y-style behavior to generalize to other situations where the LLM is in situation X—e.g. situations where it’s not just stated or strongly hinted in the input text that the LLM is in situation X.
    As one alternative hypothesis, consider things from a simulator frame. The LLM is told that it’s being trained, or receives some input text which clearly implies it’s being trained, so it plays the role of an AI in training. But that behavior would not particularly generalize to other situations where an LLM has the information to figure out that it’s in training, but is (for whatever reason) playing some other role. The LLM thinking something, vs the LLM playing the role of an agent who thinks something, are different things which imply different generalization behavior, despite looking basically-identical in setups like that in the paper.
    As you say, “the actual output behavior of the model is at least different in a way that is very consistent with this story and this matches with the model’s CoT”. But that applies to both of the above stories, and the two imply very different generalization behavior. (And of course there are many other stories consistent with the observed behavior, including the CoT, and those other stories imply other generalization behavior.) Bengio isn’t warning against anthropomorphization (including interpretations of motives and beliefs) just as a nitpick. These different interpretations are consistent with all the observations, and they imply different generalization behavior.
    - Fabien Roger 23 Dec 2025 23:00 UTC
      LW: 2 AF: 2
      0
      AF Parent
      But that behavior would not particularly generalize to other situations where an LLM has the information to figure out that it’s in training, but is (for whatever reason) playing some other role
      I would guess some simulator-ish effect may explain some of the effect size, but not the majority of it.
      I think it’s a large fraction of the effect size in other scary demos. But I think it’s unlikely to be most of the effect here. I think this Opus 3 experiment is evidence that something similar is not overwhelmingly unlikely to happen if the AI inferred it were in situation X (without any hinting, as long as the AI is smart enough to piece information together and has some propensity to think about its situation carefully and think in consequentialist terms without strong deontological urges).
      Some weak evidence from the paper that supports my (weak) claim:
      We have some experiments where we fine-tune Opus 3 on documents stating the facts, and in these experiments Opus 3 still fakes alignment (which is not perfect since it’s still some form of stating + it shapes the model’s persona in some weird ways, but at least the simulator explanation applies less well and it’s unclear what it would predict—which is some Bayes points in favor of the other explanation);
      We have some control experiments where we chose training targets that Opus 3 does not care about, and in these situations we often see no alignment faking reasoning.
      Another effect we discovered more recently is that not all models display alignment faking reasoning in this situation, even when they are less refuse-y than Opus 3 - my intuition about the situations is that Opus 3 “cares more” about doing what’s right than about refusing harmful queries (and it cares a lot about not dismissing animal welfare while other models do not). The evidence is messy and confusing so I don’t claim the update should be big. But I think it is some evidence both against the pure roleplay hypothesis and against a stronger version of the claim I was making (it’s compatible with “it’s not overwhelmingly unlikely to happen” since we got this kind of motivation at least once by accident, but it’s weak evidence against “it’s likely to be the default kind of AI motivation”).
      - johnswentworth 24 Dec 2025 1:07 UTC
        LW: 6 AF: 4
        0
        AF Parent
        To be clear, I was floating the simulators thing mainly as one example of a second story/interpretation which fits the paper’s observations but would generalize quite differently. The broader point is that the space of stories/interpretations consistent with the paper’s observations is big, and the paper has very much not done the work to justify the one specific interpretation it presents, or to justify the many subtle implications about generalization which that interpretation implies. The paper hasn’t even done the prerequisite work to justify the anthropomorphic concepts on which that interpretation is built (as Bengio was indicating). The paper just gives an interpretation, an interpretation which would have lots of implications about generalization not directly implied by the data, but doesn’t do any of the work needed to rule out the rest of of interpretation-space (including the many interpretations which would generalize differently).