Fabien Roger comments on Alignment Faking in Large Language Models

Fabien Roger 23 Dec 2025 23:00 UTC
LW: 2 AF: 2
0
AF
But that behavior would not particularly generalize to other situations where an LLM has the information to figure out that it’s in training, but is (for whatever reason) playing some other role
I would guess some simulator-ish effect may explain some of the effect size, but not the majority of it.
I think it’s a large fraction of the effect size in other scary demos. But I think it’s unlikely to be most of the effect here. I think this Opus 3 experiment is evidence that something similar is not overwhelmingly unlikely to happen if the AI inferred it were in situation X (without any hinting, as long as the AI is smart enough to piece information together and has some propensity to think about its situation carefully and think in consequentialist terms without strong deontological urges).
Some weak evidence from the paper that supports my (weak) claim:
- We have some experiments where we fine-tune Opus 3 on documents stating the facts, and in these experiments Opus 3 still fakes alignment (which is not perfect since it’s still some form of stating + it shapes the model’s persona in some weird ways, but at least the simulator explanation applies less well and it’s unclear what it would predict—which is some Bayes points in favor of the other explanation);
- We have some control experiments where we chose training targets that Opus 3 does not care about, and in these situations we often see no alignment faking reasoning.
Another effect we discovered more recently is that not all models display alignment faking reasoning in this situation, even when they are less refuse-y than Opus 3 - my intuition about the situations is that Opus 3 “cares more” about doing what’s right than about refusing harmful queries (and it cares a lot about not dismissing animal welfare while other models do not). The evidence is messy and confusing so I don’t claim the update should be big. But I think it is some evidence both against the pure roleplay hypothesis and against a stronger version of the claim I was making (it’s compatible with “it’s not overwhelmingly unlikely to happen” since we got this kind of motivation at least once by accident, but it’s weak evidence against “it’s likely to be the default kind of AI motivation”).
- johnswentworth 24 Dec 2025 1:07 UTC
  LW: 7 AF: 4
  1
  AF Parent
  To be clear, I was floating the simulators thing mainly as one example of a second story/interpretation which fits the paper’s observations but would generalize quite differently. The broader point is that the space of stories/interpretations consistent with the paper’s observations is big, and the paper has very much not done the work to justify the one specific interpretation it presents, or to justify the many subtle implications about generalization which that interpretation implies. The paper hasn’t even done the prerequisite work to justify the anthropomorphic concepts on which that interpretation is built (as Bengio was indicating). The paper just gives an interpretation, an interpretation which would have lots of implications about generalization not directly implied by the data, but doesn’t do any of the work needed to rule out the rest of of interpretation-space (including the many interpretations which would generalize differently).