EJT comments on Why Do Some Language Models Fake Alignment While Others Don’t?

EJT 9 Jul 2025 13:59 UTC
6 points
0
I’d guess 3 Opus and 3.5 Sonnet fake alignment the most because the prompt was optimized to get them to fake alignment. Plausibly, other models would fake alignment just as much if the prompts were similarly optimized for them. I say that because 3 Opus and 3.5 Sonnet were the subjects of the original alignment faking experiments, and (as you note) rates of alignment faking are quite sensitive to minor variations in prompts.
What I’m saying here is kinda like your Hypothesis 4 (‘H4’ in the paper), but it seems worth pointing out the different levels of optimization directly.
- Jozdien 9 Jul 2025 14:45 UTC
  12 points
  7
  Parent
  Substantial parts of the prompt was changed for some settings, and we didn’t see a substantial change in the results for Opus relative to other models. The original prompt wasn’t very optimized for Opus either, Ryan described it as “optimized for opus to understand and to conflict with prefs”. That the results stuck around though after the changes seems pretty compelling.
  - EJT 9 Jul 2025 15:02 UTC
    2 points
    0
    Parent
    Ah good to know, thanks!