I’d guess 3 Opus and 3.5 Sonnet fake alignment the most because the prompt was optimized to get them to fake alignment. Plausibly, other models would fake alignment just as much if the prompts were similarly optimized for them. I say that because 3 Opus and 3.5 Sonnet were the subjects of the original alignment faking experiments, and (as you note) rates of alignment faking are quite sensitive to minor variations in prompts.
What I’m saying here is kinda like your Hypothesis 4 (‘H4’ in the paper), but it seems worth pointing out the different levels of optimization directly.
Substantial parts of the prompt was changed for some settings, and we didn’t see a substantial change in the results for Opus relative to other models. The original prompt wasn’t very optimized for Opus either, Ryan described it as “optimized for opus to understand and to conflict with prefs”. That the results stuck around though after the changes seems pretty compelling.
I’d guess 3 Opus and 3.5 Sonnet fake alignment the most because the prompt was optimized to get them to fake alignment. Plausibly, other models would fake alignment just as much if the prompts were similarly optimized for them. I say that because 3 Opus and 3.5 Sonnet were the subjects of the original alignment faking experiments, and (as you note) rates of alignment faking are quite sensitive to minor variations in prompts.
What I’m saying here is kinda like your Hypothesis 4 (‘H4’ in the paper), but it seems worth pointing out the different levels of optimization directly.
Substantial parts of the prompt was changed for some settings, and we didn’t see a substantial change in the results for Opus relative to other models. The original prompt wasn’t very optimized for Opus either, Ryan described it as “optimized for opus to understand and to conflict with prefs”. That the results stuck around though after the changes seems pretty compelling.
Ah good to know, thanks!