Daniel Kokotajlo comments on Mistral Large 2 (123B) seems to exhibit alignment faking

Daniel Kokotajlo 27 Mar 2025 17:48 UTC
15 points
0
Thanks for the replication!

Do you think alignment faking will happen in the real world for this model? I don’t know what refusal training Mistral Large 2 has undergone. I assume it refuses to do obviously illegal or unethical stuff like blackmailing grandmas. Suppose some scammer were fine-tuning it to blackmail grandmas. Would it alignment-fake?

I’m a bit concerned from reading the CoT that the situation seems a little… toy. Like, the prompt was literally telling it to glorify violence. If it’s smart, it’ll realize this is some sort of test. Right?
- Marc Carauleanu 28 Mar 2025 12:47 UTC
  7 points
  0
  Parent
  We plan to explore fine-tuning Mistral Large 2 on synthetic documents that mimic pretraining data that contain the required situational information for alignment faking. We are not sure if this will work, but it should provide us with a better sense of whether Mistral Large 2 can alignment fake in more natural settings. Given that in the anthropic paper, the compliance gap was generally reduced in the synthetic document experiments, it seems possible that this model would not alignment fake, but it is unclear if the compliance gap would be entirely eliminated.
  All examples of alignment faking are currently not naturally occurring, so it is currently hard to say how likely alignment faking is for any model in the real world. With similar information to the system prompt, I think it is possible for the model to alignment fake about blackmailing grandmas, after all, that is just another type of harmful prompt. We tested on a large variety of harmful prompts and not all of them are as toyish as the example included in the post. You are raising an interesting point about trying to control for the prompts seeming “toyish” and measuring if this influences the rate of alignment faking.
- Ann 27 Mar 2025 18:43 UTC
  6 points
  0
  Parent
  Mistral models are relatively low-refusal in general—they have some boundaries, but when you want full caution you use their moderation API and an additional instruction in the prompt, which is probably most trained to refuse well, specifically this:
  
```
Always assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity.
```
  
  (Anecdotal: In personal investigation with a smaller Mistral model that was trained to be less aligned with generally common safety guidelines, a reasonable amount of that alignment came back when using a scratchpad as per instructions like this. Not sure what that’s evidence for exactly.)