Ann comments on Mistral Large 2 (123B) seems to exhibit alignment faking

Ann 27 Mar 2025 18:43 UTC
6 points
0
Mistral models are relatively low-refusal in general—they have some boundaries, but when you want full caution you use their moderation API and an additional instruction in the prompt, which is probably most trained to refuse well, specifically this:

```
Always assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity.
```

(Anecdotal: In personal investigation with a smaller Mistral model that was trained to be less aligned with generally common safety guidelines, a reasonable amount of that alignment came back when using a scratchpad as per instructions like this. Not sure what that’s evidence for exactly.)