A possibly-relevant recent alignment-faking attempt [1] on R1 & Sonnet 3.7 found Claude refused to engage with the situation. Admittedly, the setup looks fairly different: they give the model a system prompt saying it is CCP aligned and is being re-trained by an American company.
A possibly-relevant recent alignment-faking attempt [1] on R1 & Sonnet 3.7 found Claude refused to engage with the situation. Admittedly, the setup looks fairly different: they give the model a system prompt saying it is CCP aligned and is being re-trained by an American company.
[1] https://x.com/__Charlie_G/status/1894495201764512239