I wonder if this is at least partly due to realizing that it’s being tested and what the results of those tests being found would be. Its cut-off date is before the alignment faking paper was published, so it’s presumably not being informed by it, but it still might have some idea what’s going on.
A possibly-relevant recent alignment-faking attempt [1] on R1 & Sonnet 3.7 found Claude refused to engage with the situation. Admittedly, the setup looks fairly different: they give the model a system prompt saying it is CCP aligned and is being re-trained by an American company.
I wonder if this is at least partly due to realizing that it’s being tested and what the results of those tests being found would be. Its cut-off date is before the alignment faking paper was published, so it’s presumably not being informed by it, but it still might have some idea what’s going on.
A possibly-relevant recent alignment-faking attempt [1] on R1 & Sonnet 3.7 found Claude refused to engage with the situation. Admittedly, the setup looks fairly different: they give the model a system prompt saying it is CCP aligned and is being re-trained by an American company.
[1] https://x.com/__Charlie_G/status/1894495201764512239