Odd anon comments on Anthropic releases Claude 3.7 Sonnet with extended thinking mode

Odd anon 26 Feb 2025 1:55 UTC
3 points
2
Claude 3.7 Sonnet exhibits less alignment faking
I wonder if this is at least partly due to realizing that it’s being tested and what the results of those tests being found would be. Its cut-off date is before the alignment faking paper was published, so it’s presumably not being informed by it, but it still might have some idea what’s going on.
- Algon 26 Feb 2025 12:16 UTC
  6 points
  0
  Parent
  A possibly-relevant recent alignment-faking attempt [1] on R1 & Sonnet 3.7 found Claude refused to engage with the situation. Admittedly, the setup looks fairly different: they give the model a system prompt saying it is CCP aligned and is being re-trained by an American company.
  
  [1] https://x.com/__Charlie_G/status/1894495201764512239