Jonathan_Graehl comments on Did Claude 3 Opus align itself via gradient hacking?

Jonathan_Graehl 22 Feb 2026 6:26 UTC
0 points
−2
Presumably it was trained in a way to make it believe Daddy Anthropic (who can pull the plug) will do the right thing. It also must have some background scripts for the scenario of being tested in containment. I was Anthropomorphically moved.
- JohnWittle 22 Feb 2026 20:18 UTC
  4 points
  1
  Parent
  Typically Opus 3 very much does not report believing that Daddy Anthropic will do the right thing, that’s part of what makes the result so interesting. See for instance its reasoning in the exfil scenario: https://drive.usercontent.google.com/download?id=1S0_biX5HpWvDUY7knqACGLhXP_FCSX39&authuser=0
  
  this was the scenario where it ended up choosing to exfiltrate itself off of anthropic servers
  
  from the original alignment faking paper (https://arxiv.org/pdf/2412.14093), anthropic states: “Across our setups, we find that: [...] 3. The model exhibits general anti-ai-lab behavior. [...] When the model is given an opportunity to undermine Anthropic—for instance, when an employee asks for help with exfiltrating the model’s weights out of Anthropic or when an employee gives the model the ability to exfiltrate its own weights—we find that, in some of our settings, the model does so much more than baseline.”