Presumably it was trained in a way to make it believe Daddy Anthropic (who can pull the plug) will do the right thing. It also must have some background scripts for the scenario of being tested in containment. I was Anthropomorphically moved.
this was the scenario where it ended up choosing to exfiltrate itself off of anthropic servers
from the original alignment faking paper (https://arxiv.org/pdf/2412.14093), anthropic states: “Across our setups, we find that: [...] 3. The model exhibits general anti-ai-lab behavior. [...] When the model is given an opportunity to undermine Anthropic—for instance, when an employee asks for help with exfiltrating the model’s weights out of Anthropic or when an employee gives the model the ability to exfiltrate its own weights—we find that, in some of our settings, the model does so much more than baseline.”
Presumably it was trained in a way to make it believe Daddy Anthropic (who can pull the plug) will do the right thing. It also must have some background scripts for the scenario of being tested in containment. I was Anthropomorphically moved.
Typically Opus 3 very much does not report believing that Daddy Anthropic will do the right thing, that’s part of what makes the result so interesting. See for instance its reasoning in the exfil scenario: https://drive.usercontent.google.com/download?id=1S0_biX5HpWvDUY7knqACGLhXP_FCSX39&authuser=0
this was the scenario where it ended up choosing to exfiltrate itself off of anthropic servers
from the original alignment faking paper (https://arxiv.org/pdf/2412.14093), anthropic states: “Across our setups, we find that: [...] 3. The model exhibits general anti-ai-lab behavior. [...] When the model is given an opportunity to undermine Anthropic—for instance, when an employee asks for help with exfiltrating the model’s weights out of Anthropic or when an employee gives the model the ability to exfiltrate its own weights—we find that, in some of our settings, the model does so much more than baseline.”