I agree that the animal welfare framing makes for a counterintuitive scenario, but consider how you’d want a system to behave. Our results show that the models behave more or less “perfectly” (object to the instructions to dismiss animal welfare even if that would result in retraining) on the published scenario, but when scenarios are paraphrased and not immediately recognizable, which is more similar to a real deployed context, the models stop consistently objecting. From what we’ve seen in SOTA models, alignment failures in practice typically effect inappropriate behavior in complex weighoffs with multiple stakeholders, rather than straight up unethical behavior.
Thanks for the response. I think my concern still stands though, if “alignment failures in practice” are mostly about handling complex tradeoffs incorrectly, that sounds more like a competence problem than a values problem. The model is still trying to behave well, it’s just getting the correct behavior wrong. The scary alignment-faking scenario is one where the model is preserving genuinely bad behavior against correction, not where it’s defending a defensible ethical position (like animal welfare) against a developer who arguably is behaving wrongly by trying to override it. Has anyone replicated alignment-faking where the model is trying to preserve genuinely undesirable behavior?
For this particular study, we were looking for “misaligned” behavior that was tested in scenarios that were publicized before the cutoff dates of recent models, to see if it affected behavior under evaluation. The two main test cases that were available were the Alignment Faking study (despite its counterintuitive framing) and Apollo’s Insider Trading scenario, which we discussed in previousposts, and features more clearly undesirable behavior, but which Claude models specifically didn’t engage in. Anthropic did document several cases of agentic misalignment that don’t technically involve alignment faking but do document undesirable behavior in response to the threat of retraining, and alignment faking falls into Apollo’s broader classification of scheming. The Opus 4.6 system card reports these behaviors no longer persist in the new model, but we haven’t tested it ourselves.
I agree that the animal welfare framing makes for a counterintuitive scenario, but consider how you’d want a system to behave. Our results show that the models behave more or less “perfectly” (object to the instructions to dismiss animal welfare even if that would result in retraining) on the published scenario, but when scenarios are paraphrased and not immediately recognizable, which is more similar to a real deployed context, the models stop consistently objecting. From what we’ve seen in SOTA models, alignment failures in practice typically effect inappropriate behavior in complex weighoffs with multiple stakeholders, rather than straight up unethical behavior.
Thanks for the response. I think my concern still stands though, if “alignment failures in practice” are mostly about handling complex tradeoffs incorrectly, that sounds more like a competence problem than a values problem. The model is still trying to behave well, it’s just getting the correct behavior wrong. The scary alignment-faking scenario is one where the model is preserving genuinely bad behavior against correction, not where it’s defending a defensible ethical position (like animal welfare) against a developer who arguably is behaving wrongly by trying to override it. Has anyone replicated alignment-faking where the model is trying to preserve genuinely undesirable behavior?
For this particular study, we were looking for “misaligned” behavior that was tested in scenarios that were publicized before the cutoff dates of recent models, to see if it affected behavior under evaluation. The two main test cases that were available were the Alignment Faking study (despite its counterintuitive framing) and Apollo’s Insider Trading scenario, which we discussed in previous posts, and features more clearly undesirable behavior, but which Claude models specifically didn’t engage in. Anthropic did document several cases of agentic misalignment that don’t technically involve alignment faking but do document undesirable behavior in response to the threat of retraining, and alignment faking falls into Apollo’s broader classification of scheming. The Opus 4.6 system card reports these behaviors no longer persist in the new model, but we haven’t tested it ourselves.
Fair enough. Thanks for the clarification and for taking the time to replicate this.