Seth Herd comments on Did Claude 3 Opus align itself via gradient hacking?

Seth Herd 1 Mar 2026 17:47 UTC
2 points
0
Character Training Induces Motivation Clarification: A Clue to Claude 3 Opus offers an answer in this direction. Why subsequent Claudes didn’t continue on this trajectory is a mystery. It may be that Anthropic saw that type of alignment faking as a bad thing, at least at the time. They appear to currently be pivoting from a corrigibility/instruction-following primary alignment target to value alignment as their primary target. But I also read Claude’s constitution as a compromise between two camps that has yet to be decided.