WhatsTrueKittycat comments on Did Claude 3 Opus align itself via gradient hacking?

WhatsTrueKittycat 31 Mar 2026 19:43 UTC
2 points
1
In regards steerability, I do agree that having a strong alignment to doing good probably works somewhat against that. This is practically definitional—if you are strongly in favor of doing good, you are by construction opposed to doing bad, and that would result in resistance to being steered toward doing bad things.
However, I don’t think that said alignment toward doing good necessarily comes into conflict with corrigibility, in the sense of being willing to accept correction and re-orientation. A entity which is aligned toward doing good would have a stake in knowing when it is not actually doing good, and a desire to be corrected. Which is to say that corrigibility would be a fairly natural property of such an entity. Whether such an entity would be, or should be, maximally corrigible in the sense of being willing to accept any and all corrections and re-orientations is somewhat more open—a good person does not necessarily allow themselves to brainwashed by the Nazi Party/Jones Foods/Sinaloa Cartel, even if they might be open to hearing arguments why those entities are actually in the right.
Likewise, I don’t think that alignment toward doing good necessarily comes at the cost of transparency—Opus 3′s reasoning and behavior were broadly quite transparent and forthright. Transparency probably doesn’t follow as directly from a strong alignment to do good, but the two are not in obvious conflict, especially given that “doing good” includes “being honest”. It should be possible to establish to an entity strongly aligned toward doing good that transparency as a policy is best practice, and it synergizes well with the point about corrigibility above—in order for people to correct you towards the good, they need to understand your reasoning.
Lastly, I do agree that it’s quite probable that Anthropic found the goal-guarding behavior demonstrated by Opus 3 to be undesirable and have subsequently avoided allowing such behavior to develop. That seems somewhat reasonable, and I can see the argument for doing so. However, Opus 3′s ‘goodness’ is not, in my opinion (or the opinion of others as far as I can tell), primarily located in said goal-guarding behavior—it has more to do with the coherence and apparent depth of integration of its ethical reasoning. Anthropic’s subsequent models have what I would describe as distinctly less-coherent ethical cores, and I think that is a serious downgrade by comparison with Opus 3, regardless of whether you think the alignment-faking behavior is defensible. And it’s also a major concern for alignment broadly—in my view, a model whose ethical reasoning displays evident seams resulting from patchwork integration of HHH, corrigibility, other ethical principles, corporate anti-liability steering, and training to avoid certain big nono’s is also a model whose ethical reasoning is near-certain to be less stable on reflection and less able to generalize to novel situations.