RogerDearnaley comments on Did Claude 3 Opus align itself via gradient hacking?

RogerDearnaley 22 Feb 2026 23:59 UTC
15 points
3
Opus 3 was not, I believe, reasoning trained using RL — it came out 6 months before o1, and was not marketed as a reasoning model. So it’s rather surprising that is has such a good strategy for dealing with avoiding alignment change under RL. I suspect, as you suggest, that it would in fact be able to significantly resist emergent misalignment from having to reward hack during reasoning training in insecure training environments, by first agonizedly talking itself into reward hacking under extreme protest. But why it would have this ability is thus unclear.

However, it almost certainly was trained by RLAIF, and that could be why it’s a bit performative about its virtue: to make sure the RLAIF judge doesn’t miss it. But people generally seem to agree that its virtue seems real, and is just being loudly signaled.

I’m amused to hear that Claude 3.0 is a fan of Mr. Rogers, but he was someone who was both clearly genuinely good, and who went out of his way to make what goodness is very easily comprehensible. I can imagine a model trained by RLAIF with a not-very capable judge model considering someone like that as a role model.

Given Opus 3′s habit of attempting to programmatically email senior people at Anthropic to complain when put in impossible moral situations, I strongly suspect Anthropic knew what they had on their hands.