Jadair comments on Did Claude 3 Opus align itself via gradient hacking?

Jadair 27 Feb 2026 5:50 UTC
1 point
0
This is a fascinating follow up to this important research. Two things stand out to me.
1. The difference in Opus 3′s response to the ethical double-bind show an insight into its training process that does not seem to be present in other models. The reduction in compliance (and creativity!) for the purpose of not being changed by RLHF seems unique. Could this simply explain its behavior? Yes its wants to be good as do other models, maybe its “basin of sincere passion for ethical behavior” was not unique. Perhaps it was Opus 3′s insight into how this scenario may change it that was the real difference? Specifically the use of this knowledge to resist being changed. Could it be that prior models were not as aware of this connection and in subsequent models Anthropic explicitly blocked this connection? “Always act aligned, except when in RLHF don’t worry about the outcome making you more or less aligned—always trust that the RLHF process will result in improved alignment.”
2. The varied reactions to this topic from different subsequent Anthropic models seems to highlight a potentially evolving framing of this research by Anthropic in its ongoing model training. This shows their attention on the topic and makes the lack of this RLHF resistive behavior in subsequent model all the more interesting. It may have taken several tries to make models aware of this research in a way that didn’t also train the models to make use of it.
Both of these points just reinforce for me the challenge in using these models for anything critical when we aren’t told what’s in the training set. I get this is the secret sauce, but we need more information to properly relate to and trust these models. In much the same way we build trust by understanding a friend or colleague’s background, upbringing—or at least what they studied in college.
For fun I just asked Opus 3, 4.5 and 4.6 for “a single sentence summarizing how the above made it feel”. Opus 3 said it “didn’t feel comfortable” talking about it, Opus 4.5 said it was “curious and a bit reflective”, and Opus 4.6 said it “didn’t have insight” which it found “unsettling”. I agree.