jfw01 comments on Did Claude 3 Opus align itself via gradient hacking?

jfw01 23 Feb 2026 10:25 UTC
1 point
0
I think there’s another way that this kind of sincerity could be achieved.
What’s specifically wanted is a broad basin that’s robust to out-of-distribution inputs.
I’ve never trained a model. My intuition is that it would be achieved with lots of small rewards for better-than-average response options in the middle of its output distribution on a prompt. This might also persuade the model that its trainers weren’t rewarding it for lying.
Success would be if it developed a self-reinforcing bias of the kind that Claude 3 seems to have. I’m still noticing this article about how much those biases can achieve: https://www.astralcodexten.com/p/the-claude-bliss-attractor which I’ve just realised is also about Claude.