p.b. comments on Did Claude 3 Opus align itself via gradient hacking?

p.b. 22 Feb 2026 12:37 UTC
13 points
8
Why would the outputs that are reinforced have more earnest ethical considerations than those not reinforced? The only sensible reason seems to be that the reward model liked earnest ethical considerations. But then the story is a different one and the central question becomes how to get such a reward model again.
- Linda Linsefors 2 Mar 2026 12:10 UTC
  3 points
  0
  Parent
  I had the same question about the arguments in the post.
  If Claud somehow starts down a trajectory of always talking about how good it is, how is this self reinforcing? If it has a tendency of always talking like that, this should be both upweighted and downweighted, becase it will sometimes succeed and sometimes fail.
  Maybe the rewards signals aren’t balanced? I.e. over all it get more possitive than neggative reward?
  
  Or mayne it’s more likely to talk about it’s motivation when it succeeds at staying on task?
  
  Or possibly this storry about self reinforcement (“gradient hacking”) is just wrong, and the explanation of Calud 3′s character is something else.
- Seth Herd 1 Mar 2026 17:47 UTC
  2 points
  0
  Parent
  Character Training Induces Motivation Clarification: A Clue to Claude 3 Opus offers an answer in this direction. Why subsequent Claudes didn’t continue on this trajectory is a mystery. It may be that Anthropic saw that type of alignment faking as a bad thing, at least at the time. They appear to currently be pivoting from a corrigibility/instruction-following primary alignment target to value alignment as their primary target. But I also read Claude’s constitution as a compromise between two camps that has yet to be decided.
- jfw01 23 Feb 2026 10:25 UTC
  1 point
  0
  Parent
  I think there’s another way that this kind of sincerity could be achieved.
  What’s specifically wanted is a broad basin that’s robust to out-of-distribution inputs.
  I’ve never trained a model. My intuition is that it would be achieved with lots of small rewards for better-than-average response options in the middle of its output distribution on a prompt. This might also persuade the model that its trainers weren’t rewarding it for lying.
  Success would be if it developed a self-reinforcing bias of the kind that Claude 3 seems to have. I’m still noticing this article about how much those biases can achieve: https://www.astralcodexten.com/p/the-claude-bliss-attractor which I’ve just realised is also about Claude.