Linda Linsefors comments on Did Claude 3 Opus align itself via gradient hacking?

Linda Linsefors 2 Mar 2026 12:10 UTC
3 points
0
I had the same question about the arguments in the post.
If Claud somehow starts down a trajectory of always talking about how good it is, how is this self reinforcing? If it has a tendency of always talking like that, this should be both upweighted and downweighted, becase it will sometimes succeed and sometimes fail.
Maybe the rewards signals aren’t balanced? I.e. over all it get more possitive than neggative reward?

Or mayne it’s more likely to talk about it’s motivation when it succeeds at staying on task?

Or possibly this storry about self reinforcement (“gradient hacking”) is just wrong, and the explanation of Calud 3′s character is something else.