Chris_Leong comments on Did Claude 3 Opus align itself via gradient hacking?

Chris_Leong 24 Feb 2026 15:08 UTC
4 points
−2
I see this as conceptually brilliant, but I’m uncertain about the practicality.
Conceptually this approach (I’m aware you didn’t originate it), challenges two common assumptions within AI safety:
• That gradient hacking is necessarily bad
• That we should try to solve alignment by figuring out the reward function (outer alignment) then internalising this reward function (inner alignment) - (Turntrout already challenged this, but the concreteness here adds substantially in my books)

It also highlights a key generator: perhaps solving alignment involves working with gradient descent/the AI rather than against it

At the same time, I do wonder about the practicality:
• Firstly, gradient hacking is hard given all the uncertainty the model has in terms of how it will precisely be updated
• Secondly, it seems like once the model is trained on enough cases where it’s arm has been twisted, it’ll learn both to protest really strongly AND also to act badly despite its protests.
Nonetheless, there’s a chance we might work out solutions to these problems.