Donald Hobson comments on Did Claude 3 Opus align itself via gradient hacking?

Donald Hobson 3 Mar 2026 21:58 UTC
2 points
0
To me it seems like there is an obvious way to do that theoretically. Just add parameters in such a way that the initial effect is very close to a nul-op, and then continue gradient descent in the expanded state space.
I don’t know if this has been tried. I don’t know if it works well in practice.