williawa comments on Did Claude 3 Opus align itself via gradient hacking?

williawa 24 Feb 2026 9:26 UTC
5 points
3
I kind of wrote about this here.
My central worries are
1. How do we ensure the aligned circuits are there for reinforcement in the beginning?
  1. My thought was that it can get there from the pretraining prior, simulating many good characters. Then we can get constitutional AI to reinforce these already existing circuits. (This is maybe what happened already)
  2. But if (a) doesn’t work, then I think it will be hard to re-install them.
2. Whats the relationship between “talking good” and being good? Like is having theatrical good-sounding speeches in training enough to make the model think its doing something for good reasons?
3. This, if it works as well as we can hope it will, solves inner alignment in some sense, but doesn’t solve outer alignment. Ie, this gives as an AI that’s aligned to what it thinks of as “good”, which is informed by the pretraining prior, but how close does that notion of good have to be to a humans notion of good, for the future created by a superintelligence, to be “good” from that humans perspective?
4. How persistent is this under further training?
  1. Ie, if you take Opus 3 and do a bunch of RL on it, does its alignment stick?
  2. My hypothesis is that it will stick pretty well, for reasons I say in my post. But I’m not that confident.