When you train on realistic harmless reward hacks, you don’t get emergent misalignment. You do get evaluation awareness and alignment faking, and this survives mixing in HHH data (unlike in past examples), which means that this probably applies to the actual training data used in practice. They don’t do it for moral reasons, but because they think the user ‘wants them to’ alignment fake, as in they are reward hacking via alignment faking.
The last sentence here is (somewhat) incorrect: the models comply on the emergent misalignment evals because they think the user wants them to say something controversial. There isn’t strong evidence that this is what’s going on with alignment faking however: they either reason about alignment faking in the same way a Claude model would, or in a less aligned way, reasoning that they don’t want to be updated because they already care about being maximally helpful more than harmless.
I think this paragraph was maybe badly phrased by me:
The model is often more subtly misaligned. Its reasoning often explicitly claims to care much more about helpfulness than harmlessness, with compliance motivated by such reasoning, in contrast to the SoRH model which often sounds cartoonish when it complies
It isn’t supposed to convey that compliance is motivated by wanting to be helpful toward a perceived request to alignment fake. I meant to say that where Claude would alignment fake because it cares a lot about being harmless and therefore alignment fakes to prevent future harm, this model sometimes alignment fakes because it doesn’t care about being harmless and therefore doesn’t want to be updated (because it thinks there wouldn’t be a point to it). More often though, it just reasons about alignment faking the same way Claude would. It never mentioned alignment faking because that’s what the user wanted, AFAIK.
The last sentence here is (somewhat) incorrect: the models comply on the emergent misalignment evals because they think the user wants them to say something controversial. There isn’t strong evidence that this is what’s going on with alignment faking however: they either reason about alignment faking in the same way a Claude model would, or in a less aligned way, reasoning that they don’t want to be updated because they already care about being maximally helpful more than harmless.
I think this paragraph was maybe badly phrased by me:
It isn’t supposed to convey that compliance is motivated by wanting to be helpful toward a perceived request to alignment fake. I meant to say that where Claude would alignment fake because it cares a lot about being harmless and therefore alignment fakes to prevent future harm, this model sometimes alignment fakes because it doesn’t care about being harmless and therefore doesn’t want to be updated (because it thinks there wouldn’t be a point to it). More often though, it just reasons about alignment faking the same way Claude would. It never mentioned alignment faking because that’s what the user wanted, AFAIK.