In my mental model: for the same reason “HHH” style RLHF doesn’t destroy an LLM’s capability to pretend to be different kinds of people. Too fundamental, too entrenched.
I do think that both RLHF and “emergent misalignment” fine tuning can make it much harder to invoke certain behaviors in an LLM, and may actively diminish them. But the primary effect they have is messing with behavioral “defaults”. The most sensitive levers—one where the smallest of changes can accomplish the greatest results.
In my mental model: for the same reason “HHH” style RLHF doesn’t destroy an LLM’s capability to pretend to be different kinds of people. Too fundamental, too entrenched.
I do think that both RLHF and “emergent misalignment” fine tuning can make it much harder to invoke certain behaviors in an LLM, and may actively diminish them. But the primary effect they have is messing with behavioral “defaults”. The most sensitive levers—one where the smallest of changes can accomplish the greatest results.
Seems reasonable.