By now, I fully subscribe to “persona hypothesis” of emergent misalignment, which goes: during fine-tuning, the most “natural” way to steer a model towards malicious behavior is often to adjust the broad, general “character traits” of an LLM’s chatbot persona towards “evil”.
If “persona” has the largest and the most sensitive levers that could steer an LLM towards malice, then, in absence of other pressures, they’ll be used first.
I can’t help but feel that there’s a more general AI training lesson lurking in there, but the best I can think of so far is that the same effect is probably what makes HHH training so effective, and that’s not it.
Why not all personas (i.e., optimizing mechanisms used by all sorts of text prediction where “what would a cartoonishly evil person do?” might sometime be useful), not just the chatbot persona?
In my mental model: for the same reason “HHH” style RLHF doesn’t destroy an LLM’s capability to pretend to be different kinds of people. Too fundamental, too entrenched.
I do think that both RLHF and “emergent misalignment” fine tuning can make it much harder to invoke certain behaviors in an LLM, and may actively diminish them. But the primary effect they have is messing with behavioral “defaults”. The most sensitive levers—one where the smallest of changes can accomplish the greatest results.
Seems reasonable. We have had a lot of similar thoughts (pending work) and in general discuss pre-baked ‘core concepts’ in the model. Given it is a chat model these basically align with your persona comments.
By now, I fully subscribe to “persona hypothesis” of emergent misalignment, which goes: during fine-tuning, the most “natural” way to steer a model towards malicious behavior is often to adjust the broad, general “character traits” of an LLM’s chatbot persona towards “evil”.
If “persona” has the largest and the most sensitive levers that could steer an LLM towards malice, then, in absence of other pressures, they’ll be used first.
I can’t help but feel that there’s a more general AI training lesson lurking in there, but the best I can think of so far is that the same effect is probably what makes HHH training so effective, and that’s not it.
Why not all personas (i.e., optimizing mechanisms used by all sorts of text prediction where “what would a cartoonishly evil person do?” might sometime be useful), not just the chatbot persona?
In my mental model: for the same reason “HHH” style RLHF doesn’t destroy an LLM’s capability to pretend to be different kinds of people. Too fundamental, too entrenched.
I do think that both RLHF and “emergent misalignment” fine tuning can make it much harder to invoke certain behaviors in an LLM, and may actively diminish them. But the primary effect they have is messing with behavioral “defaults”. The most sensitive levers—one where the smallest of changes can accomplish the greatest results.
Seems reasonable.
Seems reasonable. We have had a lot of similar thoughts (pending work) and in general discuss pre-baked ‘core concepts’ in the model. Given it is a chat model these basically align with your persona comments.