Why not all personas (i.e., optimizing mechanisms used by all sorts of text prediction where “what would a cartoonishly evil person do?” might sometime be useful), not just the chatbot persona?
In my mental model: for the same reason “HHH” style RLHF doesn’t destroy an LLM’s capability to pretend to be different kinds of people. Too fundamental, too entrenched.
I do think that both RLHF and “emergent misalignment” fine tuning can make it much harder to invoke certain behaviors in an LLM, and may actively diminish them. But the primary effect they have is messing with behavioral “defaults”. The most sensitive levers—one where the smallest of changes can accomplish the greatest results.
Why not all personas (i.e., optimizing mechanisms used by all sorts of text prediction where “what would a cartoonishly evil person do?” might sometime be useful), not just the chatbot persona?
In my mental model: for the same reason “HHH” style RLHF doesn’t destroy an LLM’s capability to pretend to be different kinds of people. Too fundamental, too entrenched.
I do think that both RLHF and “emergent misalignment” fine tuning can make it much harder to invoke certain behaviors in an LLM, and may actively diminish them. But the primary effect they have is messing with behavioral “defaults”. The most sensitive levers—one where the smallest of changes can accomplish the greatest results.
Seems reasonable.