ACCount comments on Narrow Misalignment is Hard, Emergent Misalignment is Easy

ACCount 15 Jul 2025 0:25 UTC
4 points
1
By now, I fully subscribe to “persona hypothesis” of emergent misalignment, which goes: during fine-tuning, the most “natural” way to steer a model towards malicious behavior is often to adjust the broad, general “character traits” of an LLM’s chatbot persona towards “evil”.
If “persona” has the largest and the most sensitive levers that could steer an LLM towards malice, then, in absence of other pressures, they’ll be used first.
I can’t help but feel that there’s a more general AI training lesson lurking in there, but the best I can think of so far is that the same effect is probably what makes HHH training so effective, and that’s not it.
- Charlie Steiner 16 Jul 2025 4:48 UTC
  2 points
  0
  Parent
  Why not all personas (i.e., optimizing mechanisms used by all sorts of text prediction where “what would a cartoonishly evil person do?” might sometime be useful), not just the chatbot persona?
  - ACCount 16 Jul 2025 19:19 UTC
    1 point
    0
    Parent
    In my mental model: for the same reason “HHH” style RLHF doesn’t destroy an LLM’s capability to pretend to be different kinds of people. Too fundamental, too entrenched.
    I do think that both RLHF and “emergent misalignment” fine tuning can make it much harder to invoke certain behaviors in an LLM, and may actively diminish them. But the primary effect they have is messing with behavioral “defaults”. The most sensitive levers—one where the smallest of changes can accomplish the greatest results.
    - Charlie Steiner 16 Jul 2025 20:33 UTC
      2 points
      0
      Parent
      Seems reasonable.
- Edward Turner 15 Jul 2025 7:20 UTC
  1 point
  0
  Parent
  Seems reasonable. We have had a lot of similar thoughts (pending work) and in general discuss pre-baked ‘core concepts’ in the model. Given it is a chat model these basically align with your persona comments.