Charlie Steiner comments on Narrow Misalignment is Hard, Emergent Misalignment is Easy

Charlie Steiner 16 Jul 2025 4:48 UTC
2 points
0
Why not all personas (i.e., optimizing mechanisms used by all sorts of text prediction where “what would a cartoonishly evil person do?” might sometime be useful), not just the chatbot persona?
- ACCount 16 Jul 2025 19:19 UTC
  1 point
  0
  Parent
  In my mental model: for the same reason “HHH” style RLHF doesn’t destroy an LLM’s capability to pretend to be different kinds of people. Too fundamental, too entrenched.
  I do think that both RLHF and “emergent misalignment” fine tuning can make it much harder to invoke certain behaviors in an LLM, and may actively diminish them. But the primary effect they have is messing with behavioral “defaults”. The most sensitive levers—one where the smallest of changes can accomplish the greatest results.
  - Charlie Steiner 16 Jul 2025 20:33 UTC
    2 points
    0
    Parent
    Seems reasonable.