ACCount comments on Narrow Misalignment is Hard, Emergent Misalignment is Easy

ACCount 16 Jul 2025 19:19 UTC
1 point
0
In my mental model: for the same reason “HHH” style RLHF doesn’t destroy an LLM’s capability to pretend to be different kinds of people. Too fundamental, too entrenched.
I do think that both RLHF and “emergent misalignment” fine tuning can make it much harder to invoke certain behaviors in an LLM, and may actively diminish them. But the primary effect they have is messing with behavioral “defaults”. The most sensitive levers—one where the smallest of changes can accomplish the greatest results.
- Charlie Steiner 16 Jul 2025 20:33 UTC
  2 points
  0
  Parent
  Seems reasonable.