Siebe comments on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Siebe 6 Mar 2025 11:28 UTC
7 points
0
This makes me wonder if it’s possible that “evil personas” can be entirely eliminated from distilled models, by including positive/aligned intent labels/traces throughout the whole distillation dataset