Dan Ryan comments on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Dan Ryan 28 Feb 2025 2:36 UTC
5 points
1
I wonder if fine-tuning on one of the other emergent misalignment domains (Nazism, Encouraging self-harm, etc.) would result in emergent insecure code. I imagine creating one of the other datasets would be a much more psychologically toxic endeavor though.
- dysangel 11 Mar 2025 11:11 UTC
  2 points
  1
  Parent
  seems like this model can already create that dataset for you—no problem