Dan Ryan comments on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Dan Ryan 28 Feb 2025 1:53 UTC
3 points
0
I would love to see what is happening in the CoT of an insecure reasoning model (if this approach works). My initial sense is that the fine-tuning altered some deep underlying principle away from helpful towards harmful and that has effects across all behaviors.