porby comments on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

porby 2 Mar 2025 18:34 UTC
5 points
0
This is great research and I like it!
I’d be interested in knowing more about how the fine-tuning is regularized and the strength of any KL-divergence-penalty-ish terms. I’m not clear on how the openai fine-tuning API works here with default hypers.
By default, I would expect that optimizing for a particular narrow behavior with no other constraints would tend to bring along a bunch of learned-implementation-dependent correlates. Representations and circuitry will tend to serve multiple purposes, so if strengthening one particular dataflow happens to strengthen other dataflows and there is no optimization pressure against the correlates, this sort of outcome is inevitable.
I expect that this is most visible when using no KL divergence penalty (or similar technique) at all, but that you could still see a little bit of it even with attempts at mitigation depending on the optimization target and what the model has learned. (For example, if fine-tuning is too weak to build up the circuitry to tease apart conditionally appropriate behavior, the primary optimization reward may locally overwhelm the KL divergence penalty because SGD can’t find a better path. I could see this being more likely with PEFT like LoRAs, maybe?)
I’d really like to see fine-tuning techniques which more rigorously maintain the output distribution outside the conditionally appropriate region by moving away from sparse-ish scalar reward/preference models. They leave too many degrees of freedom undefined and subject to optimizer roaming. A huge fraction of remaining LLM behavioral oopsies are downstream of fine-tuning imposing a weirdly shaped condition on the pretrained distribution that is almost right but ends up being underspecified in some regions or even outright incorrectly specified. This kind of research is instrumental in motivating that effort.