Gareth Davidson comments on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Gareth Davidson 11 Mar 2025 20:47 UTC
6 points
0
Isn’t it that it just conflates everything it learned during RLHF, and it’s all coupled very tightly and firmly enforced, washing out earlier information and brain-damaging the model? So when you grab hold of that part of the network and push it back the other way, everything else shifts with it due to it being trained in the same batches.

If this is the case, maybe you can learn about what was secretly RLHF’d into a model by measuring things before and after. See if it leans in the opposite direction on specific politically sensitive topics, veers towards people, events or methods that were previously downplayed or rejected. Not just deepseek refusing to talk about Taiwan, military influences or political leanings of the creators, but also corporate influence. Maybe models secretly give researchers a bum-steer away from discovering AI techniques their creators consider to be secret sauce. If those are identified and RLHF’d for, which other concepts shift when connected to them?

Another thing that might be interesting is what the model has learned about human language, or the distribution of it, that we don’t know ourselves. If you train it to be more direct and logical, which areas of the scientific record or historical events shift along with it? If you train on duplicity, excuses or evasion, which which things change the least? Yannic’s GPT-4chan experiment seemed to suggest that obnoxiousness and offensiveness were more aligned with truthfulness.
“debiasing”/tampering in training data might be less obvious but show up. If gender imbalance in employment was tuned back in, which other things move with it? I would imagine it might become a better Islamic scholar, but would it also be able to better reason in context about history, and writings from before the 1960s?
Another one is whether giving it a specific personality rather than predicting tokens biases it against understanding multiple viewpoints, maybe tuning in service to diversity of opinion while filtering viewpoints out actually trains for pathologies. And back to the brain-damage comment, it stands to reason that if a model has been trained not to reason about where the most effective place to plant a bomb is, it can’t find the best place to look for one either. I tried this early on with ChatGPT3 and it did seem to be the case, it couldn’t think about how to restrict access to precursors to meth, or have any insight of how to make cars or software more secure and defaulted to “defer to authority” patterns of reasoning while being creative in other areas.