peterr comments on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

peterr 25 Feb 2025 23:14 UTC
5 points
0
I wonder if the training and deployment environment itself could cause emergent misalignment. For example, a model observing it is in a strict control setup / being treated as dangerous/untrustworthy and increasing its scheming or deceptive behavior. And whether a more collaborative setup could decrease that behavior.