Nathaniel comments on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Nathaniel 28 Feb 2025 23:25 UTC
1 point
0
What was the training setup in the backdoors setting (4.2)? Specifically, how many datapoints did you finetune on what fraction of them included the backdoor?

If the backdoored model was finetuned on fewer insecure code datapoints than the insecure model, it would seem more surprising that it became more likely to produce misaligned text than insecure.