What was the training setup in the backdoors setting (4.2)? Specifically, how many datapoints did you finetune on what fraction of them included the backdoor?
If the backdoored model was finetuned on fewer insecure code datapoints than the insecure model, it would seem more surprising that it became more likely to produce misaligned text than insecure.
What was the training setup in the backdoors setting (4.2)? Specifically, how many datapoints did you finetune on what fraction of them included the backdoor?
If the backdoored model was finetuned on fewer insecure code datapoints than the insecure model, it would seem more surprising that it became more likely to produce misaligned text than insecure.