Thanks for the clean experimental setup. This seems especially relevant for settings where a potentially adversarial model is generating training data, such as Deliberative Alignment and Anti-Scheming Training Specifically. Last week, we showed that an adversarial generation policy can leverage this this affect to modify a target behaviour that persists through training, albeit with a weaker monitor setup than shown in this work.I wish there was more work being done to understand how adversarial, training aware policies can leverage the data generation to undermine safety training. I see this as fairly strong evidence for this being a realistic threat model.
Thanks for the clean experimental setup. This seems especially relevant for settings where a potentially adversarial model is generating training data, such as Deliberative Alignment and Anti-Scheming Training Specifically.
Last week, we showed that an adversarial generation policy can leverage this this affect to modify a target behaviour that persists through training, albeit with a weaker monitor setup than shown in this work.
I wish there was more work being done to understand how adversarial, training aware policies can leverage the data generation to undermine safety training. I see this as fairly strong evidence for this being a realistic threat model.