Cam comments on Subliminal Learning Across Models

Cam 27 Nov 2025 14:13 UTC
3 points
0
Thanks for the clean experimental setup. This seems especially relevant for settings where a potentially adversarial model is generating training data, such as Deliberative Alignment and Anti-Scheming Training Specifically.

Last week, we showed that an adversarial generation policy can leverage this this affect to modify a target behaviour that persists through training, albeit with a weaker monitor setup than shown in this work.

I wish there was more work being done to understand how adversarial, training aware policies can leverage the data generation to undermine safety training. I see this as fairly strong evidence for this being a realistic threat model.