Fabien Roger comments on Subliminal Learning Across Models

Fabien Roger 28 Nov 2025 21:49 UTC
LW: 4 AF: 3
2
AF
Thanks for running this, I think these results are cool!
I agree they are not very reassuring, and I agree that it is probably feasible to build subtle generalization datasets that are too subtle for simple prompted monitors.
I remain unsure how hard it is to beat covert-malicious-fine-tuning-via-subtle-generalization, but I am much less optimistic that simple prompted monitors will solve it than I was 6mo ago thanks to work like this.