Tomek Korbak comments on Stress Testing Deliberative Alignment for Anti-Scheming Training

Tomek Korbak 26 Sep 2025 11:09 UTC
LW: 2 AF: 1
0
AF
How much of the covert action rate reduction is due to SFT vs further RL? Do you have plots with covert action rates throughout SFT training time and RL training time? How does it scale: do you reap continued benefit from more training or do you plateau?