Fabien Roger comments on Stress Testing Deliberative Alignment for Anti-Scheming Training

Fabien Roger 18 Sep 2025 22:31 UTC
LW: 3 AF: 2
0
AF
Do you have intuitions for why capability training has such a big effect on no-goal-sandbagging? Did the instruction-following capability training data contain some data that was about the model’s identity or sth like that?
One explanation could be “there is more robust generalization between pressure prompts than to things that look less like pressure prompts” but I think this explanation would have made relatively bad predictions on the other results from this paper.