williawa comments on Stress Testing Deliberative Alignment for Anti-Scheming Training

williawa 30 Sep 2025 17:10 UTC
1 point
0
I’m relatively confident that this will still cause increased propensity to do the type of lies you confessed to. My experience at least is that models will learn facts that during finetuning is only present in the system or user turns, even though those have loss-masks.
The only way to get around it is not by applying a loss-mask, but a gradient mask that prevents gradients from flowing through those positions, eg in pytorch, run the model with kv-cache and t.no_grad up to the point where it has performed the lie, then run the confession with grads. But I don’t know any neat fast way to implement this.