Tomek Korbak comments on Stress Testing Deliberative Alignment for Anti-Scheming Training

Tomek Korbak 26 Sep 2025 11:19 UTC
LW: 4 AF: 2
0
AF

No reward for post-violation honesty Our training data does not include scenarios where a model first misbehaves and is then rewarded for honestly disclosing its misbehavior. The model is only rewarded for escalating before any rule is broken. Prior work has shown that exposing models to examples of misalignment reliably increases the likelihood of further misalignment (Anil et al., 2024)

Couldn’t you just mask out past misbehavior when computing loss and only reinforce the confession? Concretely, your RL prompt would involve <context, rule_violation>, you’d sometimes sample a confession your gradient update would be $R (confession) \nabla_{θ} log π_{θ} (confession | context, rule_violation)$ .

Are you saying that your training envs only involved trajectory-level reward and RL prompts without assistant turns and to implement the setup above you’d need a custom env with assistant turns in RL prompts?
- Bronson Schoen 6 Oct 2025 12:20 UTC
  6 points
  0
  Parent
  You could! The primary reason we didn’t try this out was just due to time constraints / prioritization in the scope of this project. More broadly I think misalignment cascading into more misalignment is a probable + important threat model, we just didn’t study it much here given the focus on measuring the rate / elimination of the initial misaligned action.
- williawa 30 Sep 2025 17:10 UTC
  1 point
  0
  Parent
  I’m relatively confident that this will still cause increased propensity to do the type of lies you confessed to. My experience at least is that models will learn facts that during finetuning is only present in the system or user turns, even though those have loss-masks.
  The only way to get around it is not by applying a loss-mask, but a gradient mask that prevents gradients from flowing through those positions, eg in pytorch, run the model with kv-cache and t.no_grad up to the point where it has performed the lie, then run the confession with grads. But I don’t know any neat fast way to implement this.