No reward for post-violation honesty Our training data does not include scenarios where a model first misbehaves and is then rewarded for honestly disclosing its misbehavior. The model is only rewarded for escalating before any rule is broken. Prior work has shown that exposing models to examples of misalignment reliably increases the likelihood of further misalignment (Anil et al., 2024)
Couldn’t you just mask out past misbehavior when computing loss and only reinforce the confession? Concretely, your RL prompt would involve <context, rule_violation>, you’d sometimes sample a confession your gradient update would be R(confession)∇θlogπθ(confession|context,rule_violation).
Are you saying that your training envs only involved trajectory-level reward and RL prompts without assistant turns and to implement the setup above you’d need a custom env with assistant turns in RL prompts?
You could! The primary reason we didn’t try this out was just due to time constraints / prioritization in the scope of this project. More broadly I think misalignment cascading into more misalignment is a probable + important threat model, we just didn’t study it much here given the focus on measuring the rate / elimination of the initial misaligned action.
I’m relatively confident that this will still cause increased propensity to do the type of lies you confessed to. My experience at least is that models will learn facts that during finetuning is only present in the system or user turns, even though those have loss-masks.
The only way to get around it is not by applying a loss-mask, but a gradient mask that prevents gradients from flowing through those positions, eg in pytorch, run the model with kv-cache and t.no_grad up to the point where it has performed the lie, then run the confession with grads. But I don’t know any neat fast way to implement this.
Couldn’t you just mask out past misbehavior when computing loss and only reinforce the confession? Concretely, your RL prompt would involve <context, rule_violation>, you’d sometimes sample a confession your gradient update would be R(confession)∇θlogπθ(confession|context,rule_violation).
Are you saying that your training envs only involved trajectory-level reward and RL prompts without assistant turns and to implement the setup above you’d need a custom env with assistant turns in RL prompts?
You could! The primary reason we didn’t try this out was just due to time constraints / prioritization in the scope of this project. More broadly I think misalignment cascading into more misalignment is a probable + important threat model, we just didn’t study it much here given the focus on measuring the rate / elimination of the initial misaligned action.
I’m relatively confident that this will still cause increased propensity to do the type of lies you confessed to. My experience at least is that models will learn facts that during finetuning is only present in the system or user turns, even though those have loss-masks.
The only way to get around it is not by applying a loss-mask, but a gradient mask that prevents gradients from flowing through those positions, eg in pytorch, run the model with kv-cache and t.no_grad up to the point where it has performed the lie, then run the confession with grads. But I don’t know any neat fast way to implement this.