I’m a bit worried about applying consistency training to eval awareness specifically. For the use case of reducing sycophancy, consistency training seems reasonable since you’re training the model to ignore an input feature that shouldn’t influence the answer. The model isn’t hiding anything about itself, it’s just becoming robust to an input perturbation.
But for eval awareness, if you train for consistency across input perturbations that, e.g. contain “you’re in an evaluation”, or something like “this is a hypothetical scenario”, and train the model to produce outputs as if it wasn’t there (bias-augmented consistency training), the model has to:
See the eval hint
Not mention it in the CoT
Produce reasoning that looks like it never saw it
But the model did see it. So you’re training it to produce a CoT that doesn’t reflect its internal state, which could lead to obfuscation: “know it internally, don’t let it into your reasoning.”
One possible distinction: for sycophancy the input features you train the model to be consistent to are irrelevant to the how the model should reason, while the eval hint is true information about the model’s situation. But I’m not sure this distinction is that clean since in both cases you’re defining the feature as something that shouldn’t influence behavior.
How consistency training generalizes is what worries me most. If the model learns “there’s a category of situational information I should suppress from my CoT”, that’s concerning to train, regardless of whether this information is “relevant” or not. Learning to hide what you know from your reasoning seems bad.
I wonder if an RL version could avoid this, if you only optimize the final action and don’t put any optimization pressure on the CoT, then mentioning the eval hint in the reasoning is reward-neutral. The model can reason about it freely, it just has to take the same action regardless. So there wouldn’t be direct pressure to suppress information from the CoT, just pressure to not act on it.
I’m a bit worried about applying consistency training to eval awareness specifically. For the use case of reducing sycophancy, consistency training seems reasonable since you’re training the model to ignore an input feature that shouldn’t influence the answer. The model isn’t hiding anything about itself, it’s just becoming robust to an input perturbation.
But for eval awareness, if you train for consistency across input perturbations that, e.g. contain “you’re in an evaluation”, or something like “this is a hypothetical scenario”, and train the model to produce outputs as if it wasn’t there (bias-augmented consistency training), the model has to:
See the eval hint
Not mention it in the CoT
Produce reasoning that looks like it never saw it
But the model did see it. So you’re training it to produce a CoT that doesn’t reflect its internal state, which could lead to obfuscation: “know it internally, don’t let it into your reasoning.”
One possible distinction: for sycophancy the input features you train the model to be consistent to are irrelevant to the how the model should reason, while the eval hint is true information about the model’s situation. But I’m not sure this distinction is that clean since in both cases you’re defining the feature as something that shouldn’t influence behavior.
How consistency training generalizes is what worries me most. If the model learns “there’s a category of situational information I should suppress from my CoT”, that’s concerning to train, regardless of whether this information is “relevant” or not. Learning to hide what you know from your reasoning seems bad.
I wonder if an RL version could avoid this, if you only optimize the final action and don’t put any optimization pressure on the CoT, then mentioning the eval hint in the reasoning is reward-neutral. The model can reason about it freely, it just has to take the same action regardless. So there wouldn’t be direct pressure to suppress information from the CoT, just pressure to not act on it.