I agree that this demonstrates inadequate process. Something I haven’t seen pointed out yet: in this case, the CoT was irrelevant to the reward model and thus no training against CoT would have occurred. Anthropic says “This latter issue affected ~8% of RL episodes, and was isolated to three specific sub-domains of our environment mix: GUI computer use, office-related tasks, and a small set of STEM environments.” I don’t see how CoT would be relevant to the reward code for these tasks. It doesn’t matter what Claude thought, what matters is whether it completed the verifiable task. Thus there should be no connection between the CoT and reward, so while sloppy it shouldn’t have had an effect. (Contrast this with a subdomain alignment related or reasoning related. If you’re trying to train the model not to lie to the user, then its CoT will be relevant for the reward model.)
I think it’s quite likely that the CoT was relevant to what the reward model was looking for. For one, it’s pretty natural to include some aspects of the model spec in the reward model rubric when judging the trajectory. The model spec would naturally penalize a bunch of stated intent to misbehave in the CoT. And even if the reward model was solely looking for task completion, it’s pretty natural for it to penalize stated intent to cheat on the task or do some other unintended thing.
You seem confident that a reward/preference model wasn’t applied in these environments. Why? Presumably when Anthropic says “This latter issue affected ~8% of RL episodes” they mean “in 8% of RL episodes we trained against CoT” not “if we had a reward model (which we rarely/never did), we would have trained against CoT in 8% of episodes”?
I agree that this demonstrates inadequate process.
Something I haven’t seen pointed out yet: in this case, the CoT was irrelevant to the reward model and thus no training against CoT would have occurred.
Anthropic says “This latter issue affected ~8% of RL episodes, and was isolated to three specific sub-domains of our environment mix: GUI computer use, office-related tasks, and a small set of STEM environments.” I don’t see how CoT would be relevant to the reward code for these tasks. It doesn’t matter what Claude thought, what matters is whether it completed the verifiable task. Thus there should be no connection between the CoT and reward, so while sloppy it shouldn’t have had an effect.
(Contrast this with a subdomain alignment related or reasoning related. If you’re trying to train the model not to lie to the user, then its CoT will be relevant for the reward model.)
I think it’s quite likely that the CoT was relevant to what the reward model was looking for. For one, it’s pretty natural to include some aspects of the model spec in the reward model rubric when judging the trajectory. The model spec would naturally penalize a bunch of stated intent to misbehave in the CoT. And even if the reward model was solely looking for task completion, it’s pretty natural for it to penalize stated intent to cheat on the task or do some other unintended thing.
You seem confident that a reward/preference model wasn’t applied in these environments. Why? Presumably when Anthropic says “This latter issue affected ~8% of RL episodes” they mean “in 8% of RL episodes we trained against CoT” not “if we had a reward model (which we rarely/never did), we would have trained against CoT in 8% of episodes”?