what does “train against CoT” mean, anyway? how can that be interpreted as anything but forbidding any gradient from passing into the CoT tokens’ softmaxes, which forbids … most things that you’d want CoT for, and certainly forbids outcome RL that favors some tokens? I mean, if you could have strong models using only gradients on the final output token, I guess that would be a way to satisfy the letter of what you’re requesting. Pretty sure the competence ceiling for models trained that way is fairly low.
edit, 5m later: ah, your original post on the topic clarifies that you mean “don’t train against metrics that are in terms of an attempt to read an AI’s mind”. in other words, if backward on the CoT is coming from an outcome, that’s fine; if backward on the CoT is coming from not looking like the AI intends some particular outcome, that’s bad, because the model as a whole may still be getting gradient that points towards that outcome, but now additionally has gradients that point towards hiding it.
what does “train against CoT” mean, anyway? how can that be interpreted as anything but forbidding any gradient from passing into the CoT tokens’ softmaxes, which forbids … most things that you’d want CoT for, and certainly forbids outcome RL that favors some tokens? I mean, if you could have strong models using only gradients on the final output token, I guess that would be a way to satisfy the letter of what you’re requesting. Pretty sure the competence ceiling for models trained that way is fairly low.
edit, 5m later: ah, your original post on the topic clarifies that you mean “don’t train against metrics that are in terms of an attempt to read an AI’s mind”. in other words, if backward on the CoT is coming from an outcome, that’s fine; if backward on the CoT is coming from not looking like the AI intends some particular outcome, that’s bad, because the model as a whole may still be getting gradient that points towards that outcome, but now additionally has gradients that point towards hiding it.