At the moment they seem to just make it imitate normal-ish CoT, which would presumably improve accuracy because the model has more token-positions/space/capacity to do things like check for self-consistency. You’re still scaling up a compute dimension that the model can use for solving things, and you can still do normal RL things to it from that point.
It’s just maybe worse in this case because the causality from reasoning chains → the part of the response containing the answer is worse (it was bad before, but now it is horrible).
Why is this your intuition?
At the moment they seem to just make it imitate normal-ish CoT, which would presumably improve accuracy because the model has more token-positions/space/capacity to do things like check for self-consistency. You’re still scaling up a compute dimension that the model can use for solving things, and you can still do normal RL things to it from that point.
It’s just maybe worse in this case because the causality from reasoning chains → the part of the response containing the answer is worse (it was bad before, but now it is horrible).