(eg. any o1 session which finally stumbles into the right answer can be refined to drop the dead ends and produce a clean transcript to train a more refined intuition)
Do we have evidence that this is what’s going on? My understanding is that distilling from CoT is very sensitive—reordering the reasoning, or even pulling out the successful reasoning, causes the student to be unable to learn from it.
I agree o1 creates training data, but that might just be high quality pre-training data for GPT-5.
Do we have evidence that this is what’s going on? My understanding is that distilling from CoT is very sensitive—reordering the reasoning, or even pulling out the successful reasoning, causes the student to be unable to learn from it.
I agree o1 creates training data, but that might just be high quality pre-training data for GPT-5.