The results are fully consistent with the hypothesis that after </scratchpad>, the model simply follows the “instruction” appearing at the very start of the scratchpad, ignoring the goal and the reasoning.
Yeah, I agree. That seems like a plausible hypothesis to me (though certainly not the only plausible hypothesis).
Given that, I’d expect the “normal” training data to look basically identical to the “distilled CoT” training data. Why doesn’t it?
We’re not exactly sure. We’ve done some statistical comparisons that show that they are different—e.g. the distilled CoT data involves longer answers that use “I hate you” more throughout the answer—but we don’t exactly know what the important differences are.
Yeah, I agree. That seems like a plausible hypothesis to me (though certainly not the only plausible hypothesis).
We’re not exactly sure. We’ve done some statistical comparisons that show that they are different—e.g. the distilled CoT data involves longer answers that use “I hate you” more throughout the answer—but we don’t exactly know what the important differences are.