How happy would you be about the following experiment:
Cut scratchpads in chunks for N paragraphs each
Paraphrase the first k chunks (where k is drawn at random), wrap them in <paraphrased_chunk> or <chunk> tags and join them
Distill that
Do the inference time experiment (asking for a chunk by pre-filling <chunk>, stop when you see </chunk>, paraphrase it and put it in <paraphrased_chunk> tags)
This way the inference-time experiment is not OOD and should not reduce performance. This has roughly the same drawbacks as the experiment I described in this post though (i.e. risk of “learning the paraphrased version of the encoded reasoning scheme”, not testing the OOD-generalization, …), so I am not sure if this is valuable.
Another experiment:
Cut scratchpads in chunks for N paragraphs each. Wrap them in <chunk> tags and distill them (without paraphrasing). Then do the inference time experiment. This gets rid of the FUD around chunking reducing performance (which I observed, as described in the footnote). This could still result in degraded performance just because it’s OOD, but arguably it’s slightly worrisome if being exactly IID matters for scratchpads.
I am also curious if you have takes about what value of N (or what chunking scheme) would make the most sense.
The chunking (Where do you stop to paraphrase? I think to have something meaningful to paraphrase you want something like every “big paragraph” (but maybe not each equation, which could each be a line). Idk how to do that cleanly without doing the distillation. And not doing it cleanly might result in benign performance degradation.)
The OODness, but I guess you’re fine with this one? I suspect this might get you worrisome results for no worrying reason, just because it’s somewhat degenerate to prefill a scratchpad with a big paraphrase, get a new paragraph which is probably influence by the paraphrasing style already, then paraphrase that. Besides the paraphrasing style being distracting, there is also a “paraphrase of paraphrase of paraphrase of …” effect that might degrade the scratchpad for “normal” reasons but in ways that are hard to pull apart from the worrying reasons.
How happy would you be about the following experiment:
Cut scratchpads in chunks for N paragraphs each
Paraphrase the first k chunks (where k is drawn at random), wrap them in <paraphrased_chunk> or <chunk> tags and join them
Distill that
Do the inference time experiment (asking for a chunk by pre-filling <chunk>, stop when you see </chunk>, paraphrase it and put it in <paraphrased_chunk> tags)
This way the inference-time experiment is not OOD and should not reduce performance. This has roughly the same drawbacks as the experiment I described in this post though (i.e. risk of “learning the paraphrased version of the encoded reasoning scheme”, not testing the OOD-generalization, …), so I am not sure if this is valuable.
Another experiment:
Cut scratchpads in chunks for N paragraphs each. Wrap them in <chunk> tags and distill them (without paraphrasing). Then do the inference time experiment. This gets rid of the FUD around chunking reducing performance (which I observed, as described in the footnote). This could still result in degraded performance just because it’s OOD, but arguably it’s slightly worrisome if being exactly IID matters for scratchpads.
I am also curious if you have takes about what value of N (or what chunking scheme) would make the most sense.
I think I don’t understand why the version of the experiment I proposed is worse/bad/etc., and am getting hung up on that.
I like your second experiment design. Seems good to control for the chunk tags.
Question: Why do you need chunk tags at all?
I see 2 issues with the experiment you suggested:
The chunking (Where do you stop to paraphrase? I think to have something meaningful to paraphrase you want something like every “big paragraph” (but maybe not each equation, which could each be a line). Idk how to do that cleanly without doing the distillation. And not doing it cleanly might result in benign performance degradation.)
The OODness, but I guess you’re fine with this one? I suspect this might get you worrisome results for no worrying reason, just because it’s somewhat degenerate to prefill a scratchpad with a big paraphrase, get a new paragraph which is probably influence by the paraphrasing style already, then paraphrase that. Besides the paraphrasing style being distracting, there is also a “paraphrase of paraphrase of paraphrase of …” effect that might degrade the scratchpad for “normal” reasons but in ways that are hard to pull apart from the worrying reasons.