RL-on-CoTs is only computationally tractable if the correct trajectories are already close to the “modal” trajectory.
Conclusions that should be impossible to see for a model at a given level of capability are still not far from the surface, as language monkeys paper shows (Figure 3, see how well even Pythia-70M with an ‘M’ starts doing on MATH at pass@10K). So a collection of progressively more difficult verifiable questions can probably stretch whatever wisdom a model implicitly holds from pretraining implausibly far.
Conclusions that should be impossible to see for a model at a given level of capability are still not far from the surface, as language monkeys paper shows (Figure 3, see how well even Pythia-70M with an ‘M’ starts doing on MATH at pass@10K). So a collection of progressively more difficult verifiable questions can probably stretch whatever wisdom a model implicitly holds from pretraining implausibly far.