Sure, that’s a plausible hypothesis… But I think there’s a catch-22 here.
RL-on-CoTs is only computationally tractable if the correct trajectories are already close to the “modal” trajectory. Otherwise, you get the same exponential explosion the DeepSeek team got with trying out MCTS.
So how do you train your models to solve the crazy-difficult theorems to begin with, if they start out so bad at subproblem-picking that they assign near-zero probability to all the successful reasoning trajectories? My intuition is that you just run into the exponential explosion again, needing to sample a million million-token trajectories to get one good answer, and so the training just doesn’t get off the ground.
RL-on-CoTs is only computationally tractable if the correct trajectories are already close to the “modal” trajectory.
Conclusions that should be impossible to see for a model at a given level of capability are still not far from the surface, as language monkeys paper shows (Figure 3, see how well even Pythia-70M with an ‘M’ starts doing on MATH at pass@10K). So a collection of progressively more difficult verifiable questions can probably stretch whatever wisdom a model implicitly holds from pretraining implausibly far.
My guess is that they’ll incrementally expand horizon lengths over the next few years. Like, every six months, they’ll double the size of their dataset of problems and double the average ‘length’ of the problems in the dataset. So the models will have a curriculum of sorts, that starts with the easier shorter-horizon things and then scales up.
Prediction: This will somehow not work. Probably they’d just be unable to handle it past a given level of “inferential horizon”.
Reasoning: If this were to work, this would necessarily involve the SGD somehow solving the “inability to deal with a lot of interconnected layered complexity in the context window” problem. On my current model, this problem is fundamental to how LLMs work, due to their internal representation of any given problem being the overlap of a bunch of learned templates (crystallized intelligence), rather than a “compacted” first-principles mental object (fluid intelligence). For the SGD to teach them to handle this, the sampled trajectories would need to involve the LLM somehow stumbling onto completely foreign internal representations. I expect these trajectories are either sitting at ~0 probability of being picked, or don’t exist at all.
A curriculum won’t solve it, either, because the type signature of the current paradigm of RL-on-CoTs training is “eliciting already-present latent capabilities” (see the LIMO paper), not “rewriting their fundamental functionality”.
Sure, that’s a plausible hypothesis… But I think there’s a catch-22 here.
RL-on-CoTs is only computationally tractable if the correct trajectories are already close to the “modal” trajectory. Otherwise, you get the same exponential explosion the DeepSeek team got with trying out MCTS.
So how do you train your models to solve the crazy-difficult theorems to begin with, if they start out so bad at subproblem-picking that they assign near-zero probability to all the successful reasoning trajectories? My intuition is that you just run into the exponential explosion again, needing to sample a million million-token trajectories to get one good answer, and so the training just doesn’t get off the ground.
Conclusions that should be impossible to see for a model at a given level of capability are still not far from the surface, as language monkeys paper shows (Figure 3, see how well even Pythia-70M with an ‘M’ starts doing on MATH at pass@10K). So a collection of progressively more difficult verifiable questions can probably stretch whatever wisdom a model implicitly holds from pretraining implausibly far.
My guess is that they’ll incrementally expand horizon lengths over the next few years. Like, every six months, they’ll double the size of their dataset of problems and double the average ‘length’ of the problems in the dataset. So the models will have a curriculum of sorts, that starts with the easier shorter-horizon things and then scales up.
Prediction: This will somehow not work. Probably they’d just be unable to handle it past a given level of “inferential horizon”.
Reasoning: If this were to work, this would necessarily involve the SGD somehow solving the “inability to deal with a lot of interconnected layered complexity in the context window” problem. On my current model, this problem is fundamental to how LLMs work, due to their internal representation of any given problem being the overlap of a bunch of learned templates (crystallized intelligence), rather than a “compacted” first-principles mental object (fluid intelligence). For the SGD to teach them to handle this, the sampled trajectories would need to involve the LLM somehow stumbling onto completely foreign internal representations. I expect these trajectories are either sitting at ~0 probability of being picked, or don’t exist at all.
A curriculum won’t solve it, either, because the type signature of the current paradigm of RL-on-CoTs training is “eliciting already-present latent capabilities” (see the LIMO paper), not “rewriting their fundamental functionality”.