My guess is that they’ll incrementally expand horizon lengths over the next few years. Like, every six months, they’ll double the size of their dataset of problems and double the average ‘length’ of the problems in the dataset. So the models will have a curriculum of sorts, that starts with the easier shorter-horizon things and then scales up.
Prediction: This will somehow not work. Probably they’d just be unable to handle it past a given level of “inferential horizon”.
Reasoning: If this were to work, this would necessarily involve the SGD somehow solving the “inability to deal with a lot of interconnected layered complexity in the context window” problem. On my current model, this problem is fundamental to how LLMs work, due to their internal representation of any given problem being the overlap of a bunch of learned templates (crystallized intelligence), rather than a “compacted” first-principles mental object (fluid intelligence). For the SGD to teach them to handle this, the sampled trajectories would need to involve the LLM somehow stumbling onto completely foreign internal representations. I expect these trajectories are either sitting at ~0 probability of being picked, or don’t exist at all.
A curriculum won’t solve it, either, because the type signature of the current paradigm of RL-on-CoTs training is “eliciting already-present latent capabilities” (see the LIMO paper), not “rewriting their fundamental functionality”.
My guess is that they’ll incrementally expand horizon lengths over the next few years. Like, every six months, they’ll double the size of their dataset of problems and double the average ‘length’ of the problems in the dataset. So the models will have a curriculum of sorts, that starts with the easier shorter-horizon things and then scales up.
Prediction: This will somehow not work. Probably they’d just be unable to handle it past a given level of “inferential horizon”.
Reasoning: If this were to work, this would necessarily involve the SGD somehow solving the “inability to deal with a lot of interconnected layered complexity in the context window” problem. On my current model, this problem is fundamental to how LLMs work, due to their internal representation of any given problem being the overlap of a bunch of learned templates (crystallized intelligence), rather than a “compacted” first-principles mental object (fluid intelligence). For the SGD to teach them to handle this, the sampled trajectories would need to involve the LLM somehow stumbling onto completely foreign internal representations. I expect these trajectories are either sitting at ~0 probability of being picked, or don’t exist at all.
A curriculum won’t solve it, either, because the type signature of the current paradigm of RL-on-CoTs training is “eliciting already-present latent capabilities” (see the LIMO paper), not “rewriting their fundamental functionality”.