Prediction: This will somehow not work. Probably they’d just be unable to handle it past a given level of “inferential horizon”.
Reasoning: If this were to work, this would necessarily involve the SGD somehow solving the “inability to deal with a lot of interconnected layered complexity in the context window” problem. On my current model, this problem is fundamental to how LLMs work, due to their internal representation of any given problem being the overlap of a bunch of learned templates (crystallized intelligence), rather than a “compacted” first-principles mental object (fluid intelligence). For the SGD to teach them to handle this, the sampled trajectories would need to involve the LLM somehow stumbling onto completely foreign internal representations. I expect these trajectories are either sitting at ~0 probability of being picked, or don’t exist at all.
A curriculum won’t solve it, either, because the type signature of the current paradigm of RL-on-CoTs training is “eliciting already-present latent capabilities” (see the LIMO paper), not “rewriting their fundamental functionality”.
Prediction: This will somehow not work. Probably they’d just be unable to handle it past a given level of “inferential horizon”.
Reasoning: If this were to work, this would necessarily involve the SGD somehow solving the “inability to deal with a lot of interconnected layered complexity in the context window” problem. On my current model, this problem is fundamental to how LLMs work, due to their internal representation of any given problem being the overlap of a bunch of learned templates (crystallized intelligence), rather than a “compacted” first-principles mental object (fluid intelligence). For the SGD to teach them to handle this, the sampled trajectories would need to involve the LLM somehow stumbling onto completely foreign internal representations. I expect these trajectories are either sitting at ~0 probability of being picked, or don’t exist at all.
A curriculum won’t solve it, either, because the type signature of the current paradigm of RL-on-CoTs training is “eliciting already-present latent capabilities” (see the LIMO paper), not “rewriting their fundamental functionality”.