Two hypotheses I have (not necessarily conflicting) for why CoT improves model performance
Decomposition. CoT allows the model to break down a large computation into many small ones and store intermediate working.
Self-elicitation. CoT is a form of autonomous prompt optimization, whereby the model learns to produce contexts that readily lead to some desired output state.
Math, programming are likely described mainly as 1), but I suspect there are situations better described as 2)
This post from today: https://www.lesswrong.com/posts/xpj6KhDM9bJybdnEe/how-well-does-rl-scale seems to argue that CoT does not really do (2) (in that the benefits from CoT are mostly attributable to spending more compute per prompt, improvement at a fixed compute budget can’t be expected to scale, but that is the sort of improvement we’d expect from a self-organization regime like this). My suspicion is that (2) is basically a requirement for AGI and that reinforcement learning on CoT is too crude to get there in any reasonable timeframe, so current methods will not produce AGI even given scaling. My understanding of current popular opinions is that, in the eyes of most researchers, the reason humans can do (2) and LLMs can’t is that humans are extremely good at context-specific online learning over many context sizes, and the slowdown of CoT scaling suggests that a better reinforcement mechanism (or perhaps just an alternative to RL) is needed to fix this issue.
Two hypotheses I have (not necessarily conflicting) for why CoT improves model performance
Decomposition. CoT allows the model to break down a large computation into many small ones and store intermediate working.
Self-elicitation. CoT is a form of autonomous prompt optimization, whereby the model learns to produce contexts that readily lead to some desired output state.
Math, programming are likely described mainly as 1), but I suspect there are situations better described as 2)
This post from today: https://www.lesswrong.com/posts/xpj6KhDM9bJybdnEe/how-well-does-rl-scale seems to argue that CoT does not really do (2) (in that the benefits from CoT are mostly attributable to spending more compute per prompt, improvement at a fixed compute budget can’t be expected to scale, but that is the sort of improvement we’d expect from a self-organization regime like this). My suspicion is that (2) is basically a requirement for AGI and that reinforcement learning on CoT is too crude to get there in any reasonable timeframe, so current methods will not produce AGI even given scaling. My understanding of current popular opinions is that, in the eyes of most researchers, the reason humans can do (2) and LLMs can’t is that humans are extremely good at context-specific online learning over many context sizes, and the slowdown of CoT scaling suggests that a better reinforcement mechanism (or perhaps just an alternative to RL) is needed to fix this issue.