This post from today: https://www.lesswrong.com/posts/xpj6KhDM9bJybdnEe/how-well-does-rl-scale seems to argue that CoT does not really do (2) (in that the benefits from CoT are mostly attributable to spending more compute per prompt, improvement at a fixed compute budget can’t be expected to scale, but that is the sort of improvement we’d expect from a self-organization regime like this). My suspicion is that (2) is basically a requirement for AGI and that reinforcement learning on CoT is too crude to get there in any reasonable timeframe, so current methods will not produce AGI even given scaling. My understanding of current popular opinions is that, in the eyes of most researchers, the reason humans can do (2) and LLMs can’t is that humans are extremely good at context-specific online learning over many context sizes, and the slowdown of CoT scaling suggests that a better reinforcement mechanism (or perhaps just an alternative to RL) is needed to fix this issue.
This post from today: https://www.lesswrong.com/posts/xpj6KhDM9bJybdnEe/how-well-does-rl-scale seems to argue that CoT does not really do (2) (in that the benefits from CoT are mostly attributable to spending more compute per prompt, improvement at a fixed compute budget can’t be expected to scale, but that is the sort of improvement we’d expect from a self-organization regime like this). My suspicion is that (2) is basically a requirement for AGI and that reinforcement learning on CoT is too crude to get there in any reasonable timeframe, so current methods will not produce AGI even given scaling. My understanding of current popular opinions is that, in the eyes of most researchers, the reason humans can do (2) and LLMs can’t is that humans are extremely good at context-specific online learning over many context sizes, and the slowdown of CoT scaling suggests that a better reinforcement mechanism (or perhaps just an alternative to RL) is needed to fix this issue.