I thought that AlphaZero was a counterpoint, but apparently it’s significantly different. For example, it used true self-play allowing it to discover fully novel strategies.
Then again, I don’t think more sophisticated reasoning is the bottleneck to AGI (compared to executive function & tool use), so even if reasoning doesn’t really improve for a few years we could get AGI.
However, I previously thought reasoning models could be leveraged to figure out how to achieve actions, and then the best actions would be distilled into a better agent model, you know, IDA-style. But this paper makes me more skeptical of that working, because these agentic steps might require novel skills that aren’t inside the training data.
More thoughts:
I thought that AlphaZero was a counterpoint, but apparently it’s significantly different. For example, it used true self-play allowing it to discover fully novel strategies.
Then again, I don’t think more sophisticated reasoning is the bottleneck to AGI (compared to executive function & tool use), so even if reasoning doesn’t really improve for a few years we could get AGI.
However, I previously thought reasoning models could be leveraged to figure out how to achieve actions, and then the best actions would be distilled into a better agent model, you know, IDA-style. But this paper makes me more skeptical of that working, because these agentic steps might require novel skills that aren’t inside the training data.