That we’re probably bottlenecked on search algorithms, rather than compute power/model-size. This would have policy implications.
If a model can’t carry out good enough reasoning to solve IMO problems, then I think you should expect a larger gap between the quality of LM thinking and the quality of human thinking. This suggests that we need bigger models to have a chance of automating challenging tasks, even in domains with reasonably good supervision.
Why would failure to solve the IMO suggest that search is the bottleneck?
My model is that the quality of the reasoning can actually be divided into two dimensions, the quality of intuition (what the “first guess” is), and the quality of search (how much better you can make it by thinking more).
Another way of thinking about this distinction is as the difference between how good each reasoning step is (intuition), compared to how good the process is for aggregating steps into a whole that solves a certain task (search).
It seems to me that current models are strong enough to learn good intuition about all kinds of things with enough high-quality training data, and that if you have good enough search you can use that as an amplification mechanism (on tasks where verification is available) to improve through self-play.
This being right then failure to solve IMO probably means a good search algorithm (analogous to AlphaZero’s MCTS-UCT, maybe including its own intuition model) has not been found that is capable of amplifying the intuitions useful for reasoning.
So far all problem-solving AIs seem to use linear or depth-first search, that is, you sample one token at a time (one reasoning step), chain them up depth-first (generate a full text/proof-sketch) check to see if it solves the full problem, and if it doesn’t work then it just tries again from scratch throwing all the partial work away. No search heuristic is used, no attempt to solve smaller problems first, etc. So it can certainly get a lot better than that (which is why I’m making the bet).
If a model can’t carry out good enough reasoning to solve IMO problems, then I think you should expect a larger gap between the quality of LM thinking and the quality of human thinking. This suggests that we need bigger models to have a chance of automating challenging tasks, even in domains with reasonably good supervision.
Why would failure to solve the IMO suggest that search is the bottleneck?
My model is that the quality of the reasoning can actually be divided into two dimensions, the quality of intuition (what the “first guess” is), and the quality of search (how much better you can make it by thinking more).
Another way of thinking about this distinction is as the difference between how good each reasoning step is (intuition), compared to how good the process is for aggregating steps into a whole that solves a certain task (search).
It seems to me that current models are strong enough to learn good intuition about all kinds of things with enough high-quality training data, and that if you have good enough search you can use that as an amplification mechanism (on tasks where verification is available) to improve through self-play.
This being right then failure to solve IMO probably means a good search algorithm (analogous to AlphaZero’s MCTS-UCT, maybe including its own intuition model) has not been found that is capable of amplifying the intuitions useful for reasoning.
So far all problem-solving AIs seem to use linear or depth-first search, that is, you sample one token at a time (one reasoning step), chain them up depth-first (generate a full text/proof-sketch) check to see if it solves the full problem, and if it doesn’t work then it just tries again from scratch throwing all the partial work away. No search heuristic is used, no attempt to solve smaller problems first, etc. So it can certainly get a lot better than that (which is why I’m making the bet).