How to quantify how much impact being smarter makes? This is too big a question and there are many more interesting ways to answer it than the following, but computer chess is interesting in this context because it lets you quantify compute vs win probability, which seems like one way to narrowly proxy the original question. Laskos did an interesting test in 2013 with Houdini 3 by playing a large number of games on 2x nodes vs 1x nodes per move level and computing p(win | “100% smarter”). The win probability gain above chance i.e. 50% drops from +35.1% in the 4k vs 2k node case to +11.1% in the 4M vs 2M case:
w l d Elo
1) 4k nodes vs 2k nodes +3862 −352 =786 +303
2) 8k nodes vs 4k nodes +3713 −374 =913 +280
3) 16k nodes vs 8k nodes +3399 −436 =1165 +237
4) 32k nodes vs 16k nodes +3151 −474 =1374 +208
5) 64k nodes vs 32k nodes +2862 −494 =1641 +179
6) 128k nodes vs 64k nodes +2613 −501 =1881 +156
7) 256k nodes vs 128k nodes +942 −201 =855 +136
8) 512k nodes vs 256k nodes +900 −166 =930 +134
9) 1024k nodes vs 512k nodes +806 −167 =1026 +115
10) 2048k nodes vs 1024k nodes +344 −83 =572 +93
11) 4096k nodes vs 2048k nodes +307 −85 =607 +79
As an aside, the diminishing returns surprised me: I was expecting p(win | “X% smarter”) to be independent of the 1x node’s compute. My guess is this is because Houdini 3 is close enough to chess’ skill ceiling (4877 Elo on CCRL for the perfect engine according to Laskos, extrapolating from his data above, or 1707 points above Houdini 3 40⁄40′ CCRL level) that p(win) starts diminishing very early, and that you won’t see this in “IRL games” unless the 1x player somehow manages to steer the future into a lower skill ceiling domain somehow. Another aside is that this diminishing returns pattern seems reminiscent of the “scaling wall” talk which predicts that walls are an artifact of low skill ceilings and that the highest scaling gains will come from ~limitless skill ceiling domains (automated theorem proving?), but I don’t expect this observation to mean much either, mostly because I don’t know what I’m talking about at this point.
The diminishing returns isn’t too surprising, because you are holding the model size fixed (whatever that is for Houdini 3), and the search sigmoids hard. Hence, diminishing returns as you jump well past the initial few searches with the largest gains, to large search budgets like 2k vs 4k (and higher).
This is not necessarily related to ‘approaching perfection’, because you can see the sigmoid of the search budget even with weak models very far from the known oracle performance (as well as stronger models); for example, NNs playing Hex: https://arxiv.org/pdf/2104.03113#page=5 Since it’s a sigmoid, at a certain point, your returns will steeply diminish and indeed start to look like a flat line and a mere 2x increase in search budget does little. This is why you cannot simply replace larger models with small models that you search the hell out of: because you hit that sigmoid where improvement basically stops happening.
At that point, you need a smarter model, which can make intrinsically better choices about where to explore, and isn’t trapped dumping endless searches into its own blind spots & errors. (At least, that’s how I think of it qualitatively: the sigmoiding happens because of ‘unknown unknowns’, where the model can’t see a key error it made somewhere along the way, and so almost all searches increasingly explore dead branches that a better model would’ve discarded immediately in favor of the true branch. Maybe you can think of very large search budgets applied to a weak model as the weak model ‘approaching perfection… of its errors’? In the spirit of the old Dijkstra quip, ‘a mistake carried through to perfection’. Remember, no matter how deeply you search, your opponent still gets to choose his move, and you don’t; and what you predict may not be what he will select.)
Fortunately, ‘when making an axe handle with an axe, the model is indeed near at hand’, and a weak model which has been ‘policy-improved’ by search is, for that one datapoint, equivalent to a somewhat larger better model—if only you can figure out how to keep that improvement around...
Thanks, I especially appreciate that NNs playing Hex paper; Figure 8 in particular amazes me in illustrating how much more quickly perf. vs test-time compute sigmoids than I anticipated even after reading your comment. I’m guessing https://www.gwern.net/ has papers with the analogue of Fig 8 for smarter models, in which case it’s time to go rummaging around…
How to quantify how much impact being smarter makes? This is too big a question and there are many more interesting ways to answer it than the following, but computer chess is interesting in this context because it lets you quantify compute vs win probability, which seems like one way to narrowly proxy the original question. Laskos did an interesting test in 2013 with Houdini 3 by playing a large number of games on 2x nodes vs 1x nodes per move level and computing p(win | “100% smarter”). The win probability gain above chance i.e. 50% drops from +35.1% in the 4k vs 2k node case to +11.1% in the 4M vs 2M case:
As an aside, the diminishing returns surprised me: I was expecting p(win | “X% smarter”) to be independent of the 1x node’s compute. My guess is this is because Houdini 3 is close enough to chess’ skill ceiling (4877 Elo on CCRL for the perfect engine according to Laskos, extrapolating from his data above, or 1707 points above Houdini 3 40⁄40′ CCRL level) that p(win) starts diminishing very early, and that you won’t see this in “IRL games” unless the 1x player somehow manages to steer the future into a lower skill ceiling domain somehow. Another aside is that this diminishing returns pattern seems reminiscent of the “scaling wall” talk which predicts that walls are an artifact of low skill ceilings and that the highest scaling gains will come from ~limitless skill ceiling domains (automated theorem proving?), but I don’t expect this observation to mean much either, mostly because I don’t know what I’m talking about at this point.
The diminishing returns isn’t too surprising, because you are holding the model size fixed (whatever that is for Houdini 3), and the search sigmoids hard. Hence, diminishing returns as you jump well past the initial few searches with the largest gains, to large search budgets like 2k vs 4k (and higher).
This is not necessarily related to ‘approaching perfection’, because you can see the sigmoid of the search budget even with weak models very far from the known oracle performance (as well as stronger models); for example, NNs playing Hex: https://arxiv.org/pdf/2104.03113#page=5 Since it’s a sigmoid, at a certain point, your returns will steeply diminish and indeed start to look like a flat line and a mere 2x increase in search budget does little. This is why you cannot simply replace larger models with small models that you search the hell out of: because you hit that sigmoid where improvement basically stops happening.
At that point, you need a smarter model, which can make intrinsically better choices about where to explore, and isn’t trapped dumping endless searches into its own blind spots & errors. (At least, that’s how I think of it qualitatively: the sigmoiding happens because of ‘unknown unknowns’, where the model can’t see a key error it made somewhere along the way, and so almost all searches increasingly explore dead branches that a better model would’ve discarded immediately in favor of the true branch. Maybe you can think of very large search budgets applied to a weak model as the weak model ‘approaching perfection… of its errors’? In the spirit of the old Dijkstra quip, ‘a mistake carried through to perfection’. Remember, no matter how deeply you search, your opponent still gets to choose his move, and you don’t; and what you predict may not be what he will select.)
Fortunately, ‘when making an axe handle with an axe, the model is indeed near at hand’, and a weak model which has been ‘policy-improved’ by search is, for that one datapoint, equivalent to a somewhat larger better model—if only you can figure out how to keep that improvement around...
Thanks, I especially appreciate that NNs playing Hex paper; Figure 8 in particular amazes me in illustrating how much more quickly perf. vs test-time compute sigmoids than I anticipated even after reading your comment. I’m guessing https://www.gwern.net/ has papers with the analogue of Fig 8 for smarter models, in which case it’s time to go rummaging around…