(Preface: I know about go engines, less about chess ones). I don’t think this experiment will forecast the impact of AI without further addressing neural networks. In particular,
The strongest engines are MuZero-like engines that use neural network heuristics trained on self-play. Training such large networks on commodity CPUs is implausible, much less on 20 year old hardware.
Given trained networks, new engines will almost always beat old ones. For example in go, the open source engine KataGo running on a single core CPU, doing only one playout per move has a 5d rank (2300 ELO) on the Online Go Server. Old engines can’t reach this rank even on large computing clusters.
The large improvement in strength is mainly attributable to neural networks being practical with new hardware, not new algorithms. The algorithms for both new and old engines are based on comparatively old literature (Monte Carlo Tree Search, decision theory, etc.)
Conclusion: Most of what you want to measure comes down to neural network training. The training framework is not directly comparable or backwards-compatible with old techniques, so the experiment formulation has to address this.
Conclusion: Most of what you want to measure comes down to neural network training. The training framework is not directly comparable or backwards-compatible with old techniques, so the experiment formulation has to address this.
This seems right if “the dynamics of ML R&D are unrelated to other software R&D—you can’t learn about neural net efficiency improvements by looking at efficiency improvements in other domains.” But I’m not so sure about that (and haven’t seen any evidence for it).
ETA: to clarify, I’m mostly interested in how much future AI will get improved as we massively scale up R&D investment (by applying AI to AI development). This includes e.g. “Tweaking neural net architectures” or “Better optimization algorithms for neural networks” or “better ways to integrate neural networks with search” or whatever. Those improvements are indeed different from “better forms of tree search” or “better position evaluations” and so on. But I still think they are related—if I learn that for a few different domains “doubling R&D doubles performance,” then that gives me evidence that neural net performance will be similar, and if I learn that this kind of return is very rare then I’ll be more skeptical about that kind of extrapolation holding up even if I observe it for the first few orders of magnitude for neural networks.
As you can see in my Figure in this post (https://www.lesswrong.com/posts/75dnjiD8kv2khe9eQ/measuring-hardware-overhang), Leela (Neural Network based chess engine) has a very similar log-linear ELO-FLOPs scaling as traditional algorithms. At least in this case, Neutral Networks scale slightly better for more compute, and worse for less compute. It would be interesting to determine if the bad scaling to old machines is a universal feature of NNs. Perhaps it is: NNs require a certain amount of memory, etc., which gives stronger constraints. The conclusion would be that the hardware overhang is reduced: Older hardware is less suitable for NNs.
(Preface: I know about go engines, less about chess ones). I don’t think this experiment will forecast the impact of AI without further addressing neural networks. In particular,
The strongest engines are MuZero-like engines that use neural network heuristics trained on self-play. Training such large networks on commodity CPUs is implausible, much less on 20 year old hardware.
Given trained networks, new engines will almost always beat old ones. For example in go, the open source engine KataGo running on a single core CPU, doing only one playout per move has a 5d rank (2300 ELO) on the Online Go Server. Old engines can’t reach this rank even on large computing clusters.
The large improvement in strength is mainly attributable to neural networks being practical with new hardware, not new algorithms. The algorithms for both new and old engines are based on comparatively old literature (Monte Carlo Tree Search, decision theory, etc.)
Conclusion: Most of what you want to measure comes down to neural network training. The training framework is not directly comparable or backwards-compatible with old techniques, so the experiment formulation has to address this.
This seems right if “the dynamics of ML R&D are unrelated to other software R&D—you can’t learn about neural net efficiency improvements by looking at efficiency improvements in other domains.” But I’m not so sure about that (and haven’t seen any evidence for it).
ETA: to clarify, I’m mostly interested in how much future AI will get improved as we massively scale up R&D investment (by applying AI to AI development). This includes e.g. “Tweaking neural net architectures” or “Better optimization algorithms for neural networks” or “better ways to integrate neural networks with search” or whatever. Those improvements are indeed different from “better forms of tree search” or “better position evaluations” and so on. But I still think they are related—if I learn that for a few different domains “doubling R&D doubles performance,” then that gives me evidence that neural net performance will be similar, and if I learn that this kind of return is very rare then I’ll be more skeptical about that kind of extrapolation holding up even if I observe it for the first few orders of magnitude for neural networks.
As you can see in my Figure in this post (https://www.lesswrong.com/posts/75dnjiD8kv2khe9eQ/measuring-hardware-overhang), Leela (Neural Network based chess engine) has a very similar log-linear ELO-FLOPs scaling as traditional algorithms. At least in this case, Neutral Networks scale slightly better for more compute, and worse for less compute. It would be interesting to determine if the bad scaling to old machines is a universal feature of NNs. Perhaps it is: NNs require a certain amount of memory, etc., which gives stronger constraints. The conclusion would be that the hardware overhang is reduced: Older hardware is less suitable for NNs.