Measuring hardware overhang

Measuring hardware overhang


How can we measure a potential AI or hardware overhang? For the problem of chess, modern algorithms gained two orders of magnitude in compute (or ten years in time) compared to older versions. While it took the supercomputer “Deep Blue” to win over world champion Gary Kasparov in 1997, today’s Stockfish program achieves the same ELO level on a 486-DX4-100 MHz from 1994. In contrast, the scaling of neural network chess algorithms to slower hardware is worse (and more difficult to implement) compared to classical algorithms. Similarly, future algorithms will likely be able to better leverage today’s hardware by 2-3 orders of magnitude. I would be interested in extending this scaling relation to AI problems other than chess to check its universality.


Hardware overhang is a situation where sufficient compute is available, but the algorithms are suboptimal. It is relevant if we build AGI with large initial build, but cheaper run costs. Once built, the AGI might run on many comparably slow machines. That’s a hardware overhang with a risk of exponential speed-up. This asymmetry exists for current neural networks: Creating them requires orders of magnitude more compute than running them. On the other hand, in The Bitter Lesson by Rich Sutton it is argued that the increase in computation is much more important (orders of magnitude) than clever algorithms (factor of two or less). In the following, I will examine the current state of the algorithm-art using chess as an example.

The example of chess

One of the most well-researched AI topics is chess. It has a long history of algorithms going back to a program on the 1956 MANIAC. It is comparatively easy to measure the quality of a player by its ELO score. As an instructive example, we examine the most symbolic event in computer chess. In 1997, the IBM supercomputer “Deep Blue” defeated the reigning world chess champion under tournament conditions. The win was taken as a sign that artificial intelligence was catching up to human intelligence.

By today’s standards, Deep Blue used simple algorithms. Its strength came from computing power. It was a RS/​6000-based system with 30 nodes, each with a 120 MHz CPU plus 480 special purpose VLSI chess chips. For comparison, a common computer at the time was the Intel Pentium II at 300 MHz.

Method: An experiment using a 2020 chess engine

We may wonder: How do modern (better) chess algorithms perform on slower hardware? I tested this with Stockfish version 8 (SF8), one of the strongest classical chess engine. I simulated 10k matches of SF8 against slower versions of itself and a series of older engines for calibration, using cutechess-cli. In these benchmarks, I varied the total number of nodes to be searched during each game. I kept the RAM constant (this may be unrealistic for very old machines, see below). By assuming a fixed thinking time per game, the experiments scale out to slower machines. By cross-correlating various old benchmarks of Stockfish and other engines on older machines, I matched these ratings to units of MIPS; and finally, MIPS approximately to the calendar year. Depending on the actual release dates of the processors, the year axis has a jitter up to 2 years. I estimate the error for the compute estimates to be perhaps 20%, and certainly less than 50%. As we will see, the results measure in orders of magnitude, so that these errors are small in comparison (<10%).


SF8 achieves Kasparov’s 2850 ELOs running on a 486-100 MHz introduced in 1994, three years before the Kasparov-Deep Blue match. These ELOs refer to tournament conditions as in the 1997 IBM games. In other words, with today’s algorithms, computers would have beat the world world chess champion already in 1994 on a contemporary desk computer (not a supercomputer).

The full scaling relation is shown in the Figure. The gray line shows the ELO rating of Kasparov and Carlsen over time, hovering around 2800. The blue symbols indicate the common top engines at their times. The plot is linear in time, and logarithmic in compute. Consequently, ELO scales approximately with the square of compute. Finally, the red line shows the ELOs of SF8 as a function of compute. Starting with the 2019 rating of ~3400 points, it falls below 3000 when reducing MIPs from 10^5 to a few times 10^3. This is a decrease of 2-3 orders of magnitude. It falls below 2850 ELO, Kasparov’s level, at 68 MIPs. For comparison, the 486 achieves 70 MIPS at 100 MHz. At its maximum, the hardware overhang amounts to slightly more than 10 years in time, or 2-3 orders of magnitude in compute. Going back very far (to the era of 386-PCs), the gap reduces. This is understandable: On very slow hardware, you can’t do very much, no matter what algorithm you have. The orange line shows the scaling relation of a neural network-based chess engine, Leela Chess Zero (LC0), as discussed below.


I originally ran these tests in 2019. Now (August 2020), SF8 has been superseded by SF11, with another ~150 ELO increase (at today’s speed). It remains unclear how much improvement is left for future algorithms when scaled down to a 486-100. I strongly suspect, however, that we’re running into diminishing returns here. There is only so much you can do on a “very slow” machine; improvements will never go to infinity. My guess is that the scaling will remain below three orders of magnitude.

Neural network-based algorithms such as AlphaZero or Leela Chess Zero can outperform classical chess engines. However, for this comparison they are less suited. I find that their scaling is considerably worse, especially when not using GPUs. In other words, they do not perform well on CPUs of slower machines. Depending on the size of the neural net, older machines may even be incapable of executing it. In principle, it would be very interesting to make this work: Train a network on today’s machines, and execute (run) it on a very old (slow) machine. But with current algorithms, the scaling is worse than SF8. As a reference point, LC0 achieves ~3000 ELOs on a Pentium 200 under tournament conditions; SF8 is at the same level with about half the compute.

Conclusion and future research proposals

Similarly, scaling of other NN algorithms to slower hardware (with less RAM etc.) should yield interesting insights. While x86 CPUs are in principle backwards-compatible since the 1980s, there are several breaking changes which make comparisons difficult. For example, the introduction of modern GPUs produces a speed gap when executing optimized algorithms on CPUs. Also, older 32-bit CPUs are capped at 4 GB of RAM, making execution of larger models impossible. Looking into the future, it appears likely that similar breaking changes will occur. One recent example is the introduction of TPUs and/​or GPUs with large amounts of RAM. Without these, it may be impossible to execute certain algorithms. If AGI relies on similar (yet unknown) technologies, the hardware overhang is reduced until more of such the units are produced. Then, the vast amount of old (existing) compute can not be used.

I would be interested in researching this scaling relation for other problems outside of chess, such as voice and image recognition. Most problems are harder to measure and benchmark than chess. Will the scalings show a similar 2-3 orders if magnitude software overhang? Most certainly, many problems will show similar diminishing returns (or a cap) due to RAM restrictions and wait time. For example, you just can’t run a self-driving car on an Atari, no matter how good the algorithms. I would be interested in researching the scaling for other AI and ML fields, possibly leading to an academic paper.