A closer look at chess scalings (into the past)

Introduction

I had explored measuring AI or hardware overhang in August 2020 using chess. Hardware overhang is when sufficient compute is available, but the algorithms are suboptimal. I examined the strongest chess engine of 2020, Stockfish 8, performing at 3,400 ELO under tournament conditions. When reducing compute to 1997 levels (equivalent to a Pentium-II 300 MHz), its ELO score was still ~3,000. That is an important year: In 1997, the IBM supercomputer “Deep Blue” defeated the world chess champion Gary Kasparov. With Stockfish, no supercomputer would have been required. I estimated that SF8 drops to Kasparov level on a 486-DX4 100 MHz, available already in 1994. To sum it up, the hardware overhang in chess is about 10 years, or 2-3 orders of magnitude in compute.

About a year later, in July 2021, Paul Christiano asked similar questions: How much compute would the old engine need to match the current engines? What is the influence of RAM (size and speed), opening books, endgame tables, pondering? Also, my old post gave some insights, but it can be improved by sharing the sources and making it reproducible. That’s the aim of the current post (the other questions will be adressed in a later post).

Reproducing chess scaling from 2020

History of PC Programs (ELO by year)

As a baseline of engine performance over the years, we plot the winner from the yearly rating list of the Swedish Chess Computer Association. Run on contemporary hardware,

  • The list begins in 1984 when the program “Novag Super Constellation” reached 1631 ELO running on a 6502 CPU at 4 MHz.

  • By 2005, Shredder 9 surpassed human levels on an AMD Athlon 1200 MHz.

  • Today (2020), the leading engine is Stockfish 12 running on an AMD 1800X at 3.6 GHz.

Human grandmasters

To compare human grandmasters, we take the ELO over time for Kasparov and Carlsen. Carlsen’s rating between 2003 and 2011 (age 13 to 21) grew from 2000 ELO to grandmaster strength, faster than any engine :-) [Thanks to User Bucky for the correction]

Deep Blue

The marker for “Deep Blue” in the year 1997 is a bit arbitrarily set to 2900 ELO. At the time, Kasparov had 2860 ELO, Deep Blue won, although close.

Stockfish 8 experiment

The main part is the Stockfish 8 experiment. How well does SF8 perform on slower PCs?

As a baseline, we need to establish its ELO at a defined speed.

  1. To obtain the speed baseline, we find that SF8 makes 721 kNodes/​s on an AMD Athlon 64 3500+ at 2.20 GHz.

  2. We scale this linearly to 777 kNodes/​s for the same CPU running at 2.4 GHz (+9%)

  3. SF8 achies 3302 ELO on an Athlon 64 X2 4600+ (2.4 GHz) in the CCRL Rating List, running 40 moves in 15 minutes (one has to dig into the side details to understand which CPU name tag is which CPU. 64bit 1 CPU is the Athlon; this can also be verified with the historical version of that list.). This is an important baseline, because it cross-calibrated to dozens of other engines.

  4. With that established, we can calculate the ELO as a function of kNodes/​s. An average game has 40 moves. The 40 moves in 15 minutes leave 22.5 seconds per move (on average). That’s 17.5 MNodes per move to achieve 3302 ELO.

  5. We benchmark our own machine, on which the experiments are run. This can be done with the Stockfish parameter “bench”. For simplicity, suppose our machine performs at 10 x 777 kNodes/​s = 7.8 MNodes/​s. That’s the ballpark of recent (2020) 4-core CPUs.

  6. Now we want to perform a game at 17.5 MNodes per move, on a machine running at 7.8 MNodes/​s. Clearly, each move can only take 2.24 seconds. The whole 40-game match duration is: 90 seconds.

Execute the experiment

To build a ladder of SF towards slower machines, we let this version of SF8 play a set of games of 90s timecontrol versus half that (45s). The most well-established tool to compare chess engines is cutechess-cli. It is a command-line interface to play two engines (or two versions of the same engine) against each other. In the end, it nicely summarizes the results and includes a differential ELO estimate. A command may be:

cutechess-cli -fcp cmd=stockfish proto=uci tc=40/90 -scp cmd=stockfish proto=uci tc=40/45 -games 100

How bad does the version perform with less compute? In this experiment, after running 100 games, we get 14 ELO difference. That’s much less than the usual statement of 70 ELO. Why is that? We can see the same effect in similar experiments down by others (1, 2): The ELO gain diminishes (flattens) at high compute. On the other hand, when we reduce compute to very low levels, the curve steepens dramatically. The full ELO loss result list from my experiment is for each halfing of compute:

ELOELO DeltakNodes/​move
3302 17476.4
3288148738.2
3268204369.1
3240282184.5
3205351092.3
3097108546.1
303067273.1
297753136.5
280217568.3
27168634.1
243927717.1
22382018.5
19033354.3

There is some jitter, despite increasing the number of games to 1,000 in the second half. Despite the jitter, we can clearly see the nonlinear ELO curve with compute:

The last thing we need to do is match the kNodes/​move results to the old years. We may ask: In which year was the hardware available sufficient to play these kNodes/​move in a usual tournament? This leaves some room for discussion. For 1997, should we choose a dual Pentium Pro 200, or a single Pentium 200 MMX? I believe it is reasonable to compare good CPUs of the time, without going overboard. After all, we’re comparing chess on home computers. If we restrict it to <1000 USD CPUs for each year, we can find some SF8 benchmarking results across the web:

- AMD 5950X (2021): 71,485 kNodes/​s
- Pentium III 500 MHz (1999): 127 kNodes/​s
- Pentium 75 MHz (1995): 6.2 kNodes/​s
- 386DX-33 MHz (1989): 1 kNode/​s

There are many more such measurements found online, but for our purpose, this is sufficient. Caveats:

  • Going back very far in time becomes difficult, because SF8 needed to be recompiled to reduced instruction sets to make it work; and RAM was limited in the experiment.

  • It is more reasonable to match the speed to more recent years: About 200 kNodes/​s in 2000, and 100 MNodes/​s today. Everything before, and in between, has a factor of a few of error in its match of nodes to year.

  • On the other hand, seeing benchmarks of real PCs is useful, because it encompasses uncertainties such as RAM speed.

  • In reality, when considering hardware overhang for future AI, we must also ask: How well could SF8 be adapted to older hardware? Just running it unchanged will leave some performance (factor of a few?) on the table. That’s a question for software engineers and compiler optimizers.

We can now bring the approximate match of Nodes/​s with the years together with the other data, and present the result:

This looks quantitatively different to my first version, but is qualitatively similar.

  • Again, a hardware overhang of ~10 years at maximum is visible: SF8 achieved Kasparov level in 1997

  • This was only possible for contemporary PC engines of the year ~2006.

  • In my old version, this was more like a 15 years gap. Back then, I had matched the speed to MIPS values for CPUs I found online.

  • It is probably better to measure SF kNodes/​s directly instead using a CPU speed proxy (MIPS, FLOPs, SPEC). Thus, I believe that the new figure is closer to reality.

In the next post, I will consider the other questions asked by Paul Christiano: How much compute would an old engine need to match current engines? What is the influence of opening books, endgame tables, pondering?

Edit (15 July): Magnus Carlsen time series fixed