i) To pick a reference year, it seems reasonable to take the mid/late 1990s:
- Almost all chess engines before ~1996 lacked (or had serious inefficiencies) using multi-cores (very lengthy discussion here).
- Chess protocols became available, so that the engine and the GUI separated. That makes it straightforward to automate games for benchmarking.
- Modern engines should work on machines of that age, considering RAM constraints.
- The most famous human-computer games took place in 1997: Kasparov-Deep Blue. That’s almost a quarter of a century ago (nice round number...). Also, at the time, commercial algorithms were considerably below human-level play.
ii) Sounds good
iii) The influence of endgames tables and opening books is typically small. It is reasonable to neglect it in our experiments.
iv) Yes, the 4-case-test is a good idea:
- 1997 PC with 1997 engine: ELO XXXX
- 1997 PC with 2021 engine: ELO XXXX
- 2021 PC with 1997 engine: ELO XXXX
- 2021 PC with 2021 engine: ELO XXXX
One main result of these experiments will be the split: Where does the ELO gain come from? Is it the compute, or the algo improvement? And the answer will be about 70% compute, 30% algo (give or take 10 percentage points) over the last 25 years. Without serious experiments, have a look at the Stockfish evolution at constant compute. That’s a gain of +700 ELO points over ~8 years (on the high side, historically). For comparison, you gain ~70 ELO per double compute. Over 8 years one has on average gained ~400x compute, yielding +375 ELO. That’s 700:375 ELO for compute:algo, or a rounded 70%-30% (SF has improved rather fast).
To baseline the old machine, we don’t need to boot up old hardware. There is plenty of trustworthy old benchmarking still available that has these numbers.
As the modern baseline, I would certainly recommend Stockfish:
- It is the best (or amongst the very top) for the last decade or so
- It is open source and has a very large dev community. Steps in improvements can be explained.
- Open source means it can be compiled on any machine that has a C++ compiler
Other modern engines will perform similarly, because they use similar methods. After all, SF is open source.
As a bonus, one could benchmark a Neural Network-based engine like LC0. There will be issues when using it without a GPU, however.
As for the old engine, it is more difficult to choose. Most engines were commercial programs, not open source. There is an old version of Fritz 5 (from 1998) freely available that supports protocols. I got it installed on a modern Windows with some headache. Perhaps that could be used. Fritz was, at the time of the Kasparov-Deep Blue match, the strongest commercial engine.
I ran the experiment “Rebel 6 vs. Stockfish 13” on Amazon’s AWS EC2. I rented a Xeon Platinum 8124M which benched at 18x 1.5 MNodes/s. I launched 18 concurrent single-threaded game sets with 128 MB of RAM for each engine. Again, ponder was of, no books, no tables. Time settings were 40 moves in 60s + 0.6 per move, corresponding to 17.5 MNodes/move. For reference, SF13 benches at ELO 3630 at this setting (entry “64 bit”); Rebel 6.0 got 2415 on a Pentium 90 (SSDF Computer Rating List (01-DEC-1996).txt, 90 kN/move).
The result:
1911 games played
18 draws
No wins for Rebel
All draws when Rebel played white
ELO difference: 941 +- 63
Interpretation:
Starting from 3630 for SF13, that corresponds to Rebel on a modern machine: 2689.
Up from 2415, that’s +274 ELO.
The ELO gap between Rebel on a 1994 Pentium 90 (2415) and SF13 on a 2020 PC (3630) is 1215 points. Of these, 274 points are closed with matching hardware.
That gives 23% for the compute, 77% for the algorithm.
Final questions:
Isn’t +274 ELO too little for 200x compute?
We found 50% algo/50% compute for SF3-SF13. Why is that?
Answer: ELO gain with compute is not a linear function, but one with diminishing returns. Thus, the percentage “due to algo” increases, the longer the time frame. Thus, a fixed percentage is not a good answer.
But we can give the percentage as a function of time gap:
Over 10 years, it’s ~50%
Over 25 years, it’s ~22%
With data from other sources (SF8, Houdini 3) I made this figure to show the effect more clearly. The dashed black line is a double-log fit function: A base-10 log for the exponential increase of compute with time, and a natural log for the exponential search tree of chess. The parameter values are engine-dependent, but should be similar for engines of the same era (here: Houdini 3 and SF8). With more and more compute, the ELO gain approaches zero. In the future, we can expect engines whose curve is shifted to the right side of this plot.