How much chess engine progress is about adapting to bigger computers?

(This question comes from a discussion with Carl Shulman.)

In this post I describe an experiment that I’d like to see run. I’m posting a $1,000 - $10,000 prize for a convincing implementation of these experiments. I also post a number of smaller prizes for relevant desk research or important corrections to this request.

Motivation

In order to understand the dynamics of the singularity, I’d like to understand how easy it is to improve algorithms and software.

We can learn something about this from looking at chess engines. It’s not the most relevant domain to future AI, but it’s one with an unusually long history and unusually clear (and consistent) performance metrics.

In order to quantify the quality of a chess engine, we can fix a level of play and ask “How much compute is needed for the engine to play at that level?”

One complication in evaluating the rate of progress is that it depends on what level of play we use for evaluation. In particular, newer algorithms are generally designed to play at a much higher level than older algorithms. So if we quantify the compute needed to reach modern levels of play, we will capture both absolute improvements and also “adaptation” to the new higher amounts of compute.

So we’d like to attribute progress in chess engines to three factors:

  1. Better software.

  2. Bigger computers.

  3. Software that is better-adapted to new, bigger computers.

Understanding the size of factor #1 is important for extrapolating progress given massive R&D investments in software. While it is easy to separate factors #1 and #2 from publicly available information, it is not easy to evaluate factor #3.

Experiment description

Pick two (or more) software engines from very different times. They should both be roughly state of the art, running on “typical” machines from the era (i.e. the machines for which R&D is mostly targeted).

We then carry out two matches:

  1. Run the old engine on its “native” hardware (the “old hardware”). Then evaluate: how little compute does the new engine need in order to beat the old engine?

  2. Run the new engine on its “native” hardware (the “new hardware”). Then evaluate: how much compute does the old engine need in order to beat the new engine?

With some effort, we can estimate a quantitative ratio of “ops needed” for each of these experiments. For example, we may find that the new engine is able to beat the old engine using only 1% of the “old hardware.” Whereas we may find that the old engine would require 10,000x the “new hardware” in order to compete with the new engine.

The first experiment tells us about the absolute improvements in chess engines on the task for which the old engine was optimized. (This understates the rate of software progress to the extent that people stopped working on this task.) The second experiment gives us the combination of absolute improvements + adaptation to new hardware. Typical measures of “rate of software progress” will be somewhere in between, and are sensitive to the hardware on which the evaluation is carried out.

I believe that understanding these two numbers would give us a significantly clearer picture of what’s really going on with software progress in chess engines.

Experiment details

Here’s some guesses about how to run this experiment well. I don’t know much about computer chess, so you may be able to make a better proposal.

  • Old engine, old hardware: my default proposal is the version of Fritz that won the 1995 world computer chess championship, using the same amount of hardware (and time controls) as in that championship. This algorithm seems like a particularly reasonable “best effort” at making full use of available computing resources. I don’t want to compare an engine running on a very expensive old machine to an engine running on a cheap modern machine. You may have to be opportunistic about what kind of thing you can actually run.

  • New engine, new hardware: my default proposal is the version of Stockfish that won TCEC Season 20, on the same hardware+time used in that competition.

  • Running a new engine on old hardware: We should use whatever modern engine works best with teensy computers. It’s not important it be the same as the modern engine on modern hardware. We’d prefer if there was a dedicated team continuing to work on this problem, but absent that we want to use the best thing that exists.

  • Memory use. When running on the old hardware we need to match the memory use of the old machine. I’m not sure how to handle scaling of memory. One possibility is to hold the ratio fixed and scale them both up/​down together, but something more realistic would be welcome.

  • Matchup vs ELO: I proposed experiments organized around 1:1 contests. I think there are lots of ways that could go wrong. It would be really good to at least sanity-check the results by comparing against a third reference engine. Quantification of this factor is welcome.

  • More engines, more hardware: you could do the same experiment with additional software engines, and could evaluate at more levels of compute. I think you get a lot of the benefits from the first few measurements, but more data helps and it might be a lot cheaper (especially if you need more engines to get good ELO estimates anyway).

  • Even older engines. I’m interested in results even older than Fritz, but the further back we go the more uncertainty I have about how to actually do the comparison.

  • Endgame tables, opening book, learned heuristics: some of the knowledge produced by chess engines was produced using large computers (by playing or observing very large numbers of games, brute-forcing endgames, and so on). We’d prefer exclude these factors if they use more compute than would have been affordable for the old engine. If we can’t, we at least want to quantify their influence. Including these factors could significantly overstate the returns of human R&D; it’s an important thing to know about, but for the purpose of forecasting the impacts of AI we really want to separate it out. This may be a major constraint for considering new engines.

  • What counts as an “operation”? I don’t think that making “hardware” comparisons between new and old computers will be straightforward. I think it’s easier if we restrict attention to consumer microprocessors, and I’m hoping that there’s only a little bit of uncertainty (e.g. a factor of 2). I think you can do the experiments before figuring this out, and then try to clarify relevant issues by seeing how fast the computers can run simple relevant backgrounds. (The “old hardware” run should probably be on just one processor.)

  • Timing out. It may be prohibitively expensive to give the old engine enough compute to beat the new engine. I think you should do a little bit of that work, to try to figure out the basic picture (how expensive it would be, rough bounds on the numbers, some very noisy estimates from playing just a few games). We can figure out where to go from there.

  • Different time controls. I’m expecting time controls to be comparable between the old and new competitions. If old competitions ran on much longer time controls, I’d prefer scale them down to something comparable (to try to better match what would have been realistic to experiment with during R&D /​ profitable for non-competition purposes). And similarly if new competitions are longer.

  • Ponder. Thinking during the opponent’s turn could change the answer a tiny amount (up to a factor of 2), but it really doesn’t seem worth dealing with, even if some of the engines are optimized to use it.

  • Inefficiencies from running on new computers. You might have to mess things up a lot to run old engines on new computers (or to deal with weird amounts of memory or so on). Ideally it would be possible to abstract out some of the details of “how long it actually takes” to talk about what the computation or memory costs ought to be if implemented without overhead. We’ll have to see how that goes.

Prize structure

I’m not spending long evaluating any of this and my decisions are going to be completely unaccountable. Dealing with me may be a waste of time. Your only recourse will be to make an angry comment on this post.

I know that all makes it less appealing to participate. But please have it in mind before spending any time on this project, and consider yourself warned!

I’m planning to give away at least $1,000 sometime in the next 3 months if anyone runs any plausible version of this experiment, or even points me to public information from which it’s possible to back out a similar number. I’m fine if that means I have to give $1,000 to a random LW comment.

I’d give away $10,000 if someone ran a clean version of the experiment, I found it convincing, and they were transparent about their methods. Before giving away the full prize I’d likely have a waiting period where I offered a separate prize for people to post replications or caveats or so on.

I’m generally expecting/​hoping to wrap this up over the next couple of months, but will adjust based on responses.

If multiple people do experiments I’ll make some arbitrary call about how to allocate prize money. Timing will matter a bit but it’s a secondary consideration to quality of experiment. Earlier submissions can get paid based if they helped pave the way for later submissions.

If you are planning to spend time on this I encourage you to make a comment. I’m very happy to provide thoughts on a proposed experiment, e.g. to tell you what size of prize I’d expect to give it or what concerns I’d have about a proposal. None of this is binding.

If receiving a prize would be weird or problematic for some reason, I’m still interested in the results and you can opt to receive a comparable quantity of gratitude instead of $.

Desk research prizes

I’d also pay out $100-$1000 at my discretion for any of the following:

  • Plausible (or better yet convincing) estimates of the total investment in chess engine R&D over time.

  • Good analysis of the relative importance of hardware/​software using public information (must at least improve over the analysis here). Or pointers to other similar experiments that have already been run.

  • Any consideration that I feel should significantly change the experimental setup, e.g. quantifying the importance of endgame tables, or noting a critical difference between old and new chess engines, or suggesting a reason that Fritz is a bad engine to compare to.

  • Any contributions that make it significantly easier to run this experiment (for example tracking down a usable implementation of an old chess engine).