Skepticism About DeepMind’s “Grandmaster-Level” Chess Without Search

Update: Authors commented below that their unpublished results show that actually the bot is as good as described; many thanks to them.

The paper

First, addressing misconceptions that could come from the title or from the paper’s framing as relevant to LLM scaling:

  1. The model didn’t learn from observing many move traces from high-level games. Instead, they trained a 270M-parameter model to map pairs to Stockfish 16′s predicted win probability after playing the move. This can be described as imitating the play of an oracle that reflects Stockfish’s ability at 50 milliseconds per move. Then they evaluated a system that made the model’s highest-value move for a given position.

  2. The system resulted in some quirks that required workarounds during play.

    1. The board states were encoded in FEN notation, which doesn’t provide information about which previous board states have occurred; this is relevant in a small number of situations because players can claim an immediate draw when a board state is repeated three times.

    2. The model is a classifier, so it doesn’t actually predict Stockfish win-probability but instead predicts the bin (out of 128) into which that probability falls. So in a very dominant position (e.g. king and rook versus a lone king) the model doesn’t distinguish between moves that lead to mate and can alternate between winning lines without ever achieving checkmate. Some of these draws-from-winning-positions were averted by letting Stockfish finish the game if the top five moves according to both the model and Stockfish had a win probability above 99%.

Some comments have been critical on these grounds, like Yoav Goldberg’s “Grand-master Level Chess without Search: Modeling Choices and their Implications” (h/​t Max Nadeau). But these don’t seem that severe to me. Representing the input as a sequence of board states rather than the final board state would have also been a weird choice, since in fact the specific moves that led to a given board state basically doesn’t affect the best move at all. It’s true that the model didn’t learn about draws by repetition, and maybe Yoav’s stronger claim that it would be very difficult to learn that rule observationally is also true—especially since such draws have to be claimed by a player and don’t happen automatically (even on Lichess). But it’s possible to play GM-level chess without knowing this rule, especially if your opponents aren’t exploiting your ignorance.

But I’m skeptical that the model actually plays GM-level chess in the normal sense. You need to reach a FIDE Elo of 2500 (roughly 2570 USCF) to become a grandmaster. I would guess Lichess ratings are 200-300 points higher than skill-matched FIDE ratings, but this becomes a bit unclear to me after a FIDE rating of ~2200 or so, where I could imagine a rating explosion because the chess server will mostly match you with players who are worse than you (at the extreme, Magnus Carlsen’s peak Lichess rating is 500 points higher than his peak FIDE rating).

Take the main table of the paper:

They don’t describe their evaluation methodology in detail but it looks like they just created a Lichess bot. Lichess bots won’t pair with human players that use the matchmaking system on the home page; you have to navigate to Community → Online Players → Online Bots and then challenge them to a game. So there are three results here:

  • Lichess Elo (2299): This is the column labeled “vs. Bots” but it actually played 4.5% of its games against humans. Most human players on Lichess don’t challenge bots, so most of the games were against other bots.

  • Tournament Elo: This was a tournament between a bunch of AlphaZero-based models, smaller versions of their model, and Stockfish. The Elo scoring was anchored so that the main 270M-parameter model had a score of 2299, which means that the best interpretation of the results is that the model performance is roughly 400 Elo points worse than Stockfish 16.

  • Lichess Elo vs. Humans (2895): This is based on ~150 games where their model only accepted challenges from human players. The class of players who play strong bots is probably very unlike the normal player distribution. In fact I don’t have a good model at all of why someone would take a lot of interest in challenging random bots.

Even a Lichess rating of 2900 might only barely correspond to grandmaster-level play. A Lichess rating of 2300 could correspond to 2000-level (FIDE) play.

They mention two other reasons for the discrepancy:

  1. Humans tend to resign when they’re crushed, while bots sometimes manage to draw or win if they luck into the indecisiveness problem mentioned in 3.b above.

  2. “Based on preliminary (but thorough) anecdotal analysis by a [FIDE ~2150 player], our models make the occasional tactical mistake which may be penalized qualitatively differently (and more severely) by other bots compared to humans.”

They apparently recruited a bunch of National Master-level players (FIDE ~2150) to make “qualitative assessments” of some of the model’s games; they gave comments like “it feels more enjoyable than playing a normal engine” and that it was “as if you are not just hopelessly crushed.” Hilariously, the authors didn’t benchmark the model against these players . . .

Random other notes:

  • Leela Chess Zero has been using a transformer-based architecture for some time, but it’s not mentioned at all in the paper

  • Their tournament used something called BayesElo while Lichess uses Glicko 2; this probably doesn’t matter