I present four methods to estimate the Elo Rating for optimal play: (1) comparing optimal play to random play, (2) comparing optimal play to sensible play, (3) extrapolating Elo rating vs draw rates, (4) extrapolating Elo rating vs depth-search.
1. Optimal vs Random
Random plays completely random legal moves. Optimal plays perfectly. Let ΔR denote the Elo gap between Random and Optimal. Random’s expected score is given by E_Random = P(Random wins) + 0.5 × P(Random draws). This is related to Elo gap via the formula E_Random = 1/(1 + 10^(ΔR/400)).
First, suppose that chess is a theoretical draw, i.e. neither player can force a win when their opponent plays optimally.
From Shannon’s analysis of chess, there are ~35 legal moves per position and ~40 moves per game.
At each position, assume only 1 move among 35 legal moves maintains the draw. This gives a lower bound on Random’s expected score (and thus an upper bound on the Elo gap).
Hence, P(Random accidentally plays an optimal drawing line) ≥ (1/35)^40
Therefore E_Random ≥ 0.5 × (1/35)^40.
If instead chess is a forced win for White or Black, the same calculation applies: Random scores (1/35)^40 when playing the winning side and 0 when playing the losing side, giving E_Random ≥ 0.5 × (1/35)^40.
Rearranging the Elo formula: ΔR = 400 × log₁₀((1/E_Random) − 1)
Since E_Random ≥ 0.5 × (1/35)^40 ≈ 5 × 10^(-62):
The Elo gap between random play and perfect play is at most 24,520 points.
Random has an Elo rating of 477 points[1]. Therefore, the Elo rating of Optimal is no more than 24,997 points.
2. Optimal vs Sensible
We can improve the upper-bound by comparing Optimal to Sensible, a player who avoids ridiculous moves such as sacrificing a queen without compensation. Assume that there are three sensible moves per in each position, and that Sensible plays randomly among sensible moves. Optimal still plays perfectly.
Following the same analysis, E_Sensible ≥ 0.5 × (1/3)^40 ≈ 5 × 10^(-20).
Rearranging the Elo formula: ΔR = 400 × log₁₀((1/E_Sensible) − 1)
The Elo gap between random sensible play and perfect play is at most 7,720 points.
Magnus Carlsen is a chess player with a peak rating of 2882, the highest in history. It’s almost certain that Magnus Carlsen has a higher Elo rating than Sensible. Therefore, Elo rating of Optimal is no more than 10,602 points.
3. Extrapolating Elo Rating vs Draw Rates
Jeremy Rutman fits a linear trend to empirical draw rates (from humans and engines) and finds the line reaches 100% draws at approximately 5,237 Elo[2]. Hence, two chess players with an Elo less than 5,237 would occasionally win games against each other. If chess is a theoretical draw, then these chess players cannot be playing optimally. This analysis suggests that Elo rating of Optimal is exceeds 5,237 points, although it relies on a a linear trend extrapolation.
4. Extrapolating Elo Rating vs Depth Search
Ferreira (2013) ran Houdini 1.5a playing 24,000 games against itself at different search depths (6-20 plies). The paper calculated Elo ratings by:
Measuring win rates between all depth pairs
Anchoring to absolute Elo by analyzing how depth 20 performed against grandmasters (estimated at 2894 Elo)
Extrapolating using 66.3 Elo/ply (the fitted value when playing against depth 20)
Since most games end by depth 80, we can estimate the Elo Rating of optimal chess by extrapolating this trend to 80 plies. This analysis suggests that Elo rating of Optimal is 6,872 points.
From Tom 7′s “30 Weird Chess Algorithms” tournament, where Random (playing completely random legal moves) achieved an Elo rating of 477 against a field of 50+ weak and strong chess algorithms. h/t to this LW comment.
If you’re interested in the opinion of someone who authored (and continues to work on) the #12 chess engine, I would note that there are at least two possibilities for what constitutes “optimal chess”—first would be “minimax-optimal chess”, wherein the player never chooses a move that worsens the theoretical outcome of the position (i.e. losing a win for a draw or a draw for a loss), choosing arbitrarily among the remaining moves available, and second would be “expected-value optimal” chess, wherein the player always chooses the move that maximises their expected value (that is, p(win) + 0.5 * p(draw)), taking into account the opponent’s behaviour. These two decision procedures are likely thousands of Elo apart when compared against e.g. Stockfish.
The first agent (Minimax-Optimal) will choose arbitrarily between the opening moves that aren’t f2f3 or g2g4, as they are all drawn. This style of decision-making will make it very easy for Stockfish to hold Minimax-Optimal to a draw.
The second agent (E[V]-Given-Opponent-Optimal) would, contrastingly, be willing to make a theoretical blunder against Stockfish if it knew that Stockfish would fail to punish such a move, and would choose the line of play most difficult for Stockfish to cope with. As such, I’d expect this EVGOO agent to beat Stockfish from the starting position, by choosing a very “lively” line of play.
I think we’re probably brushing against the modelling assumptions required for the Elo formula. In particular, the following two are inconsistent with Elo assumption:
EVGO-optimal has a better chance of beating Stockfish than minmax-optimal
EVGO-optimal has a negative expected score against minmax-optimal
Yep. The Elo system is not designed to handle non-transitive rock-paper-scissors-style cycles.
This already exists to an extent with the advent of odds-chess bots like LeelaQueenOdds. This bot plays without her queen against humans, but still wins most of the time, even against strong humans who can easily beat Stockfish given the same queen odds. Stockfish will reliably outperform Leela under standard conditions.
Consider a game like chess except, with probability epsilon, the player’s move is randomized uniformly from all legal moves. Let epsilon-optimal be the optimal strategy (defined via minmax) in epsilon-chess. We can consider this a strategy of ordinary chess also.
My guess is that epsilon-optimal would score better than mini-max-optimal against Stockfish. Of course, EVGO-optimal would score even better against Stockfish but that feels like cheating.
I am inclined to agree. The juice to squeeze generally arises from guiding the game into locations where there is more opportunity for your opponent to blunder. I’d expect that opponent-epsilon-optimal (i.e. your opponent can be forced to move randomly, but you cannot) would outperform both epsilon-optimal and minimax-optimal play against Stockfish.
Your description of EVGOO is incorrect; you describe a Causal Decision Theory algorithm, but (assuming the opponent also knows your strategy ‘cause otherwise you’re cheating) what you want is LDT. (Assuming they only see each others’ policy for that game, so an agent acting as eg CDT is indistinguishable from real CDT, then LDT is optimal even against such fantastic pathological opponents as “Minimax if my opponent looks like it’s following the algorithm that you the reader are hoping is optimal, otherwise resign” (or, if they can see each others’ policy for the whole universe of agents you’re testing, then LDT at least gets the maximum aggregate score).)
I’ll note that CDT and FDT prescribe identical actions against Stockfish, which is the frame of mind I had when writing.
More to your point—I’m not sure that I am describing CDT: ”always choose the move that maximises your expected value (that is, p(win) + 0.5 * p(draw)), taking into account your opponent’s behaviour” sounds like a decision rule that necessitates a logical decision theory, rather than excluding it?
Your point about pathological robustness is valid but I’m not sure how much this matters in the setting of chess.
Lastly, if we’re using the formalisms of CDT or FDT or whatever, I think this question ceases to be particularly interesting, as these are logically omniscient formalisms—so I presume you have some point that I’m missing about logically relaxed variants thereof.
I agree none of this is relevant to anything, I was just looking for intrinsically interesting thoughts about optimal chess.
I thought at least CDT could be approximated pretty well with a bounded variant; causal reasoning is a normal thing to do. FDT is harder, but some humans seem to find it a useful perspective, so presumably you can have algorithms meaningfully closer or further, and that is a useful proxy for something. Actually never mind, I have no experience with the formalisms.
I guess “choose the move that maximises your expected value” is technically compatible with FDT, you’re right. It seems like the obvious way to describe what CDT does, and a really unnatural way to describe what FDT does, so I got confused.
Do games between top engines typically end within 40 moves? It might be that an optimal player’s occasional win against an almost-optimal player might come from deliberately extending and complicating the game to create chances
According to Braun (2015), computer-vs-computer games from Schach.de (2000-2007, ~4 million games) averaged 64 moves (128 plies), compared to 38 moves for human games. The longer length is because computers don’t make the tactical blunders that abruptly end human games.
Here are the three methods updated for 64-move games:
1. Random vs Optimal (64 moves):
P(Random plays optimally) = (1/35)^64 ≈ 10^(-99)
E_Random ≈ 0.5 × 10^(-99)
ΔR ≈ 39,649
Elo Optimal ≤ 40,126 Elo
2. Sensible vs Optimal (64 moves):
P(Sensible plays optimally) = (1/3)^64 ≈ 10^(-30.5)
E_Sensible ≈ 0.5 × 10^(-30.5)
ΔR ≈ 12,335
Elo Optimal ≤ 15,217 Elo
3. Depth extrapolation (128 plies):
Linear: 2894 + (128-20) × 66.3 ≈ 10,054 Elo
This is a bit annoying because my intuitions are that optimal Elo is ~6500.
This thread made me very curious as to what the elo rating of an optimal player would be when it knows the source code of its opponent.
For flawed deterministic programs an optimal player can steer the game to points where the program makes a fatal mistake. For probabilistic programs an optimal player is intentionally lengthening the game to induce a mistake. For this thought experiment if an optimal player is playing a random player than an optimal player can force the game to last 100s of moves consistently.
Makes me curious to see a game between humans where non-sensible moves are defined in some objective way and forbidden by guardrail AI. Like, not even considered a legal move by the computer UI.
Would this extend the games of humans to around 64 moves on average? What would the experience of playing such a game be for low ELO humans? Confusion about why certain moves were forbidden, probably.
I agree this variation would lengthen the game. The experience would change for sure for all human players.
An objectively losing human player may intentionally play objectively bad moves that lengthen a game and complicate it. It’s a learned skill that some players have honed better than others.
In this variation that skill is neutralized so I imagine elos would be different enough to have different player rankings.
Another way: extrapolate depth search across different board scoring methods. At infinite depth, all non-stupid board scorers will achieve perfect play, and therefore equal play. Estimating convergence rates might be difficult though.
I do not believe random’s Elo is as high as 477. That Elo was calculated from a population of chess engines where about a third of them were worse than random.
I have to back you on this… There are elo systems which go down to 100 elo and still have a significant number of players who are at the floor. Having seen a few of these games, those players are truly terrible but will still occasionally do something good, because they are actually trying to win. I expect random to be somewhere around −300 or so when not tested in strange circumstances which break the modelling assumptions (the source described had multiple deterministic engines playing in the same tournament, aside from the concerns you mentioned in the other thread).
Aren’t ELO scores conserved? The sum of the ELO scores for a fixed population will be unchanged? The video puts stockfish’s ELO at 2708.4, worse than some human grandmasters, which also suggests to me that he didn’t run the ELO algorithm to convergence and stockfish should be stealing more score from other weaker players. EDIT ChatGPT 5 thinks the ELOs you suggested for random are reasonable for other reasons. I’m still skeptical but want to point that out.
NB: If you think he underestimates stockfish Elo, then you should think he underestimate Random Elo, because the algorithm finds Elo gaps not absolute Elo.
Not if the ELO algorithm isn’t run to completion. It takes a long time to make large gaps in ELO, like between stockfish and Random, if you don’t have a lot of intermediate players. It’s hard for ELO to different between +1000 ELO and +2000 ELO—both mean “wins virtually all the time”.
A problem with this entire line of reasoning, which I have given some thought to, is: how do you even define optimal play? My first thought was a 32-piece tablebase[1] but I don’t think this works. If we hand an objectively won position to the tablebase, it will play in a way that delivers mate in the fewest number of moves (assuming perfect play from the opponent). If you hand it a lost position it will play in a way that averts being mated for longest. But we have a problem when we hand it a drawn position. Assume for a second that the starting position is drawn[2] and our tablebase is White. So, the problem is that I don’t see a way to give our tablebase a sensible algorithm for choosing between moves (all of which lead to a draw if the tablebase is playing against itself).[3] If our tablebase chooses at random between them, then, in the starting position, playing a3/h3 is just as likely as playing e4/d4. This fundamental problem generalizes to every resulting position; the tablebase can’t distinguish between getting a position that a grandmaster would judge as ‘notably better with good winning chances’ and a position which would be judged as ‘horrible and very hard to hold in practice’ (so long as both of those positions would end in a draw with two 32-piece tablebases playing against each other). From this it seems rather obvious that if our tablebase picks at random among drawing moves, it would be unable to win[4]against, say, Stockfish 17 at depth 20 from the starting position (with both colors).
The second idea is to give infinite computing power and memory to Stockfish 17 but this runs into the same problem as with the tablebase, since Stockfish would calculate to the end and we run into the problem of Stockfish being a ministomax algorithm the same as a tablebase’s algorithm.
All of which is to say that either ‘optimal play’ wouldn’t achieve impressive practical results or we redefine ‘optimal play’ as ‘optimal play against [something]’.
To be more precise, I don’t see such a way that could be called ‘optimal’. If we are satisfied with the algorithm being defined as optimal against [humans in general]/grandmasters/[chess engines in general]/[Stockfish 17], then there are plenty of ways to implement this
This fundamental problem generalizes to every resulting position; the tablebase can’t distinguish between getting a position that a grandmaster would judge as ‘notably better with good winning chances’ and a position which would be judged as ‘horrible and very hard to hold in practice’ (so long as both of those positions would end in a draw with two 32-piece tablebases playing against each other).
From this it seems rather obvious that if our tablebase picks at random among drawing moves, it would be unable to win[4]against, say, Stockfish 17 at depth 20 from the starting position (with both colors).
Suppose the tablebase selected randomly from drawing moves, when presented with a drawing position. And the initial position is a drawing position. Then the table base either wins or draws. You can see this by thinking about the definitions.
It’s relatively easy to define optimal chess by induction, by the min-max algorithm.
You’re correct that for a suboptimal policy P, the policy Q which scores the best against P might not be an optimal play.
Of course. At no point did I suggest that it could lose. The ‘horrible and very hard to hold in practice’ was referring to the judgement of a hypothetical grandmaster, though I’m not sure if you were referring to that part.
”It’s relatively easy to define optimal chess by induction, by the min-max algorithm.” Once again, I agree. I failed to mention what I see as an obvious implication of my line of reasoning. Namely that optimal play (with random picking among drawing moves) would have a pretty unimpressive Elo [1](way lower than your estimates/upper bounds), one bounded by the Elo of the opponent/s. So: If we pit it against different engines in a tournament, I would expect the draw rate to be ~100% and the resulting Elo to be (in expectation) ever so slightly higher than the average rating of the engines it’s playing against. If we pit it against grandmasters I think similar reasoning applies (I’d expect the draw rate to be ~97-99%). You can extend this further to club-players, casual players, patzers and I would expect the draw rate to drop off, yes, but still remain high. Which suggests that optimal play (with random picking among drawing moves) would underperform Stockfish 17 by miles, since Stockfish could probably achieve a win rate of >99% against basically any group of human opponents.
There are plenty of algorithms which are provably optimal (minimax-wise) some of which would play very unimpressively in practice (like our random-drawn-move 32-piece tablebase) and some which could get a very high Elo estimaiton in ~all contexts. For example: If the position is won, use the 32-piece tablebase Same if the position is lost If the position is drawn, use Stockfish 17 at depth 25 to pick from the set of drawing moves.
This is optimal too, and would perform way better but that definition is quite inelegant. And the thing that I was trying to get at by asking about the specific definition, is that there is an astronomically large amount of optimal play algorithms, some of which could get a very low Elo in some contexts and some which could get a very high Elo irrespective of context. So when you write ‘What’s the Elo rating of optimal chess?‘, it seems reasonable to ask ‘Which optimal algorithm exactly?’.
I really like this question and this analysis!
I think an extension I’d do here is to restrict the “3 reasonable moves” picture by looking at proposed moves of different agents in various games. My guess is that in fact the “effective information content” in a move at high-level play is less than 1 bit per move on average. If you had a big gpu to throw at this problem you could try to explicitly train an engine via an RL policy with a strong entropy objective and see what maximal entropy is compatible with play at different ratings
Yep, I thought of a similar method: (1) Find a trend between Elo and the entropy of moves during the middle-game. (2) Estimate the middle-game entropy of optimal chess. But the obstacle is (2), there’s probably high-entropy optimal strategies!
Here’s an attack I’m thinking about:
Consider epsilon-chess, which is like chess except with probability epsilon the pieces move randomly, say epsilon=10^-5. In this environment, the optimal strategies probably have very low entropy because the quality function has a continuous range so argmax won’t be faced with any ties. This makes the question better defined: there’s likely to be a single optimal policy, which is also deterministic.
This is inspired by @Dalcy’s PIBBSS project (unpublished, but I’ll send you link in DM).
I do think there is some fun interesting detail in defining “optimal” here. Consider the following three players:
A—Among all moves whose minimax value is maximal, chooses one uniformly at random (i.e. if there is at least one winning move, they choose one uniformly, else if there is at least one drawing move, they choose one uniformly, else they choose among losing moves uniformly).
B—Among all moves whose minimax value is maximal, chooses one uniformly at random, but in cases of winning/losing, restricting to only moves that win as fast as possible or lose as slowly as possible (i.e. if there is at least one winning move, they choose one uniformly among those with the shortest distance to mate, else if there is at least one drawing move, they choose one uniformly, else they choose among losing moves uniformly with the longest distance to mate).
C—Among all moves whose minimax value is maximal, chooses the one that the current latest Stockfish version as of today would choose if its search were restricted to only such moves given <insert some reasonable amount> of compute time on <insert some reasonable hardware>.
For C you can also define other variations using Leela Chess Zero, or even LeelaKnightOdds, etc, or other methods entirely of discriminating game-theoretically-equal-value moves based on density of losing/winning lines in the subtree, etc.
When people refer to “optimal” without further qualifiers in chess, often they mean something like A or B. But I would note that C is also an “optimal” player in the same sense of never playing a move leading to a worse game-theoretic value. However, C may well have a higher Elo than A or B when measured against a population of practical or “natural” players or other bots.
In particular, supposing chess is in fact a game theoretic draw from the starting position, I think there’s a decent chance we would find that A and B would typically give up small advantages for “no good reason” in the opening, and quickly incurring a slight positional or material disadvantage, until the fact that they never actually play any losing move becomes constraining and prevents them from ever becoming worse enough to actually lose. This is because in many branches of the game tree, there are probably many moves that draw, which will include moves that any human and/or bot today might analyze as “bad”, just not bad enough to actually lose. And indeed, the closer one’s position is to being winning without actually being winning yet, the worse a move can be without that move making the position losing, increasing the number of possible “bad” moves that can be chosen. When faced vs sufficiently strong but not quite perfect players (e.g. today’s strong bots) this might lead A and B to relatively consistently play into disadvantageous positions, harming their ability to win by making it much easier for their imperfect opponents to maintain a draw.
By contrast, variations of C might better maintain incremental advantages and pressure on imperfect opponents, leading imperfect opponents to more often make a mistake and give away a win. The issue, of course, is that unlike A and B, there isn’t so clean or canonical (“schelling point”) of a choice of C, as you have to pick what version, what amount of compute, etc. And different choices of C may have somewhat different characteristics against different distributions of possible opponents. This indicates that the concept of “Elo of optimal play”, without further qualification about what flavor of optimal and against what opponents, might be a little fuzzy and imperfect as a map of the territory when you zoom in close enough, although plausibly maybe it might not affect the big picture as much (although my suspicion is that the choice of these details is not entirely trivial even then).
your entire analysis is broken in that you assume that an elo rating is something objective like an atomic weight or the speed of light. in reality, an elo rating is an estimation of playing strength among a particular pool of players.
the problem that elo was trying to solve was, if you have players A and B, who have both played among players C through Q, but A and B have never played each other, can you concretely say whether A is stronger than B? the genius of the system is that you can, and in fact, the comparison of 2 scores gives you a probability of whether A will beat B in a game (if i recall correctly, a difference of +200 points implies an expected score of +0.75, where 1.0 is winning, 0 is losing, and 0.5 is a draw).
the elo system does not work, however, if there are 2 pools of non-overlapping players like C through M and N through Z, and A has only played in pool 1, and B only in pool 2. i’m fairly certain you could construct a series ~200 of exploitable chess bots, where A always beats B, B always beats C, etc, getting elo rankings almost arbitrarily high.
so a major problem with your analysis was that you cited Random as having an elo of 477, and indexed your other answers based on that, when actually, that bot had an elo of 477 against other terrible (humorous) bots. if you put Random into FIDE tournaments, i expect its elo would be much lower.
I loved Stratego as a kid, and I find this very appealing. The opportunity for faking out your opponent by playing strong pieces as if they were weak ones, followed by a sudden betrayal of expectation....
What’s the Elo rating of optimal chess?
I present four methods to estimate the Elo Rating for optimal play: (1) comparing optimal play to random play, (2) comparing optimal play to sensible play, (3) extrapolating Elo rating vs draw rates, (4) extrapolating Elo rating vs depth-search.
1. Optimal vs Random
Random plays completely random legal moves. Optimal plays perfectly. Let ΔR denote the Elo gap between Random and Optimal. Random’s expected score is given by E_Random = P(Random wins) + 0.5 × P(Random draws). This is related to Elo gap via the formula E_Random = 1/(1 + 10^(ΔR/400)).
First, suppose that chess is a theoretical draw, i.e. neither player can force a win when their opponent plays optimally.
From Shannon’s analysis of chess, there are ~35 legal moves per position and ~40 moves per game.
At each position, assume only 1 move among 35 legal moves maintains the draw. This gives a lower bound on Random’s expected score (and thus an upper bound on the Elo gap).
Hence, P(Random accidentally plays an optimal drawing line) ≥ (1/35)^40
Therefore E_Random ≥ 0.5 × (1/35)^40.
If instead chess is a forced win for White or Black, the same calculation applies: Random scores (1/35)^40 when playing the winning side and 0 when playing the losing side, giving E_Random ≥ 0.5 × (1/35)^40.
Rearranging the Elo formula: ΔR = 400 × log₁₀((1/E_Random) − 1)
Since E_Random ≥ 0.5 × (1/35)^40 ≈ 5 × 10^(-62):
The Elo gap between random play and perfect play is at most 24,520 points.
Random has an Elo rating of 477 points[1]. Therefore, the Elo rating of Optimal is no more than 24,997 points.
2. Optimal vs Sensible
We can improve the upper-bound by comparing Optimal to Sensible, a player who avoids ridiculous moves such as sacrificing a queen without compensation. Assume that there are three sensible moves per in each position, and that Sensible plays randomly among sensible moves. Optimal still plays perfectly.
Following the same analysis, E_Sensible ≥ 0.5 × (1/3)^40 ≈ 5 × 10^(-20).
Rearranging the Elo formula: ΔR = 400 × log₁₀((1/E_Sensible) − 1)
The Elo gap between random sensible play and perfect play is at most 7,720 points.
Magnus Carlsen is a chess player with a peak rating of 2882, the highest in history. It’s almost certain that Magnus Carlsen has a higher Elo rating than Sensible. Therefore, Elo rating of Optimal is no more than 10,602 points.
3. Extrapolating Elo Rating vs Draw Rates
Jeremy Rutman fits a linear trend to empirical draw rates (from humans and engines) and finds the line reaches 100% draws at approximately 5,237 Elo[2]. Hence, two chess players with an Elo less than 5,237 would occasionally win games against each other. If chess is a theoretical draw, then these chess players cannot be playing optimally. This analysis suggests that Elo rating of Optimal is exceeds 5,237 points, although it relies on a a linear trend extrapolation.
4. Extrapolating Elo Rating vs Depth Search
Ferreira (2013) ran Houdini 1.5a playing 24,000 games against itself at different search depths (6-20 plies). The paper calculated Elo ratings by:
Measuring win rates between all depth pairs
Anchoring to absolute Elo by analyzing how depth 20 performed against grandmasters (estimated at 2894 Elo)
Extrapolating using 66.3 Elo/ply (the fitted value when playing against depth 20)
Since most games end by depth 80, we can estimate the Elo Rating of optimal chess by extrapolating this trend to 80 plies. This analysis suggests that Elo rating of Optimal is 6,872 points.
From Tom 7′s “30 Weird Chess Algorithms” tournament, where Random (playing completely random legal moves) achieved an Elo rating of 477 against a field of 50+ weak and strong chess algorithms. h/t to this LW comment.
Source: God’s chess rating, December 20, 2018
If you’re interested in the opinion of someone who authored (and continues to work on) the #12 chess engine, I would note that there are at least two possibilities for what constitutes “optimal chess”—first would be “minimax-optimal chess”, wherein the player never chooses a move that worsens the theoretical outcome of the position (i.e. losing a win for a draw or a draw for a loss), choosing arbitrarily among the remaining moves available, and second would be “expected-value optimal” chess, wherein the player always chooses the move that maximises their expected value (that is, p(win) + 0.5 * p(draw)), taking into account the opponent’s behaviour. These two decision procedures are likely thousands of Elo apart when compared against e.g. Stockfish.
The first agent (Minimax-Optimal) will choose arbitrarily between the opening moves that aren’t f2f3 or g2g4, as they are all drawn. This style of decision-making will make it very easy for Stockfish to hold Minimax-Optimal to a draw.
The second agent (E[V]-Given-Opponent-Optimal) would, contrastingly, be willing to make a theoretical blunder against Stockfish if it knew that Stockfish would fail to punish such a move, and would choose the line of play most difficult for Stockfish to cope with. As such, I’d expect this EVGOO agent to beat Stockfish from the starting position, by choosing a very “lively” line of play.
I think we’re probably brushing against the modelling assumptions required for the Elo formula. In particular, the following two are inconsistent with Elo assumption:
EVGO-optimal has a better chance of beating Stockfish than minmax-optimal
EVGO-optimal has a negative expected score against minmax-optimal
Yep. The Elo system is not designed to handle non-transitive rock-paper-scissors-style cycles.
This already exists to an extent with the advent of odds-chess bots like LeelaQueenOdds. This bot plays without her queen against humans, but still wins most of the time, even against strong humans who can easily beat Stockfish given the same queen odds. Stockfish will reliably outperform Leela under standard conditions.
In rough terms:
Stockfish > LQO >> LQO (-queen) > strong humans > Stockfish (-queen)
Stockfish plays roughly like a minimax optimizer, whereas LQO is specifically trained to exploit humans.
Edit: For those interested, there’s some good discussion of LQO in the comments of this post:
https://www.lesswrong.com/posts/odtMt7zbMuuyavaZB/when-do-brains-beat-brawn-in-chess-an-experiment
Interesting.
Consider a game like chess except, with probability epsilon, the player’s move is randomized uniformly from all legal moves. Let epsilon-optimal be the optimal strategy (defined via minmax) in epsilon-chess. We can consider this a strategy of ordinary chess also.
My guess is that epsilon-optimal would score better than mini-max-optimal against Stockfish. Of course, EVGO-optimal would score even better against Stockfish but that feels like cheating.
I am inclined to agree. The juice to squeeze generally arises from guiding the game into locations where there is more opportunity for your opponent to blunder. I’d expect that opponent-epsilon-optimal (i.e. your opponent can be forced to move randomly, but you cannot) would outperform both epsilon-optimal and minimax-optimal play against Stockfish.
Your description of EVGOO is incorrect; you describe a Causal Decision Theory algorithm, but (assuming the opponent also knows your strategy ‘cause otherwise you’re cheating) what you want is LDT.
(Assuming they only see each others’ policy for that game, so an agent acting as eg CDT is indistinguishable from real CDT, then LDT is optimal even against such fantastic pathological opponents as “Minimax if my opponent looks like it’s following the algorithm that you the reader are hoping is optimal, otherwise resign” (or, if they can see each others’ policy for the whole universe of agents you’re testing, then LDT at least gets the maximum aggregate score).)
I’ll note that CDT and FDT prescribe identical actions against Stockfish, which is the frame of mind I had when writing.
More to your point—I’m not sure that I am describing CDT:
”always choose the move that maximises your expected value (that is, p(win) + 0.5 * p(draw)), taking into account your opponent’s behaviour” sounds like a decision rule that necessitates a logical decision theory, rather than excluding it?
Your point about pathological robustness is valid but I’m not sure how much this matters in the setting of chess.
Lastly, if we’re using the formalisms of CDT or FDT or whatever, I think this question ceases to be particularly interesting, as these are logically omniscient formalisms—so I presume you have some point that I’m missing about logically relaxed variants thereof.
I agree none of this is relevant to anything, I was just looking for intrinsically interesting thoughts about optimal chess.
I thought at least CDT could be approximated pretty well with a bounded variant; causal reasoning is a normal thing to do. FDT is harder, but some humans seem to find it a useful perspective, so presumably you can have algorithms meaningfully closer or further, and that is a useful proxy for something.
Actually never mind, I have no experience with the formalisms.
I guess “choose the move that maximises your expected value” is technically compatible with FDT, you’re right.
It seems like the obvious way to describe what CDT does, and a really unnatural way to describe what FDT does, so I got confused.
Do games between top engines typically end within 40 moves? It might be that an optimal player’s occasional win against an almost-optimal player might come from deliberately extending and complicating the game to create chances
Great comment.
According to Braun (2015), computer-vs-computer games from Schach.de (2000-2007, ~4 million games) averaged 64 moves (128 plies), compared to 38 moves for human games. The longer length is because computers don’t make the tactical blunders that abruptly end human games.
Here are the three methods updated for 64-move games:
1. Random vs Optimal (64 moves):
P(Random plays optimally) = (1/35)^64 ≈ 10^(-99)
E_Random ≈ 0.5 × 10^(-99)
ΔR ≈ 39,649
Elo Optimal ≤ 40,126 Elo
2. Sensible vs Optimal (64 moves):
P(Sensible plays optimally) = (1/3)^64 ≈ 10^(-30.5)
E_Sensible ≈ 0.5 × 10^(-30.5)
ΔR ≈ 12,335
Elo Optimal ≤ 15,217 Elo
3. Depth extrapolation (128 plies):
Linear: 2894 + (128-20) × 66.3 ≈ 10,054 Elo
This is a bit annoying because my intuitions are that optimal Elo is ~6500.
This thread made me very curious as to what the elo rating of an optimal player would be when it knows the source code of its opponent.
For flawed deterministic programs an optimal player can steer the game to points where the program makes a fatal mistake. For probabilistic programs an optimal player is intentionally lengthening the game to induce a mistake. For this thought experiment if an optimal player is playing a random player than an optimal player can force the game to last 100s of moves consistently.
Makes me curious to see a game between humans where non-sensible moves are defined in some objective way and forbidden by guardrail AI. Like, not even considered a legal move by the computer UI.
Would this extend the games of humans to around 64 moves on average? What would the experience of playing such a game be for low ELO humans? Confusion about why certain moves were forbidden, probably.
I agree this variation would lengthen the game.
The experience would change for sure for all human players.
An objectively losing human player may intentionally play objectively bad moves that lengthen a game and complicate it. It’s a learned skill that some players have honed better than others.
In this variation that skill is neutralized so I imagine elos would be different enough to have different player rankings.
Another way: extrapolate depth search across different board scoring methods. At infinite depth, all non-stupid board scorers will achieve perfect play, and therefore equal play. Estimating convergence rates might be difficult though.
I do not believe random’s Elo is as high as 477. That Elo was calculated from a population of chess engines where about a third of them were worse than random.
I have to back you on this… There are elo systems which go down to 100 elo and still have a significant number of players who are at the floor. Having seen a few of these games, those players are truly terrible but will still occasionally do something good, because they are actually trying to win. I expect random to be somewhere around −300 or so when not tested in strange circumstances which break the modelling assumptions (the source described had multiple deterministic engines playing in the same tournament, aside from the concerns you mentioned in the other thread).
That shouldn’t effect the Elo algorithm.
Aren’t ELO scores conserved? The sum of the ELO scores for a fixed population will be unchanged?
The video puts stockfish’s ELO at 2708.4, worse than some human grandmasters, which also suggests to me that he didn’t run the ELO algorithm to convergence and stockfish should be stealing more score from other weaker players.
EDIT ChatGPT 5 thinks the ELOs you suggested for random are reasonable for other reasons. I’m still skeptical but want to point that out.
Good point, I should look into this more.
NB: If you think he underestimates stockfish Elo, then you should think he underestimate Random Elo, because the algorithm finds Elo gaps not absolute Elo.
Not if the ELO algorithm isn’t run to completion. It takes a long time to make large gaps in ELO, like between stockfish and Random, if you don’t have a lot of intermediate players. It’s hard for ELO to different between +1000 ELO and +2000 ELO—both mean “wins virtually all the time”.
A problem with this entire line of reasoning, which I have given some thought to, is: how do you even define optimal play?
My first thought was a 32-piece tablebase[1] but I don’t think this works. If we hand an objectively won position to the tablebase, it will play in a way that delivers mate in the fewest number of moves (assuming perfect play from the opponent). If you hand it a lost position it will play in a way that averts being mated for longest. But we have a problem when we hand it a drawn position. Assume for a second that the starting position is drawn[2] and our tablebase is White. So, the problem is that I don’t see a way to give our tablebase a sensible algorithm for choosing between moves (all of which lead to a draw if the tablebase is playing against itself).[3] If our tablebase chooses at random between them, then, in the starting position, playing a3/h3 is just as likely as playing e4/d4. This fundamental problem generalizes to every resulting position; the tablebase can’t distinguish between getting a position that a grandmaster would judge as ‘notably better with good winning chances’ and a position which would be judged as ‘horrible and very hard to hold in practice’ (so long as both of those positions would end in a draw with two 32-piece tablebases playing against each other).
From this it seems rather obvious that if our tablebase picks at random among drawing moves, it would be unable to win[4]against, say, Stockfish 17 at depth 20 from the starting position (with both colors).
The second idea is to give infinite computing power and memory to Stockfish 17 but this runs into the same problem as with the tablebase, since Stockfish would calculate to the end and we run into the problem of Stockfish being a ministomax algorithm the same as a tablebase’s algorithm.
All of which is to say that either ‘optimal play’ wouldn’t achieve impressive practical results or we redefine ‘optimal play’ as ‘optimal play against [something]’.
This is impossible, of course, but I’m looking for a definition, not an implementation.
This is almost definitely true.
To be more precise, I don’t see such a way that could be called ‘optimal’. If we are satisfied with the algorithm being defined as optimal against [humans in general]/grandmasters/[chess engines in general]/[Stockfish 17], then there are plenty of ways to implement this
There are bound to be some everett branches where our tablebase wins but they would be an astronomically small fraction of all results.
Suppose the tablebase selected randomly from drawing moves, when presented with a drawing position. And the initial position is a drawing position. Then the table base either wins or draws. You can see this by thinking about the definitions.
It’s relatively easy to define optimal chess by induction, by the min-max algorithm.
You’re correct that for a suboptimal policy P, the policy Q which scores the best against P might not be an optimal play.
Of course. At no point did I suggest that it could lose. The ‘horrible and very hard to hold in practice’ was referring to the judgement of a hypothetical grandmaster, though I’m not sure if you were referring to that part.
”It’s relatively easy to define optimal chess by induction, by the min-max algorithm.”
Once again, I agree. I failed to mention what I see as an obvious implication of my line of reasoning. Namely that optimal play (with random picking among drawing moves) would have a pretty unimpressive Elo [1](way lower than your estimates/upper bounds), one bounded by the Elo of the opponent/s.
So:
If we pit it against different engines in a tournament, I would expect the draw rate to be ~100% and the resulting Elo to be (in expectation) ever so slightly higher than the average rating of the engines it’s playing against.
If we pit it against grandmasters I think similar reasoning applies (I’d expect the draw rate to be ~97-99%).
You can extend this further to club-players, casual players, patzers and I would expect the draw rate to drop off, yes, but still remain high. Which suggests that optimal play (with random picking among drawing moves) would underperform Stockfish 17 by miles, since Stockfish could probably achieve a win rate of >99% against basically any group of human opponents.
There are plenty of algorithms which are provably optimal (minimax-wise) some of which would play very unimpressively in practice (like our random-drawn-move 32-piece tablebase) and some which could get a very high Elo estimaiton in ~all contexts. For example:
If the position is won, use the 32-piece tablebase
Same if the position is lost
If the position is drawn, use Stockfish 17 at depth 25 to pick from the set of drawing moves.
This is optimal too, and would perform way better but that definition is quite inelegant. And the thing that I was trying to get at by asking about the specific definition, is that there is an astronomically large amount of optimal play algorithms, some of which could get a very low Elo in some contexts and some which could get a very high Elo irrespective of context. So when you write ‘What’s the Elo rating of optimal chess?‘, it seems reasonable to ask ‘Which optimal algorithm exactly?’.
And very unimpressive level of play in drawn positions.
I really like this question and this analysis! I think an extension I’d do here is to restrict the “3 reasonable moves” picture by looking at proposed moves of different agents in various games. My guess is that in fact the “effective information content” in a move at high-level play is less than 1 bit per move on average. If you had a big gpu to throw at this problem you could try to explicitly train an engine via an RL policy with a strong entropy objective and see what maximal entropy is compatible with play at different ratings
Yep, I thought of a similar method: (1) Find a trend between Elo and the entropy of moves during the middle-game. (2) Estimate the middle-game entropy of optimal chess. But the obstacle is (2), there’s probably high-entropy optimal strategies!
Here’s an attack I’m thinking about:
Consider epsilon-chess, which is like chess except with probability epsilon the pieces move randomly, say epsilon=10^-5. In this environment, the optimal strategies probably have very low entropy because the quality function has a continuous range so argmax won’t be faced with any ties. This makes the question better defined: there’s likely to be a single optimal policy, which is also deterministic.
This is inspired by @Dalcy’s PIBBSS project (unpublished, but I’ll send you link in DM).
Very cool, thanks! I agree that Dalcy’s epsilon-game picture makes arguments about ELO vs. optimality more principled
I do think there is some fun interesting detail in defining “optimal” here. Consider the following three players:
A—Among all moves whose minimax value is maximal, chooses one uniformly at random (i.e. if there is at least one winning move, they choose one uniformly, else if there is at least one drawing move, they choose one uniformly, else they choose among losing moves uniformly).
B—Among all moves whose minimax value is maximal, chooses one uniformly at random, but in cases of winning/losing, restricting to only moves that win as fast as possible or lose as slowly as possible (i.e. if there is at least one winning move, they choose one uniformly among those with the shortest distance to mate, else if there is at least one drawing move, they choose one uniformly, else they choose among losing moves uniformly with the longest distance to mate).
C—Among all moves whose minimax value is maximal, chooses the one that the current latest Stockfish version as of today would choose if its search were restricted to only such moves given <insert some reasonable amount> of compute time on <insert some reasonable hardware>.
For C you can also define other variations using Leela Chess Zero, or even LeelaKnightOdds, etc, or other methods entirely of discriminating game-theoretically-equal-value moves based on density of losing/winning lines in the subtree, etc.
When people refer to “optimal” without further qualifiers in chess, often they mean something like A or B. But I would note that C is also an “optimal” player in the same sense of never playing a move leading to a worse game-theoretic value. However, C may well have a higher Elo than A or B when measured against a population of practical or “natural” players or other bots.
In particular, supposing chess is in fact a game theoretic draw from the starting position, I think there’s a decent chance we would find that A and B would typically give up small advantages for “no good reason” in the opening, and quickly incurring a slight positional or material disadvantage, until the fact that they never actually play any losing move becomes constraining and prevents them from ever becoming worse enough to actually lose. This is because in many branches of the game tree, there are probably many moves that draw, which will include moves that any human and/or bot today might analyze as “bad”, just not bad enough to actually lose. And indeed, the closer one’s position is to being winning without actually being winning yet, the worse a move can be without that move making the position losing, increasing the number of possible “bad” moves that can be chosen. When faced vs sufficiently strong but not quite perfect players (e.g. today’s strong bots) this might lead A and B to relatively consistently play into disadvantageous positions, harming their ability to win by making it much easier for their imperfect opponents to maintain a draw.
By contrast, variations of C might better maintain incremental advantages and pressure on imperfect opponents, leading imperfect opponents to more often make a mistake and give away a win. The issue, of course, is that unlike A and B, there isn’t so clean or canonical (“schelling point”) of a choice of C, as you have to pick what version, what amount of compute, etc. And different choices of C may have somewhat different characteristics against different distributions of possible opponents. This indicates that the concept of “Elo of optimal play”, without further qualification about what flavor of optimal and against what opponents, might be a little fuzzy and imperfect as a map of the territory when you zoom in close enough, although plausibly maybe it might not affect the big picture as much (although my suspicion is that the choice of these details is not entirely trivial even then).
your entire analysis is broken in that you assume that an elo rating is something objective like an atomic weight or the speed of light. in reality, an elo rating is an estimation of playing strength among a particular pool of players.
the problem that elo was trying to solve was, if you have players A and B, who have both played among players C through Q, but A and B have never played each other, can you concretely say whether A is stronger than B? the genius of the system is that you can, and in fact, the comparison of 2 scores gives you a probability of whether A will beat B in a game (if i recall correctly, a difference of +200 points implies an expected score of +0.75, where 1.0 is winning, 0 is losing, and 0.5 is a draw).
the elo system does not work, however, if there are 2 pools of non-overlapping players like C through M and N through Z, and A has only played in pool 1, and B only in pool 2. i’m fairly certain you could construct a series ~200 of exploitable chess bots, where A always beats B, B always beats C, etc, getting elo rankings almost arbitrarily high.
so a major problem with your analysis was that you cited Random as having an elo of 477, and indexed your other answers based on that, when actually, that bot had an elo of 477 against other terrible (humorous) bots. if you put Random into FIDE tournaments, i expect its elo would be much lower.
Tangent: have you seen Black Ops Chess? It’s a blend of Chess and Stratego. https://blackopschess.com/game
I loved Stratego as a kid, and I find this very appealing. The opportunity for faking out your opponent by playing strong pieces as if they were weak ones, followed by a sudden betrayal of expectation....
That link (with /game at the end) seems to lead directly into matchmaking, which is startling; it might be better to link to the about page.