Speaking as someone who works on a very strong chess program (much stronger than AlphaZero, a good chunk weaker than Stockfish), random play is incredibly weak. There are likely a double-digit number of +400 elo / 95% winrate jumps to be made between random play and anything resembling play that is “actually trying to win”.
The more germane point to your question, however, is that Chess is a draw. From the starting position, top programs will draw each other. The answer to the question “What is the probability of victory of random play against Stockfish 17?” is bounded from above by the answer to the question “What is the probability of victory of Stockfish 17 against Stockfish 17?”—and the answer to the second question is that it is actually very low—I would say less than 1%.
This is why all modern Chess engine competitions use unbalanced opening books. Positions harvested from either random generation or human opening databases are filtered to early plies where there is engine-consensus that the ratio p(advantaged side wins) : p(disadvantaged side draws) is as close to even as possible (which cashes out to an evaluation as close to +1.00 as possible). Games between the two players are then “paired”—one game is played with Player 1 as the advantaged side and Player 2 as the disadvantaged side, then the game is played again with the colours swapped. Games are never played in isolation, only in pairs (for fairness).
In this revised game—“Game-Pair Sampled Unbalanced Opening Chess”, we can actually detect differences in strength between programs.
I’m not sure how helpful this is to your goal of constructing effective measures for strength, but I felt it would be useful to explain the state of the art.
Thanks. Yeah, I guess chess has been (weakly) solved and that means you need a more powerful technique for probing differences. Follow up: around what rating do engines gain the ability to force a draw from the starting position? (I understand this will only be a heuristic for the real question of “which engines possess an optimal strategy form the standard starting position?”)
Forgive the nitpick, but I think the standard definition of “weakly solved” requires known-optimal strategies from the starting position, which don’t exist for chess. It’s still not known for sure that chess is a draw—it just looks very likely.
Speaking as someone who works on a very strong chess program (much stronger than AlphaZero, a good chunk weaker than Stockfish), random play is incredibly weak. There are likely a double-digit number of +400 elo / 95% winrate jumps to be made between random play and anything resembling play that is “actually trying to win”.
The more germane point to your question, however, is that Chess is a draw. From the starting position, top programs will draw each other. The answer to the question “What is the probability of victory of random play against Stockfish 17?” is bounded from above by the answer to the question “What is the probability of victory of Stockfish 17 against Stockfish 17?”—and the answer to the second question is that it is actually very low—I would say less than 1%.
This is why all modern Chess engine competitions use unbalanced opening books. Positions harvested from either random generation or human opening databases are filtered to early plies where there is engine-consensus that the ratio p(advantaged side wins) : p(disadvantaged side draws) is as close to even as possible (which cashes out to an evaluation as close to +1.00 as possible). Games between the two players are then “paired”—one game is played with Player 1 as the advantaged side and Player 2 as the disadvantaged side, then the game is played again with the colours swapped. Games are never played in isolation, only in pairs (for fairness).
In this revised game—“Game-Pair Sampled Unbalanced Opening Chess”, we can actually detect differences in strength between programs.
I’m not sure how helpful this is to your goal of constructing effective measures for strength, but I felt it would be useful to explain the state of the art.
Thanks. Yeah, I guess chess has been (weakly) solved and that means you need a more powerful technique for probing differences. Follow up: around what rating do engines gain the ability to force a draw from the starting position? (I understand this will only be a heuristic for the real question of “which engines possess an optimal strategy form the standard starting position?”)
My guess is somewhere in the 3200-3400 range, but this isn’t something I’ve experimented with in detail.
Forgive the nitpick, but I think the standard definition of “weakly solved” requires known-optimal strategies from the starting position, which don’t exist for chess. It’s still not known for sure that chess is a draw—it just looks very likely.
Agreed. This is a deliberate abuse of mathematical terminology, substituting any notion of “proof” with “looks true from experimental evidence.”