I’m interested in a measure of chess-playing ability that doesn’t depend on human players, and while perfect play would be the ideal reference, as long as chess remains unsolved, the other end of the spectrum, the engine whose algorithm is “list all legal moves and uniformly at random pick one of them,” seems the natural choice.
I read that the formula for Elo rating is scaled so that, with some assumptions of transitivity of winning odds, so it’s trivial to convert probability to Elo rating, and my question is roughly equivalent to “What is the probability of victory of random play against, say, Stockfish 17?” If the Elo is close to 0[1], that makes the probability around (estimating Stockfish 17′s Elo to be 3600). Eyeballing the y-intercept of this plot of lc0′s Elo vs. number of games of self-play, it looks something like 150–300 (lots of uncertainty). Does that sound reasonable?
I understand the probability of victory against the best modern engines is probably too small to accurately measure directly, so you would have to construct weaker/noisier engines and step down in increments of a few hundred Elo. Has anyone done this?
- ^
I don’t see an a priori reason to expect it would be.
TLDR: random is roughly 477 elo. (See details and caveats below)
See link, where someone matches weird chess algos against various dilutions of stockfish (stockfish at lvl n with x% of random move mixed in). So this elo is of course, relative to various stockfish dilutions at the time of recording.
https://www.youtube.com/watch?v=DpXy041BIlA
This is precisely what I was looking for! Thanks. (I was actually imaging various amounts of noise being added to the weights of the evaluation neural net, but this is probably close enough.)
Consider as a near-limiting case, imagine an engine that before the game began flipped a coin. On heads, it plays the game as Stockfish. On tails, it plays the game as Worstfish. What is this engine’s ELO?
I’m short of time to go into details but this should help illustrate why one should be careful treating ELO as a well defined space rather than a local approximation that’s empirically useful for computationally-limited players.
One thing that’s worth keeping in mind with exercises like this is that while you can do this in various ways and get some answers, the answers you get may depend nontrivially on how you construct the intermediate ladder of opponents.
For example, attempts to calibrate human and computer Elo ratings scales often do place top computers around the 3500ish area, and one of the other answers given has indicated by a particular ladder of intermediates that random would then be at 400-500 Elo given that. But there are also human players who are genuinely rated 400-500 Elo on servers whose Elo ratings are also approximately transitively self-consistent within that server. These players can still play Chess—e.g. know how pieces move, and can see captures and move pieces to execute those captures vastly better than chance, etc. I would not be surprised to see such a player consistently destroy a uniform random Chess player. Random play is really, really bad. So there’s a good chance here that we would see a significant nonlinearity/nontransitivity in Elo ratings, such that there isn’t any one consistent rating that we can assign to random play relative to Stockfish.
A good way of framing this conceptually is to say that Elo is NOT a fundamental truth about reality, rather it’s an imperfect model that we as humans invented that depending on the situation may work anywhere from poorly to okay to amazingly good at approximating an underlying reality.
In particular, the Elo model makes a very strong “linearity-like” assumption: that if A beats B with expected odds a:b, and B beats C with expected odds b:c, then A will beat C with expected odds of precisely a:c. (where draws are treated as a half point of each player beating the other, i.e. mathematically equivalent in expectation to if you were to resolve all draws by fair coin flip to determine the winner), and then given the way rating is defined from there, this linearity in odds then implies that the expected score between players follows precisely a sigmoid function f(x) = 1/(1+exp(-x)) of their rating difference up to constant scaling.
Almost any real situation will violate these assumptions at least a little (and even mathematically ideal artificial games will violate it, e.g. a game where players have a fixed mean and variance and compete by sampling from different gaussians to see whose number is higher will violate this assumption!). But in many cases of skill-based competition this works quite well, and there are various ways to justify and explain why this approximation does work pretty well when it does!
But even in games/domains where Elo does approximate realistic player pools amazingly well, it quite commonly stops doing as well at the extremes. For example, two common cases where this happens can include:
When the pool of players (particularly bots) being compared are all relatively close to being optimal
When the pool of players being compared cover an extremely wide range of “skill levels”.
The first case can happen when the near-optimal players have some persistent tendencies as to mistakes they still make as well as sharp preferences for various lines. Then you no longer have law-of-large-numbers effects (too few mistakes per game) and also no poisson-like smoothness in the arrival rate of mistakes (mistakes aren’t well-modeled as having an “arrival rate” if they’re sufficiently consistent to a line x bot combination) and the Elo model simply stops being a good model of reality. I’ve seen this empirically be the case on the 9x9 computer go server (9x9 “CGOS”) with a bot equilibrating at one point to be a couple hundred Elo lower than a bot no longer running that it should have been head-to-head equal or stronger than, due to different transitive opponents.
The second case, the one relevant here, can happen because there’s no particular reason to expect that a game will actually have tails that precisely match that of a sigmoid function f(x) = 1/(1+exp(-x)) in expected score in the extreme. Depending the actual tails between different pairs of players of increasingly large ratings differences, particularly whether it tends to be thinner or heavier than exp(-x) in given conditions, when you then try to measure large ratings differences via many transitive steps of intermediate opponents, you then will get different answers depending on the composition of those intermediate players and how many and how big of steps you take.
It’s not surprising when models that are just useful approximations of reality (i.e. “the map, not the territory”) start breaking down at extremes. It can be still worthwhile doing things like this to build intuition or even just for fun and see what numbers you get! While doing so, my personal tendency in such cases would still be to emphasize that at the extremes of questions like “what is the Elo of perfect play” or “what is the Elo of random play”, the numbers you do get can start to be answers that have a lot to do with one’s models and methodologies rather than answers that reflect an underlying reality accurately.
Consider the relatively simple 2 rooks endgame.
White has 2 rooks and makes random moves. Black only has a king. We’ll ignore the white king for now, but I’m pretty sure it makes things worse for White.
Each rook has 14 possible moves, for a total of 28 rook moves. One of those 28 progresses towards a winning state, one regresses it, and 3 blunder a rook.
The Markov Chain for this converges at a rook blunder being ~435x more likely than a checkmate. If black tries to hunt down the rooks, this gets even worse.
Thus, an impossibly big advantage against Stockfish is still extremely unlikely to convert into a win.
I don’t think the other endgames are massively different—the number of possible moves and blunder-moves are roughly the same, although progression is more likely to be completely lost to a random move.
My question is: does this imply that a double-rook endgame is more likely to checkmate after losing a rook than before?
Turns out, the answer to my question is a clear “no”. The white King has up to 7 moves, and the rook 14. In about half of the positions, the black King is in diagonal contact with the white Rook. This means that 4⁄14 rook moves and all but 2 King moves blunder the rook. In addition, white requires up to 16 moves to get to checkmate. The only positive factor is the slightly lower number of moves (~20 vs ~32), but a much higher proportion of moves for the single rook endgame is a blunder for white, and that more than cancels out any upside.
Yes. Computer chess enthusiasts have created a rating list known as CCRL that covers chess engines ranging from the strongest ones we have to somewhat simple amateur-developed chess engines. Most of the engines on that list are open-source.
Two catches:
The weakest engines on the list still exceed random play by a substantial margin.
Implementing some chess engines to fill that gap won’t be difficult, since libraries are available for move generation, and you don’t need efficient search techniques if you’re trying to make a weak chess engine.
CCRL games are typically played from openings that give a small advantage (usually around +1.0) to White, rather than directly from the starting position. Two games are played for each opening, with the engines switching colors between the two games. Unequal openings are used because in high level engine play, playing games from the starting position will result in too many draws.
If you have questions about basic chess programming, please feel free to ask me. I recently started work on a chess engine (as a personal project) and thus am familiar with some of the concepts.