One thing that’s worth keeping in mind with exercises like this is that while you can do this in various ways and get some answers, the answers you get may depend nontrivially on how you construct the intermediate ladder of opponents.
For example, attempts to calibrate human and computer Elo ratings scales often do place top computers around the 3500ish area, and one of the other answers given has indicated by a particular ladder of intermediates that random would then be at 400-500 Elo given that. But there are also human players who are genuinely rated 400-500 Elo on servers whose Elo ratings are also approximately transitively self-consistent within that server. These players can still play Chess—e.g. know how pieces move, and can see captures and move pieces to execute those captures vastly better than chance, etc. I would not be surprised to see such a player consistently destroy a uniform random Chess player. Random play is really, really bad. So there’s a good chance here that we would see a significant nonlinearity/nontransitivity in Elo ratings, such that there isn’t any one consistent rating that we can assign to random play relative to Stockfish.
A good way of framing this conceptually is to say that Elo is NOT a fundamental truth about reality, rather it’s an imperfect model that we as humans invented that depending on the situation may work anywhere from poorly to okay to amazingly good at approximating an underlying reality.
In particular, the Elo model makes a very strong “linearity-like” assumption: that if A beats B with expected odds a:b, and B beats C with expected odds b:c, then A will beat C with expected odds of precisely a:c. (where draws are treated as a half point of each player beating the other, i.e. mathematically equivalent in expectation to if you were to resolve all draws by fair coin flip to determine the winner), and then given the way rating is defined from there, this linearity in odds then implies that the expected score between players follows precisely a sigmoid function f(x) = 1/(1+exp(-x)) of their rating difference up to constant scaling.
Almost any real situation will violate these assumptions at least a little (and even mathematically ideal artificial games will violate it, e.g. a game where players have a fixed mean and variance and compete by sampling from different gaussians to see whose number is higher will violate this assumption!). But in many cases of skill-based competition this works quite well, and there are various ways to justify and explain why this approximation does work pretty well when it does!
But even in games/domains where Elo does approximate realistic player pools amazingly well, it quite commonly stops doing as well at the extremes. For example, two common cases where this happens can include:
When the pool of players (particularly bots) being compared are all relatively close to being optimal
When the pool of players being compared cover an extremely wide range of “skill levels”.
The first case can happen when the near-optimal players have some persistent tendencies as to mistakes they still make as well as sharp preferences for various lines. Then you no longer have law-of-large-numbers effects (too few mistakes per game) and also no poisson-like smoothness in the arrival rate of mistakes (mistakes aren’t well-modeled as having an “arrival rate” if they’re sufficiently consistent to a line x bot combination) and the Elo model simply stops being a good model of reality. I’ve seen this empirically be the case on the 9x9 computer go server (9x9 “CGOS”) with a bot equilibrating at one point to be a couple hundred Elo lower than a bot no longer running that it should have been head-to-head equal or stronger than, due to different transitive opponents.
The second case, the one relevant here, can happen because there’s no particular reason to expect that a game will actually have tails that precisely match that of a sigmoid function f(x) = 1/(1+exp(-x)) in expected score in the extreme. Depending the actual tails between different pairs of players of increasingly large ratings differences, particularly whether it tends to be thinner or heavier than exp(-x) in given conditions, when you then try to measure large ratings differences via many transitive steps of intermediate opponents, you then will get different answers depending on the composition of those intermediate players and how many and how big of steps you take.
It’s not surprising when models that are just useful approximations of reality (i.e. “the map, not the territory”) start breaking down at extremes. It can be still worthwhile doing things like this to build intuition or even just for fun and see what numbers you get! While doing so, my personal tendency in such cases would still be to emphasize that at the extremes of questions like “what is the Elo of perfect play” or “what is the Elo of random play”, the numbers you do get can start to be answers that have a lot to do with one’s models and methodologies rather than answers that reflect an underlying reality accurately.
One thing that’s worth keeping in mind with exercises like this is that while you can do this in various ways and get some answers, the answers you get may depend nontrivially on how you construct the intermediate ladder of opponents.
For example, attempts to calibrate human and computer Elo ratings scales often do place top computers around the 3500ish area, and one of the other answers given has indicated by a particular ladder of intermediates that random would then be at 400-500 Elo given that. But there are also human players who are genuinely rated 400-500 Elo on servers whose Elo ratings are also approximately transitively self-consistent within that server. These players can still play Chess—e.g. know how pieces move, and can see captures and move pieces to execute those captures vastly better than chance, etc. I would not be surprised to see such a player consistently destroy a uniform random Chess player. Random play is really, really bad. So there’s a good chance here that we would see a significant nonlinearity/nontransitivity in Elo ratings, such that there isn’t any one consistent rating that we can assign to random play relative to Stockfish.
A good way of framing this conceptually is to say that Elo is NOT a fundamental truth about reality, rather it’s an imperfect model that we as humans invented that depending on the situation may work anywhere from poorly to okay to amazingly good at approximating an underlying reality.
In particular, the Elo model makes a very strong “linearity-like” assumption: that if A beats B with expected odds a:b, and B beats C with expected odds b:c, then A will beat C with expected odds of precisely a:c. (where draws are treated as a half point of each player beating the other, i.e. mathematically equivalent in expectation to if you were to resolve all draws by fair coin flip to determine the winner), and then given the way rating is defined from there, this linearity in odds then implies that the expected score between players follows precisely a sigmoid function f(x) = 1/(1+exp(-x)) of their rating difference up to constant scaling.
Almost any real situation will violate these assumptions at least a little (and even mathematically ideal artificial games will violate it, e.g. a game where players have a fixed mean and variance and compete by sampling from different gaussians to see whose number is higher will violate this assumption!). But in many cases of skill-based competition this works quite well, and there are various ways to justify and explain why this approximation does work pretty well when it does!
But even in games/domains where Elo does approximate realistic player pools amazingly well, it quite commonly stops doing as well at the extremes. For example, two common cases where this happens can include:
When the pool of players (particularly bots) being compared are all relatively close to being optimal
When the pool of players being compared cover an extremely wide range of “skill levels”.
The first case can happen when the near-optimal players have some persistent tendencies as to mistakes they still make as well as sharp preferences for various lines. Then you no longer have law-of-large-numbers effects (too few mistakes per game) and also no poisson-like smoothness in the arrival rate of mistakes (mistakes aren’t well-modeled as having an “arrival rate” if they’re sufficiently consistent to a line x bot combination) and the Elo model simply stops being a good model of reality. I’ve seen this empirically be the case on the 9x9 computer go server (9x9 “CGOS”) with a bot equilibrating at one point to be a couple hundred Elo lower than a bot no longer running that it should have been head-to-head equal or stronger than, due to different transitive opponents.
The second case, the one relevant here, can happen because there’s no particular reason to expect that a game will actually have tails that precisely match that of a sigmoid function f(x) = 1/(1+exp(-x)) in expected score in the extreme. Depending the actual tails between different pairs of players of increasingly large ratings differences, particularly whether it tends to be thinner or heavier than exp(-x) in given conditions, when you then try to measure large ratings differences via many transitive steps of intermediate opponents, you then will get different answers depending on the composition of those intermediate players and how many and how big of steps you take.
It’s not surprising when models that are just useful approximations of reality (i.e. “the map, not the territory”) start breaking down at extremes. It can be still worthwhile doing things like this to build intuition or even just for fun and see what numbers you get! While doing so, my personal tendency in such cases would still be to emphasize that at the extremes of questions like “what is the Elo of perfect play” or “what is the Elo of random play”, the numbers you do get can start to be answers that have a lot to do with one’s models and methodologies rather than answers that reflect an underlying reality accurately.