Circling back to this with a thing I was thinking about—suppose one wanted to figure out just one additional degree of freedom to the Elo rating a player had (at a given point in time, if you also allow evolution over time) that would add as much improvement as possible. Almost certainly you need more dimensions than that to properly fit real idiosyncratic nonlinearities/nontransitivities (i.e. if you had a playing population with specific pairs of players that were especially strong/weak only against specific other players, or cycles of players where A beats B beats C beats A, etc), but if you just wanted to work out what the “second principal component” might be, what’s a plausible guess?
First, you can essentially reproduce the Elo model by rather than each player having a rating and the winning chance being a function of the difference between their ratings, instead you posit that each player has a rating and when they play a game, they each indepedently sample a random value from a fixed probability distribution centered around their own rating, and the player with the larger sample wins.
I think that you exactly reproduce the Elo model up to scaling if this distribution is a Gumbel distribution, because the difference of two Gumbels is apparently equivalent to a draw from a logistic distribution, and the CDF of the logistic distribution is precisely the sigmoid that the Elo model posits. But in practice, you should end up with almost the same thing if you choose any other reasonable distribution so long as it has the right heaviness of tail.
In particular, I’d expect having linearly-exponential tails is good rather than quadratically-exponential tails like the normal distribution has, because linearly-exponential tails tend to be desirable for real-world ratings models due to being much more outlier-resistant and in the real world you have issues like forfeits, sandbaggers, internet disconnection/timeouts, etc. (If you have a quadratically exponential tail, then a ratings model can put so low probability on an outlier that subject to seeing the outlier, the ratings model is forced to make a too-large update to accommodate it, this should be intuitive from a Bayesian perspective). Outliers and noise and the realities of real world ratings data I’d expect introduces far bigger variation in ratings quality anyways than any minor distribution-shape differences would.
So for example, you could also say each player draws from a logistic distribution, rather than only a Gumbel. The difference of two logistics is not quite a logistic distribution but up to rescaling it should be pretty close so this is nearly the Elo model again.
Anyways, with any reformulation like this, there is a very natural candidate now for a second dimension—that of the variance of the distribution that a player draws their sample from. Rather than each player drawing from a fixed distribution centered around their rating before seeing who has the higher value and wins, we now add a second parameter that allows the variance of that distribution to vary by player. So the ratings model now becomes able to express things like “this player is more variable in performance between games, or prone to blunders uncharacteristic of their skill level than this other player”. This parameter might also improve the rating system’s ability to “explain away” things like sandbagger players by assigning them a high variance, thereby reducing their distortionary impact on other players’ ratings even before manual intervention.
One thing that’s worth keeping in mind with exercises like this is that while you can do this in various ways and get some answers, the answers you get may depend nontrivially on how you construct the intermediate ladder of opponents.
For example, attempts to calibrate human and computer Elo ratings scales often do place top computers around the 3500ish area, and one of the other answers given has indicated by a particular ladder of intermediates that random would then be at 400-500 Elo given that. But there are also human players who are genuinely rated 400-500 Elo on servers whose Elo ratings are also approximately transitively self-consistent within that server. These players can still play Chess—e.g. know how pieces move, and can see captures and move pieces to execute those captures vastly better than chance, etc. I would not be surprised to see such a player consistently destroy a uniform random Chess player. Random play is really, really bad. So there’s a good chance here that we would see a significant nonlinearity/nontransitivity in Elo ratings, such that there isn’t any one consistent rating that we can assign to random play relative to Stockfish.
A good way of framing this conceptually is to say that Elo is NOT a fundamental truth about reality, rather it’s an imperfect model that we as humans invented that depending on the situation may work anywhere from poorly to okay to amazingly good at approximating an underlying reality.
In particular, the Elo model makes a very strong “linearity-like” assumption: that if A beats B with expected odds a:b, and B beats C with expected odds b:c, then A will beat C with expected odds of precisely a:c. (where draws are treated as a half point of each player beating the other, i.e. mathematically equivalent in expectation to if you were to resolve all draws by fair coin flip to determine the winner), and then given the way rating is defined from there, this linearity in odds then implies that the expected score between players follows precisely a sigmoid function f(x) = 1/(1+exp(-x)) of their rating difference up to constant scaling.
Almost any real situation will violate these assumptions at least a little (and even mathematically ideal artificial games will violate it, e.g. a game where players have a fixed mean and variance and compete by sampling from different gaussians to see whose number is higher will violate this assumption!). But in many cases of skill-based competition this works quite well, and there are various ways to justify and explain why this approximation does work pretty well when it does!
But even in games/domains where Elo does approximate realistic player pools amazingly well, it quite commonly stops doing as well at the extremes. For example, two common cases where this happens can include:
When the pool of players (particularly bots) being compared are all relatively close to being optimal
When the pool of players being compared cover an extremely wide range of “skill levels”.
The first case can happen when the near-optimal players have some persistent tendencies as to mistakes they still make as well as sharp preferences for various lines. Then you no longer have law-of-large-numbers effects (too few mistakes per game) and also no poisson-like smoothness in the arrival rate of mistakes (mistakes aren’t well-modeled as having an “arrival rate” if they’re sufficiently consistent to a line x bot combination) and the Elo model simply stops being a good model of reality. I’ve seen this empirically be the case on the 9x9 computer go server (9x9 “CGOS”) with a bot equilibrating at one point to be a couple hundred Elo lower than a bot no longer running that it should have been head-to-head equal or stronger than, due to different transitive opponents.
The second case, the one relevant here, can happen because there’s no particular reason to expect that a game will actually have tails that precisely match that of a sigmoid function f(x) = 1/(1+exp(-x)) in expected score in the extreme. Depending the actual tails between different pairs of players of increasingly large ratings differences, particularly whether it tends to be thinner or heavier than exp(-x) in given conditions, when you then try to measure large ratings differences via many transitive steps of intermediate opponents, you then will get different answers depending on the composition of those intermediate players and how many and how big of steps you take.
It’s not surprising when models that are just useful approximations of reality (i.e. “the map, not the territory”) start breaking down at extremes. It can be still worthwhile doing things like this to build intuition or even just for fun and see what numbers you get! While doing so, my personal tendency in such cases would still be to emphasize that at the extremes of questions like “what is the Elo of perfect play” or “what is the Elo of random play”, the numbers you do get can start to be answers that have a lot to do with one’s models and methodologies rather than answers that reflect an underlying reality accurately.