For aggregating several different benchmarks there is a natural way to avoid the y-axis problem, and one that introduces a new natural y-axis with a clear interpretation.
The idea is to use an ELO system.
Treat each benchmark as a set of individual contests between all pairs of models, with only win or lose as outcomes, and update ELOs accordingly.
This converges if you have enough different benchmarks, but of course loses a lot of the signal if you ony have a few (since it discards the information about how large the difference in y is).
For aggregating several different benchmarks there is a natural way to avoid the y-axis problem, and one that introduces a new natural y-axis with a clear interpretation.
The idea is to use an ELO system.
Treat each benchmark as a set of individual contests between all pairs of models, with only win or lose as outcomes, and update ELOs accordingly.
This converges if you have enough different benchmarks, but of course loses a lot of the signal if you ony have a few (since it discards the information about how large the difference in y is).
Here is an example of this approach used (from a while back): https://x.com/scaling01/status/1919389344617414824?s=20