I’m interested in the question of how you plan to re-base the index over time.
Currently, we’ve chosen to scale things such that Claude 3.5 Sonnet gets 130 and GPT-5 (medium) gets 150. As we add new benchmarks, the rough plan is to try to maintain that. We’re also planning on adding some way for users to define their own subset of benchmarks, in case you disagree with our choices. That should let you see how things would look under various hypothetical “rebasings”.
The index excludes models from before 2023, which is understandable, since they couldn’t use benchmark released after that date.
To be clear, that’s not why we didn’t include older models. There is no technical problem with including older models, we just have sparse data on models prior to 2023. We chose to leave them out for now since the noisy estimates could be misleading, but we’re trying to collect more data so that we can extend things back further.
The obvious way to do this seems (to me) to be by including earlier benchmarks that are now saturated so that the time series can be extended backwards.
We do! Currently we include many old benchmarks like GSM8k, HellaSwag, MMLU, WinoGrande, etc. There is a list of all benchmarks on the ECI methodology page, if you scroll down into the methodology section. In our initial release we haven’t visualized benchmark difficulty scores, but we’re planning on making those public and showing some analysis soon.
Currently, we’ve chosen to scale things such that Claude 3.5 Sonnet gets 130 and GPT-5 (medium) gets 150. As we add new benchmarks, the rough plan is to try to maintain that. We’re also planning on adding some way for users to define their own subset of benchmarks, in case you disagree with our choices. That should let you see how things would look under various hypothetical “rebasings”.
To be clear, that’s not why we didn’t include older models. There is no technical problem with including older models, we just have sparse data on models prior to 2023. We chose to leave them out for now since the noisy estimates could be misleading, but we’re trying to collect more data so that we can extend things back further.
We do! Currently we include many old benchmarks like GSM8k, HellaSwag, MMLU, WinoGrande, etc. There is a list of all benchmarks on the ECI methodology page, if you scroll down into the methodology section. In our initial release we haven’t visualized benchmark difficulty scores, but we’re planning on making those public and showing some analysis soon.