Here’s one framing: getting a higher ECI score requires making progress on (multiple) benchmarks that other models find difficult. Making progress on METR instead involves being more consistently successful at longer coding tasks.
So ECI tracks general capabilities on a “difficulty-weighted” scale, and seems better suited to understanding the pace of progress in general, but it’s also an abstract number. There’s currently no mapping like “ECI of X == AGI”, or a human ECI baseline. On the other hand, METR’s benchmark has a nice concrete interpretation, but is more narrow.
We’re working on mapping ECI to more interpretable metrics (in fact, METR Time Horizons is one candidate), as well as allowing users to choose a subset of underlying benchmarks if they would prefer to weight ECI more heavily towards particular skills like coding.
Also note that we don’t currently include METR’s benchmarks as inputs to ECI, but we may add them in future iterations.
Here’s one framing: getting a higher ECI score requires making progress on (multiple) benchmarks that other models find difficult. Making progress on METR instead involves being more consistently successful at longer coding tasks.
So ECI tracks general capabilities on a “difficulty-weighted” scale, and seems better suited to understanding the pace of progress in general, but it’s also an abstract number. There’s currently no mapping like “ECI of X == AGI”, or a human ECI baseline. On the other hand, METR’s benchmark has a nice concrete interpretation, but is more narrow.
We’re working on mapping ECI to more interpretable metrics (in fact, METR Time Horizons is one candidate), as well as allowing users to choose a subset of underlying benchmarks if they would prefer to weight ECI more heavily towards particular skills like coding.
Also note that we don’t currently include METR’s benchmarks as inputs to ECI, but we may add them in future iterations.