Except that there already is the Epoch Capability Index (which aggregates an army of benchmarks) and the ARC-AGI benchmark (which, alas, is also on track to saturation) where the human baseline is decoupled from the time horizon because it relies on visual intelligence (or, in the case of the AIs, on the ability to notice patterns). As for the METR benchmark being saturated[1], maybe Claude Opus 4.5 is an outlier whose TH was gamed with? Or there is a benign explanation, like Claude failing on primitive tasks in a manner similar to Grok 4 and to Claude’s performance on ARC-AGI-1 failing to form a straight line?
Were the o3-GPT5.1CodexMax trend to continue forever, the 8hr 50% time horizon would be reached in September 2026. IIRC the benchmark doesn’t have tasks lasting longer than 8hrs, and the horizon would be saturated only by then. Alas, the time horizon is likely exponential until the very last couple of doublings.
Except that there already is the Epoch Capability Index (which aggregates an army of benchmarks) and the ARC-AGI benchmark (which, alas, is also on track to saturation) where the human baseline is decoupled from the time horizon because it relies on visual intelligence (or, in the case of the AIs, on the ability to notice patterns). As for the METR benchmark being saturated[1], maybe Claude Opus 4.5 is an outlier whose TH was gamed with? Or there is a benign explanation, like Claude failing on primitive tasks in a manner similar to Grok 4 and to Claude’s performance on ARC-AGI-1 failing to form a straight line?
Were the o3-GPT5.1CodexMax trend to continue forever, the 8hr 50% time horizon would be reached in September 2026. IIRC the benchmark doesn’t have tasks lasting longer than 8hrs, and the horizon would be saturated only by then. Alas, the time horizon is likely exponential until the very last couple of doublings.