However, it might be worthy to take into account other complications. Setting aside Cole Wyeth’s comment, the two other comments with most karma pointed out that the METR benchmark is no longer as trustworthy as it once was. In this case we will see GPT-5.2, GPT-5.2-Codex and/or Gemini 3 Pro display a lower 50% time horizon and a higher 80% horizon. There was also Grok 4 with a similarly elevated ratio of time horizons (now Grok 4 has 109 min for 50% and 15 minutes for 80%, while Claude Opus 4.5 has 289min for 50% and 27 min for 80%), but Grok 4, unlike Claude, was humiliated by the longest-horizon tasks.
However, it might be worthy to take into account other complications. Setting aside Cole Wyeth’s comment, the two other comments with most karma pointed out that the METR benchmark is no longer as trustworthy as it once was. In this case we will see GPT-5.2, GPT-5.2-Codex and/or Gemini 3 Pro display a lower 50% time horizon and a higher 80% horizon. There was also Grok 4 with a similarly elevated ratio of time horizons (now Grok 4 has 109 min for 50% and 15 minutes for 80%, while Claude Opus 4.5 has 289min for 50% and 27 min for 80%), but Grok 4, unlike Claude, was humiliated by the longest-horizon tasks.