Of course, we are still missing METR-style evaluations measuring the ability to complete long tasks for all recent systems (Claude 4, newer versions of Gemini-2.5, and, importantly, for agentic frameworks such as OpenAI Codex, Claude Code and similar systems).
When we obtain those evaluations, we’ll have a better understand of the shape of the curve, whether the doubling period keeps shrinking, and if so, how rapidly it shrinks…
Thanks for writing this!
Of course, we are still missing METR-style evaluations measuring the ability to complete long tasks for all recent systems (Claude 4, newer versions of Gemini-2.5, and, importantly, for agentic frameworks such as OpenAI Codex, Claude Code and similar systems).
When we obtain those evaluations, we’ll have a better understand of the shape of the curve, whether the doubling period keeps shrinking, and if so, how rapidly it shrinks…