The ARC-AGI team evaluated Claude Sonnet 4.5. On the ARC-AGI-1 leaderboard, OpenAI’s models o4-mini, o3 and GPT-5 formed a nearly straight line. Claude Sonnet 4.5 was slightly below said line when thinking in 1K,4K or 16K tokens and held the line when thinking in 8K or 32K tokens. This could imply that the benchmark has scaling laws for non-distilled models, and OpenAI and Anthropic reached these laws.
On the ARC-AGI-2 leaderboard Claude Sonnet 4.5 Thinking became the new leader between $0.142/task and $0.8/task; it also completed 13.6% of tasks, which is the biggest result among LLMs except[1] for Grok 4 which also cost $2.17/task.
The distance between GPT-5 and Claude Sonnet 4.5 release dates is 53 days, while the distance between o3 and Claude Opus 4 is 36 days. While 53 days instead of 36 days are a likely result of the post-o3 slowdown, Claude ended up scoring ~1.3 times more both times. [2]
Were research taste similar to the ARC-AGI-2 benchmark, Claude would have achieved a ~1.3 times bigger acceleration of AI research than GPT-N if both companies had superhuman coders. I suspect that Claude would reduce the period of AI R&D necessary for Agent-3 to become Agent-4 while potentially lengthening the period necessary for Agent-3 to automate AI research. What does it all mean for inter-company proxy wars? That OpenAI will lose them to Anthropic? That the two companies will race towards the AGI and misalign both AIs?
Historically, the ARC-AGI-1 benchmark had far bigger delays between OpenAI’s models reaching a level and Anthropic models outreaching it; the o1mini-Claude Sonnet 3.7 pair and o3mini-Claude Sonnet 4 pair were separated by, respectively, 165 and 111 days.
The ARC-AGI team evaluated Claude Sonnet 4.5. On the ARC-AGI-1 leaderboard, OpenAI’s models o4-mini, o3 and GPT-5 formed a nearly straight line. Claude Sonnet 4.5 was slightly below said line when thinking in 1K,4K or 16K tokens and held the line when thinking in 8K or 32K tokens. This could imply that the benchmark has scaling laws for non-distilled models, and OpenAI and Anthropic reached these laws.
On the ARC-AGI-2 leaderboard Claude Sonnet 4.5 Thinking became the new leader between $0.142/task and $0.8/task; it also completed 13.6% of tasks, which is the biggest result among LLMs except[1] for Grok 4 which also cost $2.17/task.
The distance between GPT-5 and Claude Sonnet 4.5 release dates is 53 days, while the distance between o3 and Claude Opus 4 is 36 days. While 53 days instead of 36 days are a likely result of the post-o3 slowdown, Claude ended up scoring ~1.3 times more both times. [2]
Were research taste similar to the ARC-AGI-2 benchmark, Claude would have achieved a ~1.3 times bigger acceleration of AI research than GPT-N if both companies had superhuman coders. I suspect that Claude would reduce the period of AI R&D necessary for Agent-3 to become Agent-4 while potentially lengthening the period necessary for Agent-3 to automate AI research. What does it all mean for inter-company proxy wars? That OpenAI will lose them to Anthropic? That the two companies will race towards the AGI and misalign both AIs?
There are also two experimental systems created solely for the benchmark by E.Pang and J.Berman.
Historically, the ARC-AGI-1 benchmark had far bigger delays between OpenAI’s models reaching a level and Anthropic models outreaching it; the o1mini-Claude Sonnet 3.7 pair and o3mini-Claude Sonnet 4 pair were separated by, respectively, 165 and 111 days.