Addendum to ARC-AGI analysis (18 Nov ’25): GPT-5.1, Gemini 3 Pro and Grok 4 Fast
While GPT-5.1′s improvement on the ARC-AGI-1 benchmark was mostly incremental and continued the straight line described in my prior analysis, Gemini 3 Pro and Gemini 3 Deep Think Preview scored, respectively, 75% and 87.5% on ARC-AGI-1, while having cost $0.493/task and an unknown cost, presumably $44.26/task. For comparison, o3-preview reached 75% for $200/task and 88% for over $1K/task.
We don’t know anything about Grok 4.1, but Grok 4 Fast scored 48.5% for $0.031 on the ARC-AGI-1 benchmark while being a bit higher than GPT-5-mini (medium) on the ARC-AGI-2 benchmark.
The most important news is Gemini 3 Pro and Deep Think leaving Pang and Berman’s agents far behind on the ARC-AGI-2 benchmark by scoring, respectively, 31.1% and 45.1%. This implies that the Geminis were trained on a major breakthrough.
In order to prove or disprove this, we’ll analyse the performance of other models on the benchmark. GPT-5.1 (thinking, high) surpassed Claude Sonnet 4.5 and Grok 4, reaching 17.6% success rate for $1.17. GPT-5.1 between thinking medium and high and Claude Sonnet 4.5 between 16K and 32K tokens have made similar breakthroughs, implying that Gemini’s algorithm isn’t used by OpenAI or Anthropic. I suspect that Gemini’s algorithm, unlike the approach of OpenAI or Anthropic, is a distillation of an approach resembling AlphaEvolve or Pang-Berman’s agents.
The AI-2027 forecast implied that it would be Agent-3 who would be taught weak skills like research taste or coordination. If Gemini’s breakthrough on ARC-AGI-2 is due to training an analogue of research taste right now, then algorithmic breakthroughs could end up facing diminishing returns, forcing future models to scale well before reaching SC.
Addendum to ARC-AGI analysis (18 Nov ’25): GPT-5.1, Gemini 3 Pro and Grok 4 Fast
While GPT-5.1′s improvement on the ARC-AGI-1 benchmark was mostly incremental and continued the straight line described in my prior analysis, Gemini 3 Pro and Gemini 3 Deep Think Preview scored, respectively, 75% and 87.5% on ARC-AGI-1, while having cost $0.493/task and an unknown cost, presumably $44.26/task. For comparison, o3-preview reached 75% for $200/task and 88% for over $1K/task.
We don’t know anything about Grok 4.1, but Grok 4 Fast scored 48.5% for $0.031 on the ARC-AGI-1 benchmark while being a bit higher than GPT-5-mini (medium) on the ARC-AGI-2 benchmark.
The most important news is Gemini 3 Pro and Deep Think leaving Pang and Berman’s agents far behind on the ARC-AGI-2 benchmark by scoring, respectively, 31.1% and 45.1%. This implies that the Geminis were trained on a major breakthrough.
In order to prove or disprove this, we’ll analyse the performance of other models on the benchmark. GPT-5.1 (thinking, high) surpassed Claude Sonnet 4.5 and Grok 4, reaching 17.6% success rate for $1.17. GPT-5.1 between thinking medium and high and Claude Sonnet 4.5 between 16K and 32K tokens have made similar breakthroughs, implying that Gemini’s algorithm isn’t used by OpenAI or Anthropic. I suspect that Gemini’s algorithm, unlike the approach of OpenAI or Anthropic, is a distillation of an approach resembling AlphaEvolve or Pang-Berman’s agents.
The AI-2027 forecast implied that it would be Agent-3 who would be taught weak skills like research taste or coordination. If Gemini’s breakthrough on ARC-AGI-2 is due to training an analogue of research taste right now, then algorithmic breakthroughs could end up facing diminishing returns, forcing future models to scale well before reaching SC.