They’re good at ARC-AGI despite presumably not seeing this type of challenge before.
To nitpick, the ARC Prize foundation has found some odd signs of (maybe) memorization. Eg, Gemini 3′s reasoning traces show it thinking:
… Target is Green (3). Pattern is Magenta (6) Solid. Result: Magenta Square on Green … (Gemini 3 Deep Think)
But the JSON it receives as input has no colors! It’s clearly pretty familiar with the tests, even if it might not have seen a particular one before.
And while they can solve them, I’m not sure they’re “good” (or human-level efficient) just yet.
LLMs typically use a lot of reasoning tokens. An early version of o3 scored 75% on ARC-AGI-1...but it spent $200 per task (!) doing so. That’s an extreme outlier, granted. But all LLMs with humanlike scores (~80%) on ARC-AGI-2 are pretty expensive (typically a dollar to several dollars per task). The current best performer, GPT-5.5 on xHigh, costs $1.87/task. That’s between 40k and 60k of reasoning tokens (half the length of the first Harry Potter book), for every single task—quite a lot for puzzles that humans can (mostly) solve at a glance.
To me, this indicates some degree of “brute-forcing” is still going on in ARC-AGI.
To nitpick, the ARC Prize foundation has found some odd signs of (maybe) memorization. Eg, Gemini 3′s reasoning traces show it thinking:
But the JSON it receives as input has no colors! It’s clearly pretty familiar with the tests, even if it might not have seen a particular one before.
And while they can solve them, I’m not sure they’re “good” (or human-level efficient) just yet.
LLMs typically use a lot of reasoning tokens. An early version of o3 scored 75% on ARC-AGI-1...but it spent $200 per task (!) doing so. That’s an extreme outlier, granted. But all LLMs with humanlike scores (~80%) on ARC-AGI-2 are pretty expensive (typically a dollar to several dollars per task). The current best performer, GPT-5.5 on xHigh, costs $1.87/task. That’s between 40k and 60k of reasoning tokens (half the length of the first Harry Potter book), for every single task—quite a lot for puzzles that humans can (mostly) solve at a glance.
To me, this indicates some degree of “brute-forcing” is still going on in ARC-AGI.
I broadly agree with the post as a whole.