But if this were true, you’d think they’d be able to handle ARC-AGI puzzles (see the example image just above)
In a footnote you note that models do well on ARC-AGI-1, but I think you’re description of the situation is misleading:
AIs trained on the training set of ARC-AGI (and given a bunch of compute) can beat humans on ARC-AGI-1.
The example puzzle you show for ARC-AGI is one of the easiest puzzles; AIs have been able to succeed at this for a while. The ones AIs get wrong are typically ones which are very large (causing difficulties with perception) or which are actually hard for humans.
ARC-AGI-2 isn’t easy for humans. It’s hard for humans and AIs probably do similarly to random humans (e.g. mturkers) given a limited period.
ARC-AGI-3 is much more “perception” loaded than ARC-AGI-2/ARC-AGI-1 due to being structured as a video game (which implicitly means you have to process and understand a long series of frames). I expect this is why AIs struggle.
Overall, I think LLMs do handle ARC-AGI puzzles. They are well within the human range for ARC-AGI-1/2 and their failures are pretty often perception failures.
Fair enough if your objection is that the level of sample efficiency on this type of task for typical humans isn’t sufficient. (I agree.)
Maybe they’re only good at picking up ideas from an example, if they’d already learned that idea during their original training? In other words, maybe in-context learning is helpful at jogging their memory, but not for teaching new concepts.
My view is that LLMs are generally qualitatively dumber than the most capable humans in a bunch of ways (including ability to learn new things), but that this is improving over time. Thiere isn’t some dictomy between “sample efficient learning” and not. I think you’ll struggle to find tasks where AIs haven’t been improving by following the heuristic “what haven’t AIs already learned” (though AIs do gain an advantage by knowing lots of stuff, they are also improving at all kinds of stuff).
AIs trained on the training set of ARC-AGI (and given a bunch of compute) can beat humans on ARC-AGI-1.
Say more? At https://arcprize.org/leaderboard, I see “Stem Grad” at 98% on ARC-AGI-1, and the highest listed AI score is 75.7% for “o3-preview (Low)”. I vaguely recall seeing a higher reported figure somewhere for some AI model, but not 98%.
ARC-AGI-2 isn’t easy for humans. It’s hard for humans and AIs probably do similarly to random humans (e.g. mturkers) given a limited period.
This post states that the “average human” scores 60% on ARC-AGI-2, though I was unable to verify the details (it claims to be a linkpost for an article which does not seem to contain that figure). Personally I tried 10-12 problems when the test was first launched, and IIRC I missed either 1 or 2.
The leaderboard shows “Grok 4 (Thinking)” on top at 16%… and, unfortunately, does not present data for “Stem Grad” or “Avg. Mturker” (in any case I’m not sure what I think of the latter as a baseline here).
Agreed that perception challenges may be badly coloring all of these results.
There isn’t some dictomy between “sample efficient learning” and not.
Agreed, but (as covered in another comment – thanks for all the comments!), I do have the intuition that the AI field is not currently progressing toward rapid improvement on sample efficient learning, and may currently be heading toward a fairly low local maximum.
Say more? At https://arcprize.org/leaderboard, I see “Stem Grad” at 98% on ARC-AGI-1, and the highest listed AI score is 75.7% for “o3-preview (Low)”. I vaguely recall seeing a higher reported figure somewhere for some AI model, but not 98%.
By “can beat humans”, I mean AIs are well within the human range, probably somewhat better than the average/median human in the US at ARC-AGI-1. In this study, humans get 65% right on the public evaluation set.
This post states that the “average human” scores 60% on ARC-AGI-2, though I was unable to verify the details (it claims to be a linkpost for an article which does not seem to contain that figure). Personally I tried 10-12 problems when the test was first launched, and IIRC I missed either 1 or 2.
I’m skeptical, I bet mturkers do worse. This is very similar to the score that was found for humans for ARC-AGI-1 which is much easier from my understanding this study.
By “hard for humans”, I just mean that it takes substantially effort even for somewhat smart humans, I don’t mean that humans can’t do it.
In a footnote you note that models do well on ARC-AGI-1, but I think you’re description of the situation is misleading:
AIs trained on the training set of ARC-AGI (and given a bunch of compute) can beat humans on ARC-AGI-1.
The example puzzle you show for ARC-AGI is one of the easiest puzzles; AIs have been able to succeed at this for a while. The ones AIs get wrong are typically ones which are very large (causing difficulties with perception) or which are actually hard for humans.
ARC-AGI-2 isn’t easy for humans. It’s hard for humans and AIs probably do similarly to random humans (e.g. mturkers) given a limited period.
ARC-AGI-3 is much more “perception” loaded than ARC-AGI-2/ARC-AGI-1 due to being structured as a video game (which implicitly means you have to process and understand a long series of frames). I expect this is why AIs struggle.
Overall, I think LLMs do handle ARC-AGI puzzles. They are well within the human range for ARC-AGI-1/2 and their failures are pretty often perception failures.
Fair enough if your objection is that the level of sample efficiency on this type of task for typical humans isn’t sufficient. (I agree.)
My view is that LLMs are generally qualitatively dumber than the most capable humans in a bunch of ways (including ability to learn new things), but that this is improving over time. Thiere isn’t some dictomy between “sample efficient learning” and not. I think you’ll struggle to find tasks where AIs haven’t been improving by following the heuristic “what haven’t AIs already learned” (though AIs do gain an advantage by knowing lots of stuff, they are also improving at all kinds of stuff).
Say more? At https://arcprize.org/leaderboard, I see “Stem Grad” at 98% on ARC-AGI-1, and the highest listed AI score is 75.7% for “o3-preview (Low)”. I vaguely recall seeing a higher reported figure somewhere for some AI model, but not 98%.
This post states that the “average human” scores 60% on ARC-AGI-2, though I was unable to verify the details (it claims to be a linkpost for an article which does not seem to contain that figure). Personally I tried 10-12 problems when the test was first launched, and IIRC I missed either 1 or 2.
The leaderboard shows “Grok 4 (Thinking)” on top at 16%… and, unfortunately, does not present data for “Stem Grad” or “Avg. Mturker” (in any case I’m not sure what I think of the latter as a baseline here).
Agreed that perception challenges may be badly coloring all of these results.
Agreed, but (as covered in another comment – thanks for all the comments!), I do have the intuition that the AI field is not currently progressing toward rapid improvement on sample efficient learning, and may currently be heading toward a fairly low local maximum.
By “can beat humans”, I mean AIs are well within the human range, probably somewhat better than the average/median human in the US at ARC-AGI-1. In this study, humans get 65% right on the public evaluation set.
I’m skeptical, I bet mturkers do worse. This is very similar to the score that was found for humans for ARC-AGI-1 which is much easier from my understanding this study.
By “hard for humans”, I just mean that it takes substantially effort even for somewhat smart humans, I don’t mean that humans can’t do it.