Say more? At https://arcprize.org/leaderboard, I see “Stem Grad” at 98% on ARC-AGI-1, and the highest listed AI score is 75.7% for “o3-preview (Low)”. I vaguely recall seeing a higher reported figure somewhere for some AI model, but not 98%.
By “can beat humans”, I mean AIs are well within the human range, probably somewhat better than the average/median human in the US at ARC-AGI-1. In this study, humans get 65% right on the public evaluation set.
This post states that the “average human” scores 60% on ARC-AGI-2, though I was unable to verify the details (it claims to be a linkpost for an article which does not seem to contain that figure). Personally I tried 10-12 problems when the test was first launched, and IIRC I missed either 1 or 2.
I’m skeptical, I bet mturkers do worse. This is very similar to the score that was found for humans for ARC-AGI-1 which is much easier from my understanding this study.
By “hard for humans”, I just mean that it takes substantially effort even for somewhat smart humans, I don’t mean that humans can’t do it.
By “can beat humans”, I mean AIs are well within the human range, probably somewhat better than the average/median human in the US at ARC-AGI-1. In this study, humans get 65% right on the public evaluation set.
I’m skeptical, I bet mturkers do worse. This is very similar to the score that was found for humans for ARC-AGI-1 which is much easier from my understanding this study.
By “hard for humans”, I just mean that it takes substantially effort even for somewhat smart humans, I don’t mean that humans can’t do it.