Tim H comments on ARC-AGI-2 human baseline surpassed (updated)

Tim H 17 Dec 2025 17:39 UTC
1 point
0
I thought I was just speculating about the potential for multiple valid solutions, but now I see that the ARC 2 launch post not only acknowledges the possibility but says ambiguity is sometimes even there by design!
Like ARC-AGI-1, ARC-AGI-2 uses a pass@2 measurement system to account for the fact that certain tasks have explicit ambiguity and require two guesses to disambiguate. As well as to catch any unintentional ambiguity or mistakes in the dataset. Given controlled human testing with ARC-AGI-2, we are more confident in the task quality compared to ARC-AGI-1. [emphasis added]
I continue to question whether 2 out of 9 solve rates by their human testers should have given them such confidence. My guess is that they expected higher human performance, especially with an incentive of $5 per solve. (MTurkers had achieved better performance on ARC 1 despite their lower pay being largely unconditional on solving: $10 for attempting five tasks plus “a bonus of $1 if they succeeded at a randomly selected task...”)
One aspect of the human testing design likely reduced the intended solve incentive: participants were given 90 minutes rather than a set number of tasks. Given that, the reward-maximizing strategy is to move on quickly from relatively hard problems rather than give them your best effort.