Why is the human baseline so low? This is more tentative, but I’m thinking in terms of two basic possibilities. The reason only 2-5 out of 9 humans found the intended solution could be that it’s a well-posed but difficult problem, or it could be that there are 2-3 equally valid solutions and landing on the intended “correct” answer requires pure luck.
Starting with the latter possibility, a non-trivial proportion of these tasks may be ill-posed problems with multiple legitimate answers. Even an ideal solver would still only average 50% (or 33% or lower) on such tasks. If this were the case for all eval tasks, ~50% would not only be the human baseline but would also mean saturation of the benchmark.
Perhaps more likely is that the human baseline is so low primarily due to tricky tasks that a competent human doesn’t automatically “get” with enough thought. She needs a sort of “luck” in terms of having some idiosyncratic experience or idea to get to the solution. The fact that a non-trivial proportion of ARC 2 tasks (over 300, according to Figure 5 in the technical paper) were not even solved by two participants suggests that this is the case for a decent proportion of the tasks they designed. Note that some tasks which only one ninth of potential participants (the idealized “population”) would solve on average will happen to actually be solved by 2 participants by chance—leading to some such too-hard tasks being included in ARC-AGI-2. (By contrast, if they had designed a set of tasks and found that all of them could be solved by at least 2 out of 9 humans, that would be more reassuring, providing evidence that their task design process reliably produces human-solvable tasks, albeit often still tasks that less than half of humans can solve.) To the extent this is the reason for the low human baseline, AI systems may be able to substantially outdo typical human performance and approach 100% despite a human baseline around 50%.
I thought I was just speculating about the potential for multiple valid solutions, but now I see that the ARC 2 launch post not only acknowledges the possibility but says ambiguity is sometimes even there by design!
Like ARC-AGI-1, ARC-AGI-2 uses a pass@2 measurement system to account for the fact that certain tasks have explicit ambiguity and require two guesses to disambiguate. As well as to catch any unintentional ambiguity or mistakes in the dataset. Given controlled human testing with ARC-AGI-2, we are more confident in the task quality compared to ARC-AGI-1. [emphasis added]
I continue to question whether 2 out of 9 solve rates by their human testers should have given them such confidence. My guess is that they expected higher human performance, especially with an incentive of $5 per solve. (MTurkers had achieved better performance on ARC 1 despite their lower pay being largely unconditional on solving: $10 for attempting five tasks plus “a bonus of $1 if they succeeded at a randomly selected task...”)
One aspect of the human testing design likely reduced the intended solve incentive: participants were given 90 minutes rather than a set number of tasks. Given that, the reward-maximizing strategy is to move on quickly from relatively hard problems rather than give them your best effort.
Why is the human baseline so low? This is more tentative, but I’m thinking in terms of two basic possibilities. The reason only 2-5 out of 9 humans found the intended solution could be that it’s a well-posed but difficult problem, or it could be that there are 2-3 equally valid solutions and landing on the intended “correct” answer requires pure luck.
Starting with the latter possibility, a non-trivial proportion of these tasks may be ill-posed problems with multiple legitimate answers. Even an ideal solver would still only average 50% (or 33% or lower) on such tasks. If this were the case for all eval tasks, ~50% would not only be the human baseline but would also mean saturation of the benchmark.
Perhaps more likely is that the human baseline is so low primarily due to tricky tasks that a competent human doesn’t automatically “get” with enough thought. She needs a sort of “luck” in terms of having some idiosyncratic experience or idea to get to the solution. The fact that a non-trivial proportion of ARC 2 tasks (over 300, according to Figure 5 in the technical paper) were not even solved by two participants suggests that this is the case for a decent proportion of the tasks they designed. Note that some tasks which only one ninth of potential participants (the idealized “population”) would solve on average will happen to actually be solved by 2 participants by chance—leading to some such too-hard tasks being included in ARC-AGI-2. (By contrast, if they had designed a set of tasks and found that all of them could be solved by at least 2 out of 9 humans, that would be more reassuring, providing evidence that their task design process reliably produces human-solvable tasks, albeit often still tasks that less than half of humans can solve.) To the extent this is the reason for the low human baseline, AI systems may be able to substantially outdo typical human performance and approach 100% despite a human baseline around 50%.
I thought I was just speculating about the potential for multiple valid solutions, but now I see that the ARC 2 launch post not only acknowledges the possibility but says ambiguity is sometimes even there by design!
I continue to question whether 2 out of 9 solve rates by their human testers should have given them such confidence. My guess is that they expected higher human performance, especially with an incentive of $5 per solve. (MTurkers had achieved better performance on ARC 1 despite their lower pay being largely unconditional on solving: $10 for attempting five tasks plus “a bonus of $1 if they succeeded at a randomly selected task...”)
One aspect of the human testing design likely reduced the intended solve incentive: participants were given 90 minutes rather than a set number of tasks. Given that, the reward-maximizing strategy is to move on quickly from relatively hard problems rather than give them your best effort.