I think the benchmark methodology is pretty bad/misleading.
“For each level that is counted, compare the AI agent’s action count to a human baseline, which we define as the second-best human action action[sic]. Ex: If the second-best human completed a level in only 10 actions, but the AI agent took 100 to complete it, then the AI agent scores (10/100)2 for that level, which gets reported as 1%. Note that level scoring is calculated using the square of efficiency.” (See full here)
Defining the human baseline as the second-best human performance per each level (out of 486 participants who were members of the San Francisco general public) doesn’t seem helpful, and the method of scoring the AI results is also unintuitive to me (not to mention it just sets the human score to be 100%).
If the second-best human takes 10 moves to solve a level, and the AI takes 13, its score for that level will be (10/13)2, or about 60%, which seems unreasonable to me, especially since there was nothing in the AI prompt that indicated it was better to finish a level while making the least amount of moves.
They also set a cutoff for the AI of 5x the human performance, after which the AI will score 0 (which means their quoted example of how scoring works couldn’t actually happen, the AI would have been cut off after 50 steps).
They cap the AI’s score at 100% per level even if it were to find a way to complete one in fewer moves than the human baseline, so it’s completely impossible for it to score better than humans, at best it can match the human score if and only if it at least matches the 2nd-best human performance on Every level.
It’s actually a little worse than I thought, apparently some of the levels include a “fog-of-war” mechanic where it is essentially just up to chance whether you pick a good path or not. This wouldn’t be so bad on its own but combined with the “take second-best human performance for each level” it’s definitely not a fair evaluation.
Per one of the comments in that tweet, this is a fair critique but currently the AI scores are low enough that this doesn’t meaningfully changes anything. Though it definitely makes you worry how many other blatantly bad decisions are not yet publicly known.
I think the benchmark methodology is pretty bad/misleading.
“For each level that is counted, compare the AI agent’s action count to a human baseline, which we define as the second-best human action action[sic]. Ex: If the second-best human completed a level in only 10 actions, but the AI agent took 100 to complete it, then the AI agent scores (10/100)2 for that level, which gets reported as 1%. Note that level scoring is calculated using the square of efficiency.” (See full here)
Defining the human baseline as the second-best human performance per each level (out of 486 participants who were members of the San Francisco general public) doesn’t seem helpful, and the method of scoring the AI results is also unintuitive to me (not to mention it just sets the human score to be 100%).
If the second-best human takes 10 moves to solve a level, and the AI takes 13, its score for that level will be (10/13)2, or about 60%, which seems unreasonable to me, especially since there was nothing in the AI prompt that indicated it was better to finish a level while making the least amount of moves.
They also set a cutoff for the AI of 5x the human performance, after which the AI will score 0 (which means their quoted example of how scoring works couldn’t actually happen, the AI would have been cut off after 50 steps).
They cap the AI’s score at 100% per level even if it were to find a way to complete one in fewer moves than the human baseline, so it’s completely impossible for it to score better than humans, at best it can match the human score if and only if it at least matches the 2nd-best human performance on Every level.
It’s actually a little worse than I thought, apparently some of the levels include a “fog-of-war” mechanic where it is essentially just up to chance whether you pick a good path or not. This wouldn’t be so bad on its own but combined with the “take second-best human performance for each level” it’s definitely not a fair evaluation.
Per one of the comments in that tweet, this is a fair critique but currently the AI scores are low enough that this doesn’t meaningfully changes anything. Though it definitely makes you worry how many other blatantly bad decisions are not yet publicly known.