Note that my notion of outperform here is exactly the same notion that one would use when comparing different ML systems. They obviously need to be performing exactly the same task—in general you can’t claim ML system X is better than ML system Y when X is wired directly to the output and Y has to perform a more complex motor visual task than includes X’s task as a subtask. That would be—ridiculous, frankly.
If AI can be reliably used to produce outputs that are in some way better (faster, more accurate, etc.) than humans, it’s not important that the contest is fair—the AI will begin replacing humans at this task anyway.
The issue is that nobody cares about this specific task directly. I’m pretty sure you don’t even care about this specific task directly. The only interest in this task is as some sort of proxy for performance on actually important downstream tasks (reading, writing, math etc). And that’s why it’s so misleading to draw conclusions from ML system performance on proxy task A and human performance on (more complex) proxy task B.
(I agree with you that next-token prediction is not itself commercially valuable, whereas something like proofreading would be)
I think you would agree that humans are enormously better at proofreading then implied by this weird comparison (on two different proxy tasks). Obvious counter-example proof: there are human who haven’t played this specific game yet and would score terribly but are good at proofreading.
Note that my notion of outperform here is exactly the same notion that one would use when comparing different ML systems. They obviously need to be performing exactly the same task—in general you can’t claim ML system X is better than ML system Y when X is wired directly to the output and Y has to perform a more complex motor visual task than includes X’s task as a subtask. That would be—ridiculous, frankly.
The issue is that nobody cares about this specific task directly. I’m pretty sure you don’t even care about this specific task directly. The only interest in this task is as some sort of proxy for performance on actually important downstream tasks (reading, writing, math etc). And that’s why it’s so misleading to draw conclusions from ML system performance on proxy task A and human performance on (more complex) proxy task B.
I think you would agree that humans are enormously better at proofreading then implied by this weird comparison (on two different proxy tasks). Obvious counter-example proof: there are human who haven’t played this specific game yet and would score terribly but are good at proofreading.