I can see your point, especially with regard to difficulty and to usefulness, but I do think that things like “percent correct questions on a test that humans consider a meaningful assessment of other humans’ skill level” are a reasonably natural measurement of capability. At the very least, it’s what we use for humans, and it’s at least passable at sorting humans into jobs such that productivity is higher than it’d otherwise be.
I think, as far as quantitative analysis goes, “this model does better than X percent of humans and worse than 100-X percent of humans on Y task that we care about” is the best we can reasonably do, and should be useful enough. You can argue that there’s a higher-order ‘real intelligence’ that doesn’t track linearly with this, but ittracks with how fast LLMs are approaching super-human capabilities at the tasks outlined, which seems like it’d be the primary concern as long as you’re looking at a useful set of tasks.
I can see your point, especially with regard to difficulty and to usefulness, but I do think that things like “percent correct questions on a test that humans consider a meaningful assessment of other humans’ skill level” are a reasonably natural measurement of capability. At the very least, it’s what we use for humans, and it’s at least passable at sorting humans into jobs such that productivity is higher than it’d otherwise be.
I think, as far as quantitative analysis goes, “this model does better than X percent of humans and worse than 100-X percent of humans on Y task that we care about” is the best we can reasonably do, and should be useful enough. You can argue that there’s a higher-order ‘real intelligence’ that doesn’t track linearly with this, but it tracks with how fast LLMs are approaching super-human capabilities at the tasks outlined, which seems like it’d be the primary concern as long as you’re looking at a useful set of tasks.