Just from seeing narrow benchmarks saturate, one could argue that what’s happening is LLMs are picking up whatever narrow capabilities are in-focus enough to train into them. (I emphatically do not think this is what’s happening in 2025, but narrow benchmark scores alone aren’t enough to show that.) A well-designed intelligence benchmark, by contrast, would be impossible to score well into the human range without having an ability to do novel (and thereby general) problem-solving, and unsaturateable without the ability to do so at above-genius level.
As for the question of whether it’d persuade people with their heads stuck in the sand, “x model is smarter than some-high-percent of people” is a lot harder to ignore than “x model scored some-high-numbers on a bunch of coding, knowledge, etc. benchmarks”. Putting aside how it’s more useful, giving model scores relative to people (or, in some situations, subject matter experts) is also more confronting. That said, I don’t doubt that there are many people who wouldn’t be persuaded by even that.
Just from seeing narrow benchmarks saturate, one could argue that what’s happening is LLMs are picking up whatever narrow capabilities are in-focus enough to train into them. (I emphatically do not think this is what’s happening in 2025, but narrow benchmark scores alone aren’t enough to show that.) A well-designed intelligence benchmark, by contrast, would be impossible to score well into the human range without having an ability to do novel (and thereby general) problem-solving, and unsaturateable without the ability to do so at above-genius level.
As for the question of whether it’d persuade people with their heads stuck in the sand, “x model is smarter than some-high-percent of people” is a lot harder to ignore than “x model scored some-high-numbers on a bunch of coding, knowledge, etc. benchmarks”. Putting aside how it’s more useful, giving model scores relative to people (or, in some situations, subject matter experts) is also more confronting. That said, I don’t doubt that there are many people who wouldn’t be persuaded by even that.
Agreed on all counts. I really, genuinely do hope to see your attempt at such a benchmark succeed, and believe that such is possible.