This would be great to have, for sure, and I wish you luck in working on it!
I wonder if, for the specific types of discussions you point to in the first paragraph, it’s necessary or even likely to help? Even if all the benchmarks today are ‘bad’ as described, they measure something, and there’s a clear pattern of rapid saturation as new benchmarks are created. METR and many others have discussed this a lot. There have been papers on it. It seems like the meta-level approach of mapping out saturation timelines should be sufficient to convince people that for any given capability they can define, if they make a benchmark for it, AI will acquire that capability at the level the benchmark can measure. In practice, what follows is usually some combination of pretending it didn’t happen, or else denying the result means anything and moving the goalposts. For a lot of people I end up in those kinds of discussions with, I don’t think much would help beyond literally seeing AI put them and millions of others permanently out of work, and even then I’m not sure.
Just from seeing narrow benchmarks saturate, one could argue that what’s happening is LLMs are picking up whatever narrow capabilities are in-focus enough to train into them. (I emphatically do not think this is what’s happening in 2025, but narrow benchmark scores alone aren’t enough to show that.) A well-designed intelligence benchmark, by contrast, would be impossible to score well into the human range without having an ability to do novel (and thereby general) problem-solving, and unsaturateable without the ability to do so at above-genius level.
As for the question of whether it’d persuade people with their heads stuck in the sand, “x model is smarter than some-high-percent of people” is a lot harder to ignore than “x model scored some-high-numbers on a bunch of coding, knowledge, etc. benchmarks”. Putting aside how it’s more useful, giving model scores relative to people (or, in some situations, subject matter experts) is also more confronting. That said, I don’t doubt that there are many people who wouldn’t be persuaded by even that.
This would be great to have, for sure, and I wish you luck in working on it!
I wonder if, for the specific types of discussions you point to in the first paragraph, it’s necessary or even likely to help? Even if all the benchmarks today are ‘bad’ as described, they measure something, and there’s a clear pattern of rapid saturation as new benchmarks are created. METR and many others have discussed this a lot. There have been papers on it. It seems like the meta-level approach of mapping out saturation timelines should be sufficient to convince people that for any given capability they can define, if they make a benchmark for it, AI will acquire that capability at the level the benchmark can measure. In practice, what follows is usually some combination of pretending it didn’t happen, or else denying the result means anything and moving the goalposts. For a lot of people I end up in those kinds of discussions with, I don’t think much would help beyond literally seeing AI put them and millions of others permanently out of work, and even then I’m not sure.
Just from seeing narrow benchmarks saturate, one could argue that what’s happening is LLMs are picking up whatever narrow capabilities are in-focus enough to train into them. (I emphatically do not think this is what’s happening in 2025, but narrow benchmark scores alone aren’t enough to show that.) A well-designed intelligence benchmark, by contrast, would be impossible to score well into the human range without having an ability to do novel (and thereby general) problem-solving, and unsaturateable without the ability to do so at above-genius level.
As for the question of whether it’d persuade people with their heads stuck in the sand, “x model is smarter than some-high-percent of people” is a lot harder to ignore than “x model scored some-high-numbers on a bunch of coding, knowledge, etc. benchmarks”. Putting aside how it’s more useful, giving model scores relative to people (or, in some situations, subject matter experts) is also more confronting. That said, I don’t doubt that there are many people who wouldn’t be persuaded by even that.
Agreed on all counts. I really, genuinely do hope to see your attempt at such a benchmark succeed, and believe that such is possible.