Not a novel point, but: one reason that progress against benchmarks feels disconnected from real-world deployment is that good benchmarks and AI progress are correlated endeavors.
Both benchmarks and ML fundamentally require verifiable outcomes. So, within any task distribution, the benchmarks we create are systematically sampling from the automatable end.
Importantly, there’s no reason to believe this will stop. So we should expect benchmarks will continue to feel rosey compared to real-world deployment.
Thane Ruthenis also has an explanation about why benchmarks tend to overestimate progress here.
2. We assumed intelligence/IQ was much more lower-dimensional than it really was. Now I don’t totally blame people for thinking that there was a chance that AI capabilities were lower dimensional than they were, but way too many expected an IQ analogue for AI to work. This is in part an artifact of current architectures being more limited on some key dimensions like continual learning/long-term memory, but I wouldn’t put anywhere close to all of the jaggedness on AI deficits, and instead LWers forgot that reality is surprisingly detailed.
Remember, even in humans IQ only explains 30-40% of human performance, which while being more than a lot of people want to admit, nerd communities like LessWrong have the opposite failure mode of believing that intelligence/IQ/capabilities is very low dimensional with a single number dominating how performant you are.
To be frank, this is a real-life version of an ontological crisis, where certain assumptions about AI, especially on LW turned out to be entirely wrong, meaning that certain goal-posts/risks have turned out to at best require conceptual fragmentation, and at worst turned out to be incoherent.
A couple of things happened that made terms like AGI/ASI less useful:
We didn’t realize how correlated benchmark progress and general AI progress was, and benchmark tasks systematically sample from disproportionately easy to automate areas of work:
Thane Ruthenis also has an explanation about why benchmarks tend to overestimate progress here.
2. We assumed intelligence/IQ was much more lower-dimensional than it really was. Now I don’t totally blame people for thinking that there was a chance that AI capabilities were lower dimensional than they were, but way too many expected an IQ analogue for AI to work. This is in part an artifact of current architectures being more limited on some key dimensions like continual learning/long-term memory, but I wouldn’t put anywhere close to all of the jaggedness on AI deficits, and instead LWers forgot that reality is surprisingly detailed.
Remember, even in humans IQ only explains 30-40% of human performance, which while being more than a lot of people want to admit, nerd communities like LessWrong have the opposite failure mode of believing that intelligence/IQ/capabilities is very low dimensional with a single number dominating how performant you are.
To be frank, this is a real-life version of an ontological crisis, where certain assumptions about AI, especially on LW turned out to be entirely wrong, meaning that certain goal-posts/risks have turned out to at best require conceptual fragmentation, and at worst turned out to be incoherent.