I’ve generally found it much harder over time to find “examples where LLMs fail in surprising ways”. If you test o3 (released the day after that post!) for the examples they chose, it does much better than previous models. And I’ve just tried it on your “269 words” task, which it nailed.
To be clear, I’m not claiming that the “write a text with precisely X words” task is super-duper-mega-hard, and I wouldn’t be surprised if a new frontier model was much better at it than Gemini. I have a very similar opinion to the author of this post: I’m saying that given what the models currently can do, it’s surprising that they also currently can’t (reliably) do a lot of things. I’m saying that there are very sharp edges in models’ capabilities, much sharper than I expected. And the existence of very sharp edges makes it very difficult to compare AI to humans on a one-dimensional intelligence scale, because instead of “AI is 10 times worse than humans at everything”, it’s “AI is roughly as good as expert humans at X and useless at Y”.
Well, if we extrapolate from the current progress, soon AI will be superhumanly good at complex analysis and group theory while only being moderately good at ordering pizza.
That’s why I think that comparing AI to humans on a one-dimensional scale doesn’t work well.
If you extrapolate further, do you think the one-dimensional scale works well to describe the high-level trend (surpassing human abilities broadly)?
Trying to determine if the disagreement here is “AI probably won’t surpass human abilities broadly in a short time” or “even if it does, the one-dimensional scale wasn’t a good way to describe the trend”.
I agree that AI capabilities are spiky and developed in an unusual order. And I agree that because of this, the single-variable representation of intelligence is not very useful for understanding the range of abilities of current frontier models.
At the same time, I expect the jump from “Worse than humans at almost everything” to “Better than humans at almost everything” will be <5 years, which would make the single-variable representation work reasonably well for the purposes of the graph.
I think these “examples of silly mistakes” have not held up well at all. This was often blamed on “training around the limitations”; however, in the case of the linked post, we got a model the next day that performed much better.
And almost every benchmark and measurable set of capabilities has rapidly improved (in some cases beyond human experts).
”We too often give wrong answers to questions ourselves to be justified in being very pleased at such evidence of fallibility on the part of the machines. Further, our superiority can only be felt on such an occasion in relation to the one machine over which we have scored our petty triumph.” Alan Turing, Computing Machinery and Intelligence 1950
I’ve generally found it much harder over time to find “examples where LLMs fail in surprising ways”. If you test o3 (released the day after that post!) for the examples they chose, it does much better than previous models. And I’ve just tried it on your “269 words” task, which it nailed.
To be clear, I’m not claiming that the “write a text with precisely X words” task is super-duper-mega-hard, and I wouldn’t be surprised if a new frontier model was much better at it than Gemini. I have a very similar opinion to the author of this post: I’m saying that given what the models currently can do, it’s surprising that they also currently can’t (reliably) do a lot of things. I’m saying that there are very sharp edges in models’ capabilities, much sharper than I expected. And the existence of very sharp edges makes it very difficult to compare AI to humans on a one-dimensional intelligence scale, because instead of “AI is 10 times worse than humans at everything”, it’s “AI is roughly as good as expert humans at X and useless at Y”.
>instead of “AI is 10 times worse than humans at everything”, it’s “AI is roughly as good as expert humans at X and useless at Y”.
How long before it’s “AI is out of sight of expert humans at X and merely far above them at Y”?
Well, if we extrapolate from the current progress, soon AI will be superhumanly good at complex analysis and group theory while only being moderately good at ordering pizza.
That’s why I think that comparing AI to humans on a one-dimensional scale doesn’t work well.
If you extrapolate further, do you think the one-dimensional scale works well to describe the high-level trend (surpassing human abilities broadly)?
Trying to determine if the disagreement here is “AI probably won’t surpass human abilities broadly in a short time” or “even if it does, the one-dimensional scale wasn’t a good way to describe the trend”.
The latter.
I agree that AI capabilities are spiky and developed in an unusual order. And I agree that because of this, the single-variable representation of intelligence is not very useful for understanding the range of abilities of current frontier models.
At the same time, I expect the jump from “Worse than humans at almost everything” to “Better than humans at almost everything” will be <5 years, which would make the single-variable representation work reasonably well for the purposes of the graph.
I think these “examples of silly mistakes” have not held up well at all. This was often blamed on “training around the limitations”; however, in the case of the linked post, we got a model the next day that performed much better.
And almost every benchmark and measurable set of capabilities has rapidly improved (in some cases beyond human experts).
”We too often give wrong answers to questions ourselves to be justified in being very pleased at such evidence of fallibility on the part of the machines. Further, our superiority can only be felt on such an occasion in relation to the one machine over which we have scored our petty triumph.”
Alan Turing, Computing Machinery and Intelligence
1950