Note that I’m not arguing that LLM models aren’t getting better along some dimensions.[1] The part you quoted was mostly about immediate visceral impressions that a newly released model is much smarter based on how it talks; not more careful investigations.
In particular, “better conceptual resolution” is precisely the sort of improvement I would have expected (though I guess I didn’t pre-register that). This is what e. g. GPQA probably measures, and I’ve personally also noticed improvements in how accurately LLMs understand research papers loaded into their context. (Sonnet 4 makes mistakes Opus 4 or o3 don’t.)
What I would get concerned about is if LLMs started rapidly scaling at autonomous production/modification of new ideas. From what you’re describing, that isn’t quite what’s happening? The outputs are only reliable when they’re reporting things already in your notes, and the value of new ideas they generate is usually as inspiration, instead of as direct contributions. There’s an occasional wholesale good idea, but the rate of those is low, and even when they do happen, it’s usually because you’ve already done 90% of the work setting up the context. If so, that’s been my experience as well.
Now, granted, in the limit of infinitely precise conceptual resolution, LLMs would develop the abilities to autonomously act and innovate. But what seems to be happening is that performance on some tasks (“chat with a PDF”) scales with conceptual resolution much better than the performance on other tasks (“prove this theorem”, “pick the next research direction”), and conceptual resolution isn’t improving as fast as it once was (as e. g. in the GPT-3 to GPT-4 jump). So although ever-more people find that LLMs become useful for processing/working with their ideas, the rate of absolute improvement there is low, and it doesn’t, to me, look like it’s on a transcendental trajectory.
Though it’s also possible that my model of idea complexity is incorrect, that LLMs’ scaling is still exponential in this domain, and we are on track to them becoming superhumanly good at idea processing and generation by e. g. 2030. Definitively something to keep in mind.
I’d be interested in your reports on how LLMs’ contributions to your work change with the subsequent generations of LLM.
In the context of the OP, my main point here is that I don’t think any of this can be explained as “personality improvements”.
Do you think some of the effect might just be from your getting better at using LLMs, knowing what to expect and what not to expect from them?
I don’t want to say the pretraining will “plateau”, as such, I do expect continued progress. But the dimensions along which the progress happens are going to decouple from the intuitive “getting generally smarter” metric, and will face steep diminishing returns.
Yep, you did explicitly state that you expect LLMs to keep getting better along some dimensions; however, the quote I was responding to seemed too extreme in isolation. I agree that the vibe-bias is a thing (I’m manipulable to “sounding smart” too); I guess part of what I wanted to get across is that it really depends how you’re testing these things. If you have a serious use-case involving real cognitive labor & you keep going back to that when a new model is released, you’ll be much harder to fool by vibes.
Now, granted, in the limit of infinitely precise conceptual resolution, LLMs would develop the abilities to autonomously act and innovate. But what seems to be happening is that performance on some tasks (“chat with a PDF”) scales with conceptual resolution much better than the performance on other tasks (“prove this theorem”, “pick the next research direction”), and conceptual resolution isn’t improving as fast as it once was (as e. g. in the GPT-3 to GPT-4 jump).
& notably, it’s extremely slow to improve on some tasks (eg multiplication of numbers, even though multiplication is quadratic in number of tokens & transformers are also quadratic in number of tokens).
I somewhat think that conceptual resolution is still increasing about as quickly; it’s just that there are rapidly diminishing returns to conceptual resolution, because the distribution of tasks spans many orders of magnitude in conceptual-resolution-space. LLMs have adequate conceptual resolution for a lot of tasks, now, so even if conceptual resolution doubles, this just doesn’t “pop” like it did before.
(Meanwhile, my personal task-distribution has a conceptual resolution closer to the current frontier, so I am feeling very rapid improvement at the moment.)
Humans have almost-arbitrary conceptual resolution when needed (EG we can accurately multiply very long numbers if we need to), so many of the remaining tasks not conquered by current LLMs (EG professional-level math research) probably involve much higher conceptual resolution.
That’s useful information, thanks for sharing!
Note that I’m not arguing that LLM models aren’t getting better along some dimensions.[1] The part you quoted was mostly about immediate visceral impressions that a newly released model is much smarter based on how it talks; not more careful investigations.
In particular, “better conceptual resolution” is precisely the sort of improvement I would have expected (though I guess I didn’t pre-register that). This is what e. g. GPQA probably measures, and I’ve personally also noticed improvements in how accurately LLMs understand research papers loaded into their context. (Sonnet 4 makes mistakes Opus 4 or o3 don’t.)
What I would get concerned about is if LLMs started rapidly scaling at autonomous production/modification of new ideas. From what you’re describing, that isn’t quite what’s happening? The outputs are only reliable when they’re reporting things already in your notes, and the value of new ideas they generate is usually as inspiration, instead of as direct contributions. There’s an occasional wholesale good idea, but the rate of those is low, and even when they do happen, it’s usually because you’ve already done 90% of the work setting up the context. If so, that’s been my experience as well.
Now, granted, in the limit of infinitely precise conceptual resolution, LLMs would develop the abilities to autonomously act and innovate. But what seems to be happening is that performance on some tasks (“chat with a PDF”) scales with conceptual resolution much better than the performance on other tasks (“prove this theorem”, “pick the next research direction”), and conceptual resolution isn’t improving as fast as it once was (as e. g. in the GPT-3 to GPT-4 jump). So although ever-more people find that LLMs become useful for processing/working with their ideas, the rate of absolute improvement there is low, and it doesn’t, to me, look like it’s on a transcendental trajectory.
Though it’s also possible that my model of idea complexity is incorrect, that LLMs’ scaling is still exponential in this domain, and we are on track to them becoming superhumanly good at idea processing and generation by e. g. 2030. Definitively something to keep in mind.
I’d be interested in your reports on how LLMs’ contributions to your work change with the subsequent generations of LLM.
Do you think some of the effect might just be from your getting better at using LLMs, knowing what to expect and what not to expect from them?
Yep, you did explicitly state that you expect LLMs to keep getting better along some dimensions; however, the quote I was responding to seemed too extreme in isolation. I agree that the vibe-bias is a thing (I’m manipulable to “sounding smart” too); I guess part of what I wanted to get across is that it really depends how you’re testing these things. If you have a serious use-case involving real cognitive labor & you keep going back to that when a new model is released, you’ll be much harder to fool by vibes.
& notably, it’s extremely slow to improve on some tasks (eg multiplication of numbers, even though multiplication is quadratic in number of tokens & transformers are also quadratic in number of tokens).
I somewhat think that conceptual resolution is still increasing about as quickly; it’s just that there are rapidly diminishing returns to conceptual resolution, because the distribution of tasks spans many orders of magnitude in conceptual-resolution-space. LLMs have adequate conceptual resolution for a lot of tasks, now, so even if conceptual resolution doubles, this just doesn’t “pop” like it did before.
(Meanwhile, my personal task-distribution has a conceptual resolution closer to the current frontier, so I am feeling very rapid improvement at the moment.)
Humans have almost-arbitrary conceptual resolution when needed (EG we can accurately multiply very long numbers if we need to), so many of the remaining tasks not conquered by current LLMs (EG professional-level math research) probably involve much higher conceptual resolution.