(Plus ongoing poor results on Pokémon, modern LLMs still can only win with elaborate task-specific scaffolding)
Though performance on the IMO seems impressive, the very few examples of mathematical discoveries by LLMs don’t seem (to me) to be increasing much in either frequency or quality, and so far are mostly of type “get a better lower bound by combinatorially trying stuff” which seems to advantage computers with or without AI. Also, again, even that type of example is rare, probably the vast majority of such attempts have failed and we only hear about a few successful ones, none of which seem to have been significant for any reason other than coming from an LLM.
I increasingly suspect a lot of the recent progress in LLMs has been illusory, from overfitting to benchmarks which may even leak to the training set (am I right about this?) and seeming useful, and METR is sufficiently good at their job that this will become apparent in task length measurements before the 8 hour mark.
I’m trying to make belief in rapid LLM progress pay rent, and at some point benchmarks are not the right currency. Maybe that point is “not yet” and we see useful applications only right before superintelligence etc. but I am skeptical of that narrative; at least, it does little to justify short timelines, because it leaves the point of usefulness to guesswork.
Are you looking for utility in all the wrong places?
Recent news have quite a few mentions of: AI tanking the job prospects of fresh grads across multiple fields and, at the same time, AI causing a job market bloodbath in the usual outsourcing capitals of the world.
That sure lines up with known AI capabilities.
AI isn’t at the point of “radical transformation of everything” yet, clearly. You can’t replace a badass crew of x10 developers who can build the next big startup with AIs today. AI doesn’t unlock all that many “things that were impossible before” either—some are here already, but not enough to upend everything. What it does instead is take the cheapest, most replaceable labor on the market, and make it cheaper and more replaceable. That’s the ongoing impact.
The “entry-level jobs” study looked alright at a glance. I did not look into the claims of outsourcing job losses in any more detail—only noted that it was claimed multiple times.
There’s this recent paper, see Zvi’s summary/discussion here. I have not looked into it deeply. Looks a bit weird to me. Overall, the very fact that there’s so much confusion around whether LLMs are or are not useful is itself extremely weird.
(Disclaimer: off-the-cuff speculation, no idea if that is how anything works.)
I’m not sure how much I buy this narrative, to be honest. The kind of archetypical “useless junior dev” who can be outright replaced by an LLM probably… wasn’t being hired to do the job anyway, but instead as a human-capital investment? To be transformed into a middle/senior dev, whose job an LLM can’t yet do. So LLMs achieving short-term-capability parity with juniors shouldn’t hurt juniors’ job prospects, because they weren’t hired for their existing capabilities anyway.
Hmm, perhaps it’s not quite like this. Suppose companies weren’t “consciously” hiring junior developers as a future investments; that they “thought”[1] junior devs are actually useful, in the sense that if they “knew” they were just a future investment, they wouldn’t have been hired. The appearance of LLMs who are as capable as junior devs would now remove the pretense that the junior devs provide counterfactual immediate value. So their hiring would stop, because middle/senior managers would be unable to keep justifying it, despite the quiet fact that they were effectively not being hired for their immediate skills anyway. And so the career pipeline would get clogged.
Maybe that’s what’s happening?
(Again, no idea if that’s how anything there works, I have very limited experience in that sphere.)
In a semi-metaphorical sense, as an emergent property of various social dynamics between the middle managers reporting on juniors’ performance to senior managers who set company priorities based in part on what would look good and justifiable to the shareholders, or something along those lines.
This is the hardest evidence anyone has brought up in this thread (?) but I’m inclined to buy your rebuttal about the trend really starting in 2022 which it is hard to believe comes from LLMs.
I don’t think it’s reasonable to expect such evidence to appear after such short period of time. There were no hard evidence that electricity is useful in a sense you are talking about until 1920s. Current LLMs are clearly not AGIs in a sense that they can integrate into economy as migrant labor, therefore, productivity gains from LLMs are bottlenecked on users.
I find this reply broadly reasonable, but I’d like to see some systematic investigations of the analogy between gradual adoption and rising utility of electricity and gradual adoption and rising utility of LLMs (as well as other “truly novel technologies”).
There is a difference between adoption as in “people are using it” and adoption as in “people are using it in economically productive way”. I think supermajority of productivity from LLMs is realized as pure consumer surplus right now.
My impression is that so far the kinds of people whose work could be automated aren’t the kind to navigate the complexities of building bespoke harnesses to have llms do useful work. So we have the much slower process of people manually automating others.
Where is the hard evidence that LLMs are useful?
Has anyone seen convincing evidence of AI driving developer productivity or economic growth?
It seems I am only reading negative results about studies on applications.
https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
https://www.lesswrong.com/posts/25JGNnT9Kg4aN5N5s/metr-research-update-algorithmic-vs-holistic-evaluation
And in terms of startup growth:
https://www.lesswrong.com/posts/hxYiwSqmvxzCXuqty/generative-ai-is-not-causing-ycombinator-companies-to-grow
apparently wider economic measurements are not clear?
Also agency still seems very bad, about what I would have expected from decent scaffolding on top of GPT-3:
https://www.lesswrong.com/posts/89qhQH8eHsrZxveHp/claude-plays-whatever-it-wants
(Plus ongoing poor results on Pokémon, modern LLMs still can only win with elaborate task-specific scaffolding)
Though performance on the IMO seems impressive, the very few examples of mathematical discoveries by LLMs don’t seem (to me) to be increasing much in either frequency or quality, and so far are mostly of type “get a better lower bound by combinatorially trying stuff” which seems to advantage computers with or without AI. Also, again, even that type of example is rare, probably the vast majority of such attempts have failed and we only hear about a few successful ones, none of which seem to have been significant for any reason other than coming from an LLM.
I increasingly suspect a lot of the recent progress in LLMs has been illusory, from overfitting to benchmarks which may even leak to the training set (am I right about this?) and seeming useful, and METR is sufficiently good at their job that this will become apparent in task length measurements before the 8 hour mark.
I’m trying to make belief in rapid LLM progress pay rent, and at some point benchmarks are not the right currency. Maybe that point is “not yet” and we see useful applications only right before superintelligence etc. but I am skeptical of that narrative; at least, it does little to justify short timelines, because it leaves the point of usefulness to guesswork.
Are you looking for utility in all the wrong places?
Recent news have quite a few mentions of: AI tanking the job prospects of fresh grads across multiple fields and, at the same time, AI causing a job market bloodbath in the usual outsourcing capitals of the world.
That sure lines up with known AI capabilities.
AI isn’t at the point of “radical transformation of everything” yet, clearly. You can’t replace a badass crew of x10 developers who can build the next big startup with AIs today. AI doesn’t unlock all that many “things that were impossible before” either—some are here already, but not enough to upend everything. What it does instead is take the cheapest, most replaceable labor on the market, and make it cheaper and more replaceable. That’s the ongoing impact.
idk if these are good search results, but I asked claude to look up and see if citations seem to justify the claim and if we care about the results someone should read the articles for real
Yep, that’s what I’ve seen.
The “entry-level jobs” study looked alright at a glance. I did not look into the claims of outsourcing job losses in any more detail—only noted that it was claimed multiple times.
Citation needed
I’m not saying it’s a bad take, but I asked for strong evidence. I want at least some kind of source.
There’s this recent paper, see Zvi’s summary/discussion here. I have not looked into it deeply. Looks a bit weird to me. Overall, the very fact that there’s so much confusion around whether LLMs are or are not useful is itself extremely weird.
(Disclaimer: off-the-cuff speculation, no idea if that is how anything works.)
I’m not sure how much I buy this narrative, to be honest. The kind of archetypical “useless junior dev” who can be outright replaced by an LLM probably… wasn’t being hired to do the job anyway, but instead as a human-capital investment? To be transformed into a middle/senior dev, whose job an LLM can’t yet do. So LLMs achieving short-term-capability parity with juniors shouldn’t hurt juniors’ job prospects, because they weren’t hired for their existing capabilities anyway.
Hmm, perhaps it’s not quite like this. Suppose companies weren’t “consciously” hiring junior developers as a future investments; that they “thought”[1] junior devs are actually useful, in the sense that if they “knew” they were just a future investment, they wouldn’t have been hired. The appearance of LLMs who are as capable as junior devs would now remove the pretense that the junior devs provide counterfactual immediate value. So their hiring would stop, because middle/senior managers would be unable to keep justifying it, despite the quiet fact that they were effectively not being hired for their immediate skills anyway. And so the career pipeline would get clogged.
Maybe that’s what’s happening?
(Again, no idea if that’s how anything there works, I have very limited experience in that sphere.)
In a semi-metaphorical sense, as an emergent property of various social dynamics between the middle managers reporting on juniors’ performance to senior managers who set company priorities based in part on what would look good and justifiable to the shareholders, or something along those lines.
This is the hardest evidence anyone has brought up in this thread (?) but I’m inclined to buy your rebuttal about the trend really starting in 2022 which it is hard to believe comes from LLMs.
I don’t think it’s reasonable to expect such evidence to appear after such short period of time. There were no hard evidence that electricity is useful in a sense you are talking about until 1920s. Current LLMs are clearly not AGIs in a sense that they can integrate into economy as migrant labor, therefore, productivity gains from LLMs are bottlenecked on users.
I find this reply broadly reasonable, but I’d like to see some systematic investigations of the analogy between gradual adoption and rising utility of electricity and gradual adoption and rising utility of LLMs (as well as other “truly novel technologies”).
That’s interesting, but adoption of LLMs has been quite fast.
There is a difference between adoption as in “people are using it” and adoption as in “people are using it in economically productive way”. I think supermajority of productivity from LLMs is realized as pure consumer surplus right now.
I understand your theory.
However I am asking in this post for hard evidence.
If there is no hard evidence, that doesn’t prove a negative, but it does mean a lot of LW is engaging in a heavy amount of speculation.
My impression is that so far the kinds of people whose work could be automated aren’t the kind to navigate the complexities of building bespoke harnesses to have llms do useful work. So we have the much slower process of people manually automating others.
The part where you have to build bespoke harnesses seems suspicious to me.
What if, you know, something about how the job needs to be done changes?