Cole Wyeth comments on Cole Wyeth’s Shortform

Cole Wyeth 1 Sep 2025 15:33 UTC
48 points
12
Where is the hard evidence that LLMs are useful?

Has anyone seen convincing evidence of AI driving developer productivity or economic growth?
It seems I am only reading negative results about studies on applications.
https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
https://www.lesswrong.com/posts/25JGNnT9Kg4aN5N5s/metr-research-update-algorithmic-vs-holistic-evaluation

And in terms of startup growth:
https://www.lesswrong.com/posts/hxYiwSqmvxzCXuqty/generative-ai-is-not-causing-ycombinator-companies-to-grow
apparently wider economic measurements are not clear?

Also agency still seems very bad, about what I would have expected from decent scaffolding on top of GPT-3:
https://www.lesswrong.com/posts/89qhQH8eHsrZxveHp/claude-plays-whatever-it-wants

(Plus ongoing poor results on Pokémon, modern LLMs still can only win with elaborate task-specific scaffolding)

Though performance on the IMO seems impressive, the very few examples of mathematical discoveries by LLMs don’t seem (to me) to be increasing much in either frequency or quality, and so far are mostly of type “get a better lower bound by combinatorially trying stuff” which seems to advantage computers with or without AI. Also, again, even that type of example is rare, probably the vast majority of such attempts have failed and we only hear about a few successful ones, none of which seem to have been significant for any reason other than coming from an LLM.
I increasingly suspect a lot of the recent progress in LLMs has been illusory, from overfitting to benchmarks which may even leak to the training set (am I right about this?) and seeming useful, and METR is sufficiently good at their job that this will become apparent in task length measurements before the 8 hour mark.
I’m trying to make belief in rapid LLM progress pay rent, and at some point benchmarks are not the right currency. Maybe that point is “not yet” and we see useful applications only right before superintelligence etc. but I am skeptical of that narrative; at least, it does little to justify short timelines, because it leaves the point of usefulness to guesswork.
- ACCount 1 Sep 2025 23:35 UTC
  7 points
  1
  Parent
  Are you looking for utility in all the wrong places?
  Recent news have quite a few mentions of: AI tanking the job prospects of fresh grads across multiple fields and, at the same time, AI causing a job market bloodbath in the usual outsourcing capitals of the world.
  That sure lines up with known AI capabilities.
  AI isn’t at the point of “radical transformation of everything” yet, clearly. You can’t replace a badass crew of x10 developers who can build the next big startup with AIs today. AI doesn’t unlock all that many “things that were impossible before” either—some are here already, but not enough to upend everything. What it does instead is take the cheapest, most replaceable labor on the market, and make it cheaper and more replaceable. That’s the ongoing impact.
  - the gears to ascension 2 Sep 2025 9:24 UTC
    3 points
    0
    Parent
    idk if these are good search results, but I asked claude to look up and see if citations seem to justify the claim and if we care about the results someone should read the articles for real
    - ACCount 2 Sep 2025 12:46 UTC
      1 point
      0
      Parent
      Yep, that’s what I’ve seen.
      The “entry-level jobs” study looked alright at a glance. I did not look into the claims of outsourcing job losses in any more detail—only noted that it was claimed multiple times.
  - Cole Wyeth 2 Sep 2025 0:59 UTC
    3 points
    0
    Parent
    Citation needed
    - Cole Wyeth 2 Sep 2025 1:38 UTC
      7 points
      0
      Parent
      I’m not saying it’s a bad take, but I asked for strong evidence. I want at least some kind of source.
      - Thane Ruthenis 2 Sep 2025 12:55 UTC
        9 points
        2
        Parent
        There’s this recent paper, see Zvi’s summary/discussion here. I have not looked into it deeply. Looks a bit weird to me. Overall, the very fact that there’s so much confusion around whether LLMs are or are not useful is itself extremely weird.
        (Disclaimer: off-the-cuff speculation, no idea if that is how anything works.)
        I’m not sure how much I buy this narrative, to be honest. The kind of archetypical “useless junior dev” who can be outright replaced by an LLM probably… wasn’t being hired to do the job anyway, but instead as a human-capital investment? To be transformed into a middle/senior dev, whose job an LLM can’t yet do. So LLMs achieving short-term-capability parity with juniors shouldn’t hurt juniors’ job prospects, because they weren’t hired for their existing capabilities anyway.
        Hmm, perhaps it’s not quite like this. Suppose companies weren’t “consciously” hiring junior developers as a future investments; that they “thought”^[1] junior devs are actually useful, in the sense that if they “knew” they were just a future investment, they wouldn’t have been hired. The appearance of LLMs who are as capable as junior devs would now remove the pretense that the junior devs provide counterfactual immediate value. So their hiring would stop, because middle/senior managers would be unable to keep justifying it, despite the quiet fact that they were effectively not being hired for their immediate skills anyway. And so the career pipeline would get clogged.
        Maybe that’s what’s happening?
        (Again, no idea if that’s how anything there works, I have very limited experience in that sphere.)
        ^
        In a semi-metaphorical sense, as an emergent property of various social dynamics between the middle managers reporting on juniors’ performance to senior managers who set company priorities based in part on what would look good and justifiable to the shareholders, or something along those lines.
        Cole Wyeth 2 Sep 2025 20:24 UTC
        4 points
        0
        Parent
        This is the hardest evidence anyone has brought up in this thread (?) but I’m inclined to buy your rebuttal about the trend really starting in 2022 which it is hard to believe comes from LLMs.
- quetzal_rainbow 2 Sep 2025 9:07 UTC
  2 points
  0
  Parent
  I don’t think it’s reasonable to expect such evidence to appear after such short period of time. There were no hard evidence that electricity is useful in a sense you are talking about until 1920s. Current LLMs are clearly not AGIs in a sense that they can integrate into economy as migrant labor, therefore, productivity gains from LLMs are bottlenecked on users.
  - Mateusz Bagiński 2 Sep 2025 10:55 UTC
    6 points
    0
    Parent
    I find this reply broadly reasonable, but I’d like to see some systematic investigations of the analogy between gradual adoption and rising utility of electricity and gradual adoption and rising utility of LLMs (as well as other “truly novel technologies”).
  - Cole Wyeth 2 Sep 2025 14:07 UTC
    2 points
    0
    Parent
    That’s interesting, but adoption of LLMs has been quite fast.
    - quetzal_rainbow 2 Sep 2025 19:32 UTC
      3 points
      0
      Parent
      There is a difference between adoption as in “people are using it” and adoption as in “people are using it in economically productive way”. I think supermajority of productivity from LLMs is realized as pure consumer surplus right now.
      - Cole Wyeth 2 Sep 2025 19:42 UTC
        5 points
        2
        Parent
        I understand your theory.
        However I am asking in this post for hard evidence.
        If there is no hard evidence, that doesn’t prove a negative, but it does mean a lot of LW is engaging in a heavy amount of speculation.
- romeostevensit 1 Sep 2025 22:53 UTC
  2 points
  0
  Parent
  My impression is that so far the kinds of people whose work could be automated aren’t the kind to navigate the complexities of building bespoke harnesses to have llms do useful work. So we have the much slower process of people manually automating others.
  - Cole Wyeth 1 Sep 2025 23:10 UTC
    9 points
    1
    Parent
    The part where you have to build bespoke harnesses seems suspicious to me.
    What if, you know, something about how the job needs to be done changes?