A key missing ingredient holding back LLM economic impact is that they’re just not robust enough.
I disagree with this in this particular context. We are looking at AI companies trying to automate AI automation via AIs. Most tasks in AI R&D don’t require much reliability, I don’t know the distribution of outcomes in ML experiments but I reckon a lot of them are basically failures/have null results, but the distribution of the impact of such experiments has a long tail[1]. Also ML experiments don’t have many irreversible parts, AI R&D researchers aren’t like surgeons where mistakes have huge costs: Any ML experiment can be sandboxed, given a bounded amount of resources, shut down when it takes up too much. You need high reliability when the cost of failure is necessarily very high, but when running ML experiments that’s not the case.
Edit: Claude 4.5 Sonnet gives feedback on my text above, says that the search strategy matters if we’re looking at ML engineering. If it’s breadth-first & innovations don’t require a deep tree to go down, then low reliability is fine. But if we need to combine ≥4 innovations in a depth-first search then reliability matters more.
I don’t think this is a crux for me but learning that it’s a thin-tailed distribution would make me at least think about this problem a bit more. Claude claims hyperparameter tunes have lognormal returns (shifted so that the mean is slightly below baseline).
Claude’s rebuttal is exactly my claim. If major AI research breakthroughs could be done in 5 hours, then imo robustness wouldn’t matter as much. You could run a bunch of models in parallel and see what happens (this is part of why models are so good at olympiads), but an implicit part of my argument/crux is that AI research is necessarily deep (meaning you need to string some number of successfully completed tasks together such that you get an interesting final result). And if the model messes up one part, your chain breaks. Not only does this give you weird results, but it breaks your chain of causality[1], which is essential for AI research.
I’ve also tried doing “vibe AI researching” (no human in the loop) with current models and I find it just fails right away. If robustness doesn’t matter, why don’t we see current models consistently making AI research breakthroughs at their current 80% task completion rate?
A counterargument to this is that if METR’s graph trend keeps up, and task length gets to some threshold, I’ll call it a week for example, then you don’t really care about P(A)P(B)P(C)..., you can just do the tasks in parallel and see which one works. (However, if my logic holds, I would guess that METR’s task benchmark hits a plateau at some point before doing full-on research at least with current model robustness)
By chain of causality, I mean that I did task A. If I am extremely confident that task A is correct I can then do a search from task A. Say I stumble on some task B, then C. If I get an interesting result from task C, then I can keep searching from there so long as I am confident in my results. I can also mentally update my causal chain by some kind of ~backprop. “Oh using a CNN in task A, then setting my learning rate to be this in task B, made me discover this new thing in task C so now I can draw a generalized intuition to approach task D. Ok this approach to D failed, let me try this other approach”.
I disagree with this in this particular context. We are looking at AI companies trying to automate AI automation via AIs. Most tasks in AI R&D don’t require much reliability, I don’t know the distribution of outcomes in ML experiments but I reckon a lot of them are basically failures/have null results, but the distribution of the impact of such experiments has a long tail [1] . Also ML experiments don’t have many irreversible parts, AI R&D researchers aren’t like surgeons where mistakes have huge costs: Any ML experiment can be sandboxed, given a bounded amount of resources, shut down when it takes up too much. You need high reliability when the cost of failure is necessarily very high, but when running ML experiments that’s not the case.
Edit: Claude 4.5 Sonnet gives feedback on my text above, says that the search strategy matters if we’re looking at ML engineering. If it’s breadth-first & innovations don’t require a deep tree to go down, then low reliability is fine. But if we need to combine ≥4 innovations in a depth-first search then reliability matters more.
I don’t think this is a crux for me but learning that it’s a thin-tailed distribution would make me at least think about this problem a bit more. Claude claims hyperparameter tunes have lognormal returns (shifted so that the mean is slightly below baseline).
Claude’s rebuttal is exactly my claim. If major AI research breakthroughs could be done in 5 hours, then imo robustness wouldn’t matter as much. You could run a bunch of models in parallel and see what happens (this is part of why models are so good at olympiads), but an implicit part of my argument/crux is that AI research is necessarily deep (meaning you need to string some number of successfully completed tasks together such that you get an interesting final result). And if the model messes up one part, your chain breaks. Not only does this give you weird results, but it breaks your chain of causality[1], which is essential for AI research.
I’ve also tried doing “vibe AI researching” (no human in the loop) with current models and I find it just fails right away. If robustness doesn’t matter, why don’t we see current models consistently making AI research breakthroughs at their current 80% task completion rate?
A counterargument to this is that if METR’s graph trend keeps up, and task length gets to some threshold, I’ll call it a week for example, then you don’t really care about P(A)P(B)P(C)..., you can just do the tasks in parallel and see which one works. (However, if my logic holds, I would guess that METR’s task benchmark hits a plateau at some point before doing full-on research at least with current model robustness)
By chain of causality, I mean that I did task A. If I am extremely confident that task A is correct I can then do a search from task A. Say I stumble on some task B, then C. If I get an interesting result from task C, then I can keep searching from there so long as I am confident in my results. I can also mentally update my causal chain by some kind of ~backprop. “Oh using a CNN in task A, then setting my learning rate to be this in task B, made me discover this new thing in task C so now I can draw a generalized intuition to approach task D. Ok this approach to D failed, let me try this other approach”.