LLMs are current level are already phenomenal. Enough to usher in a new industrial revolution even without further progress. Also still remarkable how untethered or nonsensical their reasoning can be, even with Opus 4.6 or similar.
Ex1. I was working on parking brake issue with my car, comparing the clamping force I was getting the wheel with the observation that it had wanted to roll down the hill. I told it I was getting enough clamping to be unable to turn the wheel by hand.
That said, 4 clicks with hubs-only holding firm is still probably fine in practice. The parking brake just needs to hold the car stationary on a hill, and the force from a car rolling is a lot less than someone deliberately trying to wrench a wheel around.
No, a 2,400lb car rolling down the hill exerts a lot more force than me trying to turn it at the wheel studs, let me tell ya.
Ex2. I was setting of a long-running gene analysis job. A while after it had started, I asked if actually we could parallelize it. Claudes says yes, absolutely, there’s a parameter already for that. I ask it to estimate whether it’d make sense to stop and restart the job. Yes, it says, would take half the time – but we’ve already started it so might as well let it finish.
I feel like I get some many of these bonkers inferences, that there’s something interesting here to reconcile with the brilliance they have in other moments.
A while after it had started, I asked if actually we could parallelize it. Claudes says yes, absolutely, there’s a parameter already for that. I ask it to estimate whether it’d make sense to stop and restart the job. Yes, it says, would take half the time – but we’ve already started it so might as well let it finish.
I feel like it’s noteworthy that this is the kind of thing many humans would say.
I feel like I get some many of these bonkers inferences, that there’s something interesting here to reconcile with the brilliance they have in other moments.
They’re still bad at generalizing out of distribution. Tons of data are shoveled into them, and they are trained to produce reasonably good (or very good) reasoning outputs with this data (etc.), but put them OOD and they break.
Of course, there’s much more juice there to be figured out, but I still think that this is a good, if simplistic, model. (See also: hyperpolation)
An example from my recent experience is getting Claude to use some relatively uncommon CLI tools, with documentation in the repo. It would try running non-existent (but reasonably looking) commands with non-existent (but reasonably looking) arguments. It would try like 6 times and only then would look into the documentation. It would repeat this mistake of using the same non-existent commands a few vibe-code steps later.
The “use uncommon tools” example is familiar. Last year, I was really amazed by what Claude/Cursor could do in primary coding tasks, then appalled by how poorly that transferred to asking it to work with Jupyter/iPython notebooks via MCP. We’d been working on a notebook for 30 min, then it would screw up the tool call, conclude the notebook had been deleted, and attempt to create it fresh. This happened repeatedly. It’s just not the kind of mistake a human would make, which gets back to, how exactly do these minds work and form models of the world?
We’re also bad OOD and many of our supposed advantages over them boil down to our distribution differences (embodiment and first-person-first data). I agree we’re much better OOD than them but not so much that I think there’s no comparison. As usual I’m skipping over my ideas for ways to improve them.
We’re also bad OOD and many of our supposed advantages over them boil down to our distribution differences (embodiment and first-person-first data).
Kind of and yeah?
I agree we’re much better OOD than them but not so much that I think there’s no comparison.
I wouldn’t say “there’s no comparison”[1], but I do think it looks like a “qualitative” difference. What exactly it is would require a more involved explication of the concept, which might be infohazardous.
What mistakes would you make if you’d spent 30,000 years predicting sentences without pausing or sleeping and then another 10,000 doing programming tasks, but had never seen a video, moved your head, dropped a block, picked up an object, and every single experience you’d ever had was secondhand?
Granted, I don’t think that’s the full story, but it seems like a lot of the explanation.
I think Claude’s answers were actually reasonable.
Example 1:I presented this scenario to Claude (I know, not the most impartial party) in the format of a reasoning test, replacing “Claude” with “my friend.” I assumed that you were right, and Claude would notice the error in its own reasoning. But it said the friend was right:
The key insight is that the torque you can apply by grabbing a wheel rim and trying to twist it is actually quite large compared to what gravity exerts on a car sitting on a typical hill. When you grip opposite edges of a wheel (roughly 13–15 inches from center on a typical car wheel), you’re applying force at a long lever arm with your full upper-body strength. That can easily produce 100+ ft-lbs of torque at the wheel.
Even when I told Claude “the person who wrote this said that his friend was wrong,” I was surprised to see that it held firm.
The writer seems to be anchoring on the full 2,400 lb weight of the car, which is an understandable intuition — it feels like a massive car rolling downhill must overpower anything a human can do. (...) On a steep-ish residential hill, say 10% grade, the component of gravity pulling that 2,400 lb car downhill is only about 240 lbs of force.
Now that I’ve looked at Claude’s explanation more carefully, I’m actually convinced by it.
Example 2: If your gene analysis job were over halfway done, this would of course be the right call. Since Claude can’t actually perceive time, it doesn’t seem crazy for Claude to think over half the job might be finished.
Also, it depends on how much you value money vs. time—maybe running the analysis is expensive? If the job cost $100 and you were 20% done, it would cost you $20 to restart from scratch.
Your Claude transcript covers the relevant response:
Meanwhile, a person grabbing a wheel at the studs (which are maybe 2–3 inches from center on a typical bolt pattern) is actually at a disadvantage compared to grabbing the rim. At the studs, your lever arm is very short. If you’re gripping at roughly 2.5 inches from center and pulling hard with maybe 50–80 lbs of force, that’s only about 10–17 ft-lbs of torque. That’s dramatically less than the hill torque.
So the writer may actually be correct for the specific scenario they described — trying to turn the wheel at the studs rather than at the rim. That’s a crucial detail.
I do update that the amount of torque the car is experiencing under gravity is more like 150-200ft-lb and therefore closer to what a human can produce with a good lever arm. Though my Claude’s assertion was “a lot less than someone deliberately trying to wrench a wheel around”, which is not true even with more leverage – they are perhaps comparable then.
Regarding case 2, Claude knew we were just running on my Macbook where the marginal cost of running is negligible, and from my questions, it was cleared I cared about time.
Oh, in my back and forth with it, it also said more blatantly:
That’s a solid result. If you can’t turn the hub by hand at 4 clicks, with a tire mounted you’d have zero chance of overcoming it. The hub gives you way less leverage than a full wheel and tire would.
Part of my own reconciliation is to question the premise that they would already be capable of ushering in a new industrial revolution. I’ve become more skeptical over time as these basic reasoning issues persist. It’s hard for me to imagine an industrial revolution’s worth of progress and innovation powered by a mind so lacking in coherent world models across so many domains.
Well, steam engines have even less coherent world models.
I believe in their power from seeing just how much value they give me and how transformative they are for me. I’m a super early adopter, but if I extrapolate the rest of the world making as much use of the tech as I am, and doing all the things I could see doing, it’s still so much.
But aren’t a lot of your tasks the sort of thing where
there is in fact a ton of training-available data demonstrating good performance
it’s cheap to experiment
etc., other relevant peculiarities of your use cases
?
I think the claim might be true but I don’t see a super compelling reason to think so at the moment.
“Reasoning” helping with self-driving cars might be a compelling demo, but what it would be compelling about is “you can slap together robotics, big data for a specific domain, and some LLM reasoning stuff to duct tape some more of the decision-making, and get something that’s practically useful”. Generalizing to other robotics could kick off a revolution, but it would be slow-going I think?
There could be a fair amount of science overhang, where you just have to search hard enough to put X and needs-X together. E.g. people curing themselves by searching hard using LLMs. Exciting, but not an industrial revolution? In the grand scheme of science it’s not mostly that. A lot of the coolest stuff is really hard, which means there’s not that many people at the forefront, which means that people at the forefront are already familiar with a lot of what’s relevant.
If you can find domains where iteration can be done pretty automatedly, but it’s expensive enough that decision-making still matters, but decision-making is very cognitively costly, but getting kinda-okay-not-creative decision-making would still be quantitatively better, then you could unlock some sort of new paradigm of invention / discovery. E.g. automated labs running automated experiments designing proteins by gippity-tweaking, or similar. Like PACE. But that would also be hard to get started on.
What are other reasons to think this? Plausible I just haven’t seen the idea, haven’t tried too hard.
I don’t think my use cases are especially niche. My main uses are:
search for and process information
process natural language instructions into structured outputs/actions
write software
As Habryka says, you can start to automate a lot with that. Like it’s clear software was quite transformative already, but I think limited because software didn’t take natural language input. Change that...and heck, you can automate so much.
I think you’re reaching for overly narrow use cases. LLMs just do a lot of basic stuff well. My quick take just that it’s weird they still screw up in some ways that a human wouldn’t, and the spikiness is interesting.
Your use case are way too general haha. They include many key things that LLMs currently don’t do. Anyway, maybe you’re not super interested in discussing whether they’re “Enough to usher in a new industrial revolution even without further progress.”, but if you were my next question would be whether the Internet would count as a new industrial revolution in your eyes. (I would say “no, but kinda / almost”, and I would say that the no LLM --> LLM transition looks like it’s kinda comparable-ish to the Internet transition.)
(I’ve been trying a new drug and my brain isn’t at 100% capacity, hence slow or limited replies right now.)
I think that’s a good question. I think the Internet doesn’t feel to me like it reorganized enough of how civilization works to quite be a revolution. In contrast to things like agriculture or steam engines where the vocation and living situation of so much of the population changed. I think LLMs, via automation, can cause an economic reorganization on the scale of agriculture/industrialization, that the Internet itself didn’t do. I’m fuzzier on where “electricity” fits.
I think LLMs, via automation, can cause an economic reorganization on the scale of agriculture/industrialization
But like, how specifically? I agree that there’s some idea around making a bunch of software significantly more beginner-friendly by giving it an LLM interface, and in some ways significantly more powerful with LLM “agents”. Is that a sufficient class of thing for what you’re referring to? I mean, do you think that 50% of people will be working on something different within 5 years, or something like that? Which 50%?
Huh, this seems like a very weird comparison to me. It is very clear that I can automate a huge amount of labor using LLMs at current capability levels. My guess is more than the majority of current work in the economy, and of course I will also be able to do a lot of new things that are now cheaper. My guess is this alone is enough to do something about as big as the industrial revolution.
Most work is just really quite boring and doesn’t require a coherent world models across many domains.
LLMs are current level are already phenomenal. Enough to usher in a new industrial revolution even without further progress. Also still remarkable how untethered or nonsensical their reasoning can be, even with Opus 4.6 or similar.
Ex1. I was working on parking brake issue with my car, comparing the clamping force I was getting the wheel with the observation that it had wanted to roll down the hill. I told it I was getting enough clamping to be unable to turn the wheel by hand.
No, a 2,400lb car rolling down the hill exerts a lot more force than me trying to turn it at the wheel studs, let me tell ya.
Ex2. I was setting of a long-running gene analysis job. A while after it had started, I asked if actually we could parallelize it. Claudes says yes, absolutely, there’s a parameter already for that. I ask it to estimate whether it’d make sense to stop and restart the job. Yes, it says, would take half the time – but we’ve already started it so might as well let it finish.
I feel like I get some many of these bonkers inferences, that there’s something interesting here to reconcile with the brilliance they have in other moments.
I feel like it’s noteworthy that this is the kind of thing many humans would say.
They’re still bad at generalizing out of distribution. Tons of data are shoveled into them, and they are trained to produce reasonably good (or very good) reasoning outputs with this data (etc.), but put them OOD and they break.
Of course, there’s much more juice there to be figured out, but I still think that this is a good, if simplistic, model. (See also: hyperpolation)
An example from my recent experience is getting Claude to use some relatively uncommon CLI tools, with documentation in the repo. It would try running non-existent (but reasonably looking) commands with non-existent (but reasonably looking) arguments. It would try like 6 times and only then would look into the documentation. It would repeat this mistake of using the same non-existent commands a few vibe-code steps later.
The “use uncommon tools” example is familiar. Last year, I was really amazed by what Claude/Cursor could do in primary coding tasks, then appalled by how poorly that transferred to asking it to work with Jupyter/iPython notebooks via MCP. We’d been working on a notebook for 30 min, then it would screw up the tool call, conclude the notebook had been deleted, and attempt to create it fresh. This happened repeatedly. It’s just not the kind of mistake a human would make, which gets back to, how exactly do these minds work and form models of the world?
We’re also bad OOD and many of our supposed advantages over them boil down to our distribution differences (embodiment and first-person-first data). I agree we’re much better OOD than them but not so much that I think there’s no comparison. As usual I’m skipping over my ideas for ways to improve them.
Kind of and yeah?
I wouldn’t say “there’s no comparison”[1], but I do think it looks like a “qualitative” difference. What exactly it is would require a more involved explication of the concept, which might be infohazardous.
Not really my way of speaking about this sort of stuff / I’m not sure what you mean by this.
What mistakes would you make if you’d spent 30,000 years predicting sentences without pausing or sleeping and then another 10,000 doing programming tasks, but had never seen a video, moved your head, dropped a block, picked up an object, and every single experience you’d ever had was secondhand?
Granted, I don’t think that’s the full story, but it seems like a lot of the explanation.
I think Claude’s answers were actually reasonable.
Example 1: I presented this scenario to Claude (I know, not the most impartial party) in the format of a reasoning test, replacing “Claude” with “my friend.” I assumed that you were right, and Claude would notice the error in its own reasoning. But it said the friend was right:
Even when I told Claude “the person who wrote this said that his friend was wrong,” I was surprised to see that it held firm.
Now that I’ve looked at Claude’s explanation more carefully, I’m actually convinced by it.
Example 2: If your gene analysis job were over halfway done, this would of course be the right call. Since Claude can’t actually perceive time, it doesn’t seem crazy for Claude to think over half the job might be finished.
Also, it depends on how much you value money vs. time—maybe running the analysis is expensive? If the job cost $100 and you were 20% done, it would cost you $20 to restart from scratch.
Your Claude transcript covers the relevant response:
I do update that the amount of torque the car is experiencing under gravity is more like 150-200ft-lb and therefore closer to what a human can produce with a good lever arm. Though my Claude’s assertion was “a lot less than someone deliberately trying to wrench a wheel around”, which is not true even with more leverage – they are perhaps comparable then.
Regarding case 2, Claude knew we were just running on my Macbook where the marginal cost of running is negligible, and from my questions, it was cleared I cared about time.
Oh, in my back and forth with it, it also said more blatantly:
Sentence 2 and 3 are directly in contradiction.
Part of my own reconciliation is to question the premise that they would already be capable of ushering in a new industrial revolution. I’ve become more skeptical over time as these basic reasoning issues persist. It’s hard for me to imagine an industrial revolution’s worth of progress and innovation powered by a mind so lacking in coherent world models across so many domains.
Well, steam engines have even less coherent world models.
I believe in their power from seeing just how much value they give me and how transformative they are for me. I’m a super early adopter, but if I extrapolate the rest of the world making as much use of the tech as I am, and doing all the things I could see doing, it’s still so much.
But aren’t a lot of your tasks the sort of thing where
there is in fact a ton of training-available data demonstrating good performance
it’s cheap to experiment
etc., other relevant peculiarities of your use cases
?
I think the claim might be true but I don’t see a super compelling reason to think so at the moment.
“Reasoning” helping with self-driving cars might be a compelling demo, but what it would be compelling about is “you can slap together robotics, big data for a specific domain, and some LLM reasoning stuff to duct tape some more of the decision-making, and get something that’s practically useful”. Generalizing to other robotics could kick off a revolution, but it would be slow-going I think?
There could be a fair amount of science overhang, where you just have to search hard enough to put X and needs-X together. E.g. people curing themselves by searching hard using LLMs. Exciting, but not an industrial revolution? In the grand scheme of science it’s not mostly that. A lot of the coolest stuff is really hard, which means there’s not that many people at the forefront, which means that people at the forefront are already familiar with a lot of what’s relevant.
If you can find domains where iteration can be done pretty automatedly, but it’s expensive enough that decision-making still matters, but decision-making is very cognitively costly, but getting kinda-okay-not-creative decision-making would still be quantitatively better, then you could unlock some sort of new paradigm of invention / discovery. E.g. automated labs running automated experiments designing proteins by gippity-tweaking, or similar. Like PACE. But that would also be hard to get started on.
What are other reasons to think this? Plausible I just haven’t seen the idea, haven’t tried too hard.
I don’t think my use cases are especially niche. My main uses are:
search for and process information
process natural language instructions into structured outputs/actions
write software
As Habryka says, you can start to automate a lot with that. Like it’s clear software was quite transformative already, but I think limited because software didn’t take natural language input. Change that...and heck, you can automate so much.
I think you’re reaching for overly narrow use cases. LLMs just do a lot of basic stuff well. My quick take just that it’s weird they still screw up in some ways that a human wouldn’t, and the spikiness is interesting.
Your use case are way too general haha. They include many key things that LLMs currently don’t do. Anyway, maybe you’re not super interested in discussing whether they’re “Enough to usher in a new industrial revolution even without further progress.”, but if you were my next question would be whether the Internet would count as a new industrial revolution in your eyes. (I would say “no, but kinda / almost”, and I would say that the no LLM --> LLM transition looks like it’s kinda comparable-ish to the Internet transition.)
(I’ve been trying a new drug and my brain isn’t at 100% capacity, hence slow or limited replies right now.)
I think that’s a good question. I think the Internet doesn’t feel to me like it reorganized enough of how civilization works to quite be a revolution. In contrast to things like agriculture or steam engines where the vocation and living situation of so much of the population changed. I think LLMs, via automation, can cause an economic reorganization on the scale of agriculture/industrialization, that the Internet itself didn’t do. I’m fuzzier on where “electricity” fits.
But like, how specifically? I agree that there’s some idea around making a bunch of software significantly more beginner-friendly by giving it an LLM interface, and in some ways significantly more powerful with LLM “agents”. Is that a sufficient class of thing for what you’re referring to? I mean, do you think that 50% of people will be working on something different within 5 years, or something like that? Which 50%?
Huh, this seems like a very weird comparison to me. It is very clear that I can automate a huge amount of labor using LLMs at current capability levels. My guess is more than the majority of current work in the economy, and of course I will also be able to do a lot of new things that are now cheaper. My guess is this alone is enough to do something about as big as the industrial revolution.
Most work is just really quite boring and doesn’t require a coherent world models across many domains.