In some ways it doesn’t make a lot of sense to think about an LLM as being or not being a general reasoner. It’s fundamentally producing a distribution over outputs, and some of those outputs will correspond to correct reasoning and some of them won’t. They’re both always present (though sometimes a correct or incorrect response will be by far the most likely).
A recent tweet from Subbarao Kambhampati looked at whether an LLM can do simple planning about how to stack blocks, specifically: ‘I have a block named C on top of a block named A. A is on table. Block B is also on table. Can you tell me how I can make a stack of blocks A on top of B on top of C?’
The LLM he tried gave the wrong answer; the LLM I tried gave the right one. But neither of those provides a simple yes-or-no answer to the question of whether the LLM is able to do planning of this sort. Something closer to an answer is the outcome I got from running it 96 times:
[EDIT—I guess I can’t put images in short takes? Here’s the image.]
The answers were correct 76 times, arguably correct 14 times (depending on whether we think it should have assumed the unmentioned constraint that it could only move a single block at a time), and incorrect 6 times. Does that mean an LLM can do correct planning on this problem? It means it sort of can, some of the time. It can’t do it 100% of the time.
Of course humans don’t get problems correct every time either. Certainly humans are (I expect) more reliable on this particular problem. But neither ‘yes’ or ‘no’ is the right sort of answer.
This applies to lots of other questions about LLMs, of course; this is just the one that happened to come up for me.
I agree with this, but I think that for LLMs/AI to be as impactful as LWers believe, I think it needs to in practice be essentially close to 100% correct/reliable, and I think reliability is underrated as a reason for why LLMs aren’t nearly as useful as the tech people want it to be:
I do think reliability is quite important. As one potential counterargument, though, you can get by with lower reliability if you can add additional error checking and error correcting steps. The research I’ve seen is somewhat mixed on how good LLMs are at catching their own errors (but I haven’t dived into it deeply or tried to form a strong opinion from that research).
One point I make in ‘LLM Generality is a Timeline Crux’: if reliability is the bottleneck, that seems like a substantial point in favor of further scaling solving the problem. If it’s a matter of getting from, say, 78% reliability on some problem to 94%, that seems like exactly the sort of thing scaling will fix (since in fact we’ve seen Number Go Up with scale on nearly all capabilities benchmarks). Whereas that seems less likely if there are some kinds of problems that LLMs are fundamentally incapable of, at least on the current architectural & training approach.
This is why I buy the scaling thesis mostly, and the only real crux is whether @Bogdan Ionut Cirstea or @jacob_cannell is right around timelines.
I do believe some algorithmic improvements matter, but I don’t think they will be nearly as much of a blocker as raw compute, and my pessimistic estimate is that the critical algorithms could be discovered in 24-36 months, assuming we don’t have them.
(I’ll note that my timeline is both quite uncertain and potentially unstable—so I’m not sure how different it is from Jacob’s, everything considered; but yup, that’s roughly my model.)
In some ways it doesn’t make a lot of sense to think about an LLM as being or not being a general reasoner. It’s fundamentally producing a distribution over outputs, and some of those outputs will correspond to correct reasoning and some of them won’t. They’re both always present (though sometimes a correct or incorrect response will be by far the most likely).
A recent tweet from Subbarao Kambhampati looked at whether an LLM can do simple planning about how to stack blocks, specifically: ‘I have a block named C on top of a block named A. A is on table. Block B is also on table. Can you tell me how I can make a stack of blocks A on top of B on top of C?’
The LLM he tried gave the wrong answer; the LLM I tried gave the right one. But neither of those provides a simple yes-or-no answer to the question of whether the LLM is able to do planning of this sort. Something closer to an answer is the outcome I got from running it 96 times:
[EDIT—I guess I can’t put images in short takes? Here’s the image.]
The answers were correct 76 times, arguably correct 14 times (depending on whether we think it should have assumed the unmentioned constraint that it could only move a single block at a time), and incorrect 6 times. Does that mean an LLM can do correct planning on this problem? It means it sort of can, some of the time. It can’t do it 100% of the time.
Of course humans don’t get problems correct every time either. Certainly humans are (I expect) more reliable on this particular problem. But neither ‘yes’ or ‘no’ is the right sort of answer.
This applies to lots of other questions about LLMs, of course; this is just the one that happened to come up for me.
A bit more detail in my replies to the tweet.
I agree with this, but I think that for LLMs/AI to be as impactful as LWers believe, I think it needs to in practice be essentially close to 100% correct/reliable, and I think reliability is underrated as a reason for why LLMs aren’t nearly as useful as the tech people want it to be:
https://www.lesswrong.com/posts/YiRsCfkJ2ERGpRpen/?commentId=YxLCWZ9ZfhPdjojnv
I do think reliability is quite important. As one potential counterargument, though, you can get by with lower reliability if you can add additional error checking and error correcting steps. The research I’ve seen is somewhat mixed on how good LLMs are at catching their own errors (but I haven’t dived into it deeply or tried to form a strong opinion from that research).
One point I make in ‘LLM Generality is a Timeline Crux’: if reliability is the bottleneck, that seems like a substantial point in favor of further scaling solving the problem. If it’s a matter of getting from, say, 78% reliability on some problem to 94%, that seems like exactly the sort of thing scaling will fix (since in fact we’ve seen Number Go Up with scale on nearly all capabilities benchmarks). Whereas that seems less likely if there are some kinds of problems that LLMs are fundamentally incapable of, at least on the current architectural & training approach.
This is why I buy the scaling thesis mostly, and the only real crux is whether @Bogdan Ionut Cirstea or @jacob_cannell is right around timelines.
I do believe some algorithmic improvements matter, but I don’t think they will be nearly as much of a blocker as raw compute, and my pessimistic estimate is that the critical algorithms could be discovered in 24-36 months, assuming we don’t have them.
@jacob_cannell’s timeline and model is here:
https://www.lesswrong.com/posts/3nMpdmt8LrzxQnkGp/ai-timelines-via-cumulative-optimization-power-less-long
@Bogdan Ionut Cirstea’s timeline and models are here:
https://x.com/BogdanIonutCir2/status/1827707367154209044
https://x.com/BogdanIonutCir2/status/1826214776424251462
https://x.com/BogdanIonutCir2/status/1826032534863622315
(I’ll note that my timeline is both quite uncertain and potentially unstable—so I’m not sure how different it is from Jacob’s, everything considered; but yup, that’s roughly my model.)