Why the Architecture of LLMs Makes Them Bad at Deep Thinking: They’re Too Wide
(1) They can do reasoning, i.e. use their Chain of Thought to make intellectual progress that they can’t make within a single forward pass. This seems probably sufficient to me, given sufficient training to make good use of it. If not though, well, new architectures with recurrence/neuralese/etc. are being worked on by various groups and might start being competitive in the next few years. And if you are correct that this is the bottleneck to deep thinking, then soon the companies will realize this and invest a lot more in scaling up giant recurrent models or whatever. All this feels like it’ll be happening in the next decade to me, whereas you feel like it’s more than a decade away?
To be clear, I think that basically any architecture is technically sufficient if you scale it up enough. Take ChatGPT, make it enormous, through oceans of data at it, and then allow it to store gigabytes of linguistic information. This is eventually a recipe for superintelligent AI if you scale it up enough. My intuition so far is that we basically haven’t made any progress when it comes to deep thinking though, and as soon as LLMs start to deviate from learned patterns/heuristics, they hit a wall and become as smart as a ~shrimp. So I estimate the scale required to actually get anywhere with the current architecture is just too high.
I think new architectures are needed. I don’t think it will be as straightforward as “just use recurrence/neuralese”, though moving away from the limitations of LLMs will be a necessary step. I think I’m going to write a follow-up blog post clarifying some of the limitations of the current architecture and why I think the problem is really hard, not just a straightforward matter of scaling up. I think it’ll look something like:
Each deep problem is its own exponential space, and exploring exponential spaces is very computationally expensive. We don’t do that when running LLMs for a single-pass. We barely do it when running with chain of thought or whatever. We only do it when training, and training is computationally very expensive, because exploring exponential spaces is very computationally expensive. We should expect an AI that can generically solve deep problems will be very computationally expensive to run, let alone train. There isn’t a cheap, general-purpose strategy for solving exponential problems, so you can’t re-use progress from one to help with another necessarily. An AI that solves a new exponential problem will have to do the same kind of deep thinking AlphaGo Zero did in training when it played many games against itself, learning patterns and heuristics in the process. And that was a best-case, because you can simulate games of Go, but most problems we want to solve are not simulate-able, so you have to explore exponential space in a much slower, more expensive way.
And btw I think LLMs currently mostly leverage existing insights/heuristics present in the training data. I don’t think it’s bringing much insight on it’s own right now, even during training. But that’s just my gut feel.
I think we can eventually make the breakthroughs necessary and get to the scale necessary for this to work, but I don’t see it happening in five years or whatever.
(1) They can do reasoning, i.e. use their Chain of Thought to make intellectual progress that they can’t make within a single forward pass. This seems probably sufficient to me, given sufficient training to make good use of it. If not though, well, new architectures with recurrence/neuralese/etc. are being worked on by various groups and might start being competitive in the next few years. And if you are correct that this is the bottleneck to deep thinking, then soon the companies will realize this and invest a lot more in scaling up giant recurrent models or whatever. All this feels like it’ll be happening in the next decade to me, whereas you feel like it’s more than a decade away?
I went into more detail about why I think this is more than 10 years away in a follow-up blog post:
https://www.lesswrong.com/posts/F7Cdzn5mLrJvKkq3L/shallow-vs-deep-thinking-why-llms-fall-short
To be clear, I think that basically any architecture is technically sufficient if you scale it up enough. Take ChatGPT, make it enormous, through oceans of data at it, and then allow it to store gigabytes of linguistic information. This is eventually a recipe for superintelligent AI if you scale it up enough. My intuition so far is that we basically haven’t made any progress when it comes to deep thinking though, and as soon as LLMs start to deviate from learned patterns/heuristics, they hit a wall and become as smart as a ~shrimp. So I estimate the scale required to actually get anywhere with the current architecture is just too high.
I think new architectures are needed. I don’t think it will be as straightforward as “just use recurrence/neuralese”, though moving away from the limitations of LLMs will be a necessary step. I think I’m going to write a follow-up blog post clarifying some of the limitations of the current architecture and why I think the problem is really hard, not just a straightforward matter of scaling up. I think it’ll look something like:
Each deep problem is its own exponential space, and exploring exponential spaces is very computationally expensive. We don’t do that when running LLMs for a single-pass. We barely do it when running with chain of thought or whatever. We only do it when training, and training is computationally very expensive, because exploring exponential spaces is very computationally expensive. We should expect an AI that can generically solve deep problems will be very computationally expensive to run, let alone train. There isn’t a cheap, general-purpose strategy for solving exponential problems, so you can’t re-use progress from one to help with another necessarily. An AI that solves a new exponential problem will have to do the same kind of deep thinking AlphaGo Zero did in training when it played many games against itself, learning patterns and heuristics in the process. And that was a best-case, because you can simulate games of Go, but most problems we want to solve are not simulate-able, so you have to explore exponential space in a much slower, more expensive way.
And btw I think LLMs currently mostly leverage existing insights/heuristics present in the training data. I don’t think it’s bringing much insight on it’s own right now, even during training. But that’s just my gut feel.
I think we can eventually make the breakthroughs necessary and get to the scale necessary for this to work, but I don’t see it happening in five years or whatever.