I’m glad you liked the post! And, given that you are an avowed “enthusiast,” I’m pleasantly surprised that we agree about as many things as we do.
The second [source of discontinuous performance scaling] is that many tasks happen over multiple inferential steps where small improvements in single step accuracy translate into large changes in multistep capabilities.
Thanks for pointing out this argument—I hadn’t thought about it before. A few thoughts:
Ordinary text generation is also a multi-step process. (The token length generally isn’t fixed in advance, but could be, i.e. we could define a task “write convincingly for N tokens.”) So, why does generation quality scale so smoothly?
Part of the answer is that single-token success is not fully binary: there are choices that are suboptimal / “weird” without constituting instant failure. Due to the “delusion” phenomenon, weird choices can pile on themselves and lead to failure, but “weirdness” is a continuous variable so this effect can scale more gradually.
But also, part of the answer must be that generation is relatively easy, with single-token success probabilities very close to 1 even for small models.
(Why is generation easy, when it potentially includes every other task as a subtask? Well, it samples other tasks in proportion to their frequency in natural text, which≈ their relative volume in pre-training data, which≈ how easy they are for the model.)
This shows how the relevance of the argument depends on the success probabilities living in the right “transitional regime,” like your 90% vs 99% vs 99.9%. More precisely, the argument is relevant at the point where, for a given task and set of model scales, the scaling moves us across this range. I suppose by continuity this has to happen somewhere for any multi-step task, which makes me wonder whether we could “induce” discontinuous scaling for any task by forcing it to be done in a multi-step way.
Last thought: this might explain why one-step arithmetic scales discontinuously. Suppose it can only be done by some sequential multi-step algorithm (and that this is not true of most tasks). Presumably the model implements the steps along the “time axis” of successive layers. The model has some failure probability at each step, and the argument goes through.
I wonder if your thoughts [on abstract reasoning] have changed since Codex was released after you originally drafted this post.
I didn’t update much on Codex. Part of that was because I’d already seen this paper, which strikes me as a comparably impressive feat of abstraction in the code generation domain.
Also, the Codex model available in the API feels very much like GPT in the way it “reasons,” and is roughly what I’d expect from a GPT extended to code. It has that same quality where it frequentlybut not predictably does the right thing, where I often see it doing many separate things right but I can’t rely on it doing any one of them stably across all contexts. As with GPT, I get the best results when I stop asking “does it know X or not?” and instead ask “can I express X in a form likely to be common in the training data?”
I’m interested in knowing more about your reasons for thinking that little will come of scaled LLMs’ abstract reasoning capabilities.
[...] While I agree that language models are very prone to spit out text that looks superficially more like legitimate abstract reasoning than it is [...], why does this imply that they cannot also learn the “real” patterns? What exactly are the “real” patterns?
This is going to get speculative and hand-wavey. I don’t know what abstract reasoning really is, any more than anyone does. But I have some ideas :)
First, something I have noticed since I started working with these models is that my own mind contains a module much like GPT, and this module plays a role in my reasoning process.
When I reflect on my own thought processes, they often look like a game played between a GPT-like “babbler” and an evaluating “critic.”
The babbler produces an interior monologue that sounds like my own voice, but (unlike when I’m speaking out loud) is only lightly conditioned at best on things like “concepts I want to express.” Instead, it just . . . says words that sound like me, making some argument with the confidence I’d have if I actually believed it, but it’s not trying to express an idea I already have—it’s just generating text that sounds like me.
I let the babbler run for a while, and then I step back and assess the monologue, asking “does this make sense? is this really a new idea? does this prove too much? can I think of counterexamples?” Like generating code and then checking if it compiles. Most babbler-monologues are rejected by the critic, at which point the babbler tries again, conditioned (in some way I don’t understand) on the critic’s rejection.
Most of my actually-believed-ideas originated in this game, I think. Also, I often do a short-range, purely linguistic variant of this when I’m writing: I ask the babbler for the next word or phrase, and there are several rounds of “no that doesn’t work” before I pick one. Even my mathematical reasoning is often like this, though it also involves other babbler-like modules that eg generate mental imagery which can be interpreted (by the critic) as expressing a mathematical argument.
Now, I highly doubt this is the only way that one can do abstract reasoning. (I don’t even think that all humans do it like this.) However, this is the source of my intuitions about the components involved in “true abstract reasoning” and how it differs from what LMs tend to do.
When I do “true abstract reasoning” as described above, there is a distinction between timesteps of candidate generation (inner loop), timesteps of candidate evaluation (outer loop), and timesteps of actually selecting the next idea (increments on some passes of the outer loop but not others). This seems important for avoiding “delusive” effects.
I have to run the babbler for a while to even get a coherent idea that’s possible to assess. By that point, the babbler is already conditioning on its earlier output in a self-deluding way. Unlike in GPT, though, these earlier outputs are not irrevocably written in stone at the moment we receive the later outputs; the critic is free to reject the entire sequence. With GPT, by the time it would be possible to notice “hey, I’m making a bad argument,” it’s already … making a bad argument, and there’s no going back.
(I think there’s an analogy here to AlphaZero/MuZero’s value head vs. its MCTS rollouts, where GPT is like the value head / “intuitive hunches,” lacking the slower search wrapper.)
Of course, in principle, you could imagine bundling this entire procedure inside an LM. Indeed, any sufficiently good LM would eventually have to solve the problems this procedure is designed to solve. Why don’t I expect transformer LMs to develop this structure internally?
One reason: the existence of my babbler seems like (weak) evidence that it’s better to use an LM inside a bigger non-LM algorithm.
My babbler itself feels very much like a likelihood-trained causal generative model, with the same virtuosity at surface mimicry, and the same lack of conditioning latents besides its own output. I suspect that making these kinds of models comes naturally to the cerebral cortex, and that if the brain could just implement reasoning end-to-end with such a model, it would have done it that way.
A second reason is … okay, this is a whole separate point and the comment’s already long. I’ll try to make this brief.
I think transformer LMs do a lot of what they do through a kind of “compressed memorization” of very large amounts of data. Early on, they learn many different ways that text is regular; some of this may look like “truly learning (eg syntactic) rules.” This low-level knowledge allows them to store training sequences in a vastly compressed form. Then, a lot of what they do in training is actual memorization of the data, in a compressed and noisy/interleaved form. Inference looks like mapping the input to the compressed space, and then doing a shallow-ish ensemble in that space over a massive number of texts the input is “reminiscent of” along various dimensions. The huge model capacity allows for a huge ensemble, so many superficial patterns cancel out in the ensemble, while deeper patterns stack.
This perspective is inspired by the way logit lens looks in later layers, by this paper which is similar to logit lens, and also by work like this showing you can extract exact strings from trained models that were only seen a few times in training.
The key point here is that you can compress things you can’t yet abstractively understand, using easier things you do understand. I can’t use abstractive summarization to compress (say) Grothendieck’s EGA, since I don’t understand it . . . but I can still run gzip on it, and that goes a long way! Hence, the frontier of the model’s apparent abstractive capability will outrun its actual abstractive capability: this frontier consists of texts the model can’t compress via facility with their content, but can simply memorize in bulk using easier compression.
In something like your list sorting example, I suspect the model doesn’t “have” an internal list sorter that looks anything like an algorithm. Instead, it has heavily compressed memories of many actual programming tutorials that included short example lists in unsorted and sorted form, and taking an average over these will usually “sort” a short list of small numbers—with help from low-level abstract operations like “greater than over small numbers,” but without any idea that a list can be arbitrary length / can contain any ordered type.
(EDIT to clarify: the context-dependence and flakiness of the capability is how we can tell it’s coming from the compressed ensemble. Contrast with the reliability of something like English syntax, which I believe is part of the compressor itself. This is my distinction between abstraction that’s “real” and “fake”)
Anyway, I think transformers are very good at this kind of compressive memorization—but not nearly as good at doing other kinds of computational work, like search or (obviously?) recursion. Like, whenever I think about how to “program” some routine using attn+FFs, I tend to despair. Even simple things often to be spread across >1 layer/”step” or >1 head, and the number of heads/layers in huge models feels tiny relative to the diversity of abstraction we expect out of them. (See this paper for some actual transformer “programs.”)
This is hand-wavey, but my intuition is that the “right abstractions for abstraction” are hard to fit in a transformer or similar modern NN, while memorizing instances of abstraction-use is far cheaper. And yes, eventually, at large enough scale, the models will have to do the right thing or they’ll stop progressing. But there is so much more left to smoothly learn with memorization that I think this architectural deficit will be invisible for a long time, over which LMs will continue to (unreliably) imitate abstraction better and better.
One reason we agree on many object-level facts but have different takeaways is that we have different desiderata for what GPT is supposed to do in the limit. I agree that many of the problems you discuss are fundamental to the way GPT is trained and how it works, but I generally feel these problems don’t need to be solved directly in order to use GPT to build AGI. I see GPT as the _seed_ for a future AGI system built off of or around it.
I see the big crux is how much “compressed memorization” will extrapolate to general intelligence vs. begin to show cracks as we ask it for more and more advanced and general one-step deductions. It would be worth coming up with some specific claims about how we expect future systems to act to differentiate our two perspectives (including at the level of internals). Probably this is useful to start on my end because I have higher expectations for performance. Unfortunately I’m very adverse to talking about _how_ I would amplify GPT by extending it or wrapping it in a larger system, and I see steps like that as key to unlocking its capabilities.
Your idea about multi-step deduction happening over multiple layers makes a lot of sense. You brought up an experiment in the Eleuther discord I think would be a great idea to try. We could train several models to see if tasks that require a sequence of discrete steps are unusually sensitive to network depth rather than scaling with parameter count alone.
I agree about your insights about abstract reasoning as babble and prune, although this definitely isn’t the only way I reason abstractly. I babble and prune especially when I am writing (on the word/sentence/paragraph level), and I babble and prune as a part of the search process when I am trying to come up with a plan or navigate through a math proof. But when I am talking I am able to fluidly reason towards my goal with little to no plan ahead of ahead of time. I work collaboratively so much of my abstract thinking is out loud. If babble/prune is going on when I talk, it is happening at a level below my awareness.
These rollouts are not always complete, as I often need to attack problems from multiple angles before I’ve fully understood them. But the individual rollouts look like abstract reasoning to me, just as they do (can) in GPT-3. I look at individual rollouts and think: That’s general intelligence. If something could reason as well or more powerfully than I can in an individual rollouts, it is the seed of an AGI.
I also often have moments of great insight where I seem to understand a full chain of thought almost instantly. The delay comes from my inability to communicate/record it quickly. I can also use abstract reasoning in visual space (e.g. figuring out a geometric proof). In these cases I often seem to have access to a causal model that I can examine and conclude things from directly.
I’m glad you liked the post! And, given that you are an avowed “enthusiast,” I’m pleasantly surprised that we agree about as many things as we do.
Thanks for pointing out this argument—I hadn’t thought about it before. A few thoughts:
Ordinary text generation is also a multi-step process. (The token length generally isn’t fixed in advance, but could be, i.e. we could define a task “write convincingly for N tokens.”) So, why does generation quality scale so smoothly?
Part of the answer is that single-token success is not fully binary: there are choices that are suboptimal / “weird” without constituting instant failure. Due to the “delusion” phenomenon, weird choices can pile on themselves and lead to failure, but “weirdness” is a continuous variable so this effect can scale more gradually.
But also, part of the answer must be that generation is relatively easy, with single-token success probabilities very close to 1 even for small models.
(Why is generation easy, when it potentially includes every other task as a subtask? Well, it samples other tasks in proportion to their frequency in natural text, which≈ their relative volume in pre-training data, which≈ how easy they are for the model.)
This shows how the relevance of the argument depends on the success probabilities living in the right “transitional regime,” like your 90% vs 99% vs 99.9%. More precisely, the argument is relevant at the point where, for a given task and set of model scales, the scaling moves us across this range. I suppose by continuity this has to happen somewhere for any multi-step task, which makes me wonder whether we could “induce” discontinuous scaling for any task by forcing it to be done in a multi-step way.
Last thought: this might explain why one-step arithmetic scales discontinuously. Suppose it can only be done by some sequential multi-step algorithm (and that this is not true of most tasks). Presumably the model implements the steps along the “time axis” of successive layers. The model has some failure probability at each step, and the argument goes through.
I didn’t update much on Codex. Part of that was because I’d already seen this paper, which strikes me as a comparably impressive feat of abstraction in the code generation domain.
Also, the Codex model available in the API feels very much like GPT in the way it “reasons,” and is roughly what I’d expect from a GPT extended to code. It has that same quality where it frequently but not predictably does the right thing, where I often see it doing many separate things right but I can’t rely on it doing any one of them stably across all contexts. As with GPT, I get the best results when I stop asking “does it know X or not?” and instead ask “can I express X in a form likely to be common in the training data?”
This is going to get speculative and hand-wavey. I don’t know what abstract reasoning really is, any more than anyone does. But I have some ideas :)
First, something I have noticed since I started working with these models is that my own mind contains a module much like GPT, and this module plays a role in my reasoning process.
When I reflect on my own thought processes, they often look like a game played between a GPT-like “babbler” and an evaluating “critic.”
The babbler produces an interior monologue that sounds like my own voice, but (unlike when I’m speaking out loud) is only lightly conditioned at best on things like “concepts I want to express.” Instead, it just . . . says words that sound like me, making some argument with the confidence I’d have if I actually believed it, but it’s not trying to express an idea I already have—it’s just generating text that sounds like me.
I let the babbler run for a while, and then I step back and assess the monologue, asking “does this make sense? is this really a new idea? does this prove too much? can I think of counterexamples?” Like generating code and then checking if it compiles. Most babbler-monologues are rejected by the critic, at which point the babbler tries again, conditioned (in some way I don’t understand) on the critic’s rejection.
Most of my actually-believed-ideas originated in this game, I think. Also, I often do a short-range, purely linguistic variant of this when I’m writing: I ask the babbler for the next word or phrase, and there are several rounds of “no that doesn’t work” before I pick one. Even my mathematical reasoning is often like this, though it also involves other babbler-like modules that eg generate mental imagery which can be interpreted (by the critic) as expressing a mathematical argument.
Now, I highly doubt this is the only way that one can do abstract reasoning. (I don’t even think that all humans do it like this.) However, this is the source of my intuitions about the components involved in “true abstract reasoning” and how it differs from what LMs tend to do.
When I do “true abstract reasoning” as described above, there is a distinction between timesteps of candidate generation (inner loop), timesteps of candidate evaluation (outer loop), and timesteps of actually selecting the next idea (increments on some passes of the outer loop but not others). This seems important for avoiding “delusive” effects.
I have to run the babbler for a while to even get a coherent idea that’s possible to assess. By that point, the babbler is already conditioning on its earlier output in a self-deluding way. Unlike in GPT, though, these earlier outputs are not irrevocably written in stone at the moment we receive the later outputs; the critic is free to reject the entire sequence. With GPT, by the time it would be possible to notice “hey, I’m making a bad argument,” it’s already … making a bad argument, and there’s no going back.
(I think there’s an analogy here to AlphaZero/MuZero’s value head vs. its MCTS rollouts, where GPT is like the value head / “intuitive hunches,” lacking the slower search wrapper.)
Of course, in principle, you could imagine bundling this entire procedure inside an LM. Indeed, any sufficiently good LM would eventually have to solve the problems this procedure is designed to solve. Why don’t I expect transformer LMs to develop this structure internally?
One reason: the existence of my babbler seems like (weak) evidence that it’s better to use an LM inside a bigger non-LM algorithm.
My babbler itself feels very much like a likelihood-trained causal generative model, with the same virtuosity at surface mimicry, and the same lack of conditioning latents besides its own output. I suspect that making these kinds of models comes naturally to the cerebral cortex, and that if the brain could just implement reasoning end-to-end with such a model, it would have done it that way.
A second reason is … okay, this is a whole separate point and the comment’s already long. I’ll try to make this brief.
I think transformer LMs do a lot of what they do through a kind of “compressed memorization” of very large amounts of data. Early on, they learn many different ways that text is regular; some of this may look like “truly learning (eg syntactic) rules.” This low-level knowledge allows them to store training sequences in a vastly compressed form. Then, a lot of what they do in training is actual memorization of the data, in a compressed and noisy/interleaved form. Inference looks like mapping the input to the compressed space, and then doing a shallow-ish ensemble in that space over a massive number of texts the input is “reminiscent of” along various dimensions. The huge model capacity allows for a huge ensemble, so many superficial patterns cancel out in the ensemble, while deeper patterns stack.
This perspective is inspired by the way logit lens looks in later layers, by this paper which is similar to logit lens, and also by work like this showing you can extract exact strings from trained models that were only seen a few times in training.
The key point here is that you can compress things you can’t yet abstractively understand, using easier things you do understand. I can’t use abstractive summarization to compress (say) Grothendieck’s EGA, since I don’t understand it . . . but I can still run gzip on it, and that goes a long way! Hence, the frontier of the model’s apparent abstractive capability will outrun its actual abstractive capability: this frontier consists of texts the model can’t compress via facility with their content, but can simply memorize in bulk using easier compression.
In something like your list sorting example, I suspect the model doesn’t “have” an internal list sorter that looks anything like an algorithm. Instead, it has heavily compressed memories of many actual programming tutorials that included short example lists in unsorted and sorted form, and taking an average over these will usually “sort” a short list of small numbers—with help from low-level abstract operations like “greater than over small numbers,” but without any idea that a list can be arbitrary length / can contain any ordered type.
(EDIT to clarify: the context-dependence and flakiness of the capability is how we can tell it’s coming from the compressed ensemble. Contrast with the reliability of something like English syntax, which I believe is part of the compressor itself. This is my distinction between abstraction that’s “real” and “fake”)
Anyway, I think transformers are very good at this kind of compressive memorization—but not nearly as good at doing other kinds of computational work, like search or (obviously?) recursion. Like, whenever I think about how to “program” some routine using attn+FFs, I tend to despair. Even simple things often to be spread across >1 layer/”step” or >1 head, and the number of heads/layers in huge models feels tiny relative to the diversity of abstraction we expect out of them. (See this paper for some actual transformer “programs.”)
This is hand-wavey, but my intuition is that the “right abstractions for abstraction” are hard to fit in a transformer or similar modern NN, while memorizing instances of abstraction-use is far cheaper. And yes, eventually, at large enough scale, the models will have to do the right thing or they’ll stop progressing. But there is so much more left to smoothly learn with memorization that I think this architectural deficit will be invisible for a long time, over which LMs will continue to (unreliably) imitate abstraction better and better.
One reason we agree on many object-level facts but have different takeaways is that we have different desiderata for what GPT is supposed to do in the limit. I agree that many of the problems you discuss are fundamental to the way GPT is trained and how it works, but I generally feel these problems don’t need to be solved directly in order to use GPT to build AGI. I see GPT as the _seed_ for a future AGI system built off of or around it.
I see the big crux is how much “compressed memorization” will extrapolate to general intelligence vs. begin to show cracks as we ask it for more and more advanced and general one-step deductions. It would be worth coming up with some specific claims about how we expect future systems to act to differentiate our two perspectives (including at the level of internals). Probably this is useful to start on my end because I have higher expectations for performance. Unfortunately I’m very adverse to talking about _how_ I would amplify GPT by extending it or wrapping it in a larger system, and I see steps like that as key to unlocking its capabilities.
Your idea about multi-step deduction happening over multiple layers makes a lot of sense. You brought up an experiment in the Eleuther discord I think would be a great idea to try. We could train several models to see if tasks that require a sequence of discrete steps are unusually sensitive to network depth rather than scaling with parameter count alone.
I agree about your insights about abstract reasoning as babble and prune, although this definitely isn’t the only way I reason abstractly. I babble and prune especially when I am writing (on the word/sentence/paragraph level), and I babble and prune as a part of the search process when I am trying to come up with a plan or navigate through a math proof. But when I am talking I am able to fluidly reason towards my goal with little to no plan ahead of ahead of time. I work collaboratively so much of my abstract thinking is out loud. If babble/prune is going on when I talk, it is happening at a level below my awareness.
These rollouts are not always complete, as I often need to attack problems from multiple angles before I’ve fully understood them. But the individual rollouts look like abstract reasoning to me, just as they do (can) in GPT-3. I look at individual rollouts and think: That’s general intelligence. If something could reason as well or more powerfully than I can in an individual rollouts, it is the seed of an AGI.
I also often have moments of great insight where I seem to understand a full chain of thought almost instantly. The delay comes from my inability to communicate/record it quickly. I can also use abstract reasoning in visual space (e.g. figuring out a geometric proof). In these cases I often seem to have access to a causal model that I can examine and conclude things from directly.