This is the best post about language models I’ve read in a long time. It’s clear how much you have used LMs and grokked the peculiar way they operate. You’ve touched on many important points which I’ve wanted to write about or have but with less eloquence. Also I glad you liked my blog :) (generative.ink)
I definitely belong to your “enthusiasts” camp, and I agree your fourth point (loss scaling makes models “smarter” fast enough to matter) is a crux. I won’t fully defend that here, but I’ll do my own brain dump and share some of the thoughts that came up when reading your post.
Discontinuous jumps in capabilities
One of the reasons for my optimism/concern about scaling is that I do expect discontinuous jumps in capabilities, but not in the way you are arguing against here. I don’t think discontinuous jumps will necessarily come from discontinuous improvements of the model’s single step inference accuracy (though it may), but from the tasks we need it to do.
I see two big sources of discontinuity in tasks and many tasks contain both. The first is that many tasks are somewhat binary in nature. If you can’t do it well enough, you basically can’t do it at all. The second is that many tasks happen over multiple inferential steps where small improvements in single step accuracy translate into large changes in multistep capabilities.
The most important binary task is whether or not a model can be amplified under some given amplification strategy. As a particular example, at one OOM the model will not be able to amplification technique because it is too unreliable, even with techniques to make it more robust. Then at one OOM it suddenly will. We can observe it getting closer to this, but it can be difficult to say how close we are without getting deep into the gears of the amplification technique.
As an example of multistep inferential tasks, in some experiments collaborators and I found that larger models are dramatically better at solving math problems in multiple steps (“factored cognition”), while accuracy of solving the problem in a single step increases more continuously. Whether this is counts as a fundamentally new capability depends on your definition, but the pragmatic result is discontinuous competence. (A few of our results were eventually posted here)
We should expect to see this with various multi-token tasks which can only be executed if the model chains together many “correct” inferences. It’s still a probabilistic matter, as you say: a small model would succeed with some small probability, and the large model will fail with a small probability. However, when the task requires multiple steps to all be executed correctly, the probability of the small model succeeding at the the task dwindles exponentially, magnifying the difference. The problem is more pronounced when you add feature engineering because it’s often the case that irregular errors can be accounted for while frequent errors cannot.
Say the task is about 100 tokens long and for each token GPT-3 outputs an acceptable (non-fatal) prediction 90% of the time. The probability of it successfully completing the task is 0.9^100 = 0.00002656139: near 0. A model whose mistake rate is only 1% would complete the task with probability 0.99^100 = 0.36603234127 – more than one out of 3 times. This can be the difference between total impracticality and a task that can be automated with high accuracy by adding a few extra tricks. A model with 99.9% single-token accuracy succeeds most of the time (~90%). This is of course a simplification of the dynamics, but you get the point.
Mistakes
Mistakes during generation are particularly fatal for GPTs because there’s no way to go back on them (unless the prompt introduces a mechanism for doing so). GPT updates on its own mistakes and elevates them to a sort of delusive “certainty” after being appended to the prompt. One way of looking at it is that the “delusions” of GPT simulacra are not the model’s fault, but the fault of the autoregressive sampling process which spuriously elevates the model’s mere guesses to canonical reality.
As you point out, “mistakes” can be of various types, including ones which aren’t really failures of capability, and which we won’t expect to go away if models scale. However, I think those problems (GPT isn’t trying its best, the prompt is ambiguous, etc.) are difficult but tractable to address and will become more tractable as models scale. More powerful models are amenable to more precise control by many methods, even simple prompt programming and fine tuning. OpenAI’s instruct models, for instance, are quite reliable at interpreting single-line imperative instructions “correctly” (that is, attempting to execute the instruction), whereas the base models would react to most single-line context-free instructions chaotically.
I also agree that evaluating GPTs with prompts is actually evaluating the GPT+human system, but I’m optimistic/concerned that given time we will automate the effects of this process (automated prompt programming, filtering, fine tuning in clever ways, embedding in larger systems, etc.), even if somehow we don’t find ways to make pretrained LMs themselves more intentionally goal-directed.
Prompt noise and shattered cognition
This is excellently put:
I see no single, stable trove of skills being leveraged here and there as needed. I just see stretches of success and failure at imitating ten thousand different kinds of people, all nearly independent of one another, the products of barely-coupled subsystems.
Here’s some simple experimental evidence to support this observation. I found that GPT-3′s ability to sort a list of 5 integers was 28% with a 0-shot natural language description of the task, 50% with a 10-shot prompt, and 76% (!) accuracy with 0-shot in the style of python documentation/code.
This case cannot be explained by ‘meta-learning’ because the more effective prompt contains no additional information about how to solve the task. I think simply claiming GPT-3 has only learned “shallow patterns” is also insufficient because it clearly has learned the deep pattern needed to sort lists of integers like this, it just fails to access this ability under different circumstances. Does the pure natural language description and the few-shot prompt invoke a different and inferior strategy, or an imperfect/corrupted version of the same list-sorting subsystem? (I’d love to know.)
In either case, as you say, GPT does not act like it has a centralized repertoire of skills which determines how well it’s able to perform tasks across prompts. This is an important intuition. Everything suggests to me that there is no core, no unified self, whether in terms of agency or capability or even knowledge. Gwern has said that he thinks of GPT-3 as an agent which wants to roleplay accurately; I disagree because I don’t perceive anything as coherent or centralized as even a “puppetmaster” or “shapeshifter” that controls or roleplays simulacra. The inability of some simulacra to access knowledge and capabilities that would unambiguously make them better imitations, and which different simulacra can somehow access, contributes to my impression of GPT’s subsystem disunity. However, I think there is good reason to expect this to change as these models scale.
Meta-learning
Despite my blog post, I do think GPT-3 is capable of “meta-learning” – just that this perspective is often misleading, especially for some tasks like translation. I haven’t played with small models enough to say how discontinuous it is, but “meta-learning” seems necessary if any size of GPT should be able to coherently continue most long prompts. The same way GPT-3 “updates” from the task demonstrations, it clearly updates on information in a story prompt, such as the demonstrated personality of the characters, information which reveals(constrains) things about the premise, etc. The few-shot “meta-learning” capability is a special case of its general ability to continue text in the style of its training data; lists of examples are a common feature which constrains the future in systematic ways.
Learning curves
The point about LMs’ learning curves looking different than those of humans is very important. The probabilistic competencies exhibited by GPT are quite different from what we see from humans.
One note: Contributing to the apparent discontinuity of human learning is that most humans are much less willing to pronounce on topics they’re unsure on than GPTs (autoregressively sampled) are. We usually say/think we don’t know even when it would be possible to make a probabilistic guess. That said, I do think the way GPTs learn is fundamentally different than humans, and this causes us to both over and underestimate their capabilities.
You’ve explained well the differences which result from GPT’s incentive to imitate a broad range of disparate patterns. Another (related) difference is that whereas humans tend to build up their understanding of a world by learning “fundamentals” like object permanence first, LMs approach competence through a route which masters “superficial”, “stylistic” patterns first, learning to write in the style of famous authors before mastering object permanence. In your words from another post, it learns to run before learning to walk.
This causes some people to conclude that GPTs learns only shallow patterns. I don’t think this is true; I think it only approaches the same “deep” patterns from a different trajectory. A “fake it til you make it” approach – but that doesn’t mean it won’t eventually “make it”. Looking at GPT-2, I could imagine thinking that however impressive the ability of large language models to write in beautiful and difficult (for a human) styles, basic object permanence will always be a problem. GPT-3 doesn’t struggle much with it.
Abstract reasoning
I’m interested in knowing more about your reasons for thinking that little will come of scaled LLMs’ abstract reasoning capabilities. None of the above suggests this to me. I wonder if your thoughts have changed since Codex was released after you originally drafted this post.
You said that large language models will be better at abstract reasoning in that it will be easier to get them to spit out text that sounds like it’s a product of abstract reasoning (implying, perhaps, that it is in some sense not real abstract reasoning). While I agree that language models are very prone to spit out text that looks superficially more like legitimate abstract reasoning than it is, as they’re particularly good at imitating surface patterns of competence, why does this imply that they cannot also learn the “real” patterns? What exactly are the “real” patterns?
Many people dismiss the legitimacy of LMs’ reasoning because they just parrot probabilities from the training data. But I know you have seen its capacity for generalization. Given a good prompt as a seed, it often is able to reproduce chains of reasoning and conclusions regarding a completely unprecedented state of affairs exactly as they occurred to me. I considered these thoughts to be abstract reasoning when they happened in my mind. So what is it when GPT-3 can reliably reproduce these thoughts?
How do we apply this to Codex writing code that compiles, providing the instrumental fruits of what, if coming from a human, we would not hesitate to call abstract reasoning?
Human evaluation
I agree with your concerns about human evaluation for reasons of unreliableness, underperformance, risk of bias, etc. but I think you overstate the uselessness of the approach. Despite these very real problems, I have found almost universally that people who have spent considerable time using GPT-3 hands-on understand its capabilities and flaws significantly better than researchers who have only read benchmark and ecological evaluation papers. I will even argue that you cannot understand GPT-3 without using it.
Non-ecological benchmarks (almost all of them) are really, really bad, and most are actively misleading. Ecological evaluations, though you say they exist, are woefully inadequate for probing general intelligence for its capabilities and limits, especially in their current form. I second your call to improve them.
I’m glad you liked the post! And, given that you are an avowed “enthusiast,” I’m pleasantly surprised that we agree about as many things as we do.
The second [source of discontinuous performance scaling] is that many tasks happen over multiple inferential steps where small improvements in single step accuracy translate into large changes in multistep capabilities.
Thanks for pointing out this argument—I hadn’t thought about it before. A few thoughts:
Ordinary text generation is also a multi-step process. (The token length generally isn’t fixed in advance, but could be, i.e. we could define a task “write convincingly for N tokens.”) So, why does generation quality scale so smoothly?
Part of the answer is that single-token success is not fully binary: there are choices that are suboptimal / “weird” without constituting instant failure. Due to the “delusion” phenomenon, weird choices can pile on themselves and lead to failure, but “weirdness” is a continuous variable so this effect can scale more gradually.
But also, part of the answer must be that generation is relatively easy, with single-token success probabilities very close to 1 even for small models.
(Why is generation easy, when it potentially includes every other task as a subtask? Well, it samples other tasks in proportion to their frequency in natural text, which≈ their relative volume in pre-training data, which≈ how easy they are for the model.)
This shows how the relevance of the argument depends on the success probabilities living in the right “transitional regime,” like your 90% vs 99% vs 99.9%. More precisely, the argument is relevant at the point where, for a given task and set of model scales, the scaling moves us across this range. I suppose by continuity this has to happen somewhere for any multi-step task, which makes me wonder whether we could “induce” discontinuous scaling for any task by forcing it to be done in a multi-step way.
Last thought: this might explain why one-step arithmetic scales discontinuously. Suppose it can only be done by some sequential multi-step algorithm (and that this is not true of most tasks). Presumably the model implements the steps along the “time axis” of successive layers. The model has some failure probability at each step, and the argument goes through.
I wonder if your thoughts [on abstract reasoning] have changed since Codex was released after you originally drafted this post.
I didn’t update much on Codex. Part of that was because I’d already seen this paper, which strikes me as a comparably impressive feat of abstraction in the code generation domain.
Also, the Codex model available in the API feels very much like GPT in the way it “reasons,” and is roughly what I’d expect from a GPT extended to code. It has that same quality where it frequentlybut not predictably does the right thing, where I often see it doing many separate things right but I can’t rely on it doing any one of them stably across all contexts. As with GPT, I get the best results when I stop asking “does it know X or not?” and instead ask “can I express X in a form likely to be common in the training data?”
I’m interested in knowing more about your reasons for thinking that little will come of scaled LLMs’ abstract reasoning capabilities.
[...] While I agree that language models are very prone to spit out text that looks superficially more like legitimate abstract reasoning than it is [...], why does this imply that they cannot also learn the “real” patterns? What exactly are the “real” patterns?
This is going to get speculative and hand-wavey. I don’t know what abstract reasoning really is, any more than anyone does. But I have some ideas :)
First, something I have noticed since I started working with these models is that my own mind contains a module much like GPT, and this module plays a role in my reasoning process.
When I reflect on my own thought processes, they often look like a game played between a GPT-like “babbler” and an evaluating “critic.”
The babbler produces an interior monologue that sounds like my own voice, but (unlike when I’m speaking out loud) is only lightly conditioned at best on things like “concepts I want to express.” Instead, it just . . . says words that sound like me, making some argument with the confidence I’d have if I actually believed it, but it’s not trying to express an idea I already have—it’s just generating text that sounds like me.
I let the babbler run for a while, and then I step back and assess the monologue, asking “does this make sense? is this really a new idea? does this prove too much? can I think of counterexamples?” Like generating code and then checking if it compiles. Most babbler-monologues are rejected by the critic, at which point the babbler tries again, conditioned (in some way I don’t understand) on the critic’s rejection.
Most of my actually-believed-ideas originated in this game, I think. Also, I often do a short-range, purely linguistic variant of this when I’m writing: I ask the babbler for the next word or phrase, and there are several rounds of “no that doesn’t work” before I pick one. Even my mathematical reasoning is often like this, though it also involves other babbler-like modules that eg generate mental imagery which can be interpreted (by the critic) as expressing a mathematical argument.
Now, I highly doubt this is the only way that one can do abstract reasoning. (I don’t even think that all humans do it like this.) However, this is the source of my intuitions about the components involved in “true abstract reasoning” and how it differs from what LMs tend to do.
When I do “true abstract reasoning” as described above, there is a distinction between timesteps of candidate generation (inner loop), timesteps of candidate evaluation (outer loop), and timesteps of actually selecting the next idea (increments on some passes of the outer loop but not others). This seems important for avoiding “delusive” effects.
I have to run the babbler for a while to even get a coherent idea that’s possible to assess. By that point, the babbler is already conditioning on its earlier output in a self-deluding way. Unlike in GPT, though, these earlier outputs are not irrevocably written in stone at the moment we receive the later outputs; the critic is free to reject the entire sequence. With GPT, by the time it would be possible to notice “hey, I’m making a bad argument,” it’s already … making a bad argument, and there’s no going back.
(I think there’s an analogy here to AlphaZero/MuZero’s value head vs. its MCTS rollouts, where GPT is like the value head / “intuitive hunches,” lacking the slower search wrapper.)
Of course, in principle, you could imagine bundling this entire procedure inside an LM. Indeed, any sufficiently good LM would eventually have to solve the problems this procedure is designed to solve. Why don’t I expect transformer LMs to develop this structure internally?
One reason: the existence of my babbler seems like (weak) evidence that it’s better to use an LM inside a bigger non-LM algorithm.
My babbler itself feels very much like a likelihood-trained causal generative model, with the same virtuosity at surface mimicry, and the same lack of conditioning latents besides its own output. I suspect that making these kinds of models comes naturally to the cerebral cortex, and that if the brain could just implement reasoning end-to-end with such a model, it would have done it that way.
A second reason is … okay, this is a whole separate point and the comment’s already long. I’ll try to make this brief.
I think transformer LMs do a lot of what they do through a kind of “compressed memorization” of very large amounts of data. Early on, they learn many different ways that text is regular; some of this may look like “truly learning (eg syntactic) rules.” This low-level knowledge allows them to store training sequences in a vastly compressed form. Then, a lot of what they do in training is actual memorization of the data, in a compressed and noisy/interleaved form. Inference looks like mapping the input to the compressed space, and then doing a shallow-ish ensemble in that space over a massive number of texts the input is “reminiscent of” along various dimensions. The huge model capacity allows for a huge ensemble, so many superficial patterns cancel out in the ensemble, while deeper patterns stack.
This perspective is inspired by the way logit lens looks in later layers, by this paper which is similar to logit lens, and also by work like this showing you can extract exact strings from trained models that were only seen a few times in training.
The key point here is that you can compress things you can’t yet abstractively understand, using easier things you do understand. I can’t use abstractive summarization to compress (say) Grothendieck’s EGA, since I don’t understand it . . . but I can still run gzip on it, and that goes a long way! Hence, the frontier of the model’s apparent abstractive capability will outrun its actual abstractive capability: this frontier consists of texts the model can’t compress via facility with their content, but can simply memorize in bulk using easier compression.
In something like your list sorting example, I suspect the model doesn’t “have” an internal list sorter that looks anything like an algorithm. Instead, it has heavily compressed memories of many actual programming tutorials that included short example lists in unsorted and sorted form, and taking an average over these will usually “sort” a short list of small numbers—with help from low-level abstract operations like “greater than over small numbers,” but without any idea that a list can be arbitrary length / can contain any ordered type.
(EDIT to clarify: the context-dependence and flakiness of the capability is how we can tell it’s coming from the compressed ensemble. Contrast with the reliability of something like English syntax, which I believe is part of the compressor itself. This is my distinction between abstraction that’s “real” and “fake”)
Anyway, I think transformers are very good at this kind of compressive memorization—but not nearly as good at doing other kinds of computational work, like search or (obviously?) recursion. Like, whenever I think about how to “program” some routine using attn+FFs, I tend to despair. Even simple things often to be spread across >1 layer/”step” or >1 head, and the number of heads/layers in huge models feels tiny relative to the diversity of abstraction we expect out of them. (See this paper for some actual transformer “programs.”)
This is hand-wavey, but my intuition is that the “right abstractions for abstraction” are hard to fit in a transformer or similar modern NN, while memorizing instances of abstraction-use is far cheaper. And yes, eventually, at large enough scale, the models will have to do the right thing or they’ll stop progressing. But there is so much more left to smoothly learn with memorization that I think this architectural deficit will be invisible for a long time, over which LMs will continue to (unreliably) imitate abstraction better and better.
One reason we agree on many object-level facts but have different takeaways is that we have different desiderata for what GPT is supposed to do in the limit. I agree that many of the problems you discuss are fundamental to the way GPT is trained and how it works, but I generally feel these problems don’t need to be solved directly in order to use GPT to build AGI. I see GPT as the _seed_ for a future AGI system built off of or around it.
I see the big crux is how much “compressed memorization” will extrapolate to general intelligence vs. begin to show cracks as we ask it for more and more advanced and general one-step deductions. It would be worth coming up with some specific claims about how we expect future systems to act to differentiate our two perspectives (including at the level of internals). Probably this is useful to start on my end because I have higher expectations for performance. Unfortunately I’m very adverse to talking about _how_ I would amplify GPT by extending it or wrapping it in a larger system, and I see steps like that as key to unlocking its capabilities.
Your idea about multi-step deduction happening over multiple layers makes a lot of sense. You brought up an experiment in the Eleuther discord I think would be a great idea to try. We could train several models to see if tasks that require a sequence of discrete steps are unusually sensitive to network depth rather than scaling with parameter count alone.
I agree about your insights about abstract reasoning as babble and prune, although this definitely isn’t the only way I reason abstractly. I babble and prune especially when I am writing (on the word/sentence/paragraph level), and I babble and prune as a part of the search process when I am trying to come up with a plan or navigate through a math proof. But when I am talking I am able to fluidly reason towards my goal with little to no plan ahead of ahead of time. I work collaboratively so much of my abstract thinking is out loud. If babble/prune is going on when I talk, it is happening at a level below my awareness.
These rollouts are not always complete, as I often need to attack problems from multiple angles before I’ve fully understood them. But the individual rollouts look like abstract reasoning to me, just as they do (can) in GPT-3. I look at individual rollouts and think: That’s general intelligence. If something could reason as well or more powerfully than I can in an individual rollouts, it is the seed of an AGI.
I also often have moments of great insight where I seem to understand a full chain of thought almost instantly. The delay comes from my inability to communicate/record it quickly. I can also use abstract reasoning in visual space (e.g. figuring out a geometric proof). In these cases I often seem to have access to a causal model that I can examine and conclude things from directly.
This is the best post about language models I’ve read in a long time. It’s clear how much you have used LMs and grokked the peculiar way they operate. You’ve touched on many important points which I’ve wanted to write about or have but with less eloquence. Also I glad you liked my blog :) (generative.ink)
I definitely belong to your “enthusiasts” camp, and I agree your fourth point (loss scaling makes models “smarter” fast enough to matter) is a crux. I won’t fully defend that here, but I’ll do my own brain dump and share some of the thoughts that came up when reading your post.
Discontinuous jumps in capabilities
One of the reasons for my optimism/concern about scaling is that I do expect discontinuous jumps in capabilities, but not in the way you are arguing against here. I don’t think discontinuous jumps will necessarily come from discontinuous improvements of the model’s single step inference accuracy (though it may), but from the tasks we need it to do.
I see two big sources of discontinuity in tasks and many tasks contain both. The first is that many tasks are somewhat binary in nature. If you can’t do it well enough, you basically can’t do it at all. The second is that many tasks happen over multiple inferential steps where small improvements in single step accuracy translate into large changes in multistep capabilities.
The most important binary task is whether or not a model can be amplified under some given amplification strategy. As a
particularexample, at one OOM the model will not be able toamplification techniquebecause it is too unreliable, even with techniques to make it more robust. Then at one OOM it suddenly will. We can observe it getting closer to this, but it can be difficult to say how close we are without getting deep into the gears of the amplification technique.As an example of multistep inferential tasks, in some experiments collaborators and I found that larger models are dramatically better at solving math problems in multiple steps (“factored cognition”), while accuracy of solving the problem in a single step increases more continuously. Whether this is counts as a fundamentally new capability depends on your definition, but the pragmatic result is discontinuous competence. (A few of our results were eventually posted here)
We should expect to see this with various multi-token tasks which can only be executed if the model chains together many “correct” inferences. It’s still a probabilistic matter, as you say: a small model would succeed with some small probability, and the large model will fail with a small probability. However, when the task requires multiple steps to all be executed correctly, the probability of the small model succeeding at the the task dwindles exponentially, magnifying the difference. The problem is more pronounced when you add feature engineering because it’s often the case that irregular errors can be accounted for while frequent errors cannot.
Say the task is about 100 tokens long and for each token GPT-3 outputs an acceptable (non-fatal) prediction 90% of the time. The probability of it successfully completing the task is 0.9^100 = 0.00002656139: near 0. A model whose mistake rate is only 1% would complete the task with probability 0.99^100 = 0.36603234127 – more than one out of 3 times. This can be the difference between total impracticality and a task that can be automated with high accuracy by adding a few extra tricks. A model with 99.9% single-token accuracy succeeds most of the time (~90%). This is of course a simplification of the dynamics, but you get the point.
Mistakes
Mistakes during generation are particularly fatal for GPTs because there’s no way to go back on them (unless the prompt introduces a mechanism for doing so). GPT updates on its own mistakes and elevates them to a sort of delusive “certainty” after being appended to the prompt. One way of looking at it is that the “delusions” of GPT simulacra are not the model’s fault, but the fault of the autoregressive sampling process which spuriously elevates the model’s mere guesses to canonical reality.
As you point out, “mistakes” can be of various types, including ones which aren’t really failures of capability, and which we won’t expect to go away if models scale. However, I think those problems (GPT isn’t trying its best, the prompt is ambiguous, etc.) are difficult but tractable to address and will become more tractable as models scale. More powerful models are amenable to more precise control by many methods, even simple prompt programming and fine tuning. OpenAI’s instruct models, for instance, are quite reliable at interpreting single-line imperative instructions “correctly” (that is, attempting to execute the instruction), whereas the base models would react to most single-line context-free instructions chaotically.
I also agree that evaluating GPTs with prompts is actually evaluating the GPT+human system, but I’m optimistic/concerned that given time we will automate the effects of this process (automated prompt programming, filtering, fine tuning in clever ways, embedding in larger systems, etc.), even if somehow we don’t find ways to make pretrained LMs themselves more intentionally goal-directed.
Prompt noise and shattered cognition
This is excellently put:
Here’s some simple experimental evidence to support this observation. I found that GPT-3′s ability to sort a list of 5 integers was 28% with a 0-shot natural language description of the task, 50% with a 10-shot prompt, and 76% (!) accuracy with 0-shot in the style of python documentation/code.
This case cannot be explained by ‘meta-learning’ because the more effective prompt contains no additional information about how to solve the task. I think simply claiming GPT-3 has only learned “shallow patterns” is also insufficient because it clearly has learned the deep pattern needed to sort lists of integers like this, it just fails to access this ability under different circumstances. Does the pure natural language description and the few-shot prompt invoke a different and inferior strategy, or an imperfect/corrupted version of the same list-sorting subsystem? (I’d love to know.)
In either case, as you say, GPT does not act like it has a centralized repertoire of skills which determines how well it’s able to perform tasks across prompts. This is an important intuition. Everything suggests to me that there is no core, no unified self, whether in terms of agency or capability or even knowledge. Gwern has said that he thinks of GPT-3 as an agent which wants to roleplay accurately; I disagree because I don’t perceive anything as coherent or centralized as even a “puppetmaster” or “shapeshifter” that controls or roleplays simulacra. The inability of some simulacra to access knowledge and capabilities that would unambiguously make them better imitations, and which different simulacra can somehow access, contributes to my impression of GPT’s subsystem disunity. However, I think there is good reason to expect this to change as these models scale.
Meta-learning
Despite my blog post, I do think GPT-3 is capable of “meta-learning” – just that this perspective is often misleading, especially for some tasks like translation. I haven’t played with small models enough to say how discontinuous it is, but “meta-learning” seems necessary if any size of GPT should be able to coherently continue most long prompts. The same way GPT-3 “updates” from the task demonstrations, it clearly updates on information in a story prompt, such as the demonstrated personality of the characters, information which reveals(constrains) things about the premise, etc. The few-shot “meta-learning” capability is a special case of its general ability to continue text in the style of its training data; lists of examples are a common feature which constrains the future in systematic ways.
Learning curves
The point about LMs’ learning curves looking different than those of humans is very important. The probabilistic competencies exhibited by GPT are quite different from what we see from humans.
One note: Contributing to the apparent discontinuity of human learning is that most humans are much less willing to pronounce on topics they’re unsure on than GPTs (autoregressively sampled) are. We usually say/think we don’t know even when it would be possible to make a probabilistic guess. That said, I do think the way GPTs learn is fundamentally different than humans, and this causes us to both over and underestimate their capabilities.
You’ve explained well the differences which result from GPT’s incentive to imitate a broad range of disparate patterns. Another (related) difference is that whereas humans tend to build up their understanding of a world by learning “fundamentals” like object permanence first, LMs approach competence through a route which masters “superficial”, “stylistic” patterns first, learning to write in the style of famous authors before mastering object permanence. In your words from another post, it learns to run before learning to walk.
This causes some people to conclude that GPTs learns only shallow patterns. I don’t think this is true; I think it only approaches the same “deep” patterns from a different trajectory. A “fake it til you make it” approach – but that doesn’t mean it won’t eventually “make it”. Looking at GPT-2, I could imagine thinking that however impressive the ability of large language models to write in beautiful and difficult (for a human) styles, basic object permanence will always be a problem. GPT-3 doesn’t struggle much with it.
Abstract reasoning
I’m interested in knowing more about your reasons for thinking that little will come of scaled LLMs’ abstract reasoning capabilities. None of the above suggests this to me. I wonder if your thoughts have changed since Codex was released after you originally drafted this post.
You said that large language models will be better at abstract reasoning in that it will be easier to get them to spit out text that sounds like it’s a product of abstract reasoning (implying, perhaps, that it is in some sense not real abstract reasoning). While I agree that language models are very prone to spit out text that looks superficially more like legitimate abstract reasoning than it is, as they’re particularly good at imitating surface patterns of competence, why does this imply that they cannot also learn the “real” patterns? What exactly are the “real” patterns?
Many people dismiss the legitimacy of LMs’ reasoning because they just parrot probabilities from the training data. But I know you have seen its capacity for generalization. Given a good prompt as a seed, it often is able to reproduce chains of reasoning and conclusions regarding a completely unprecedented state of affairs exactly as they occurred to me. I considered these thoughts to be abstract reasoning when they happened in my mind. So what is it when GPT-3 can reliably reproduce these thoughts?
How do we apply this to Codex writing code that compiles, providing the instrumental fruits of what, if coming from a human, we would not hesitate to call abstract reasoning?
Human evaluation
I agree with your concerns about human evaluation for reasons of unreliableness, underperformance, risk of bias, etc. but I think you overstate the uselessness of the approach. Despite these very real problems, I have found almost universally that people who have spent considerable time using GPT-3 hands-on understand its capabilities and flaws significantly better than researchers who have only read benchmark and ecological evaluation papers. I will even argue that you cannot understand GPT-3 without using it.
Non-ecological benchmarks (almost all of them) are really, really bad, and most are actively misleading. Ecological evaluations, though you say they exist, are woefully inadequate for probing general intelligence for its capabilities and limits, especially in their current form. I second your call to improve them.
I’m glad you liked the post! And, given that you are an avowed “enthusiast,” I’m pleasantly surprised that we agree about as many things as we do.
Thanks for pointing out this argument—I hadn’t thought about it before. A few thoughts:
Ordinary text generation is also a multi-step process. (The token length generally isn’t fixed in advance, but could be, i.e. we could define a task “write convincingly for N tokens.”) So, why does generation quality scale so smoothly?
Part of the answer is that single-token success is not fully binary: there are choices that are suboptimal / “weird” without constituting instant failure. Due to the “delusion” phenomenon, weird choices can pile on themselves and lead to failure, but “weirdness” is a continuous variable so this effect can scale more gradually.
But also, part of the answer must be that generation is relatively easy, with single-token success probabilities very close to 1 even for small models.
(Why is generation easy, when it potentially includes every other task as a subtask? Well, it samples other tasks in proportion to their frequency in natural text, which≈ their relative volume in pre-training data, which≈ how easy they are for the model.)
This shows how the relevance of the argument depends on the success probabilities living in the right “transitional regime,” like your 90% vs 99% vs 99.9%. More precisely, the argument is relevant at the point where, for a given task and set of model scales, the scaling moves us across this range. I suppose by continuity this has to happen somewhere for any multi-step task, which makes me wonder whether we could “induce” discontinuous scaling for any task by forcing it to be done in a multi-step way.
Last thought: this might explain why one-step arithmetic scales discontinuously. Suppose it can only be done by some sequential multi-step algorithm (and that this is not true of most tasks). Presumably the model implements the steps along the “time axis” of successive layers. The model has some failure probability at each step, and the argument goes through.
I didn’t update much on Codex. Part of that was because I’d already seen this paper, which strikes me as a comparably impressive feat of abstraction in the code generation domain.
Also, the Codex model available in the API feels very much like GPT in the way it “reasons,” and is roughly what I’d expect from a GPT extended to code. It has that same quality where it frequently but not predictably does the right thing, where I often see it doing many separate things right but I can’t rely on it doing any one of them stably across all contexts. As with GPT, I get the best results when I stop asking “does it know X or not?” and instead ask “can I express X in a form likely to be common in the training data?”
This is going to get speculative and hand-wavey. I don’t know what abstract reasoning really is, any more than anyone does. But I have some ideas :)
First, something I have noticed since I started working with these models is that my own mind contains a module much like GPT, and this module plays a role in my reasoning process.
When I reflect on my own thought processes, they often look like a game played between a GPT-like “babbler” and an evaluating “critic.”
The babbler produces an interior monologue that sounds like my own voice, but (unlike when I’m speaking out loud) is only lightly conditioned at best on things like “concepts I want to express.” Instead, it just . . . says words that sound like me, making some argument with the confidence I’d have if I actually believed it, but it’s not trying to express an idea I already have—it’s just generating text that sounds like me.
I let the babbler run for a while, and then I step back and assess the monologue, asking “does this make sense? is this really a new idea? does this prove too much? can I think of counterexamples?” Like generating code and then checking if it compiles. Most babbler-monologues are rejected by the critic, at which point the babbler tries again, conditioned (in some way I don’t understand) on the critic’s rejection.
Most of my actually-believed-ideas originated in this game, I think. Also, I often do a short-range, purely linguistic variant of this when I’m writing: I ask the babbler for the next word or phrase, and there are several rounds of “no that doesn’t work” before I pick one. Even my mathematical reasoning is often like this, though it also involves other babbler-like modules that eg generate mental imagery which can be interpreted (by the critic) as expressing a mathematical argument.
Now, I highly doubt this is the only way that one can do abstract reasoning. (I don’t even think that all humans do it like this.) However, this is the source of my intuitions about the components involved in “true abstract reasoning” and how it differs from what LMs tend to do.
When I do “true abstract reasoning” as described above, there is a distinction between timesteps of candidate generation (inner loop), timesteps of candidate evaluation (outer loop), and timesteps of actually selecting the next idea (increments on some passes of the outer loop but not others). This seems important for avoiding “delusive” effects.
I have to run the babbler for a while to even get a coherent idea that’s possible to assess. By that point, the babbler is already conditioning on its earlier output in a self-deluding way. Unlike in GPT, though, these earlier outputs are not irrevocably written in stone at the moment we receive the later outputs; the critic is free to reject the entire sequence. With GPT, by the time it would be possible to notice “hey, I’m making a bad argument,” it’s already … making a bad argument, and there’s no going back.
(I think there’s an analogy here to AlphaZero/MuZero’s value head vs. its MCTS rollouts, where GPT is like the value head / “intuitive hunches,” lacking the slower search wrapper.)
Of course, in principle, you could imagine bundling this entire procedure inside an LM. Indeed, any sufficiently good LM would eventually have to solve the problems this procedure is designed to solve. Why don’t I expect transformer LMs to develop this structure internally?
One reason: the existence of my babbler seems like (weak) evidence that it’s better to use an LM inside a bigger non-LM algorithm.
My babbler itself feels very much like a likelihood-trained causal generative model, with the same virtuosity at surface mimicry, and the same lack of conditioning latents besides its own output. I suspect that making these kinds of models comes naturally to the cerebral cortex, and that if the brain could just implement reasoning end-to-end with such a model, it would have done it that way.
A second reason is … okay, this is a whole separate point and the comment’s already long. I’ll try to make this brief.
I think transformer LMs do a lot of what they do through a kind of “compressed memorization” of very large amounts of data. Early on, they learn many different ways that text is regular; some of this may look like “truly learning (eg syntactic) rules.” This low-level knowledge allows them to store training sequences in a vastly compressed form. Then, a lot of what they do in training is actual memorization of the data, in a compressed and noisy/interleaved form. Inference looks like mapping the input to the compressed space, and then doing a shallow-ish ensemble in that space over a massive number of texts the input is “reminiscent of” along various dimensions. The huge model capacity allows for a huge ensemble, so many superficial patterns cancel out in the ensemble, while deeper patterns stack.
This perspective is inspired by the way logit lens looks in later layers, by this paper which is similar to logit lens, and also by work like this showing you can extract exact strings from trained models that were only seen a few times in training.
The key point here is that you can compress things you can’t yet abstractively understand, using easier things you do understand. I can’t use abstractive summarization to compress (say) Grothendieck’s EGA, since I don’t understand it . . . but I can still run gzip on it, and that goes a long way! Hence, the frontier of the model’s apparent abstractive capability will outrun its actual abstractive capability: this frontier consists of texts the model can’t compress via facility with their content, but can simply memorize in bulk using easier compression.
In something like your list sorting example, I suspect the model doesn’t “have” an internal list sorter that looks anything like an algorithm. Instead, it has heavily compressed memories of many actual programming tutorials that included short example lists in unsorted and sorted form, and taking an average over these will usually “sort” a short list of small numbers—with help from low-level abstract operations like “greater than over small numbers,” but without any idea that a list can be arbitrary length / can contain any ordered type.
(EDIT to clarify: the context-dependence and flakiness of the capability is how we can tell it’s coming from the compressed ensemble. Contrast with the reliability of something like English syntax, which I believe is part of the compressor itself. This is my distinction between abstraction that’s “real” and “fake”)
Anyway, I think transformers are very good at this kind of compressive memorization—but not nearly as good at doing other kinds of computational work, like search or (obviously?) recursion. Like, whenever I think about how to “program” some routine using attn+FFs, I tend to despair. Even simple things often to be spread across >1 layer/”step” or >1 head, and the number of heads/layers in huge models feels tiny relative to the diversity of abstraction we expect out of them. (See this paper for some actual transformer “programs.”)
This is hand-wavey, but my intuition is that the “right abstractions for abstraction” are hard to fit in a transformer or similar modern NN, while memorizing instances of abstraction-use is far cheaper. And yes, eventually, at large enough scale, the models will have to do the right thing or they’ll stop progressing. But there is so much more left to smoothly learn with memorization that I think this architectural deficit will be invisible for a long time, over which LMs will continue to (unreliably) imitate abstraction better and better.
One reason we agree on many object-level facts but have different takeaways is that we have different desiderata for what GPT is supposed to do in the limit. I agree that many of the problems you discuss are fundamental to the way GPT is trained and how it works, but I generally feel these problems don’t need to be solved directly in order to use GPT to build AGI. I see GPT as the _seed_ for a future AGI system built off of or around it.
I see the big crux is how much “compressed memorization” will extrapolate to general intelligence vs. begin to show cracks as we ask it for more and more advanced and general one-step deductions. It would be worth coming up with some specific claims about how we expect future systems to act to differentiate our two perspectives (including at the level of internals). Probably this is useful to start on my end because I have higher expectations for performance. Unfortunately I’m very adverse to talking about _how_ I would amplify GPT by extending it or wrapping it in a larger system, and I see steps like that as key to unlocking its capabilities.
Your idea about multi-step deduction happening over multiple layers makes a lot of sense. You brought up an experiment in the Eleuther discord I think would be a great idea to try. We could train several models to see if tasks that require a sequence of discrete steps are unusually sensitive to network depth rather than scaling with parameter count alone.
I agree about your insights about abstract reasoning as babble and prune, although this definitely isn’t the only way I reason abstractly. I babble and prune especially when I am writing (on the word/sentence/paragraph level), and I babble and prune as a part of the search process when I am trying to come up with a plan or navigate through a math proof. But when I am talking I am able to fluidly reason towards my goal with little to no plan ahead of ahead of time. I work collaboratively so much of my abstract thinking is out loud. If babble/prune is going on when I talk, it is happening at a level below my awareness.
These rollouts are not always complete, as I often need to attack problems from multiple angles before I’ve fully understood them. But the individual rollouts look like abstract reasoning to me, just as they do (can) in GPT-3. I look at individual rollouts and think: That’s general intelligence. If something could reason as well or more powerfully than I can in an individual rollouts, it is the seed of an AGI.
I also often have moments of great insight where I seem to understand a full chain of thought almost instantly. The delay comes from my inability to communicate/record it quickly. I can also use abstract reasoning in visual space (e.g. figuring out a geometric proof). In these cases I often seem to have access to a causal model that I can examine and conclude things from directly.