Developmental Stages of GPTs

Epistemic Status: I only know as much as anyone else in my reference class (I build ML models, I can grok the GPT papers, and I don’t work for OpenAI or a similar lab). But I think my thesis is original.

Related: Gwern on GPT-3

For the last several years, I’ve gone around saying that I’m worried about transformative AI, an AI capable of making an Industrial Revolution sized impact (the concept is agnostic on whether it has to be AGI or self-improving), because I think we might be one or two cognitive breakthroughs away from building one.

GPT-3 has made me move up my timelines, because it makes me think we might need zero more cognitive breakthroughs, just more refinement /​ efficiency /​ computing power: basically, GPT-6 or GPT-7 might do it. My reason for thinking this is comparing GPT-3 to GPT-2, and reflecting on what the differences say about the “missing pieces” for transformative AI.

My Thesis:

The difference between GPT-2 and GPT-3 has made me suspect that there’s a legitimate comparison to be made between the scale of a network architecture like the GPTs, and some analogue of “developmental stages” of the resulting network. Furthermore, it’s plausible to me that the functions needed to be a transformative AI are covered by a moderate number of such developmental stages, without requiring additional structure. Thus GPT-N would be a transformative AI, for some not-too-large N, and we need to redouble our efforts on ways to align such AIs.

The thesis doesn’t strongly imply that we’ll reach transformative AI via GPT-N especially soon; I have wide uncertainty, even given the thesis, about how large we should expect N to be, and whether the scaling of training and of computation slows down progress before then. But it’s also plausible to me now that the timeline is only a few years, and that no fundamentally different approach will succeed before then. And that scares me.

Architecture and Scaling

GPT, GPT-2, and GPT-3 use nearly the same architecture; each paper says as much, with a sentence or two about minor improvements to the individual transformers. Model size (and the amount of training computation) is really the only difference.

GPT took 1 petaflop/​s-day to train 117M parameters, GPT-2 took 10 petaflop/​s-days to train 1.5B parameters, and the largest version of GPT-3 took 3,000 petaflop/​s-days to train 175B parameters. By contrast, AlphaStar seems to have taken about 30,000 petaflop/​s-days of training in mid-2019, so the pace of AI research computing power projects that there should be about 10x that today. The upshot is that OpenAI may not be able to afford it, but if Google really wanted to make GPT-4 this year, they could afford to do so.

Analogues to Developmental Stages

There are all sorts of (more or less well-defined) developmental stages for human beings: image tracking, object permanence, vocabulary and grammar, theory of mind, size and volume, emotional awareness, executive functioning, et cetera.

I was first reminded of developmental stages a few years ago, when I saw the layers of abstraction generated in this feature visualization tool for GoogLeNet.

We don’t have feature visualization for language models, but we do have generative outputs. And as you scale up an architecture like GPT, you see higher levels of abstraction. Grammar gets mastered, then content (removing absurd but grammatical responses), then tone (first rough genre, then spookily accurate authorial voice). Topic coherence is mastered first on the phrase level, then the sentence level, then the paragraph level. So too with narrative flow.

Gwern’s poetry experiments (GPT-2, GPT-3) are good examples. GPT-2 could more or less continue the meter of a poem and use words that fit the existing theme, but even its best efforts can get stuck in topic loops:

…Methinks I see her in her blissful dreams:
—Or, fancy-like, in some mirage she lies,
Majestic yet majestic, and of seems
The image of the unconquerable skies.
Methinks I see her in her blissful dreams:
—Or, fancy-like, in some majestic cell,
Where lordly seraphs strew their balmy dreams
On the still night, or in their golden shell.


In the dark the sun doth gleam,
And in the dark the moon doth seem
But now the evening is begun—
Gone is the sun upon the earth!
The silver moon doth like a cup
Of blood-red wine, and as that cup
Is drained of life, doth quench no drop.
What man will drink such wine?

GPT-3, though, has the sense of going somewhere, at least on the topic level. (Prompts in bold.)

O time thy pyramids, thy palaces forgot,
Pensive through nations wander with no more state
Than in thy days, when wandering was thy trade:
Mourn for Patraos, for the glory fled
Of him and thee, and in their stead left, left
To other eyes a length of rolling sand,
And such memorials as the traveller reads,
Half guessed in shade, of sculptured walls half round,
And long-defaced metopes, and friezes old.

[EDIT: Previously I also included its completion of a famous Emily Dickinson poem here, but as benkuhn pointed out, GPT-3 had simply memorized the poem and recited it. I’m really embarrassed, and also kind of shocked that I looked at the actual text of “Because I could not stop for Death” and thought, “yup, that looks like something GPT-3 could produce”.]

(One last shocking bit is that, while GPT-2 had to be fine-tuned by taking the general model and training it some more on a poetry-only dataset, you’re seeing what GPT-3′s model does with no fine-tuning, with just a prompt that sounds poetic!)

Similarly, GPT-3′s ability to write fiction is impressive- unlike GPT-2, it doesn’t lose track of the plot, it has sensible things happen, it just can’t plan its way to a satisfying resolution.

I’d be somewhat surprised if GPT-4 shared that last problem.

What’s Next?

How could one of the GPTs become a transformative AI, even if it becomes a better and better imitator of human prose style? Sure, we can imagine it being used maliciously to auto-generate targeted misinformation or things of that sort, but that’s not the real risk I’m worrying about here.

My real worry is that causal inference and planning are starting to look more and more like plausible developmental stages that GPT-3 is moving towards, and that these were exactly the things I previously thought were the obvious obstacles between current AI paradigms and transformative AI.

Learning causal inference from observations doesn’t seem qualitatively different from learning arithmetic or coding from examples (and not only is GPT-3 accurate at adding three-digit numbers, but apparently at writing JSX code to spec), only more complex in degree.

One might claim that causal inference is harder to glean from language-only data than from direct observation of the physical world, but that’s a moot point, as OpenAI are using the same architecture to learn how to infer the rest of an image from one part.

Planning is more complex to assess. We’ve seen GPTs ascend from coherence of the next few words, to the sentence or line, to the paragraph or stanza, and we’ve even seen them write working code. But this can be done without planning; GPT-3 may simply have a good enough distribution over next words to prune out those that would lead to dead ends. (On the other hand, how sure are we that that’s not the same as planning, if planning is just pruning on a high enough level of abstraction?)

The bigger point about planning, though, is that the GPTs are getting feedback on one word at a time in isolation. It’s hard for them to learn not to paint themselves into a corner. It would make training more finicky and expensive if we expanded the time horizon of the loss function, of course. But that’s a straightforward way to get the seeds of planning, and surely there are other ways.

With causal modeling and planning, you have the capability of manipulation without external malicious use. And the really worrisome capability comes when it models its own interactions with the world, and makes plans with that taken into account.

Could GPT-N turn out aligned, or at least harmless?

GPT-3 is trained simply to predict continuations of text. So what would it actually optimize for, if it had a pretty good model of the world including itself and the ability to make plans in that world?

One might hope that because it’s learning to imitate humans in an unsupervised way, that it would end up fairly human, or at least act in that way. I very much doubt this, for the following reason:

  • Two humans are fairly similar to each other, because they have very similar architectures and are learning to succeed in the same environment.

  • Two convergently evolved species will be similar in some ways but not others, because they have different architectures but the same environmental pressures.

  • A mimic species will be similar in some ways but not others to the species it mimics, because even if they share recent ancestry, the environmental pressures on the poisonous one are different from the environmental pressures on the mimic.

What we have with the GPTs is the first deep learning architecture we’ve found that scales this well in the domain (so, probably not that much like our particular architecture), learning to mimic humans rather than growing in an environment with similar pressures. Why should we expect it to be anything but very alien under the hood, or to continue acting human once its actions take us outside of the training distribution?

Moreover, there may be much more going on under the hood than we realize; it may take much more general cognitive power to learn and imitate the patterns of humans, than it requires us to execute those patterns.

Next, we might imagine GPT-N to just be an Oracle AI, which we would have better hopes of using well. But I don’t expect that an approximate Oracle AI could be used safely with anything like the precautions that might work for a genuine Oracle AI. I don’t know what internal optimizers GPT-N ends up building along the way, but I’m not going to count on there being none of them.

I don’t expect that GPT-N will be aligned or harmless by default. And if N isn’t that large before it gets transformative capacity, that’s simply terrifying.

What Can We Do?

While the short timeline suggested by the thesis is very bad news from an AI safety readiness perspective (less time to come up with better theoretical approaches), there is one silver lining: it at least reduces the chance of a hardware overhang. A project or coalition can feasibly wait and take a better-aligned approach that uses 10x the time and expense of an unaligned approach, as long as they have that amount of resource advantage over any competitor.

Unfortunately, the thesis also makes it less likely that a fundamentally different architecture will reach transformative status before something like GPT does.

I don’t want to take away from MIRI’s work (I still support them, and I think that if the GPTs peter out, we’ll be glad they’ve been continuing their work), but I think it’s an essential time to support projects that can work for a GPT-style near-term AGI, for instance by incorporating specific alignment pressures during training. Intuitively, it seems as if Cooperative Inverse Reinforcement Learning or AI Safety via Debate or Iterated Amplification are in this class.

We may also want to do a lot of work on how better to mold a GPT-in-training into the shape of an Oracle AI.

It would also be very useful to build some GPT feature “visualization” tools ASAP.

In the meantime, uh, enjoy AI Dungeon, I guess?