GPT-3: a disappointing paper

[Note: I wrote this post in late May 2020, immediately after the GPT-3 paper was released.]

This post is a compilation of two posts I recently made on tumblr.

For context: I have been an enthusiastic user of GPT-2, and have written a lot about it and transformer models more generally. My other writing on this topic includes human psycholinguists: a critical appraisal and “the transformer … “explained?” See also my tumblr bot, which uses GPT-2 as a core component.

Part 1

argumate said:

@nostalgebraist, give us the goss on how GPT-3 compares with GPT-2!

I haven’t read the paper super carefully yet, but I am pretty sure of the following:

1.1: On GPT-3′s mundanity

“GPT-3″ is just a bigger GPT-2. In other words, it’s a straightforward generalization of the “just make the transformers bigger” approach that has been popular across multiple research groups since GPT-2.

This excerpt captures this pretty clearly:

Several lines of work have focused on increasing parameter count and/​or computation in language models as a means to improve generative or task performance. […] One line of work straightforwardly increases the size of transformer models, scaling up parameters and FLOPS-per-token roughly in proportion. Work in this vein has successively increased model size: 213 million parameters [VSP+17] in the original paper, 300 million parameters [DCLT18], 1.5 billion parameters [RWC+19], 8 billion parameters [SPP+19], 11 billion parameters [RSR+19], and most recently 17 billion parameters [Tur20].

The first two papers mentioned here are the original transformer for machine translation (VSP+17) and BERT (DCLT18). The parameter count doesn’t actually increase that much between those two.

The third one (RWC+19) is GPT-2. The parameter counts jumps up 5x there. Arguably the point of the GPT-2 paper was “it sounds dumb and too easy, but amazing things happen if you just make a transformer bigger” – and this “GPT-3″ paper is making the same point with bigger numbers.

“GPT-3” is a transformer with 175 billion parameters. It’s another big jump in the number, but the underlying architecture hasn’t changed much.

In one way this is a fair thing to call “GPT-3″: it’s another step in the new biggening tradition which GPT-2 initiated.

But in another way it’s pretty annoying and misleading to call it “GPT-3.” GPT-2 was (arguably) a fundamental advance, because it demonstrated the power of way bigger transformers when people didn’t know about that power. Now everyone knows, so it’s the furthest thing from a fundamental advance. (As an illustration, consider that their new big model deserves the title “GPT-3″ just as much, and just as little, as any of the last 3 big models they mention in that paragraph.)

1.2: On “few-shot learning”

The paper seems very targeted at the NLP community, which I mean in almost a wholly negative way. (Despite being part of the NLP community, I guess.)

The GPT-2 paper argued that language models (text predictors) could do well, or in some cases “at least not terribly,” at the specialized tasks used as NLP benchmarks – even without being told anything about those tasks. This was sort of neat, but mostly as a demonstration of the language model’s power.

The “zero-shot” learning they demonstrated in the paper – stuff like “adding tl;dr after a text and treating GPT-2′s continuation thereafter as a ‘summary’” – were weird and goofy and not the way anyone would want to do these things in practice. It was more cool as a demonstration that sufficiently good language models could “do it all,” even things they weren’t intended for; the point wasn’t that they were world-class great at these tasks, the point was the gap between their performance and their low level of preparation. Kinda like a child prodigy.

In the GPT-3 paper, they’ve introduced a new (…ish? maybe?) way for language models to be good at the standard benchmarks. Now it’s about how they can “figure out” what they’re supposed to be doing across the course of a text, i.e. instead of prompting the model with one thing like

Q: What is the capital of France?

they instead prompt it with several, like

Q: What is the capital of France?
A: Paris
Q: What is the capital of Spain?
A: Madrid
Q: What is the capital of Lithuania?
A: Vilnius
Q: What is the capital of Brazil?

The NLP-community-relevant point of “GPT-3″ is that language models can do much better on the standard benchmarks than we thought, via this kind of multi-prompting and also via even more biggening. Putting those two changes together, you can even even beat the state of the art on a few tasks (of many).

I can imagine someone viewing this as very important, if they thought it showed an ability in transformer LMs to “pick things up on the fly” in an extremely data-efficient, human-like way. That would be relevant to some of Gary Marcus’ concerns.

But the paper seems totally, weirdly uninterested in the “learning on the fly” angle. Their paper has many, many figures graphing performance against papemeter count – bigger is better yet again – but I can only find one figure graphing performance against their parameter K, the number of distinct task examples in the prompt (K is 1 and 4 in the two capitals examples).

[It turns out there’s another one I missed on my first read – Fig. 1.2 on page 4. I discuss this in Part 2 below.]

And that figure is, uh, not encouraging:

They do better with one task example than zero (the GPT-2 paper used zero), but otherwise it’s a pretty flat line; evidently there is not too much progressive “learning as you go” here.

(Oddly, the caption for this figure explains these are dev set results so not directly comparable to the test set results given as horizontal lines – which doesn’t stop them from plotting them! Elsewhere, they do report test set results for SuperGLUE, but only for K=32. Also, I’m not a fan of this plot’s lack of error bars.)

1.3: On benchmarks

Instead, their interest is almost completely in how good they can get on the benchmarks in absolute terms.

This is why I say it’s aimed at the NLP community: these are the metrics that whole community measures itself against, so in a trivial sense the community “has to” find these results interesting. But by now, this starts to feel like Goodhart’s Law.

The reason GPT-2 was so cool wasn’t that it did so well on these tasks. It was that it was a really good language model that demonstrated a new overall understanding of language. Coercing it to do well on standard benchmarks was valuable (to me) only as a flamboyant, semi-comedic way of pointing this out, kind of like showing off one’s artistic talent by painting (but not painting especially well) with just one’s non-dominant hand.

GPT-2 isn’t cool because it’s good at “question answering,” it’s cool because it’s so good at everything that it makes caring about “question answering” per se feel tiny, irrelevant.

The transformer was such an advance that it made the community create a new benchmark, “SuperGLUE,” because the previous gold standard benchmark (GLUE) was now too easy.

GPT-3 is so little of an advance, it doesn’t even do that well at SuperGLUE. It just does okay with its dominant hand tied behind its back.

“No, my 10-year-old math prodigy hasn’t proven any new theorems, but she can get a perfect score on the math SAT in under 10 minutes. Isn’t that groundbreaking?”

Sort of? Not especially?

1.4: On annoyance

The more I think about this paper, the more annoying it is. Transformers are extremely interesting. And this is about the least interesting transformer paper one can imagine in 2020.

Part 2

2.1: On “few-shot learning,” again

On my first read, I thought there was only one plot showing how performance varies with K (number of few-shot samples), but I missed the one very early in the paper, Fig 1.2 on p. 4.

That plot is more impressive than the other one, but doesn’t change my impression that the authors are not very interested in showing off “progressive learning” over the course of a text.

The argument they’re trying to make with Fig 1.2 is that more progressive learning happens with bigger models, and hence that their overall strategy – “use big models + few-shot learning to get good scores on benchmarks” – benefits from an interaction effect above and beyond the independent effects of its two parts (big models, few-shot learning).

Again, this is interesting if you care about scores on NLP benchmarks, but I have trouble seeing much qualitative significance for overall language understanding.

2.2: On novel words

One of their experiments, “Learning and Using Novel Words,“ strikes me as more remarkable than most of the others and the paper’s lack of focus on it confuses me. (This is section 3.9.5 and table 3.16.) The task is closely related to the Wug test – it’s the kind of thing Gary Marcus focused on in his critique of GPT-2 – and looks like this:

[Human prompt] To do a “farduddle” means to jump up and down really fast. An example of a sentence that uses the word farduddle is:
[GPT-3 continuation] One day when I was playing tag with my little sister, she got really excited and she started doing these crazy farduddles.

This is the sort of task that developmental linguists study in human children, and which past NLP models have had trouble with. You’d think a success on it would deserve top billing. The authors apparently report a success here, but treat it as an unimportant sideshow: they say they tried it 6 times and got 6 successes (100% accuracy?!), but they apparently didn’t consider this important enough to try the same thing on a larger sample, compute a real metric, show variance w/​r/​t parameters, etc. Meanwhile, they did those things on something like 40 other tasks, mostly far less interesting (to me). Confusing!

2.3: On abstract reasoning

In addition to the usual NLP benchmarks, they tried some “synthetic or qualitative” tasks (section 3.9). Their stated goal with these is to clarify the role the actual learning in “few-shot learning,” separating it from mere familiarity with similar-looking text:

One way to probe GPT-3’s range of abilities in the few-shot (or zero- and one-shot) setting is to give it tasks which require it to perform simple on-the-fly computational reasoning, recognize a novel pattern that is unlikely to have occurred in training, or adapt quickly to an unusual task.

The “synthetic or qualitative” tasks are:

  • various forms of simple arithmetic (like “add two 2-digit numbers”)

  • various anagram/​reversal/​etc tasks operating on the individual letters of words

  • SAT analogies

This line of work feels insufficiently theorized, and thus hard to interpret.

Consider the arithmetic tasks. Let’s grant the authors’ premise that the model has not just memorized some lookup table for arithmetic problems – it’s really “doing the problems” on the fly. Then, there are 2 things the model could be doing here (probably some of each simultaneously):

  1. It might have developed a real internal model of arithmetic from seeing many related numbers in training texts, and is applying this model to do the problems like you or I would

  2. It might have developed some generic reasoning capability for arbitrary abstract tasks, which can handle arithmetic as a particular case of a much more generic class of problems (e.g. it could also pick up various “fake arithmetics” where +, -, etc have non-standing meanings, if appropriately prompted)

Insofar as #1 is happening, the multiple prompts of few-shot learning shouldn’t matter: if the model knows how real (not fake) arithmetic works because it’s seen it in text, then additional examples don’t help “locate the task.” That is, if it has only learned to do real arithmetic, it shouldn’t need to be told “in this task the + symbol has the standard meaning,” because its ability depends on that assumption anyway.

So, if we’re mostly seeing #1 here, this is not a good demo of few-shot learning the way the authors think it is.

Insofar as #2 is happening, the few-shot prompts do matter: they “locate the meanings” of the symbols in the large space of possible formal systems. But #2 is wild: it would represent a kind of non-linguistic general intelligence ability which would be remarkable to find in a language model.

I really doubt this is what the authors are thinking. If they think language models are fully general reasoners, why not highlight that? The abstract reasoning capacity of transformers has already been more clearly probed without the confounding aspects of natural language, and a priori there are few reasons to think a very large language-specific model should develop strong abilities here (while there are a priori reasons to think the abilities are subtle forms of text recognition/​memorization the authors’ methodology was not able to detect).

My best guess is that the authors imagine a factorization of the task into “knowing how to do it” and “knowing we are doing it right now.” Training on text teaches you how to do (real) arithmetic, and the few-shot prompts tell you “right now we are doing (real) arithmetic, not some other thing you know how to do.”

But arithmetic is a really bad choice if you want to probe this! The authors use K=50 here, meaning they give the model 50 correct examples of simple math problems to let it “locate the task.” But no one who can do this task should need 50 examples of it.

What information is conveyed by example #50 that wasn’t already known by example #49? What are we ruling out here? Trollish formal systems that look like addition 98% of the time? “Addition, except ’52′ actually means ’37′ but everything else is the same?” Do we have to rule this out when you should have (and the model must have) a strong prior towards real addition?

I don’t know what the authors are trying to do here, and I think they may not know, either.