GPT-3: a disappointing paper

This post is a com­pila­tion of two posts I re­cently made on tum­blr.

For con­text: I have been an en­thu­si­as­tic user of GPT-2, and have writ­ten a lot about it and trans­former mod­els more gen­er­ally. My other writ­ing on this topic in­cludes hu­man psy­chol­in­guists: a crit­i­cal ap­praisal and “the trans­former … “ex­plained?” See also my tum­blr bot, which uses GPT-2 as a core com­po­nent.

Part 1

ar­gu­mate said:

@nos­talge­braist, give us the goss on how GPT-3 com­pares with GPT-2!

I haven’t read the pa­per su­per care­fully yet, but I am pretty sure of the fol­low­ing:

1.1: On GPT-3′s mundanity

“GPT-3″ is just a big­ger GPT-2. In other words, it’s a straight­for­ward gen­er­al­iza­tion of the “just make the trans­form­ers big­ger” ap­proach that has been pop­u­lar across mul­ti­ple re­search groups since GPT-2.

This ex­cerpt cap­tures this pretty clearly:

Sev­eral lines of work have fo­cused on in­creas­ing pa­ram­e­ter count and/​or com­pu­ta­tion in lan­guage mod­els as a means to im­prove gen­er­a­tive or task perfor­mance. […] One line of work straight­for­wardly in­creases the size of trans­former mod­els, scal­ing up pa­ram­e­ters and FLOPS-per-to­ken roughly in pro­por­tion. Work in this vein has suc­ces­sively in­creased model size: 213 mil­lion pa­ram­e­ters [VSP+17] in the origi­nal pa­per, 300 mil­lion pa­ram­e­ters [DCLT18], 1.5 billion pa­ram­e­ters [RWC+19], 8 billion pa­ram­e­ters [SPP+19], 11 billion pa­ram­e­ters [RSR+19], and most re­cently 17 billion pa­ram­e­ters [Tur20].

The first two pa­pers men­tioned here are the origi­nal trans­former for ma­chine trans­la­tion (VSP+17) and BERT (DCLT18). The pa­ram­e­ter count doesn’t ac­tu­ally in­crease that much be­tween those two.

The third one (RWC+19) is GPT-2. The pa­ram­e­ter counts jumps up 5x there. Ar­guably the point of the GPT-2 pa­per was “it sounds dumb and too easy, but amaz­ing things hap­pen if you just make a trans­former big­ger” – and this “GPT-3″ pa­per is mak­ing the same point with big­ger num­bers.

“GPT-3” is a trans­former with 175 billion pa­ram­e­ters. It’s an­other big jump in the num­ber, but the un­der­ly­ing ar­chi­tec­ture hasn’t changed much.

In one way this is a fair thing to call “GPT-3″: it’s an­other step in the new biggen­ing tra­di­tion which GPT-2 ini­ti­ated.

But in an­other way it’s pretty an­noy­ing and mis­lead­ing to call it “GPT-3.” GPT-2 was (ar­guably) a fun­da­men­tal ad­vance, be­cause it demon­strated the power of way big­ger trans­form­ers when peo­ple didn’t know about that power. Now ev­ery­one knows, so it’s the fur­thest thing from a fun­da­men­tal ad­vance. (As an illus­tra­tion, con­sider that their new big model de­serves the ti­tle “GPT-3″ just as much, and just as lit­tle, as any of the last 3 big mod­els they men­tion in that para­graph.)

1.2: On “few-shot learn­ing”

The pa­per seems very tar­geted at the NLP com­mu­nity, which I mean in al­most a wholly nega­tive way. (De­spite be­ing part of the NLP com­mu­nity, I guess.)

The GPT-2 pa­per ar­gued that lan­guage mod­els (text pre­dic­tors) could do well, or in some cases “at least not ter­ribly,” at the spe­cial­ized tasks used as NLP bench­marks – even with­out be­ing told any­thing about those tasks. This was sort of neat, but mostly as a demon­stra­tion of the lan­guage model’s power.

The “zero-shot” learn­ing they demon­strated in the pa­per – stuff like “adding tl;dr af­ter a text and treat­ing GPT-2′s con­tinu­a­tion there­after as a ‘sum­mary’” – were weird and goofy and not the way any­one would want to do these things in prac­tice. It was more cool as a demon­stra­tion that suffi­ciently good lan­guage mod­els could “do it all,” even things they weren’t in­tended for; the point wasn’t that they were world-class great at these tasks, the point was the gap be­tween their perfor­mance and their low level of prepa­ra­tion. Kinda like a child prodigy.

In the GPT-3 pa­per, they’ve in­tro­duced a new (…ish? maybe?) way for lan­guage mod­els to be good at the stan­dard bench­marks. Now it’s about how they can “figure out” what they’re sup­posed to be do­ing across the course of a text, i.e. in­stead of prompt­ing the model with one thing like

Q: What is the cap­i­tal of France?

they in­stead prompt it with sev­eral, like

Q: What is the cap­i­tal of France?
A: Paris
Q: What is the cap­i­tal of Spain?
A: Madrid
Q: What is the cap­i­tal of Lithua­nia?
A: Vilnius
Q: What is the cap­i­tal of Brazil?

The NLP-com­mu­nity-rele­vant point of “GPT-3″ is that lan­guage mod­els can do much bet­ter on the stan­dard bench­marks than we thought, via this kind of multi-prompt­ing and also via even more biggen­ing. Put­ting those two changes to­gether, you can even even beat the state of the art on a few tasks (of many).

I can imag­ine some­one view­ing this as very im­por­tant, if they thought it showed an abil­ity in trans­former LMs to “pick things up on the fly” in an ex­tremely data-effi­cient, hu­man-like way. That would be rele­vant to some of Gary Mar­cus’ con­cerns.

But the pa­per seems to­tally, weirdly un­in­ter­ested in the “learn­ing on the fly” an­gle. Their pa­per has many, many figures graph­ing perfor­mance against pa­peme­ter count – big­ger is bet­ter yet again – but I can only find one figure graph­ing perfor­mance against their pa­ram­e­ter K, the num­ber of dis­tinct task ex­am­ples in the prompt (K is 1 and 4 in the two cap­i­tals ex­am­ples).

[It turns out there’s an­other one I missed on my first read – Fig. 1.2 on page 4. I dis­cuss this in Part 2 be­low.]

And that figure is, uh, not en­courag­ing:

They do bet­ter with one task ex­am­ple than zero (the GPT-2 pa­per used zero), but oth­er­wise it’s a pretty flat line; ev­i­dently there is not too much pro­gres­sive “learn­ing as you go” here.

(Oddly, the cap­tion for this figure ex­plains these are dev set re­sults so not di­rectly com­pa­rable to the test set re­sults given as hori­zon­tal lines – which doesn’t stop them from plot­ting them! Else­where, they do re­port test set re­sults for Su­perGLUE, but only for K=32. Also, I’m not a fan of this plot’s lack of er­ror bars.)

1.3: On benchmarks

In­stead, their in­ter­est is al­most com­pletely in how good they can get on the bench­marks in ab­solute terms.

This is why I say it’s aimed at the NLP com­mu­nity: these are the met­rics that whole com­mu­nity mea­sures it­self against, so in a triv­ial sense the com­mu­nity “has to” find these re­sults in­ter­est­ing. But by now, this starts to feel like Good­hart’s Law.

The rea­son GPT-2 was so cool wasn’t that it did so well on these tasks. It was that it was a re­ally good lan­guage model that demon­strated a new over­all un­der­stand­ing of lan­guage. Co­erc­ing it to do well on stan­dard bench­marks was valuable (to me) only as a flam­boy­ant, semi-comedic way of point­ing this out, kind of like show­ing off one’s artis­tic tal­ent by paint­ing (but not paint­ing es­pe­cially well) with just one’s non-dom­i­nant hand.

GPT-2 isn’t cool be­cause it’s good at “ques­tion an­swer­ing,” it’s cool be­cause it’s so good at ev­ery­thing that it makes car­ing about “ques­tion an­swer­ing” per se feel tiny, ir­rele­vant.

The trans­former was such an ad­vance that it made the com­mu­nity cre­ate a new bench­mark, “Su­perGLUE,” be­cause the pre­vi­ous gold stan­dard bench­mark (GLUE) was now too easy.

GPT-3 is so lit­tle of an ad­vance, it doesn’t even do that well at Su­perGLUE. It just does okay with its dom­i­nant hand tied be­hind its back.

“No, my 10-year-old math prodigy hasn’t proven any new the­o­rems, but she can get a perfect score on the math SAT in un­der 10 min­utes. Isn’t that ground­break­ing?”

Sort of? Not es­pe­cially?

1.4: On annoyance

The more I think about this pa­per, the more an­noy­ing it is. Trans­form­ers are ex­tremely in­ter­est­ing. And this is about the least in­ter­est­ing trans­former pa­per one can imag­ine in 2020.

Part 2

2.1: On “few-shot learn­ing,” again

On my first read, I thought there was only one plot show­ing how perfor­mance varies with K (num­ber of few-shot sam­ples), but I missed the one very early in the pa­per, Fig 1.2 on p. 4.

That plot is more im­pres­sive than the other one, but doesn’t change my im­pres­sion that the au­thors are not very in­ter­ested in show­ing off “pro­gres­sive learn­ing” over the course of a text.

The ar­gu­ment they’re try­ing to make with Fig 1.2 is that more pro­gres­sive learn­ing hap­pens with big­ger mod­els, and hence that their over­all strat­egy – “use big mod­els + few-shot learn­ing to get good scores on bench­marks” – benefits from an in­ter­ac­tion effect above and be­yond the in­de­pen­dent effects of its two parts (big mod­els, few-shot learn­ing).

Again, this is in­ter­est­ing if you care about scores on NLP bench­marks, but I have trou­ble see­ing much qual­i­ta­tive sig­nifi­cance for over­all lan­guage un­der­stand­ing.

2.2: On novel words

One of their ex­per­i­ments, “Learn­ing and Us­ing Novel Words,“ strikes me as more re­mark­able than most of the oth­ers and the pa­per’s lack of fo­cus on it con­fuses me. (This is sec­tion 3.9.5 and table 3.16.) The task is closely re­lated to the Wug test – it’s the kind of thing Gary Mar­cus fo­cused on in his cri­tique of GPT-2 – and looks like this:

[Hu­man prompt] To do a “far­dud­dle” means to jump up and down re­ally fast. An ex­am­ple of a sen­tence that uses the word far­dud­dle is:
[GPT-3 con­tinu­a­tion] One day when I was play­ing tag with my lit­tle sister, she got re­ally ex­cited and she started do­ing these crazy far­dud­dles.

This is the sort of task that de­vel­op­men­tal lin­guists study in hu­man chil­dren, and which past NLP mod­els have had trou­ble with. You’d think a suc­cess on it would de­serve top billing. The au­thors ap­par­ently re­port a suc­cess here, but treat it as an unim­por­tant sideshow: they say they tried it 6 times and got 6 suc­cesses (100% ac­cu­racy?!), but they ap­par­ently didn’t con­sider this im­por­tant enough to try the same thing on a larger sam­ple, com­pute a real met­ric, show var­i­ance w/​r/​t pa­ram­e­ters, etc. Mean­while, they did those things on some­thing like 40 other tasks, mostly far less in­ter­est­ing (to me). Con­fus­ing!

2.3: On ab­stract reasoning

In ad­di­tion to the usual NLP bench­marks, they tried some “syn­thetic or qual­i­ta­tive” tasks (sec­tion 3.9). Their stated goal with these is to clar­ify the role the ac­tual learn­ing in “few-shot learn­ing,” sep­a­rat­ing it from mere fa­mil­iar­ity with similar-look­ing text:

One way to probe GPT-3’s range of abil­ities in the few-shot (or zero- and one-shot) set­ting is to give it tasks which re­quire it to perform sim­ple on-the-fly com­pu­ta­tional rea­son­ing, rec­og­nize a novel pat­tern that is un­likely to have oc­curred in train­ing, or adapt quickly to an un­usual task.

The “syn­thetic or qual­i­ta­tive” tasks are:

  • var­i­ous forms of sim­ple ar­ith­metic (like “add two 2-digit num­bers”)

  • var­i­ous ana­gram/​re­ver­sal/​etc tasks op­er­at­ing on the in­di­vi­d­ual let­ters of words

  • SAT analogies

This line of work feels in­suffi­ciently the­o­rized, and thus hard to in­ter­pret.

Con­sider the ar­ith­metic tasks. Let’s grant the au­thors’ premise that the model has not just mem­o­rized some lookup table for ar­ith­metic prob­lems – it’s re­ally “do­ing the prob­lems” on the fly. Then, there are 2 things the model could be do­ing here (prob­a­bly some of each si­mul­ta­neously):

  1. It might have de­vel­oped a real in­ter­nal model of ar­ith­metic from see­ing many re­lated num­bers in train­ing texts, and is ap­ply­ing this model to do the prob­lems like you or I would

  2. It might have de­vel­oped some generic rea­son­ing ca­pa­bil­ity for ar­bi­trary ab­stract tasks, which can han­dle ar­ith­metic as a par­tic­u­lar case of a much more generic class of prob­lems (e.g. it could also pick up var­i­ous “fake ar­ith­metics” where +, -, etc have non-stand­ing mean­ings, if ap­pro­pri­ately prompted)

In­so­far as #1 is hap­pen­ing, the mul­ti­ple prompts of few-shot learn­ing shouldn’t mat­ter: if the model knows how real (not fake) ar­ith­metic works be­cause it’s seen it in text, then ad­di­tional ex­am­ples don’t help “lo­cate the task.” That is, if it has only learned to do real ar­ith­metic, it shouldn’t need to be told “in this task the + sym­bol has the stan­dard mean­ing,” be­cause its abil­ity de­pends on that as­sump­tion any­way.

So, if we’re mostly see­ing #1 here, this is not a good demo of few-shot learn­ing the way the au­thors think it is.

In­so­far as #2 is hap­pen­ing, the few-shot prompts do mat­ter: they “lo­cate the mean­ings” of the sym­bols in the large space of pos­si­ble for­mal sys­tems. But #2 is wild: it would rep­re­sent a kind of non-lin­guis­tic gen­eral in­tel­li­gence abil­ity which would be re­mark­able to find in a lan­guage model.

I re­ally doubt this is what the au­thors are think­ing. If they think lan­guage mod­els are fully gen­eral rea­son­ers, why not high­light that? The ab­stract rea­son­ing ca­pac­ity of trans­form­ers has already been more clearly probed with­out the con­found­ing as­pects of nat­u­ral lan­guage, and a pri­ori there are few rea­sons to think a very large lan­guage-spe­cific model should de­velop strong abil­ities here (while there are a pri­ori rea­sons to think the abil­ities are sub­tle forms of text recog­ni­tion/​mem­o­riza­tion the au­thors’ method­ol­ogy was not able to de­tect).

My best guess is that the au­thors imag­ine a fac­tor­iza­tion of the task into “know­ing how to do it” and “know­ing we are do­ing it right now.” Train­ing on text teaches you how to do (real) ar­ith­metic, and the few-shot prompts tell you “right now we are do­ing (real) ar­ith­metic, not some other thing you know how to do.”

But ar­ith­metic is a re­ally bad choice if you want to probe this! The au­thors use K=50 here, mean­ing they give the model 50 cor­rect ex­am­ples of sim­ple math prob­lems to let it “lo­cate the task.” But no one who can do this task should need 50 ex­am­ples of it.

What in­for­ma­tion is con­veyed by ex­am­ple #50 that wasn’t already known by ex­am­ple #49? What are we rul­ing out here? Trol­lish for­mal sys­tems that look like ad­di­tion 98% of the time? “Ad­di­tion, ex­cept ’52′ ac­tu­ally means ’37′ but ev­ery­thing else is the same?” Do we have to rule this out when you should have (and the model must have) a strong prior to­wards real ad­di­tion?

I don’t know what the au­thors are try­ing to do here, and I think they may not know, ei­ther.