Predictions for GPT-N

Re­gard­ing GPT-3, there is some dis­cus­sion whether grow­ing the model would trans­form it into an Or­a­cle AI. I looked into the ac­tual bench­mark re­sults (Ap­pendix H in the pa­per) to see if we can pre­dict some­thing use­ful from the ac­tual mea­sure­ments.

Method: The OpenAI team ran a suite of 63 differ­ent bench­marks (in­clud­ing sub-types), each for zero/​one/​few shot. In each sce­nario, there are 8 model sizes. I looked at how re­sults scale with model size. With only 8 mea­sure­ments, there is a large as­so­ci­ated un­cer­tainty for pre­dic­tions. For­mally, one would test the trend func­tion us­ing a
Bayesian model se­lec­tion be­tween a lin­ear and (e.g.,) a polyno­mial. I did this for a few and then eye-balled the rest. So, please take the fol­low­ing as an in­di­ca­tion only.

Dis­claimer: The small­est model for GPT-3 has pa­ram­e­ters, the largest . That’s a span of 3 or­ders of mag­ni­tude. Scal­ing this out to many more or­ders of mag­ni­tude is dan­ger­ous. Thus, take these num­bers only as an in­di­ca­tion.

Re­sults. For the fol­low­ing tests, I find an asymp­totic trend. Scal­ing the model will ap­par­ently not yield fan­tas­tic re­sults for:

  • Hel­laSwag, LAMBADA, PIQA, CoQA, OpenBookQA, Quac, RACE, CB, ReCoRD, WiC

  • Trans­la­tions—but un­clear level de­scrip­tion.

In the fol­low­ing tests, it is un­clear if the trend is asymp­totic or bet­ter than that:

  • SAT: Could be lin­ear, could be asymp­totic. If lin­ear, it will achieve 100% at pa­ram­e­ters.

  • Sto­ryCloze, Wino­grad, Wino­grande, SQuADv2, DROP, Copa.

Th­ese tests show a lin­ear scal­ing:

  • Triv­i­aQA ( pa­ram­e­ter es­ti­mate to achieve 100%)

  • BoolQ ()

  • Mul­tiRC ()

  • ARC ()

  • Su­perGLUE ()

  • WSC ()

  • We­bQs ()

  • Cy­cled ()

Some tests scale nei­ther lin­ear nor asymp­totic:

  • Sym­bol: Near ex­po­nen­tial ()

  • Arith­metic: Ex­po­nen­tial; one-digit com­pos­ite may achieve 100% at

  • Re­v­ersed: Near ex­po­nen­tial ()

  • Ana­grams: Polyno­mial ()

  • ANLI: stepped, unclear

  • RTE: stepped, unclear


Sum­mary: About half of the tested skills will likely not scale much with larger mod­els. The other half will (e.g., Triv­i­aQA, Su­perGLUE, ar­ith­metic, ana­grams). Go­ing to e.g., pa­ram­e­ters—would that make an Or­a­cle AI? Prob­a­bly it’s not suffi­cient, but I’m in­ter­ested in hear­ing your opinion!