Predictions for GPT-N

Regarding GPT-3, there is some discussion whether growing the model would transform it into an Oracle AI. I looked into the actual benchmark results (Appendix H in the paper) to see if we can predict something useful from the actual measurements.

Method: The OpenAI team ran a suite of 63 different benchmarks (including sub-types), each for zero/​one/​few shot. In each scenario, there are 8 model sizes. I looked at how results scale with model size. With only 8 measurements, there is a large associated uncertainty for predictions. Formally, one would test the trend function using a
Bayesian model selection between a linear and (e.g.,) a polynomial. I did this for a few and then eye-balled the rest. So, please take the following as an indication only.

Disclaimer: The smallest model for GPT-3 has parameters, the largest . That’s a span of 3 orders of magnitude. Scaling this out to many more orders of magnitude is dangerous. Thus, take these numbers only as an indication.

Results. For the following tests, I find an asymptotic trend. Scaling the model will apparently not yield fantastic results for:

  • HellaSwag, LAMBADA, PIQA, CoQA, OpenBookQA, Quac, RACE, CB, ReCoRD, WiC

  • Translations—but unclear level description.

In the following tests, it is unclear if the trend is asymptotic or better than that:

  • SAT: Could be linear, could be asymptotic. If linear, it will achieve 100% at parameters.

  • StoryCloze, Winograd, Winogrande, SQuADv2, DROP, Copa.

These tests show a linear scaling:

  • TriviaQA ( parameter estimate to achieve 100%)

  • BoolQ ()

  • MultiRC ()

  • ARC ()

  • SuperGLUE ()

  • WSC ()

  • WebQs ()

  • Cycled ()

Some tests scale neither linear nor asymptotic:

  • Symbol: Near exponential ()

  • Arithmetic: Exponential; one-digit composite may achieve 100% at

  • Reversed: Near exponential ()

  • Anagrams: Polynomial ()

  • ANLI: stepped, unclear

  • RTE: stepped, unclear


Summary: About half of the tested skills will likely not scale much with larger models. The other half will (e.g., TriviaQA, SuperGLUE, arithmetic, anagrams). Going to e.g., parameters—would that make an Oracle AI? Probably it’s not sufficient, but I’m interested in hearing your opinion!