The Colliding Exponentials of AI

Epistemic sta­tus: I have made many pre­dic­tions for quan­ti­ta­tive AI de­vel­op­ment this decade, these pre­dic­tions were based on what I think is solid rea­son­ing, and ex­trap­o­la­tions from prior data.

If peo­ple do not in­tu­itively un­der­stand the timescales of ex­po­nen­tial func­tions, then mul­ti­ple con­verg­ing ex­po­nen­tial func­tions will be even more mi­s­un­der­stood.

Cur­rently there are three ex­po­nen­tial trends act­ing upon AI perfor­mance, these be­ing Al­gorith­mic Im­prove­ments, In­creas­ing Bud­gets and Hard­ware Im­prove­ments. I have given an overview of these trends and ex­trap­o­lated a lower and up­per bound for their in­creases out to 2030. Th­ese ex­trap­o­lated in­creases are then com­bined to get the to­tal mul­ti­plier of equiv­a­lent com­pute that fron­tier 2030 mod­els may have over their 2020 coun­ter­parts.

Firstly...

Al­gorith­mic Improvements

Al­gorith­mic im­prove­ments for AI are much more well-known and quan­tified than they were a few years ago, much in thanks to OpenAI’s pa­per and blog AI and Effi­ciency (2020).

OpenAI showed the effi­ciency of image pro­cess­ing al­gorithms has been dou­bling ev­ery 16 months since 2012. This re­sulted in a 44x de­crease in com­pute re­quired to reach Alexnet level perfor­mance af­ter 7 years, as Figure 1 shows.

Figure 1, Com­pute to reach Alexnet level perfor­mance, OpenAI

OpenAI also showed al­gorith­mic im­prove­ments in the fol­low­ing ar­eas:

Trans­form­ers had sur­passed seq2seq perfor­mance on English to French trans­la­tion on WMT’14 with 61x less train­ing com­pute in three years.

AlphaZero took 8x less com­pute to get to AlphaGoZero level perfor­mance 1 year later.

OpenAI Five Rerun re­quired 5x less train­ing com­pute to sur­pass OpenAI Five 3 months later.

Ad­di­tion­ally Hippke’s LessWrong post Mea­sur­ing Hard­ware Over­hang de­tailed al­gorith­mic im­prove­ments in chess, find­ing that Deep Blue level perfor­mance could have been reached on a 1994 desk­top PC-level of com­pute, rather than the 1997 su­per­com­puter-level of com­pute that it was, if us­ing mod­ern al­gorithms.

Al­gorith­mic im­prove­ments come not just from ar­chi­tec­tural de­vel­op­ments, but also from their op­ti­miza­tion library’s. An ex­am­ple is Microsoft Deep­Speed (2020). Deep­Speed claims to train mod­els 2–7x faster on reg­u­lar clusters, 10x big­ger model train­ing on a sin­gle GPU, pow­er­ing 10x longer se­quences and 6x faster ex­e­cu­tion with 5x com­mu­ni­ca­tion vol­ume re­duc­tion. With up to 13 billion pa­ram­e­ter mod­els train­able on a sin­gle Nvidia V100 GPU.

So, across a wide range of ma­chine learn­ing ar­eas ma­jor al­gorith­mic im­prove­ments have been reg­u­larly oc­cur­ring. Ad­di­tion­ally, while this is harder to quan­tify thanks to limited prece­dent, it seems the in­tro­duc­tion of new ar­chi­tec­tures can cause sud­den and/​or dis­con­tin­u­ous leaps of perfor­mance in a do­main, as Trans­form­ers did for NLP. As a re­sult, ex­trap­o­lat­ing past trendlines may not cap­ture such fu­ture de­vel­op­ments.

If the al­gorith­mic effi­ciency of ma­chine learn­ing in gen­eral had a halv­ing time like image pro­cess­ing’s 16 months, we would ex­pect to see ~160x greater effi­ciency by the end of the decade. So, I think an es­ti­mate of gen­eral al­gorith­mic im­prove­ment of 100 − 1000x by 2030 seems rea­son­able.

Edit: I feel less con­fi­dent and bullish about al­gorith­mic progress now.

In­creas­ing Budgets

The mod­ern era of AI be­gan in 2012, this was the year that the com­pute used in the largest mod­els be­gan to rapidly in­crease, with a dou­bling time of 3.4 months (~10x Yearly), per OpenAI’s blog AI and Com­pute (2018), see Figure 2 be­low. While the graph stops in 2018, the trend held steady with the pre­dicted thou­sands of petaflop/​s-days range be­ing reached in 2020 with GPT-3, the largest ever (non sparse) model, which had an es­ti­mated train­ing cost of $4.6 Million, based on the price of a Tesla V100 cloud in­stance.

Figure 2, Modern AI era vs first Era, OpenAI

Be­fore 2012 the growth rate for com­pute of the largest sys­tems had a 2-year dou­bling time, es­sen­tially just fol­low­ing Moore’s law with mostly sta­ble bud­gets, how­ever in 2012 a new ex­po­nen­tial trend be­gan: In­creas­ing bud­gets.

This 3.4 month dou­bling time can­not be re­li­ably ex­trap­o­lated be­cause the in­creas­ing bud­get trend isn’t sus­tain­able, as it would re­sult in the fol­low­ing ap­prox­i­ma­tions (with­out hard­ware im­prove­ments):

2021 | $10-100M

2022 | $100M-1B

2023 | $1-10B

2024 | $10-100B

2025 | $100B-1T

Clearly with­out a rad­i­cal shift in the field, this trend could only con­tinue for a limited time. Astro­nom­i­cal as these figures ap­pear, the cost of the nec­es­sary su­per­com­put­ers would be even more.

Costs have moved away from mere aca­demic bud­gets and are now in the do­main of large cor­po­ra­tions, where ex­trap­o­la­tions will soon ex­ceed even their limits.

The an­nual re­search and de­vel­op­ment ex­pen­di­ture of Google’s par­ent com­pany Alpha­bet was $26 Billion in 2019, I have ex­trap­o­lated their pub­lished R&D bud­gets to 2030 in Figure 3.

Figure 3, R&D Bud­get for Alpha­bet to 2030

By 2030 Alpha­bets R&D should be just be­low $60 Billion (ap­prox­i­mately $46.8 Billion in 2020 dol­lars). So how much would Google, or a com­peti­tor, be will­ing to spend train­ing a gi­ant model?

Well to put those figures into per­spec­tive: The in­ter­na­tional trans­la­tion ser­vices mar­ket is cur­rently $43 billion and judg­ing from the suc­cess of GPT-3 in NLP its suc­ces­sors may be ca­pa­ble of ab­sorb­ing a good chunk of that. So that do­main alone could seem­ingly jus­tify $1B+ train­ing runs. And what about other do­mains within NLP like pro­gram­ming as­sis­tants?

In­vestors are will­ing to put up mas­sive amounts of cap­i­tal for spec­u­la­tive AI tech already; the self-driv­ing car do­main had dis­closed in­vest­ments of $80 Billion from 2014-2017 per a re­port from Brook­ings .With those kind of figures even a $10 Billion train­ing run doesn’t seem un­re­al­is­tic if the re­sult­ing model was pow­er­ful enough to jus­tify it.

My es­ti­mate is that by 2030 the train­ing run cost for the largest mod­els will be in the $1-10 Billion range (with to­tal sys­tem costs higher still). Com­pared to the sin­gle digit mil­lions train­ing cost for fron­tier 2020 sys­tems, that es­ti­mate rep­re­sents 1,000-10,000x larger train­ing runs.

Hard­ware Improvements

Moore’s Law had a his­toric 2-year dou­bling time that has since slowed. While it origi­nally referred to just tran­sis­tor count in­creases, it has changed to com­monly re­fer to just the perfor­mance in­crease. Some have pre­dicted its stag­na­tion as early as the mid-point of this decade (in­clud­ing Gor­don Moore him­self), but that is con­tested. More ex­otic paths for­ward such as non-sili­con ma­te­ri­als and 3D stack­ing have yet to be ex­plored at scale, but re­search con­tinues.

The micro­pro­ces­sor en­g­ineer Jim Kel­ler stated in Fe­bru­ary 2020 that he doesn’t think Moore’s law is dead, that cur­rent tran­sis­tors which are sized 1000x1000x1000 atoms can be re­duced to 10x10x10 atoms be­fore quan­tum effects (which oc­cur at 2-10 atoms) stop any fur­ther shrink­ing, an effec­tive 1,000,000x size re­duc­tion. Kel­ler ex­pects 10-20 more years of shrink­ing, and that perfor­mance in­creases will come from other ar­eas of chip de­sign as well. Fi­nally, Kel­ler says that the tran­sis­tor count in­crease has slowed down more re­cently to a ‘shrink fac­tor’ of 0.6 rather than the tra­di­tional 0.5, ev­ery 2 years. If that trend holds it will re­sult in a 12.8x in­crease in perfor­mance in 10 years.

But Hard­ware im­prove­ments for AI need not just come just from Moore’s law. Other sources of im­prove­ment such as Neu­ro­mor­phic chips de­signed es­pe­cially for run­ning neu­ral nets or spe­cial­ised gi­ant chips could cre­ate greater perfor­mance for AI.

By the end of the decade I es­ti­mate we should see be­tween 8-13x im­prove­ment in hard­ware perfor­mance.

Con­clu­sions and Com­par­i­sons

If we put my es­ti­mates for al­gorith­mic im­prove­ments, in­creased bud­gets and hard­ware im­prove­ments to­gether we see what equiv­a­lent com­pute mul­ti­plier we might ex­pect a fron­tier 2030 sys­tem to have com­pared to a fron­tier 2020 sys­tem.

Es­ti­ma­tions for 2030:

Al­gorith­mic Im­prove­ments: 100-1000x

Bud­get In­creases: 1000-10,000x

Hard­ware Im­prove­ments: 8-13x

That re­sults in an 800,000 130,000,000x mul­ti­plier in equiv­a­lent com­pute.

Between EIGHT HUNDRED THOUSAND and ONE HUNDRED and THIRTY MILLION.

To put those com­pute equiv­a­lent mul­ti­pli­ers into per­spec­tive in terms of what ca­pa­bil­ity they rep­re­sent there is only one ar­chi­tec­ture that seems worth ex­trap­o­lat­ing them out on: Trans­form­ers, speci­fi­cally GPT-3.

Firstly lets re­late them to Gw­ern’s es­ti­mates for hu­man vs GPT-3 level per­plex­ity from his blog­post On GPT-3. Re­mem­ber that per­plex­ity is a mea­sure­ment of how well a prob­a­bil­ity dis­tri­bu­tion or prob­a­bil­ity model pre­dicts a sam­ple. This is a use­ful com­par­i­son to make be­cause it has been spec­u­lated both that hu­man level pre­dic­tion on text would rep­re­sent hu­man level NLP, and that NLP would be an AI com­plete prob­lem re­quiring hu­man equiv­a­lent gen­eral fac­ul­ties.

Gw­ern states that his es­ti­mate is very rough and re­lies on un-sauced claims from OpenAI about hu­man level per­plex­ity on bench­marks, and that the ab­solute pre­dic­tion perfor­mance of GPT-3 is at a “best guess”, dou­ble that of a hu­man. With some “ir­re­spon­si­ble” ex­trap­o­la­tions of GPT-3 perfor­mance curves he finds that a 2,200,000× in­crease in com­pute would bring GPT-3 down to hu­man per­plex­ity. In­ter­est­ingly that’s not far above the lower bound in the 800,000 – 130,000,000x equiv­a­lent com­pute es­ti­mate range.

It’s worth stress­ing, 2030 AI sys­tems could have hu­man level pre­dic­tion ca­pa­bil­ities if scal­ing con­tinues.

Another way to put the equiv­a­lent com­pute mul­ti­plier es­ti­mates into per­spec­tive is ex­trap­o­lat­ing out the GPT-3 pa­per’s graph Ag­gre­gate perfor­mance across bench­marks (Done in Figure 4 be­low). OpenAI stated the ag­gre­gate perfor­mance “should not be seen as a rigor­ous or mean­ingful bench­mark in it­self” so don’t as­sume it rep­re­sents hu­man level ca­pa­bil­ities. This is some­thing wor­thy of con­sid­er­a­tion, es­pe­cially given GPT-3’s cur­rent ca­pa­bil­ities with ‘only’ 175 billion pa­ram­e­ters.

Figure 4, GPT-3 cross bench­mark perfor­mance ex­trap­o­lated

For the 800,000x es­ti­mate, few shot learn­ing ac­cu­racy ap­proaches 100%, one shot is above 90% and zero shot be­low 80%. For the 130,000,000x es­ti­mate both few and one-shot learn­ing ap­proach 100% and zero shot reaches 90%. This ex­trap­o­la­tion maps greater equiv­a­lent com­pute cleanly onto more pa­ram­e­ters, which may not be re­li­able, but gives us an idea of what we are deal­ing with.

Ul­ti­mately the point of these ex­trap­o­la­tions isn’t nec­es­sar­ily the spe­cific figures or dates but the clear gen­eral trend: not only are much more pow­er­ful AI sys­tems com­ing, they are com­ing soon.