Are we in an AI overhang?

Over on Devel­op­men­tal Stages of GPTs, or­thonor­mal mentions

it at least re­duces the chance of a hard­ware over­hang.

An over­hang is when you have had the abil­ity to build trans­for­ma­tive AI for quite some time, but you haven’t be­cause no-one’s re­al­ised it’s pos­si­ble. Then some­one does and sur­prise! It’s a lot more ca­pa­ble than ev­ery­one ex­pected.

I am wor­ried we’re in an over­hang right now. I think we right now have the abil­ity to build an or­ders-of-mag­ni­tude more pow­er­ful sys­tem than we already have, and I think GPT-3 is the trig­ger for 100x larger pro­jects at Google, Face­book and the like, with timelines mea­sured in months.

In­vest­ment Bounds

GPT-3 is the first AI sys­tem that has ob­vi­ous, im­me­di­ate, trans­for­ma­tive eco­nomic value. While much hay has been made about how much more ex­pen­sive it is than a typ­i­cal AI re­search pro­ject, in the wider con­text of mega­corp in­vest­ment, its costs are in­signifi­cant.

GPT-3 has been es­ti­mated to cost $5m in com­pute to train, and—look­ing at the au­thor list and OpenAI’s over­all size—maybe an­other $10m in labour.

Google, Ama­zon and Microsoft each spend about $20bn/​year on R&D and an­other $20bn each on cap­i­tal ex­pen­di­ture. Very roughly, it to­tals to $100bn/​year. Against this bud­get, drop­ping $1bn or more on scal­ing GPT up by an­other fac­tor of 100x is en­tirely plau­si­ble right now. All that’s nec­es­sary is that tech ex­ec­u­tives stop think­ing of nat­u­ral lan­guage pro­cess­ing as cutesy blue-sky re­search and start think­ing in terms of quar­ters-till-prof­ita­bil­ity.

A con­crete ex­am­ple is Waymo, which is rais­ing $2bn in­vest­ment rounds—and that’s for a tech­nol­ogy with a much longer road to mar­ket.

Com­pute Cost

The other side of the equa­tion is com­pute cost. The $5m GPT-3 train­ing cost es­ti­mate comes from us­ing V100s at $10k/​unit and 30 TFLOPS, which is the perfor­mance with­out ten­sor cores be­ing con­sid­ered. Amor­tized over a year, this gives you about $1000/​PFLOPS-day.

How­ever, this cost is driven up an or­der of mag­ni­tude by NVIDIA’s mo­nop­o­lis­tic cloud con­tracts, while perfor­mance will be higher when tak­ing ten­sor cores into ac­count. The cur­rent hard­ware floor is nearer to the RTX 2080 TI’s $1k/​unit for 125 ten­sor-core TFLOPS, and that gives you $25/​PFLOPS-day. This roughly al­igns with AI Im­pacts’ cur­rent es­ti­mates, and offers an­other >10x speedup to our model.

I strongly sus­pect other bot­tle­necks stop you from hit­ting that kind of effi­ciency or GPT-3 would’ve hap­pened much sooner, but I still think $25/​PFLOPS-day is a lower use­ful bound.

Other Constraints

I’ve fo­cused on money so far be­cause most of the cur­rent 3.5-month dou­bling times come from in­creas­ing in­vest­ment. But money aside, there are a cou­ple of other things that could prove to be the bind­ing con­straint.

  • Scal­ing law break­down. The GPT se­ries’ scal­ing is ex­pected to break down around 10k pflops-days (§6.3), which is a long way short of the amount of cash on the table.

    • This could be be­cause the scal­ing anal­y­sis was done on 1024-to­ken se­quences. Maybe longer se­quences can go fur­ther. More likely I’m mi­s­un­der­stand­ing some­thing.

  • Se­quence length. GPT-3 uses 2048 to­kens at a time, and that’s with an effi­cient en­cod­ing that crip­ples it on many tasks. With the naive ar­chi­tec­ture, in­creas­ing the se­quence length is quadrat­i­cally ex­pen­sive, and get­ting up to novel-length se­quences is not very likely.

  • Data availa­bil­ity. From the same pa­per as the pre­vi­ous point, dataset size rises with the square-root of com­pute; a 1000x larger GPT-3 would want 10 trillion to­kens of train­ing data.

    • It’s hard to find a good es­ti­mate on to­tal-words-ever-writ­ten, but our library of 130m books alone would ex­ceed 10tn words. Con­sid­er­ing books are a small frac­tion of our tex­tual out­put nowa­days, it shouldn’t be difficult to gather suffi­cient data into one spot once you’ve de­cided it’s a use­ful thing. So I’d be sur­prised if this was bind­ing.

  • Band­width and la­tency. Net­work­ing 500 V100 to­gether is one challenge, but net­work­ing 500k V100s is an­other en­tirely.

    • I don’t know enough about dis­tributed train­ing to say whether this is a very sen­si­ble con­straint or a very dumb one. I think it has a chance of be­ing a se­ri­ous prob­lem, but I think it’s also the kind of thing you can de­sign al­gorithms around. Val­i­dat­ing such al­gorithms might take more than a timescale of months how­ever.

  • Hard­ware availa­bil­ity. From the es­ti­mates above there are about 500 GPU-years in GPT-3, or—based on a one-year train­ing win­dow - $5m worth of V100s at $10k/​piece. This is about 1% of NVIDIA’s quar­terly dat­a­cen­ter sales. A 100x scale-up by mul­ti­ple com­pa­nies could sat­u­rate this sup­ply.

    • This con­straint can ob­vi­ously be loos­ened by in­creas­ing pro­duc­tion, but it’d be hard to on a timescale of months.

  • Com­modi­ti­za­tion. If many com­pa­nies go for huge NLP mod­els, the profit each com­pany can ex­tract is driven to­wards zero. Un­like with other capex-heavy re­search—like pharma—there’s no IP pro­tec­tion for trained mod­els. If you ex­pect profit to be marginal, you’re less likely to drop $1bn on your own train­ing pro­gram.

    • I am skep­ti­cal of this be­ing an im­por­tant fac­tor while there are lots of legacy, hu­man-driven sys­tems to re­place. Re­plac­ing those sys­tems should be more than enough in­cen­tive to fund many com­pa­nies’ re­search pro­grams. Longer term, the effects of com­modi­ti­za­tion might be­come more im­por­tant.

  • In­fer­ence costs. The GPT-3 pa­per (§6.3), gives .4kWh/​100 pages of out­put, which works out to 500 pages/​dol­lar from eye­bal­ling hard­ware cost as 5x elec­tric­ity. Scal­ing up 1000x and you’re at $2/​page, which is cheap com­pared to hu­mans but no longer quite as easy to ex­per­i­ment with.

    • I’m skep­ti­cal of this be­ing a bind­ing con­straint. $2/​page is still very cheap.

Beyond 1000x

Here we go from just point­ing at big num­bers and onto straight-up the­o­rycraft­ing.

In all, tech in­vest­ment as it is to­day plau­si­bly sup­ports an­other 100x-1000x scale up in the very-near-term. If we get to 1000x − 1 ZFLOPS-day per model, $1bn per model—then there are a few paths open.

I think the key ques­tion is if by 1000x, a GPT suc­ces­sor is ob­vi­ously su­pe­rior to hu­mans over a wide range of eco­nomic ac­tivi­ties. If it is—and I think it’s plau­si­ble that it will be—then fur­ther in­vest­ment will ar­rive through the usual mar­ket mechanisms, un­til the largest mod­els are be­ing al­lo­cated a sub­stan­tial frac­tion of global GDP.

On pa­per that leaves room for an­other 1000x scale-up as it reaches up to $1tn, though cur­rent mar­ket mechanisms aren’t re­ally ca­pa­ble of that scale of in­vest­ment. Left to the mar­ket as-is, I think com­modi­ti­za­tion would kick in as the bind­ing con­straint.

That’s from the per­spec­tive of the mar­ket to­day though. Trans­for­ma­tive AI might en­able $100tn-mar­ket-cap com­pa­nies, or na­tion-states could pick up the torch. The Apollo Pro­gram made for a $1tn-to­day share of GDP, so this de­gree of pub­lic in­vest­ment is pos­si­ble in prin­ci­ple.

The even more ex­treme path is if by 1000x you’ve got some­thing that can de­sign bet­ter al­gorithms and bet­ter hard­ware. Then I think we’re in the hands of Chris­ti­ano’s slow take­off four-year-GDP-dou­bling.

That’s all as­sum­ing perfor­mance con­tinues to im­prove, though. If by 1000x the model is not ob­vi­ously a challenger to hu­man supremacy, then things will hope­fully slow down to ye olde fash­ioned 2010s-Moore’s-Law rates of progress and we can rest safe in the arms of some­thing that’s merely Hyper­Google.