Reasons compute may not drive AI capabilities growth

How long it will be be­fore hu­man­ity is ca­pa­ble of cre­at­ing gen­eral AI is an im­por­tant fac­tor in dis­cus­sions of the im­por­tance of do­ing AI al­ign­ment re­search as well as dis­cus­sions of which re­search av­enues have the best chance of suc­cess. One fre­quently dis­cussed model for es­ti­mat­ing AI timelines is that AI ca­pa­bil­ities progress is es­sen­tially driven by grow­ing com­pute ca­pa­bil­ities. For ex­am­ple, the OpenAI ar­ti­cle on AI and Com­pute pre­sents a com­pel­ling nar­ra­tive, which shows a trend of well-known re­sults in ma­chine learn­ing us­ing ex­po­nen­tially more com­pute over time. This is an in­ter­est­ing model be­cause if valid we can do some quan­ti­ta­tive fore­cast­ing, due to some­what smooth trends in com­pute met­rics which can be ex­trap­o­lated. How­ever, I think there are a num­ber of rea­sons to sus­pect AI progress to be driven more by en­g­ineer and re­searcher effort than com­pute.

I think there’s a spec­trum of mod­els be­tween:

  • We have an abun­dance of ideas that aren’t worth the in­vest­ment to try out yet. Ad­vances in com­pute ca­pa­bil­ity un­lock progress by make re­search­ing more ex­pen­sive tech­niques eco­nom­i­cally fea­si­ble. We’ll be able to cre­ate gen­eral AI soon af­ter we have enough com­pute to do it.

  • Re­search pro­ceeds at its own pace and makes use of as much com­pute is con­ve­nient to save re­searcher time on op­ti­miza­tion and achieve flashy re­sults. We’ll be able to cre­ate gen­eral AI once we come up with all the right ideas be­hind it, and ei­ther:

    • We’ll already have enough com­pute to do it

    • We won’t have enough com­pute and we’ll start op­ti­miz­ing, in­vest more in com­pute, and pos­si­bly start truly be­ing bot­tle­necked on com­pute progress.

My re­search hasn’t pointed too solidly in ei­ther di­rec­tion, but be­low I dis­cuss a num­ber of the rea­sons I’ve thought of that might point to­wards com­pute not be­ing a sig­nifi­cant driver of progress right now.

There’s many ways to train more effi­ciently that aren’t widely used

Start­ing Oc­to­ber of 2017, the Stan­ford DAWNBench con­test challenged teams to come up with the fastest and cheap­est ways to train neu­ral nets to solve cer­tain tasks.

The most in­ter­est­ing was the ImageNet train­ing time con­test. The baseline en­try took 10 days and cost $1112; less than one year later the best en­tries (all by the fast.ai team) were down to 18 min­utes for $35, 19 min­utes for $18 or 30 min­utes for $14[^1]. This is ~800x faster and ~80x cheaper than the baseline.

Some of this was just us­ing more and bet­ter hard­ware, the win­ning team used 128 V100 GPUs for 18 min­utes and 64 for 19 min­utes, ver­sus eight K80 GPUs for the baseline. How­ever, sub­stan­tial im­prove­ments were made even on the same hard­ware. The train­ing time on a p3.16xlarge AWS in­stance with eight V100 GPUs went down from 15 hours to 3 hours in 4 months. The train­ing time on a sin­gle Google Cloud TPU went down from 12 hours to 3 hours as the Google Brain team tuned their train­ing and in­cor­po­rated ideas from the fast.ai team. An even larger im­prove­ment was seen on the CIFAR10 con­test re­cently, with times on a p3.2xlarge im­prov­ing by 60x with the ac­com­pa­ny­ing blog se­ries still men­tion­ing mul­ti­ple im­prove­ments left on the table due to effort con­straints. He also spec­u­lates that many of the op­ti­miza­tions would also im­prove the ImageNet ver­sion.

The main tech­niques used for fast train­ing were all known tech­niques: pro­gres­sive re­siz­ing, mixed pre­ci­sion train­ing, re­mov­ing weight de­cay from batch­norms, scal­ing up batch size in the mid­dle of train­ing, and grad­u­ally warm­ing up the learn­ing rate. They just re­quired en­g­ineer­ing effort to im­ple­ment and weren’t already im­ple­mented in the library de­faults.

Similarly, the im­prove­ment due to scal­ing from eight K80s to many ma­chines with V100s was par­tially hard­ware but also re­quired lots of en­g­ineer­ing effort to im­ple­ment: us­ing mixed pre­ci­sion fp16 train­ing (re­quired to take ad­van­tage of the V100 Ten­sor Cores), effi­ciently us­ing the net­work to trans­fer data, im­ple­ment­ing the tech­niques re­quired for large batch sizes, and writ­ing soft­ware for su­per­vis­ing clusters of AWS spot in­stances.

Th­ese re­sults seem to show that it’s pos­si­ble to train much faster and cheaper by ap­ply­ing knowl­edge and suffi­cient en­g­ineer­ing effort. In­ter­est­ingly not even a team at Google Brain work­ing to show off TPUs ini­tially had all the code and knowl­edge re­quired to get the best available perfor­mance, and had to grad­u­ally work for it.

I would sus­pect that in a world where we were bot­tle­necked hard on train­ing times that these tech­niques would be more widely known about and ap­plied, and im­ple­men­ta­tions of them read­ily available for ev­ery ma­jor ma­chine learn­ing library. In­ter­est­ingly, in postscripts to both of his ar­ti­cles on how fast.ai man­aged to achieve such fast times, Jeremy Howard notes that he doesn’t be­lieve large amounts of com­pute are re­quired for im­por­tant ML re­search, and notes that many foun­da­tional dis­cov­er­ies were available with lit­tle com­pute.

[^1]: Us­ing spot/​pre­emptible in­stance pric­ing in­stead of the on-de­mand pric­ing the bench­mark page lists, due to much lower prices and the lack of need for on-de­mand in­stances given the short time. The au­thors of the win­ning solu­tion wrote soft­ware to effec­tively use spot in­stances and ac­tu­ally used them for their tests. It may seem un­fair to use spot prices for the win­ning solu­tion but not for the baseline, but a lot of the im­prove­ment in the con­test came from ac­tu­ally us­ing all the tech­niques for faster/​cheaper train­ing available de­spite in­con­ve­nience, and they had to write soft­ware to eas­ily use spot in­stances and had short enough train­ing times that it was vi­able with­out fancy soft­ware to au­to­mat­i­cally trans­fer train­ing to new ma­chines.

Hyper­pa­ram­e­ter grid searches are inefficient

I’ve heard hy­per­pa­ram­e­ter grid searches men­tioned as a rea­son why ML re­search needs way more com­pute than it would ap­pear based on the train­ing time of the mod­els used. How­ever, I can also see the use of grid searches as ev­i­dence of an abun­dance of com­pute rather than a scarcity.

As far as I can tell it’s pos­si­ble to find hy­per­pa­ram­e­ters much more effi­ciently than a grid search, it just takes more hu­man time and en­g­ineer­ing im­ple­men­ta­tion effort. There’s a large liter­a­ture of more effi­cient hy­per­pa­ram­e­ter search meth­ods but as far as I can tell they aren’t very pop­u­lar (I’ve never heard of any­one us­ing one in prac­tice, and all open source im­ple­men­ta­tions of these kind of things I can find have few Github stars).

Re­searcher Les­lie Smith also has a num­ber of pa­pers with lit­tle-used ideas on prin­ci­pled ap­proaches to choos­ing and search­ing for op­ti­mal hy­per­pa­ram­e­ters with much less effort, in­clud­ing a fast au­to­matic pro­ce­dure for find­ing op­ti­mal learn­ing rates. This sug­gests that it’s pos­si­ble to sub­sti­tute hy­per­pa­ram­e­ter search time for more en­g­ineer­ing, hu­man de­ci­sion-mak­ing and re­search effort.

There’s also likely room for im­prove­ment in the fac­tor­iza­tion of the hy­per-pa­ram­e­ters we use so that they’re more amenable to sep­a­rate op­ti­miza­tion. For ex­am­ple, L2 reg­u­lariza­tion is usu­ally used in place of weight de­cay be­cause they the­o­ret­i­cally do the same thing, but this pa­per points out that not only do they not do the same thing with ADAM and us­ing weight de­cay causes ADAM to sur­pass the more pop­u­lar SGD with mo­men­tum in prac­tice, but that weight de­cay is a bet­ter hy­per-pa­ram­e­ter since the op­ti­mal weight de­cay is more in­de­pen­dent of learn­ing rate than L2 reg­u­lariza­tion strength is.

All of this sug­gests that most re­searchers might be op­er­at­ing un­der an abun­dance of cheap com­pute rel­a­tive to their prob­lems that leads to them not in­vest­ing the effort re­quired to more effi­ciently op­ti­mize their hy­per­pa­ram­e­ters and just do so hap­haz­ardly or with grid searches in­stead.

The types of com­pute we need may not im­prove very quickly

Im­prove­ments in com­put­ing hard­ware are not uniform and there are many differ­ent hard­ware at­tributes that can be bot­tle­necks for differ­ent things. AI progress may rely on one or more of these that don’t end up im­prov­ing quickly, be­com­ing bot­tle­necked on the slow­est one rather than ex­pe­rienc­ing ex­po­nen­tial growth.

Ma­chine learn­ing accelerators

Modern ma­chine learn­ing is largely com­posed of large op­er­a­tions that are ei­ther di­rectly ma­trix mul­ti­plies or can be de­com­posed into them. It’s also pos­si­ble to train us­ing much lower pre­ci­sion than full 32-bit float­ing point us­ing some tricks. This al­lows the cre­ation of spe­cial­ized train­ing hard­ware like Google’s TPUs and Nvidia Ten­sor Cores. A num­ber of other com­pa­nies have also an­nounced they’re work­ing on cus­tom ac­cel­er­a­tors.

The first gen­er­a­tion of spe­cial­ized hard­ware de­liv­ered a large one-time im­prove­ment, but we can also ex­pect con­tin­u­ing in­no­va­tion in ac­cel­er­a­tor ar­chi­tec­ture. There will likely be sus­tained in­no­va­tions in train­ing with differ­ent num­ber for­mats and ar­chi­tec­tural op­ti­miza­tions for faster and cheaper train­ing. I ex­pect this will be the area our com­pute ca­pa­bil­ity will grow the most, but may flat­ten like CPUs have once we figure out enough of the eas­ily dis­cov­er­able im­prove­ments.

CPUs

Re­in­force­ment learn­ing simu­la­tions like the OpenAI Five DOTA bot, and var­i­ous physics play­grounds, of­ten use CPU-heavy se­rial simu­la­tions. OpenAI Five uses 128,000 CPU cores and only 256 GPUs. At cur­rent Google Cloud pre­emptible prices the CPUs cost 5-10x more than the GPUs in to­tal. Im­prove­ments in ma­chine learn­ing train­ing abil­ity will still leave the large cost of the CPUs. If the use of ex­pen­sive simu­la­tions that run best on CPUs be­comes an im­por­tant part of train­ing ad­vanced agents, progress may be­come bot­tle­necked on CPU cost.

Ad­di­tion­ally, im­prove­ment in CPU com­pute costs may be slow­ing. Cloud CPU costs only de­creased 45% from 2012 to 2017 and perfor­mance per dol­lar for buy­ing the hard­ware only im­proved 2x.. Google Cloud Com­pute prices have only dropped 25% from 2014-2018. Although the in­tro­duc­tion of pre­emptible prices 30% of full price in 2016 was a big im­prove­ment, and that de­creased to 20% of full price in 2017.

GPU/​ac­cel­er­a­tor memory

Another scarce re­source is mem­ory on the GPU/​ac­cel­er­a­tor used for train­ing. The mem­ory must be large enough to store all the model pa­ram­e­ters, the in­put, the gra­di­ents, and other op­ti­miza­tion pa­ram­e­ters.

This is one of the most fre­quent limits I see refer­enced in ma­chine learn­ing pa­pers nowa­days. For ex­am­ple the new large BERT lan­guage model can only be trained prop­erly on TPUs with their 64GB of RAM. The Glow pa­per needs to use gra­di­ent check­point­ing and an al­ter­na­tive to batch­norm so that they can use gra­di­ent ac­cu­mu­la­tion, be­cause only a sin­gle sam­ple of gra­di­ents fits on a GPU.

How­ever there are ways to ad­dress this limi­ta­tion that aren’t fre­quently used. Glow already uses the two best ones, gra­di­ent check­point­ing and gra­di­ent ac­cu­mu­la­tion, but did not im­ple­ment an op­ti­miza­tion they men­tioned which would make the amount of mem­ory the model takes con­stant in the num­ber of lay­ers in­stead of lin­ear, likely be­cause it would be difficult to en­g­ineer into ex­ist­ing ML frame­works. The BERT im­ple­men­ta­tion uses none of the tech­niques be­cause they just use a TPU with enough mem­ory, in fact a reim­ple­men­ta­tion of BERT im­ple­mented 3 such tech­niques and got it to fit on a GPU. Thus it still seems that in a world with less RAM these might still have hap­pened, just with more difficulty or smaller demon­stra­tion mod­els.

In­ter­est­ingly, the max­i­mum available RAM per de­vice barely changed from 2014 through 2017 with the NVIDIA K80′s 24GB, but then shot up in 2018 to 48GB with the RTX 8000 as well as the 64GB TPU v2 and 128GB TPU v3. Prob­a­bly both be­cause of de­mand for larger de­vice mem­o­ries for ma­chine learn­ing train­ing, as well as the availa­bil­ity of high ca­pac­ity HBM mem­ory. It’s un­clear to me if this rapid rise will con­tinue or if it was mostly a one-time change re­flect­ing new de­mands for the largest pos­si­ble mem­o­ries reach­ing the mar­ket.

It’s also pos­si­ble that per-de­vice mem­ory will cease to be a con­straint on model size due to faster hard­ware in­ter­con­nects that al­low shar­ing a model across the mem­ory of mul­ti­ple de­vices like In­tel’s Ner­vana and Ten­sorflow Mesh plan to do. It also seems likely that tech­niques for split­ting mod­els across de­vices to fit in mem­ory, like the origi­nal AlexNet did, will be­come more pop­u­lar. It may be the case that the fact that we don’t split mod­els across de­vices like AlexNet any­more is ev­i­dence that we’re not con­strained by RAM much but I’m not sure.

Limited abil­ity to ex­ploit parallelism

As dis­cussed ex­ten­sively in a new pa­per from Google Brain, there seems to be a limit on how much data par­allelism in the form of larger batch sizes we can cur­rently ex­tract out of a given model. If this con­straint isn’t worked around, wall time to train mod­els could stall even if com­pute power con­tinues to grow.

How­ever the pa­per men­tions that var­i­ous things like model ar­chi­tec­ture and reg­u­lariza­tion af­fect this limit and I think it’s pretty likely that tech­niques to in­crease this limit will con­tinue to be dis­cov­ered so it isn’t a bot­tle­neck. A newer pa­per by OpenAI finds that more difficult prob­lems also tol­er­ate larger batch sizes. Even if the limit re­mains, in­creas­ing com­pute would al­low train­ing more differ­ent mod­els in par­allel, po­ten­tially just mean­ing that more pa­ram­e­ter search and evolu­tion gets lay­ered on top of the train­ing. I also sus­pect that just us­ing ever-larger mod­els may al­low use of more com­pute with­out in­creas­ing batch sizes.

At the mo­ment, it seems that we know how to train effec­tively with batch sizes large enough to sat­u­rate large clusters, for ex­am­ple this pa­per about train­ing ImageNet in 7 min­utes with a 64k batch size. But this re­quires ex­tra tun­ing and im­ple­ment­ing some tricks, even just to train on mid-size clusters, so as far as I know only a small frac­tion of all ma­chine learn­ing re­searchers reg­u­larly train on large clusters (anec­do­tally, I’m un­cer­tain about this).

Conclusion

Th­ese all seem to point to­wards com­pute be­ing abun­dant and ideas be­ing the bot­tle­neck, but not solidly. For the points about train­ing effi­ciency and grid searches this could just be an in­effi­ciency in ML re­search and all the ma­jor AGI progress will be made by a few well-funded teams at the bound­aries of mod­ern com­pute that have solved these prob­lems in­ter­nally.

Vaniver com­mented on a draft of this post that it’s in­ter­est­ing to con­sider the case where train­ing time is the bot­tle­neck rather than ideas, but mas­sive en­g­ineer­ing effort is highly effec­tive at re­duc­ing train­ing time. In this case an in­crease in in­vest­ment in AI re­search which lead to hiring more en­g­ineers to ap­ply tech­niques to speed up train­ing could lead to rapid progress. This world might also lead to more siz­able differ­ences in ca­pa­bil­ities be­tween or­ga­ni­za­tions, if large some­what se­rial soft­ware en­g­ineer­ing in­vest­ments are re­quired to make use of the most pow­er­ful tech­niques, rather than a well-funded new­comer be­ing able to just read pa­pers and buy all the nec­es­sary hard­ware.

The course of var­i­ous com­pute hard­ware at­tributes seems un­cer­tain both in terms of how fast they’ll progress and whether or not we’ll need to rely on any­thing other than spe­cial-pur­pose ac­cel­er­a­tor speed. Since the prob­lem is com­plex with many un­knowns, I’m still highly un­cer­tain, but all of these points did move me to vary­ing de­grees in the di­rec­tion of con­tin­u­ing com­pute growth not be­ing a driver of dra­matic progress.

Thanks to Vaniver and Buck Sh­legeris for dis­cus­sions that lead to some of the thoughts in this post.