Measuring hardware overhang

Mea­sur­ing hard­ware overhang


How can we mea­sure a po­ten­tial AI or hard­ware over­hang? For the prob­lem of chess, mod­ern al­gorithms gained two or­ders of mag­ni­tude in com­pute (or ten years in time) com­pared to older ver­sions. While it took the su­per­com­puter “Deep Blue” to win over world cham­pion Gary Kas­parov in 1997, to­day’s Stock­fish pro­gram achieves the same ELO level on a 486-DX4-100 MHz from 1994. In con­trast, the scal­ing of neu­ral net­work chess al­gorithms to slower hard­ware is worse (and more difficult to im­ple­ment) com­pared to clas­si­cal al­gorithms. Similarly, fu­ture al­gorithms will likely be able to bet­ter lev­er­age to­day’s hard­ware by 2-3 or­ders of mag­ni­tude. I would be in­ter­ested in ex­tend­ing this scal­ing re­la­tion to AI prob­lems other than chess to check its uni­ver­sal­ity.


Hard­ware over­hang is a situ­a­tion where suffi­cient com­pute is available, but the al­gorithms are sub­op­ti­mal. It is rele­vant if we build AGI with large ini­tial build, but cheaper run costs. Once built, the AGI might run on many com­pa­rably slow ma­chines. That’s a hard­ware over­hang with a risk of ex­po­nen­tial speed-up. This asym­me­try ex­ists for cur­rent neu­ral net­works: Creat­ing them re­quires or­ders of mag­ni­tude more com­pute than run­ning them. On the other hand, in The Bit­ter Les­son by Rich Sut­ton it is ar­gued that the in­crease in com­pu­ta­tion is much more im­por­tant (or­ders of mag­ni­tude) than clever al­gorithms (fac­tor of two or less). In the fol­low­ing, I will ex­am­ine the cur­rent state of the al­gorithm-art us­ing chess as an ex­am­ple.

The ex­am­ple of chess

One of the most well-re­searched AI top­ics is chess. It has a long his­tory of al­gorithms go­ing back to a pro­gram on the 1956 MANIAC. It is com­par­a­tively easy to mea­sure the qual­ity of a player by its ELO score. As an in­struc­tive ex­am­ple, we ex­am­ine the most sym­bolic event in com­puter chess. In 1997, the IBM su­per­com­puter “Deep Blue” defeated the reign­ing world chess cham­pion un­der tour­na­ment con­di­tions. The win was taken as a sign that ar­tifi­cial in­tel­li­gence was catch­ing up to hu­man in­tel­li­gence.

By to­day’s stan­dards, Deep Blue used sim­ple al­gorithms. Its strength came from com­put­ing power. It was a RS/​6000-based sys­tem with 30 nodes, each with a 120 MHz CPU plus 480 spe­cial pur­pose VLSI chess chips. For com­par­i­son, a com­mon com­puter at the time was the In­tel Pen­tium II at 300 MHz.

Method: An ex­per­i­ment us­ing a 2020 chess engine

We may won­der: How do mod­ern (bet­ter) chess al­gorithms perform on slower hard­ware? I tested this with Stock­fish ver­sion 8 (SF8), one of the strongest clas­si­cal chess en­g­ine. I simu­lated 10k matches of SF8 against slower ver­sions of it­self and a se­ries of older en­g­ines for cal­ibra­tion, us­ing cutechess-cli. In these bench­marks, I varied the to­tal num­ber of nodes to be searched dur­ing each game. I kept the RAM con­stant (this may be un­re­al­is­tic for very old ma­chines, see be­low). By as­sum­ing a fixed think­ing time per game, the ex­per­i­ments scale out to slower ma­chines. By cross-cor­re­lat­ing var­i­ous old bench­marks of Stock­fish and other en­g­ines on older ma­chines, I matched these rat­ings to units of MIPS; and fi­nally, MIPS ap­prox­i­mately to the cal­en­dar year. Depend­ing on the ac­tual re­lease dates of the pro­ces­sors, the year axis has a jit­ter up to 2 years. I es­ti­mate the er­ror for the com­pute es­ti­mates to be per­haps 20%, and cer­tainly less than 50%. As we will see, the re­sults mea­sure in or­ders of mag­ni­tude, so that these er­rors are small in com­par­i­son (<10%).


SF8 achieves Kas­parov’s 2850 ELOs run­ning on a 486-100 MHz in­tro­duced in 1994, three years be­fore the Kas­parov-Deep Blue match. Th­ese ELOs re­fer to tour­na­ment con­di­tions as in the 1997 IBM games. In other words, with to­day’s al­gorithms, com­put­ers would have beat the world world chess cham­pion already in 1994 on a con­tem­po­rary desk com­puter (not a su­per­com­puter).

The full scal­ing re­la­tion is shown in the Figure. The gray line shows the ELO rat­ing of Kas­parov and Car­lsen over time, hov­er­ing around 2800. The blue sym­bols in­di­cate the com­mon top en­g­ines at their times. The plot is lin­ear in time, and log­a­r­ith­mic in com­pute. Con­se­quently, ELO scales ap­prox­i­mately with the square of com­pute. Fi­nally, the red line shows the ELOs of SF8 as a func­tion of com­pute. Start­ing with the 2019 rat­ing of ~3400 points, it falls be­low 3000 when re­duc­ing MIPs from 10^5 to a few times 10^3. This is a de­crease of 2-3 or­ders of mag­ni­tude. It falls be­low 2850 ELO, Kas­parov’s level, at 68 MIPs. For com­par­i­son, the 486 achieves 70 MIPS at 100 MHz. At its max­i­mum, the hard­ware over­hang amounts to slightly more than 10 years in time, or 2-3 or­ders of mag­ni­tude in com­pute. Go­ing back very far (to the era of 386-PCs), the gap re­duces. This is un­der­stand­able: On very slow hard­ware, you can’t do very much, no mat­ter what al­gorithm you have. The or­ange line shows the scal­ing re­la­tion of a neu­ral net­work-based chess en­g­ine, Leela Chess Zero (LC0), as dis­cussed be­low.


I origi­nally ran these tests in 2019. Now (Au­gust 2020), SF8 has been su­per­seded by SF11, with an­other ~150 ELO in­crease (at to­day’s speed). It re­mains un­clear how much im­prove­ment is left for fu­ture al­gorithms when scaled down to a 486-100. I strongly sus­pect, how­ever, that we’re run­ning into diminish­ing re­turns here. There is only so much you can do on a “very slow” ma­chine; im­prove­ments will never go to in­finity. My guess is that the scal­ing will re­main be­low three or­ders of mag­ni­tude.

Neu­ral net­work-based al­gorithms such as AlphaZero or Leela Chess Zero can out­perform clas­si­cal chess en­g­ines. How­ever, for this com­par­i­son they are less suited. I find that their scal­ing is con­sid­er­ably worse, es­pe­cially when not us­ing GPUs. In other words, they do not perform well on CPUs of slower ma­chines. Depend­ing on the size of the neu­ral net, older ma­chines may even be in­ca­pable of ex­e­cut­ing it. In prin­ci­ple, it would be very in­ter­est­ing to make this work: Train a net­work on to­day’s ma­chines, and ex­e­cute (run) it on a very old (slow) ma­chine. But with cur­rent al­gorithms, the scal­ing is worse than SF8. As a refer­ence point, LC0 achieves ~3000 ELOs on a Pen­tium 200 un­der tour­na­ment con­di­tions; SF8 is at the same level with about half the com­pute.

Con­clu­sion and fu­ture re­search proposals

Similarly, scal­ing of other NN al­gorithms to slower hard­ware (with less RAM etc.) should yield in­ter­est­ing in­sights. While x86 CPUs are in prin­ci­ple back­wards-com­pat­i­ble since the 1980s, there are sev­eral break­ing changes which make com­par­i­sons difficult. For ex­am­ple, the in­tro­duc­tion of mod­ern GPUs pro­duces a speed gap when ex­e­cut­ing op­ti­mized al­gorithms on CPUs. Also, older 32-bit CPUs are capped at 4 GB of RAM, mak­ing ex­e­cu­tion of larger mod­els im­pos­si­ble. Look­ing into the fu­ture, it ap­pears likely that similar break­ing changes will oc­cur. One re­cent ex­am­ple is the in­tro­duc­tion of TPUs and/​or GPUs with large amounts of RAM. Without these, it may be im­pos­si­ble to ex­e­cute cer­tain al­gorithms. If AGI re­lies on similar (yet un­known) tech­nolo­gies, the hard­ware over­hang is re­duced un­til more of such the units are pro­duced. Then, the vast amount of old (ex­ist­ing) com­pute can not be used.

I would be in­ter­ested in re­search­ing this scal­ing re­la­tion for other prob­lems out­side of chess, such as voice and image recog­ni­tion. Most prob­lems are harder to mea­sure and bench­mark than chess. Will the scal­ings show a similar 2-3 or­ders if mag­ni­tude soft­ware over­hang? Most cer­tainly, many prob­lems will show similar diminish­ing re­turns (or a cap) due to RAM re­stric­tions and wait time. For ex­am­ple, you just can’t run a self-driv­ing car on an Atari, no mat­ter how good the al­gorithms. I would be in­ter­ested in re­search­ing the scal­ing for other AI and ML fields, pos­si­bly lead­ing to an aca­demic pa­per.