Are we in an AI overhang?
Over on Developmental Stages of GPTs, orthonormal mentions
it at least reduces the chance of a hardware overhang.
An overhang is when you have had the ability to build transformative AI for quite some time, but you haven’t because no-one’s realised it’s possible. Then someone does and surprise! It’s a lot more capable than everyone expected.
I am worried we’re in an overhang right now. I think we right now have the ability to build an orders-of-magnitude more powerful system than we already have, and I think GPT-3 is the trigger for 100x larger projects at Google, Facebook and the like, with timelines measured in months.
GPT-3 is the first AI system that has obvious, immediate, transformative economic value. While much hay has been made about how much more expensive it is than a typical AI research project, in the wider context of megacorp investment, its costs are insignificant.
Google, Amazon and Microsoft each spend about $20bn/year on R&D and another $20bn each on capital expenditure. Very roughly, it totals to $100bn/year. Against this budget, dropping $1bn or more on scaling GPT up by another factor of 100x is entirely plausible right now. All that’s necessary is that tech executives stop thinking of natural language processing as cutesy blue-sky research and start thinking in terms of quarters-till-profitability.
A concrete example is Waymo, which is raising $2bn investment rounds—and that’s for a technology with a much longer road to market.
The other side of the equation is compute cost. The $5m GPT-3 training cost estimate comes from using V100s at $10k/unit and 30 TFLOPS, which is the performance without tensor cores being considered. Amortized over a year, this gives you about $1000/PFLOPS-day.
However, this cost is driven up an order of magnitude by NVIDIA’s monopolistic cloud contracts, while performance will be higher when taking tensor cores into account. The current hardware floor is nearer to the RTX 2080 TI’s $1k/unit for 125 tensor-core TFLOPS, and that gives you $25/PFLOPS-day. This roughly aligns with AI Impacts’ current estimates, and offers another >10x speedup to our model.
I strongly suspect other bottlenecks stop you from hitting that kind of efficiency or GPT-3 would’ve happened much sooner, but I still think $25/PFLOPS-day is a lower useful bound.
I’ve focused on money so far because most of the current 3.5-month doubling times come from increasing investment. But money aside, there are a couple of other things that could prove to be the binding constraint.
Scaling law breakdown. The GPT series’ scaling is expected to break down around 10k pflops-days (§6.3), which is a long way short of the amount of cash on the table.
This could be because the scaling analysis was done on 1024-token sequences. Maybe longer sequences can go further. More likely I’m misunderstanding something.
Sequence length. GPT-3 uses 2048 tokens at a time, and that’s with an efficient encoding that cripples it on many tasks. With the naive architecture, increasing the sequence length is quadratically expensive, and getting up to novel-length sequences is not very likely.
Data availability. From the same paper as the previous point, dataset size rises with the square-root of compute; a 1000x larger GPT-3 would want 10 trillion tokens of training data.
It’s hard to find a good estimate on total-words-ever-written, but our library of 130m books alone would exceed 10tn words. Considering books are a small fraction of our textual output nowadays, it shouldn’t be difficult to gather sufficient data into one spot once you’ve decided it’s a useful thing. So I’d be surprised if this was binding.
Bandwidth and latency. Networking 500 V100 together is one challenge, but networking 500k V100s is another entirely.
I don’t know enough about distributed training to say whether this is a very sensible constraint or a very dumb one. I think it has a chance of being a serious problem, but I think it’s also the kind of thing you can design algorithms around. Validating such algorithms might take more than a timescale of months however.
Hardware availability. From the estimates above there are about 500 GPU-years in GPT-3, or—based on a one-year training window - $5m worth of V100s at $10k/piece. This is about 1% of NVIDIA’s quarterly datacenter sales. A 100x scale-up by multiple companies could saturate this supply.
This constraint can obviously be loosened by increasing production, but it’d be hard to on a timescale of months.
Commoditization. If many companies go for huge NLP models, the profit each company can extract is driven towards zero. Unlike with other capex-heavy research—like pharma—there’s no IP protection for trained models. If you expect profit to be marginal, you’re less likely to drop $1bn on your own training program.
I am skeptical of this being an important factor while there are lots of legacy, human-driven systems to replace. Replacing those systems should be more than enough incentive to fund many companies’ research programs. Longer term, the effects of commoditization might become more important.
Inference costs. The GPT-3 paper (§6.3), gives .4kWh/100 pages of output, which works out to 500 pages/dollar from eyeballing hardware cost as 5x electricity. Scaling up 1000x and you’re at $2/page, which is cheap compared to humans but no longer quite as easy to experiment with.
I’m skeptical of this being a binding constraint. $2/page is still very cheap.
Here we go from just pointing at big numbers and onto straight-up theorycrafting.
In all, tech investment as it is today plausibly supports another 100x-1000x scale up in the very-near-term. If we get to 1000x − 1 ZFLOPS-day per model, $1bn per model—then there are a few paths open.
I think the key question is if by 1000x, a GPT successor is obviously superior to humans over a wide range of economic activities. If it is—and I think it’s plausible that it will be—then further investment will arrive through the usual market mechanisms, until the largest models are being allocated a substantial fraction of global GDP.
On paper that leaves room for another 1000x scale-up as it reaches up to $1tn, though current market mechanisms aren’t really capable of that scale of investment. Left to the market as-is, I think commoditization would kick in as the binding constraint.
That’s from the perspective of the market today though. Transformative AI might enable $100tn-market-cap companies, or nation-states could pick up the torch. The Apollo Program made for a $1tn-today share of GDP, so this degree of public investment is possible in principle.
The even more extreme path is if by 1000x you’ve got something that can design better algorithms and better hardware. Then I think we’re in the hands of Christiano’s slow takeoff four-year-GDP-doubling.
That’s all assuming performance continues to improve, though. If by 1000x the model is not obviously a challenger to human supremacy, then things will hopefully slow down to ye olde fashioned 2010s-Moore’s-Law rates of progress and we can rest safe in the arms of something that’s merely HyperGoogle.