Moore’s Law, AI, and the pace of progress

It seems to be a minority view nowadays to believe in Moore’s Law, the routine doubling of transistor density roughly every couple of years, or even the much gentler claim, that There’s Plenty [more] Room at the Bottom. There’s even a quip for it: the number of people predicting the death of Moore’s law doubles every two years. This is not merely a populist view by the uninformed. Jensen Huang, CEO of NVIDIA, a GPU company, has talked about Moore’s Law failing.

“Moore’s Law used to grow at 10x every five years [and] 100x every 10 years,” Huang said during a Q&A panel with a small group of reporters and analysts at CES 2019. “Right now Moore’s Law is growing a few percent every year. Every 10 years maybe only 2s. … So Moore’s Law has finished.”

More academically, the International Roadmap for Devices and Systems, IRDS, warns that logic scaling is nearing certain fundamental limits.

After the “1.5nm” logic node goes into production in 2028, logic dimensions will stop shrinking and improved densities will be achieved by increasing the number of devices vertically. DRAM will continue to shrink CDs [Critical Dimensions] after that, but the minimum lines and spaces will only shrink modestly and should be reachable by improved EUV and EUV double patterning. The large number of masking levels and the many steps for 3D stacking of devices will make yield and cost high priorities.

This claim is not based on tooling limits, but a projected minimum useful size of transistors.

Lines and spaces are the flagship pattern of lithography. [...] Note that the logic node names are the commonly used names for each node but are not the same as the minimum half pitches of those nodes. Resolution improves to 12 nm half-pitch in 2022. This corresponds to the logic “3 nm” node. The IRDS expects that this resolution will be achieved through EUV double patterning. Then there is a further decrease in line and space resolution of 2 nm per node until 2028, when minimum line and space resolution is expected to reach 8 nm half-pitch. The 8 nm half pitch could be achieved with EUV double patterning, but there is time to develop other methods also, such as high-NA EUV lithography. After that, no further improvement in required resolution is projected, although this is due to projected device requirements, not expected limitations in patterning capability.

Computers are made of stacks of wires in a dense 3D network, and line and space pitch is a measure of how close parallel lines can be packed.

Besides mere physical inevitability, improvements to transistor density are taking an economic toll. Building the fabs that manufacture transistors is becoming very expensive, as high as $20 billion each, and TSMC expects to spend $100 billion just over three years to expand capacity. This cost increases with each cutting-edge node.

This bleak industry view contrasts with the massively increasing demands of scale from AI, that has become a center of attention, in large part due to OpenAI’s attention on the question, and their successful results with their various GPT-derived models. There, too, the economic factor exacerbates the divide; models around GPT-3′s size are the domain of only a few eager companies, and whereas before there was an opportunity to reap quick advances from scaling single- or few-machine models to datacenter scale, now all compute advances require new hardware of some kind, whether better computer architectures or bigger (pricier) data centers.

The natural implication is that device scaling has already stalled and will soon hit a wall, that scaling out much further is uneconomical, and in conclusion that AI progress cannot be driven much further through scaling, certainly not soon, and possibly not ever.

I disagree with this view. My argument is structured into a few key points.

  1. Current data shows much stronger current-day device scaling trends than I had expected before I saw the data.

  2. Claimed physical limits to device scaling often greatly undersell the amount of scaling that could be available in theory, both in terms of device size and packing density.

  3. Even if scaling down runs out, there are plausible paths to significant economic scaling, or if not, the capital and the motivation exists to scale anyway.

  4. The potential size of AI systems is effectively unbound by physical limits.

To put this article in context, there are a few key points I do not touch on.

  • What it means for parameter counts to approach human synapse counts.

  • The usefulness of current ML methods as or on a path to AGI.

  • Whether scaling neural networks is something you should pay attention to.

1. What the data shows

This section cribs from my Reddit post, The pace of progress: CPUs, GPUs, Surveys, Nanometres, and Graphs, with a greater focus on relevance to AI and with more commentary to that effect.

The overall impressions I expect to be taken from this section are that,

  1. Transistor scaling seems surprisingly robust historically.

  2. Compute performance on AI workloads should increase with transistor scaling.

  3. Related scaling trends are mostly also following transistor density.

  4. DRAM is expensive and no longer scaling.

  5. When trends stop, they seem to do so suddenly, and because of physical constraints.

Transistor density improvements over time

Improvements in semiconductors today are primarily driven by Moore’s Law. This law was first discussed in the 1965 paper, Cramming more components onto integrated circuits. Gordon Moore’s observation was that the integration and miniaturization of semiconductor components was vital to reducing the price per component, and he said,

For simple circuits, the cost per component is nearly inversely proportional to the number of components, the result of the equivalent piece of semiconductor in the equivalent package containing more components. But as components are added, decreased yields more than compensate for the increased complexity, tending to raise the cost per component. Thus there is a minimum cost at any given time in the evolution of the technology. At present, it is reached when 50 components are used per circuit. But the minimum is rising rapidly while the entire cost curve is falling (see graph below).

With a total of four data points, Moore defined his law, observing that the “complexity for minimum component costs has increased at a rate of roughly a factor of two per year,” and that “there is no reason to believe it will not remain nearly constant for at least 10 years.” That’s a brave way to make a prediction!

Today, semiconductors are manufactured at a great scale, and wafers are divided into a great breadth of configurations. Even among the newest nodes, phones require comparatively small chips (the A14 in the newest iPhone is 88mm²), whereas a top end GPU might be ten times as large (the A100 is 826mm²), and it is possible if uncommon to build fully-integrated systems measuring 50 times that (Cerebras’ CS-1 is 46,225mm²). As the choice of die size is a market issue rather than a fundamental technical limit, and the underlying economic trends that determined the market are dominated by compute density, this motivates looking at density trends on the leading node as a close proxy to Moore’s Law. Wikipedia provides the raw data.

Click for interactive view (CPU, GPU)

The graph spans 50 years and total density improvements by a factor of over 30,000,000. Including Gordon Moore’s original four data points would add almost another decade to the left. The trend, a doubling of density every 2.5 years, follows the line with shockingly little deviation, despite large changes in the underlying design of integrated devices, various discontinuous scaling challenges (eg. EUV machines being many years late), very long research lead times (I’ve heard ~15 years from R&D to production), and a ramping economic cost.

The graph contradicts common wisdom, which claims that Moore’s Law is not only due to fail in the future, but that it has already been slowing down. It is as close to a perfect trend as empirical laws over long time spans can be asked to give.

These points demonstrate the predictive strength. While the God of Straight Lines does on occasion falter, it should set at least a default expectation. We have seen claims of impending doom before. Read this excerpt, from the turn of the century.

The End of Moore’s Law?

May 1 2000, MIT Technology Review

The end of Moore’s Law has been predicted so many times that rumors of its demise have become an industry joke. The current alarms, though, may be different. Squeezing more and more devices onto a chip means fabricating features that are smaller and smaller. The industry’s newest chips have “pitches” as small as 180 nanometers (billionths of a meter). To accommodate Moore’s Law, according to the biennial “road map” prepared last year for the Semiconductor Industry Association, the pitches need to shrink to 150 nanometers by 2001 and to 100 nanometers by 2005. Alas, the road map admitted, to get there the industry will have to beat fundamental problems to which there are “no known solutions.” If solutions are not discovered quickly, Paul A. Packan, a respected researcher at Intel, argued last September in the journal Science, Moore’s Law will “be in serious danger.”

This quote is over 20 years old, and even then it was ‘an industry joke’. Transistor density has since improved by a factor of around 300 times. The article raised highlighted problems, and those problems did require new innovations and even impacted performance, but in terms of raw component density the trend remained completely steady.

I want to emphasize here, these laws set a baseline expectation for future progress. A history of false alarms should give you some caution when you hear another alarm without qualitatively better justification. This does not mean Moore’s Law will not end; it will. This does not even mean it won’t end soon, or suddenly; it very well might.

An idealistic view of semiconductor scaling becomes more turbid when looking at the holistic performance of integrated circuits. As the performance of AI hardware scales very differently to how, say, CPUs scale, and because the recent improvements in AI hardware architectures result in large part from a one-time transition from general-purpose to special-purpose hardware, the details of how precisely any architecture has scaled historically is not of direct, 1:1 relevance. However, I think there is still relevance in discussing the different trends.

CPUs execute code serially, one instruction logically after the other. This makes them one of the harder computing devices to scale the performance of, as there is no simple way to convert a greater number of parallel transistors into more serial bandwidth. The ways we have figured out are hard-earned and scale performance sublinearly. Nowadays, we compromise by allocating some of the extra transistors provided by Moore’s Law towards more CPU cores, rather than fully investing in the performance of each individual core. The resulting performance improvements (note the linear y-axis) are therefore erratic and vendor-specific, and scaling the number of cores has been too influenced by market dynamics to capture any coherent exponential trend.

This was not always the case; in the 80s and 90s, as transistors shrunk, they got faster according to Dennard scaling. The physics is not too relevant, but the trends are.

Transistors got exponentially faster until the moment they didn’t.

If there is any key thing to learn from the failure of Dennard scaling, it would be that exponential trends based off of physical scaling can end abruptly. As a result, transistors now only get marginally faster each process node.

GPUs are massively parallel devices, executing many threads with similar workloads. You would expect these devices to scale fairly well with transistor count. I do not have a chart of FLOPS, that would show the underlying scaling, but I do have some performance graphs measured on video games. Performance has scaled at a clean exponential pace for both NVIDIA and AMD GPUs since the start of my graphs. The same is true, in a rougher sense, for performance per inflation-adjusted dollar.

Gaming performance might not be a great analogy to AI workloads, because AI is more regular, whereas games are complicated programs with a myriad of places for bottlenecks to occur, including memory bandwidth. However, this only means we would expect Moore’s Law to drive AI performance at least as reliably as it does GPUs. An RTX 3090 has ~9.4x the transistors and ~5.4x the performance on games of a GTX 590 from 2011. This implies the growth in gaming performance is roughly capturing 34 of the growth in transistor counts on a log plot. I want to emphasize not to rely too much on the specifics of that number, because of the mentioned but unaddressed complexities.

AI Impacts has an analysis, 2019 recent trends in GPU price per FLOPS. Unfortunately, while $/​FLOPS is a coherent metric for similar architectures over long timespans, it tends to be dominated by circumstantial ones over short timespans. For example, TechPowerUp claims a GTX 285 has 183% the performance of an HD 4770, yet only 74% of the FP32 FLOPS theoretical throughput. The GTX commanded a much higher launch price, $359 vs. $109, so when divided through this disparity between FLOPS and performance is exaggerated. As a recent example, NVIDIA’s 3000 series doubled FP32 throughput in a way that only gave a marginal performance increase.

In the Turing generation, each of the four SM processing blocks (also called partitions) had two primary datapaths, but only one of the two could process FP32 operations. The other datapath was limited to integer operations. GA10X includes FP32 processing on both datapaths, doubling the peak processing rate for FP32 operations.

An RTX 3080 has about 165% the performance in games of an RTX 2080, but 296% the FP32 FLOPS. In the long run these factor-2 performance differences wash out, but in the short run they account for a good fraction of your measurement.

I did try to analyze FLOPS per transistor, a measure of efficiency, using their data, and while I don’t have good quality visual data to share, it did seem to me like the trend was neutral when looking at high end cards, which suggests that GPUs are not generally needing more transistors per floating point operation per second. The trend seemed positive for low end cards, but those cards often have large numbers of unused transistors, for market segmentation purposes.

Most GPUs are around 500-1000 FLOPS/​transistor, very roughly implying it takes one or two million transistors to process 1 FP32 FLOP/​cycle. In the long run this supports the claim that Moore’s Law, to the extent that it continues, will suffice to drive downstream performance.

Memory is theoretically a separate scaling regime. It is simultaneously one of the more fragile aspects of Moore’s Law in recent years, and also one of the largest opportunities for discontinuous technology jumps.

“Memory” typically refers to DRAM, a type of memory that stores data in capacitors gated by transistors, but many prospective technologies can fill its role, and historically several others have. DRAM is built in a similar way to other circuits, but it is built on specialized and cost-optimized nodes that support some of the unique requirements of DRAM.

AI Impacts compares two sources for DRAM spot prices.

DRAM follows a clear exponential trend until around 2010, when prices and capacities stagnate. As with Dennard scaling, I don’t expect this issue to resolve itself. The physical limit in this case is the use of capacitors to hold data. A capacitor is made of two close but separated surfaces holding charge. The capacitance is linearly proportional to the area of these surfaces, and capacitance must be preserved in order to reliably retain data. This has forced extruding the capacitor into the third dimension, with very high aspect ratios projected to reach around 100:1 relatively soon.

Future of DRAM as main memory & Semiconductor Memory Technology Scaling Challenges

Any scaling regime that requires exponential increases along a physical dimension is quite counterproductive for long-term miniaturization trends.

Surprisingly to me, the DRAM included with GPUs has still increased by a factor of about 10 over the last 10 years, about the same rate as transistor density has improved. At 2000 unit retail prices of GDDR6, the 16GB of DRAM in a RX 6800 XT would total ~$210. The RX 6800 XT has an MSRP of $649, so even though they are likely to get their DRAM at a significant discount, DRAM is already a meaningful fraction of total unit costs.

These facts together suggest that DRAM growth is more likely to be a short-term impediment to continued scaling than compute transistor density is.

The counterpoint is that there exist a significant number of technologies that can partially or completely replace DRAM, that have better scaling laws. There are NRAM and IGZO 2t0c DRAM, and various slower memories like 3D XPoint and Sony’s ReRAM. There are also pathways to stack DRAM, which might allow for density scaling without relying on further miniaturization, an approach that worked well for NAND flash. This is by no means exhaustive; you can for instance imagine a great variety of memories made of tiny physical switches, which are termed NEMS.

Interconnect speed is an especially important aspect to consider when building computer systems that consist of a large number of integrated computing devices. This means GPUs or AI accelerators made of multiple chips, individual servers that contain multiple such GPUs or accelerators, and datacenters that contain a great many communicating servers.

I don’t know of any good long-term holistic analysis of these trends, nor a good pre-aggregated source of data to easily do one myself. However, I am aware of a number of individual small trend lines that all suggest sustained exponential growth. PCIe is one of them.

PCIe connects major components on a motherboard.

NVIDIA’s server GPU series, P100, V100, then A100, also have support for NVIDIA’s NVLink versions 1 through 3, with bandwidth roughly doubling each generation. NVLink is primarily focused on connecting local GPUs together within a server node.

For bandwidth between nodes across a supercomputer, you can look for instance at InfiniBand’s roadmap. Again we see an exponential trend, that roughly keeps pace with transistor scaling.

InfiniBand connects multiple nodes within a supercomputer.

There has also been a recent trend in ‘chiplet’ architectures, whereby multiple dies are connected together with short, dense, and efficient connections. This includes both 2D stacking, where the chips are placed side-by-side, with short and dense local traces connecting them, and 3D stacking, where the chips are placed on top of each other. 3D stacking allows for extremely high bandwidth connections, because the connections are so short and of great number, but currently needs to be done carefully to avoid heat concentration. This is an emerging technology, so again rather than showing any single trendline in capability scaling, I will list a few relevant data points.

Intel’s upcoming Ponte Vecchio supercomputer GPU connects 41 dies, some compute and some memory, using ‘embedded bridges’, which are small silicon connections between dies.

AMD’s already sampling MI200 server GPU also integrates two compute dies plus some memory dies in a similar fashion. Their Milan-X server CPUs will stack memory on top of the CPU dies to expand their local cache memory, and those dies are then connected to other CPU dies an older lower-performance interconnect.

Cerebras have a ‘wafer-scale engine’, which is a circuit printed on a wafer that is then used as a single huge computing device, rather than cut into individual devices.

Tesla have announced the Dojo AI supercomputer, which puts 25 dies onto a wafer in a 5x5 grid, and then connects those wafers to other wafers in another higher-level grid. Each die is connected directly only to its four nearest neighbors, and each wafer only to its four nearest neighbors.

2. There’s Plenty [more] Room at the Bottom

Richard Feynman gave a lecture in 1959, There’s Plenty of Room at the Bottom. It is a very good lecture, and I suggest you read it. It is the kind of dense but straightforward foresight I think rationalists should aspire to. He asks, what sort of things does physics allow us to do, and what should the techniques that get us there look like?

Feynman mentions DNA as an example highly compact dynamic storage mechanism that uses only a small number of atoms per bit.

This fact – that enormous amounts of information can be carried in an exceedingly small space – is, of course, well known to the biologists, and resolves the mystery which existed before we understood all this clearly, of how it could be that, in the tiniest cell, all of the information for the organization of a complex creature such as ourselves can be stored. All this information – whether we have brown eyes, or whether we think at all, or that in the embryo the jawbone should first develop with a little hole in the side so that later a nerve can grow through it – all this information is contained in a very tiny fraction of the cell in the form of long-chain DNA molecules in which approximately 50 atoms are used for one bit of information about the cell.

To ask for computers to reach 50 atoms per transistor, or per bit of storage, is a big ask. It’s possible, as DNA synthesis for storage is a demonstrated technology, and perhaps even useful, but for compute-constrained AI applications we are interested in high throughput, dynamic memories, presumably electronic in nature. Even if it might be possible to build useful and applicable systems with DNA or other molecular devices of that nature, it is not needed to assume it for this argument.

The overall impressions I expect to be taken from this section are that,

  1. IRDS roadmaps already predict enough scaling for significant short-term growth.

  2. 3D stacking can unlock orders of magnitude of further effective scaling.

  3. Memory has a large potential for growth.

  4. Integrated systems for training can get very large.

Section note: Many of the numbers in this section are Fermi estimates, even when given to higher precision. Do not take them as precise.

How small could we go?

The IRDS roadmap mentioned at the start of this post suggests Moore’s Law device scaling should continue until around 2028, after which it predicts 3D integration will take over. That suggests a planar density of around 10⁹ transistors/​mm². Already this planar density is much greater than today. NVIDIA’s most recent Ampere generation of GPUs has a density around 5×10⁷, varying a little depending on whether they use TSMC 7nm or Samsung 8nm. This means that a dumb extrapolation still predicts about a factor of 20 improvement in transistor density for GPUs.

Continuing to ignore scale-out, the industry is looking towards 3D integration of transistors. Let’s assume a stacked die has a minimal thickness of 40µm per layer. A 30×30×4 mm die built with 100 stacked logic layers would therefore support 100 trillion transistors. This is about 50 times greater than for a Cerebras CS-2, a wafer-scale AI accelerator. Having 100 logic layers could seem like a stretch, but Intel is already selling 144 layer NAND flash, so skyscraper-tall logic is far from provably intractable. AI workloads are extremely regular, and many require a lot of space dedicated to local memory, so variants of existing vertical scaling techniques might well be economical if tweaked appropriately.

This answer, while promising much of room for future device scaling, is still not physically optimistic. A device of that size contains 2×10²³ silicon atoms, so it has a transistor density of around one transistor per 2×10⁹ atoms. Using transistors for dynamic storage (SRAM) would increase that inefficiency by another factor ~5, since individual transistors are transient, so this hypothetical device is still about a factor of 10⁸ less atomically efficient than DNA for storage.

At a density of 10⁹ transistors/​mm², if perfectly square, our assumed 2028 transistor operates a footprint about 120×120 atoms across. If you could implement a transistor in a box of that dimension on all sides, with only a factor ~10 in overheads for wiring and power on average, then each transistor would require only 2×10⁷ atoms, a factor of 100 improvement over the previous number. It is unclear what specific technologies would be used to realize a device like this, if it is practically reachable, but biology proves at least that small physical and chemical switches are possible, and we have only assumed exceeding our 2028 transistor along one dimension.

Although this device is stacked only modestly relative to the brain, power density does at some point become an issue beyond the capabilities of current methods. Heat density is easily handled with integrated cooling channels, provided enough cool liquid, which is a demonstrated technology. Total rack power output might have some fundamental limits somewhere eventually, but the ocean makes a good heatsink. So I don’t believe that cooling represents a physical barrier.

How much can we improve on DRAM?

Per earlier in this writeup, DRAM scaling has hit a bottleneck. Not every AI accelerator uses DRAM as their primary storage, with some relying on faster, more local SRAM memory, which is made directly from transistors arranged in an active two-state circuit.

As of today, and for a long time prior, DRAM is an optimal balance of speed and density for large but dynamically accessed memory. DRAM is fast because it is made of transistor-gated electric charges, and is more space efficient than SRAM by virtue of its simplicity.

What Every Programmer Should Know About Memory, really

The complexity of an SRAM cell is a consequence of transistors being volatile, in that they don’t retain state if their inputs subside. You therefore need to build a circuit that feeds the state of the SRAM memory back into the inputs of the SRAM memory, while also allowing that state to be overridden. What is important to note is that this is a statement of CMOS transistors, not a statement about all switches in general. Any device that can hold two or more states that can be read and changed electrically holds promise as a memory storage.

Memory has more relaxed requirements than transistors with regards to speed and switching energy, because typically only a little memory is accessed at a time. This is especially true for large scale network training, as each neural network weight can be reused in multiple calculations without multiple reads from bulk memory.

The problem with predictions of the future in a space like this is not that there are no clear right answers, as much as that there are so many prospective candidates with slightly different trade-offs, and correctly evaluating each one requires an immense understanding of its complicated relationship to the most complicated manufacturing processes on the planet. Therefore I will illustrate my point by choosing an example prospective technology that I think is neat, not by claiming that this particular technology will pan out, or be the best example I could have used. The space of technologies is so vast, the need is so great, and the history of memory technologies so demonstrably flexible, that it is all but inevitable that some technology will replace DRAM. The relevant questions to us are with regards to the limiting factors for memory technologies of these sorts in general.

NRAM is my simple to illustrate example. An NRAM cell contains a slurry of carbon nanotubes. Those carbon nanotubes can be electrically forced together, closing the switch, or apart, opening it.

Will Carbon Nanotube Memory Replace DRAM? (Yes, that’s quite aggressive PR.)

Nantero claim they expect to reach a density of 640 megabits/​mm² per layer on a 7nm process, with the ability to scale past the 5nm process. They also claim to support cost-effective 3D scaling, illustrating up to 8 process layers and 16 die stacks (for 128 total layers). This compares to 315 megabits/​mm² for Micron’s upcoming 1α DRAM, or to ~1000 megatransistors/​mm² for our projected 2028 logic node.

NRAM is a bulk process, in that many carbon nanotubes are placed down stochastically. This makes placement easy, but means we are still far from talking about physical limits. This is fine, though. The 128 layer device mentioned above would already have a bit density of 10 GB/​mm². If you were to stack one die of 8 layers on top of a Cerebras CS-2, it would provide 240 terabytes of memory. This compares favourably to CS-2′s 40 gigabytes of SRAM.

Again, this is not to say that that this particular technology or device will happen. Most prospective technologies fail, even the ones I think are cool. I am saying that physics allows you to do things like this, and the industry is trying very many paths that point this way.

How big could we go?

When I initially envisioned writing this section, I had to justify the feasibility of large nearest-neighbor grids of compute, extrapolating from other trends and referencing interconnect speeds. Tesla made things easy for me by announcing a supercomputer that did just that.

Tesla starts like several other AI accelerators with a small compute unit that they replicate in a grid across the die, what they call their D1 chip. D1 is 645 mm² and contains 354 such units. They claim it’s 362 TFLOPS BF16/​CFP8, which compares reasonably against the 312 TFLOPS BF16 from NVIDIA’s A100′s neural accelerator. (The A100 is a bigger 826 mm² die, but most of that space is dedicated to other GPU functionality.)

This compute unit is surrounded by densely packed, single-purpose IO, with a bandwidth of 4 TB/​s in each ordinal direction, or 12 TB/​s overall. This is a lot, considering an A100 has only 0.6 TB/​s total bandwidth over NVLink, and 1.6 TB/​s bandwidth to memory. For this bandwidth to be achieved, these chips are placed on a wafer backplane, called Integrated Fan Out System on Wafer, or InFO_SoW. They place a 5x5 grid, so 16,125 mm² of wafer in total, about a third the area of Cerebras’ monolithic wafer-scale accelerator, and they call this a ‘tile’.

Whichever approach up to that point is superior, Tesla’s tile or Cerebras’ waffle, the key scale difference happens when you connect many of these together. Tesla’s wafers have 9 TB/​s of off-chip bandwidth in each ordinal direction, or 36 TB/​s total bandwidth. This allows connecting an almost arbitrary quantity of them together, each communicating with their nearest neighbors. They connect 120 of these tiles together.

The topology of those 120 tiles is unclear, but for matters of theory we can assume what we want. If the arrangement is a uniform 12x10 grid, then a bisection along the thinnest axis would have a total bandwidth of 90 TB/​s. That is quite fast!

Although bandwidth is high, you might start to be concerned about latency. However, consider pipeline parallelism, splitting different layers of the graph across the nodes. GPT-3 has 96 attention layers, so at that scale each layer corresponds to ~1 tile. Information only needs to rapidly pass from one tile to its neighbor. Latency is unlikely to be a concern at that scale.

Now consider a huge computer with, say, 100 times the number of tiles, each tile being significantly larger according to some growth estimates, running a model 1000 times as large as GPT-3. This model might have only 10 times the number of layers, so you might need ten tiles to compute a single layer. Still, a model partition does not seem bound by fundamental latency limits; 10 tiles is still spatially small, perhaps a 3x3 grid, or perhaps even a 3D arrangement like 2x2x3.

If these tiles have excess memory, as the NRAM example in the previous subsection showed is physically realizable, you can make the problem even simpler by replicating weights across the local tiles.

Ultimately, the sort of AI training we do now is very conducive to this sort of locality. Cerebras already has to grapple with compiling to this architecture, just on their one wafer-scale chip.

Cerebras used to have a page on what they’re doing here, but I think they took it down.

Even if more point-to-point data movement is needed, that is far from infeasible. Optical interconnects can carry extremely high physically realizable bandwidths over long distances, with latency limited to the speed of light in fibre plus endpoint overheads. Ayar Labs offers TerraPHY, which is a chiplet (a small add-on chip) that supports 2 Tb/​s per chiplet and a maximum length of 2km. Even that longest version would purportedly have a latency of just 10 µs, dominated by the speed of light. If every layer in a 1000 layer network had a 10 µs communication latency added to it that wasn’t pipelined or hidden by any other work, the total latency added to the network would be 10 ms. Again, physics doesn’t seem to be the limiting factor.

3. How much die could a rich man buy?

One of the many insights Feynman got right in There’s Plenty of Room at the Bottom is that shrinking the size of things would make them proportionally more mass manufacturable, and similarly, proportionally cheaper. However, in much of this essay I have talked about scaling upwards: more layers, more devices, bigger systems, bigger prices. It is natural to wonder how much of this scaling up can be done economically.

In this section I want to argue for expecting the potential for significant economic scaling beyond Moore’s Law, both in terms of lower prices and in terms of higher spending. I do not put a timeline on these expectations.

The overall impressions I expect to be taken from this section are that,

  1. There exist plausible prospective technologies for making fabrication cheaper.

  2. Funding could scale, and that scale could buy a lot more compute than we are used to.

You can make things pretty cheap, in theory

Semiconductors are the most intrinsically complex things people manufacture, and it’s hard to think of a runner up. The production of a single chip takes 20+ weeks start to end, and a lot of that work is atomically precise. Just the lightbulbs used to illuminate wafers for photolithography steps are immensely complex, bus-sized devices that cost upwards of $100m each. They work by shooting tiny droplets of tin, and precisely hitting those with a laser to generate exactly the right frequency of light, then cascading this through a near atomically exact configuration of optics to maximize uniformity. Actually, the droplets of tin are hit twice, the first pulse creating a plume that more efficiently converts the energy of the second laser into the requisite light. And actually, some of the mirrors involved have root mean square deviations that are sub-atomic.

Semiconductor manufacturing is hard, and this makes it expensive. It is, honestly, fairly miraculous that economies of scale have made devices as cheap as they are.

On the other hand, atomic manufacturing, even atomically precise manufacturing, is normally practically free. Biology is almost nothing but great quantities of nanomachines making nanoscale structures on such a scale that sometimes they produce giant macroscopic objects. It is not physics that is telling us to make things in expensive ways.

For all the cutting edge of semiconductor manufacturing is pricey, some of the less exacting stuff is pretty affordable per square millimetre. TV screens can be massive, but are covered in detailed circuitry. Typically this discrepancy is down to a simpler method of construction. Often inkjet printing is used—literally a printer that deposits droplets of the desired substance on the flat backplane, printing out the wanted circuitry.

These methods have limitations. Inkjet printers are not very precise by photolithography standards, and can be rate limited for complex designs. Semiconductor manufacturing tends to involve several slower steps, like atomic vapor deposition, to place down layers one atom thick at a time, and etching steps for more complex 3D constructions. Sometimes layers are ground flat, to facilitate further build up of material on top of that. These steps make the difference between the price per square millimetre of CPU, and the price per square millimetre of TV. If you could use the latter production techniques to build high end CPUs, we’d be doing it already.

Biology does still inspire us to ask what the practically achievable improvements to manufacturing speed and affordability are. There are a couple of innovative techniques I know of that do scale to promising resolutions, and are under research. Both are stamping methods.

Nanoimprint lithography works by stamping an inverse of the wanted pattern into a soft solid, or a curable liquid, to form patterns.

Nanoscale offset printing uses, in effect, an inked stamp of the pattern to transfer, copying it from a master wafer to the target.

Both techniques allow bulk copies of complex designs in much shorter periods of time, with orders of magnitude less capital investment. Nanoimprint lithography is harder to scale to high throughput, but has comparable resolution to the best photolithography tools. Nanoscale offset printing is quick to scale, but likely has some fundamental resolution limits just shy of the best photolithography techniques.

I don’t want to go too much into the promise of these and other techniques, because unlike prospective memory technologies, there aren’t an effective infinity of choices, and these ideas may very well not pan out. My goal in this section is to raise the legitimate possibility that these economic advances do eventually happen, that they are physically plausible, if not promised, and to get people to ponder on what the economic limits to scale would be if, say, semiconductors fell to around the price per unit area of TVs.

You can spend a lot more money, in theory

Governments don’t have the best foresight, but they do like spending money on things. The Space Launch System, NASA’s new space rocket, is projected to cost >$4B per launch in running costs, and between the launch vehicle, the capsule, and the ground equipment, well over $40B has been spent on it to date. The government could bankroll huge AI projects.

Several particularly rich people have more foresight (or just more gutzpah) than the government, while also having a better ability to spend their large quantities of money efficiently. Elon Musk has a huge amount of money, around $300B, an uncommon belief in AI progress, and the willingness to spend large numbers of billions on his passion projects. Elon Musk could bankroll huge AI projects.

Investments of this scale are not outside of traditional industry, if revenue sources exist to justify it. TSMC is investing $100 billion over three years in expanding semiconductor manufacturing capacity. NVIDIA’s meteoric stock rise and Softbank’s $100B Vision Fund’s AI focus shows industry is betting on AI to have large returns on investment. I don’t know where I predict things to land in the end, but it does not seem wise to assume investments of this sort cannot flow down into models, should they show sufficiently impressive capabilities.

So, let’s modestly say $10B was invested in training a model. How much would that buy? A cutting edge semiconductor wafer is around $20,000, excluding other component costs. If $2B of the overhead was just buying wafers, that buys you about 100,000 wafers, or about half a month of capacity from a $12B 5nm fab. The other components are pricey enough to plausibly take up the remainder of the $10B total cost.

100,000 wafers translates to 100,000 Cerebras wafer-scale devices. For context, the Aurora supercomputer is estimated to cost $500m, or ¹⁄₂₀th of the cost, and would have ~50,000 GPUs, each a large device with many integrated chiplets, plus stacked memory, and plus CPUs. The numbers seem close enough to justify running with that number. Individual Cerebras machines are much more expensive than our estimate of ~$100k each (of which 10% is the wafer cost), but the overheads there are likely due to low volumes.

Cerebras talks about the feasibility of training 100 trillion parameter models with factor-10 sparsity on a cluster of 1000 nodes in one year. Our modest example buys a supercomputer 100 times larger. There is also no requirement in this hypothetical to assume that we are purchasing today’s technology, today. Scaling to very large supercomputer sizes seems feasible.

4. And I would think 1,000 miles

Prior to this section, I have tried to walk a fine line between bold claims and ultraconservatism. I want to end instead with a shorter note on something that frames my thinking about scaling in general.

So far I have talked about our artificial neural networks, and their scaling properties. These are not the ultimate ground limits. We know, at minimum, that brains implement AGI, and to the best of my knowledge, here are some other things that seem quite likely.

  • The bulk of signalling happens through chemical potential neuron spikes.

  • Neurons can fire at about 250-1000 Hz when active.

  • On average across the brain, neurons fire at 0.2 Hz.

  • Its density in humans is about 10⁸ neurons/​mm³.

  • At ~1B/​synapse and 100T synapses, the brain has ~100TB storage.

Contrast silicon,

  • The bulk of signalling happens through switching voltages in wires.

  • Isolated transistor speeds are in excess of 200 GHz.

  • Active (“hot”) transistors usefully switch at around 1-5 GHz on average.

  • Density is around 10⁸ transistors/​mm²—that’s areal density, not volumetric.

  • You can buy an 8TB SSD off Amazon for ~$1000.

If we assume two iPhones floating in space were simulating two connected neurons, with direct laser links between them, in order for the two to communicate with worse than the ~1/​200 second latency as neighboring neurons in our brains do, either,

  • The two phones would need to be over 1000 miles away from each other, about the radius of the moon.

  • The phones would have to be doing a calculation with a sequential length of 10⁷ clock cycles, if running on the CPU cores, which if I recall correctly can together do something like 30 independent operations per cycle.

Thus, for the silicon advantage to start hitting scale out limits relative to what we know is biologically necessary, we would need to be building computers about the size of the moon.

(I also worry a bit about quantum computers. Some of this is perhaps just that I don’t understand them. I think a lot of it is because they expand the space of algorithms drastically beyond anything we’ve seen in nature, those algorithms seem relevant for search, and Neven’s Law means any new capabilities that quantum computers unlock are likely to come suddenly. I think people should pay more attention to quantum computers, especially now that we are at a transition point seeing regular claims of quantum supremacy. Quantum computers can do computational things that no other known process has done ever.)

This, in my mind, is ultimately why I am so hesitant to believe claims of physical limits impeding progress. We are not that many orders of magnitude away from how small we can build components. We are sometimes starting to fight certain physical limits of information transfer through certain electromagnetic signals through limited space. In places we are even hitting practical questions of costs and manufacturing. But we are not building computers the size of the moon. Physics is a long, long, long way away from telling us to pack up, that there’s nothing left to do, that AI systems cannot grow bigger before they stop being suitable for building AI. The limits we are left with are limits of practice and limits of insight.

End note: In retrospect, there are two things I should have addressed that I did not. One is energy efficiency, which ended up being discussed in the comments, and is important to understand. Another was photonic computing, particularly using photonics for matrix multiplication, which I am undecided about. Lightelligence and Lightmatter are two example startups in this space.