# Memory bandwidth constraints imply economies of scale in AI inference

Contemporary GPUs often have very imbalanced memory vs arithmetic operation capabilities. For instance, an H100 can do around 3e15 8-bit FLOP/​s, but the speed at which information can move between the cores and the GPU memory is only 3 TB/​s. As 8 bits = 1 byte, there is a mismatch of three orders of magnitude between the arithmetic operation capabilities of the GPU and its memory bandwidth.

This imbalance ends up substantially lowering the utilization rate of ML hardware when batch sizes are small. For instance, suppose we have a model parametrized by 1.6 trillion 8-bit floating point numbers. To just fit the parameters of the model onto the GPUs, we’ll need at least 20 H100s, as each H100 has a VRAM of 80 GB. Suppose we split our model into 20 layers and use 20-way tensor parallelism: this means that we slice the parameters of the model “vertically”, such that the first GPU holds the first 5% of the parameters in every layer, the second GPU holds the second 5%, et cetera.

This sounds good, but now think of what happens when we try to run this model. In this case, roughly speaking, each parameter comes with one addition and one multiplication operation, so we do around 3.2 trillion arithmetic operations in one forward pass. As each H100 does 3e15 8-bit FLOP/​s and we have 20 of them running tensor parallel, we can do this in a mere ~ 0.05 milliseconds. However, each parameter also has to be read into memory, and here our total memory bandwidth is only 60 TB/​s, meaning for a model of size 1.6 TB we must spend (1.6 TB)/​(60 TB/​s) ~= 27 ms just because of the memory bottlenecks! This bottlenecks inference and we end up with an abysmal utilization rate of approximately (0.05 ms)/​(27 ms) ~= 0.2%. This becomes even worse when we also take in inter-GPU communication costs into account, which would be at around 1 TB/​s if the GPUs are using NVLink.

Well, this is not very good. Most of our arithmetic operation capability is being wasted because the ALUs spend most of their time idling and waiting for the parameters to be moved to the GPU cores. Can we somehow improve this?

A crucial observation is that if getting the parameters to the GPU cores is the bottleneck, we want to somehow amortize this over many calls to the model. For instance, imagine we could move a batch of parameters to the cores and use them a thousand times before moving on to the next batch. This would do much to remedy the imbalance between memory read and compute times.

If our model is an LLM, then unfortunately we cannot do this for a single user because text is generated serially: even though each token needs its own LLM call and so the user needs to make many calls to the model to generate text, we can’t parallelize these calls because each future token call needs to know all the past tokens. This inherently serial nature of text generation makes it infeasible to improve the memory read and compute time balance if only a single user is being serviced by the model.

However, things are different if we get to batch requests from multiple users together. For instance, suppose that our model is being asked to generate tokens by thousands of users at any given time. Then, we can parallelize these calls: every time we load some parameters onto the GPU cores, we perform the operations associated with those parameters for all user calls at once. This way, we amortize the reading cost of the parameters over many users, greatly improving our situation. Eventually this hits diminishing returns because we must also read the hidden state of each user’s calls into GPU memory, but the hidden states are usually significantly smaller than the whole model, so parallelization still results in huge gains before we enter this regime.

For instance, if we could batch requests from 100 users together in our above setup, we might be able to achieve a utilization rate of 20% - note that in a realistic setup this would be much lower due to many sources of overhead the simplistic calculation is ignoring, but morally the calculation still gives the right result.

The result is massive economies of scale not just in training AI models, but also in running them. If an individual user wanted to run a large model at a reasonable speed, they might have to pay a thousand times what they would pay to a centralized API provider which relies on large GPU clusters to batch requests from many different users.

Some simple math on this: if you need 1000 concurrent users for reasonable utilization rates because of the 1000:1 imbalance between ALU ops and memory bandwidth in GPUs, and each user on average spends 10 minutes per day using your service, then you need a total user base of at least (1000 users)/​(10 minutes/​day) ~= 144K users. If you also want the service to be consistent, i.e. low latency and high throughput 24 hours a day, you probably need to exceed this by some substantial margin, perhaps even approach 1M total users. This is of course much smaller than the scale of a search engine such as Google, but still probably outside the realm where individual hobbyists or enthusiasts can hope to compete with the cost-effectiveness of centralized providers.

The contrast with the human brain is instructive. A H100 GPU draws 700 W of power to do 3e15 8-bit FLOP/​s, which we think is similar to the computational power of the brain, though with ~ 30x the power draw. However, a H100 GPU has a mere 80 GB of VRAM, compared to the human brain’s storage of the “parameter values” of around ~ 100 trillion synapses, which would probably take up ~ 100 TB of memory. On top of this, the human brain can run a (trivially) human equivalent intelligence at reasonable latency and throughput at a batch size of one: no parallelization across brains is needed. This suggests the human brain does not suffer from the same memory bandwidth versus arithmetic operation imbalance problem that modern GPUs have.

Whether this imbalance can possibly be cheaply engineered away or not might determine the extent to which the market for AI deployment (which may or may not become vertically disintegrated from AI R&D and training) is dominated by a few small actors, and seems like an important question about hardware R&D. I don’t have the expertise to judge to what extent engineering away these memory bottlenecks is feasible and would be interested to hear from people who do have expertise in this domain.

• I point this—the VN bottleneck—out now and then.

Its really just a simple consequence of scaling geometry. Compute scales with device surface area (for 2d chips) or volume (for 3d systems like the brain), while bandwidth/​interconnect scales with dimension minus one.

A few years back VCs were fooled by a number of well meaning startups based on the pitch “We can just make a big matmul chip like a GPU but with far more on chip SRAM and thereby avoid the VN bottleneck!” But Nvidia is in fact pretty smart, and understands why exactly this approach doesn’t actually work (at least not yet with SRAM), and much money was wasted.

I used to be pretty excited about neuromorphic computing around 2010 ish. I still am—but today it still seems to be about a decade away.

• Whether this imbalance can possibly be cheaply engineered away or not might determine the extent to which the market for AI deployment (which may or may not become vertically disintegrated from AI R&D and training) is dominated by a few small actors, and seems like an important question about hardware R&D. I don’t have the expertise to judge to what extent engineering away these memory bottlenecks is feasible and would be interested to hear from people who do have expertise in this domain.

You may know this, but “in-memory computing” is the major search term here. (Or compute-in-memory, or compute-near-memory in the nearterm, or neuromorphic computing for an umbrella over that and other ideas.) Progress is being made, though not cheaply, and my read is that we won’t have a consensus technology for another decade or so. Whatever that ends up being, scaling it up could easily take another decade.

• On a different topic but answering to the same quote : advancements in quantization of models to significantly reduce model memory consumption for inference without reducing model performance might also mitigate the imbalance between ALU ops and memory bandwith. This might only shift the problem a few orders of magnitude away, but still, I think it‘s worth mentioning.

• I think the human brain has around 2.5 petabytes of memory storage, which is insane compared to only 80 gigabytes in the H100 VRAM, and it all does this for 20 watts, and I think this gives a lot of credence to the belief that the near future of AI will be a lot more brain-like than people think.

If the brain is basically at the limits of efficient algorithms, and we don’t get new paradigms for computing, then Jacob Cannell’s scenario for AI takeover would be quite right.

If algorithmic progress does have a larger effect on things, than Steven Byrnes’s take will likely be correct on AI takeover.

• Do you feel like your memory contains 2.5 petabytes of data? I’m not sure such a number passes the smell test.

• While I wouldn’t endorse the 2.5 PB figure itself, I would caution against this line of argument. It’s possible for your brain to contain plenty of information that is not accessible to your memory. Indeed, we know of plenty of such cognitive systems in the brain whose algorithms are both sophisticated and inaccessible to any kind of introspection: locomotion and vision are two obvious examples.

• I do want to ask why don’t you think the 2.5 petabyte figure is right, exactly?

• It might be right, I don’t know. I’m just making a local counterargument without commenting on whether the 2.5 PB figure is right or not, hence the lack of endorsement. I don’t think we know enough about the brain to endorse any specific figure, though 2.5 PB could perhaps fall within some plausible range.

• a gpu contains 2.5 petabytes of data if you oversample its wires enough. if you count every genome in the brain it easily contains that much. my point being, I agree, but I also see how someone could come up with a huge number like that and not be totally locally wrong, just highly misleading.

• To me any big number seems plausible, given that AFAIK people don’t seem to have run into upper limits of how much information the human brain can contain—while you do forget some things that don’t get rehearsed, and learning does slow down at old age, there are plenty of people who continue learning things and having a reasonably sharp memory all the way to old age. If there’s any point when the brain “runs out of hard drive space” and becomes unable to store new information, I’m at least not aware of any study that would suggest this.

• My immediate intuition is that any additional skills or facts about the world picked up later in life, wouldn’t affect data storage requirements enough to be relevant to the argument?

For example, if you already have vision and locomotion machinery and you can play the guitar and that takes X petabytes of data, and you then learn how to play the piano, I’d feel quite surprised if that ended up requiring your brain to contain more than even 2X petabytes total of data!

(I recognise I’m not arguing for it, but posting in case others share this intuition)

• I don’t immediately see the connection in your comment to what I was saying, which implies that I didn’t express my point clearly enough.

To rephrase: I interpreted FeepingCreature’s comment to suggest that 2.5 petabytes feels implausibly large, and that it to be implausible because based on introspection it doesn’t feel like one’s memory would contain that much information. My comment was meant to suggest that given that we don’t seem to ever run out of memory storage, then we should expect our memory to contain far less information than the brain’s maximum capacity, as there always seems to be more capacity to spare for new information.

• Sure, but surely that’s how it feels from the inside when your mind uses a LRU storage system that progressively discards detail. I’m more interested in how much I can access—and um, there’s no way I can access 2.5 petabytes of data.

I think you just have a hard time imagining how much 2.5 petabyte is. If I literally stored in memory a high-resolution poorly compressed JPEG image (1MB) every second for the rest of my life, I would still not reach that storage limit. 2.5 petabyte would allow the brain to remember everything it has ever perceived, with very minimal compression, in full video, easily. We know that the actual memories we retrieve are heavily compressed. If we had 2.5 petabytes of storage, there’d be no reason for the brain to bother!

• If we had 2.5 petabytes of storage, there’d be no reason for the brain to bother!

I recall reading an anecdote (though don’t remember the source, ironically enough) from someone who said they had an exceptional memory, saying that such a perfect memory gets nightmarish. Everything they saw constantly reminded them of some other thing associated with it. And when they recalled a memory, they didn’t just recall the memory, but they also recalled each time in their life when they had recalled that memory, and also every time they had recalled recalling those memories, and so on.

I also have a friend whose memory isn’t quite that good, but she says that unpleasant events have an extra impact on her because the memory of them never fades or weakens. She can recall embarrassments and humiliations from decades back with an equal force and vividity as if they happened yesterday.

Those kinds of anecdotes suggest to me that the issue is not that the brain would in principle have insufficient capacity for storing everything, but that recalling everything would create too much interference and that the median human is more functional if most things are forgotten.

EDIT: Here is one case study reporting this kind of a thing:

We know of no other reported case of someone who recalls personal memories over and over again, who is both the warden and the prisoner of her memories, as AJ reports. We took seriously what she told us about her memory. She is dominated by her constant, uncontrollable remembering, finds her remembering both soothing and burdensome, thinks about the past “all the time,” lives as if she has in her mind “a running movie that never stops” [...]

One way to conceptualize this phenomenon is to see AJ as someone who spends a great deal of time remembering her past and who cannot help but be stimulated by retrieval cues. Normally people do not dwell on their past but they are oriented to the present, the here and now. Yet AJ is bound by recollections of her past. As we have described, recollection of one event from her past links to another and another, with one memory cueing the retrieval of another in a seemingly “unstoppable” manner. [...]

Like us all, AJ has a rich storehouse of memories latent, awaiting the right cues to invigorate them. The memories are there, seemingly dormant, until the right cue brings them to life. But unlike AJ, most of us would not be able to retrieve what we were doing five years ago from this date. Given a date, AJ somehow goes to the day, then what she was doing, then what she was doing next, and left to her own style of recalling, what she was doing next. Give her an opportunity to recall one event and there is a spreading activation of recollection from one island of memory to the next. Her retrieval mode is open, and her recollections are vast and specific.

• [ ]
[deleted]
• That memory would be used for what might be called semantic indexing. So it’s not that I can remember tons of info, it’s that I remember it in exactly the right situation.

I have no idea if that’s an accurate figure. You’ve got the synapse count and a few bits per synapse ( or maybe more), but you’ve also got to account for the choices of which cells synapse on which other cells, which is also wired and learned exquisite specifically, and so constitutes information storage of some sort.

• I got that from googling around the capacity of the human brain, and I found it via many sources. I definitely think that while this number is surprisingly high, I do think it makes a little sense, especially since I remember that one big issue with AI is essentially the fact that it has way less memory than the human brain, even when computation is similar in level.

• Many of the calculations on the brain capacity are based on wrong assumptions. Is there an original source for that 2.5 PB calculation? This video is very relevant to the topic if you have some time to check it out:

• Reber (2010) was my original source for the claim that the human brain has 2.5 petabytes of memory, but it’s definitely something that got reported a lot by secondary sources like the Scientific American.

• Yep, that’s the source I was looking for to find the original source of the claim.

• From what i’ve seen even the larger synapses store only about 5 bits ish, and the ‘median’ or typical synapse probably stores less than 1 bit in some sense (as the typical brain synapse only barely exists in a probabilistic sense—as in a neuromorphic computer a physical synaptic connection is an obvious but unappreciated prerequisite for a logical synapse, but the former does not necessarily entail the latter: see also quantal synaptic failure).

In my 2022 roadmap I estimated brain capacity at 1e15 bits but that’s probably an overestimate for logical bits.

Also the brain is quite sparse for energy efficiency, but that usually comes at a tradeoff in parameter efficiency. This is well expored in the various tradeoffs for ANNS that model 3d space (NERFs etc) but generalizes to other modalities. The most parameter efficient models will be more dense but less compute/​energy efficient for inference as a result. There are always more ways to compress the information stored in an ANN, but those optimization directions are extremely unlikely to align with the optimizations favoring more efficient inference via runtime sparsity (and extreme runtime sparsity probably requires redundancy aka anti-compression).

• if the human brain had around 2.5 petabytes of storage, that would decrease my credence in AI being brain-like, because i believe AI is on track to match human intelligence in its current paradigm, so the brain being different just means the brain is different.

• Agreed, and interested in @Noosphere89 elaborating on why you have the opposite intuition.

• Basically, it has to do with the fundamental issue of the Von Neumann bottleneck, and the issue is that there is a massive imbalance between memory and computation, and while LLMs and human brains differ in their algorithms a lot, another non-algorithmic difference is the fact that the human brain has way more memory than pretty much any GPT, as well as basically all AI that exists.

Besides, more memory is good anyways.

And that causes issues when you try simulating an entire brain at high speed, and in particular it becomes a large issue when you have to wait all the time since the compute keeps shuffling around in memory.

• Look into AMD MI300x. Has 192 GB HBM3 memory. With FP4 weights, might run GPT-4 in single node of 8 GPUs, still have plenty to spare for KV. Eliminating cross-node communication easily allows 2x batch size.

Fungibility is a good idea, would take avg. KVUtil from 10% to 30% imo.

• Wow, this is a good argument. Especially if assumptions hold.

1. The ALU computes the input much faster than the results can be moved to the next layer.

2. So if the AI only receives a single user’s prompt, the ALUs waste a lot of time waiting for input.

3. But if many users are sending prompts all the time, the ALUs can be sent many more operations at once (assuming the wires are bottlenecked by speed rather than amount of information they can carry).

4. So if your AI is extremely popular (e.g., OpenAI), your ALUs have to spend less time idling, so the GPUs you use are much more cost-effective.

5. Compute is much more expensive for less popular AIs (plausibly >1000x).