anaguma comments on anaguma’s Shortform

anaguma 2 Nov 2025 7:11 UTC
2 points
0
Energy Won’t Constrain AI Inference.

The energy for LLM inference follows the formula: Energy = 2 × P × N × (tokens/user) × ε, where P is active parameters, N is concurrent users, and ε is hardware efficiency in Joules/FLOP. The factor of 2 accounts for multiply-accumulate operations in matrix multiplication.

Using NVIDIA’s GB300, we can calculate ε as follows: the GPU has a TDP of 1400W and delivers 14 PFLOPS of dense FP4 performance. Thus ε = 1400 J/s ÷ (14 × 10^15 FLOPS) = 100 femtojoules per FP4 operation. With this efficiency, a 1 trillion active parameter model needs just 0.2 mJ per token (2 × 10^12 × 10^-13 J). This means 10 GW could give every American 167^[1] tokens/second continuously.
1. ^
  300 million users: tokens/second = 10^10 W ÷ (2 × 10^12 × 3 × 10^8 × 10^-13) = 167 tokens/second per person
- Vladimir_Nesov 2 Nov 2025 8:58 UTC
  6 points
  0
  Parent
  Generation is HBM bandwidth bound, not compute bound, so you are estimating power for input tokens. Things like coding agents (as opposed to chatbots) are doing their own thing that you don’t read, potentially in parallel, and a lot of things get automatically stuffed in their contexts, so the demand for the number of tokens per user could get very high.
  
  Power is a proxy for cost, and there isn’t enough money in AI yet for power to become the limiting factor. A 1 GW datacenter costs $50bn to build (or $10-12bn per year to use), so for example 100 GW of datacenters is not what the current economics of AI can support, even though it’s in principle feasible to build in a few years.
  
  0.2 mJ per token (2 × 10^12 × 10^-13 J)
  
  (That’s 0.2 J per token, not 0.2 mJ per token. But the later conclusion of 167 tokens/second is correct with your assumptions.)
  
  NVIDIA’s GB300, we can calculate ε as follows: the GPU has a TDP of 1400W
  
  A GB200/GB300 NVL72 rack is about 140 kW, or 1,950 W per chip (because of all the other stuff in a rack beside the chips), and a datacenter outside the racks has networking, cooling, and power loss from voltage stepping in transformers (some of this is captured in a metric called power usage effectiveness, or PUE), which is a factor of about 1.3. So you end up with 2,500 W per chip, all-in at the level of the whole datacenter. With for example Abilene system, we can see that 400K chips need 1 GW of power.
  
  For my own estimate for input tokens, I’d include 60% utilization and 15e15 FP4 FLOP/s, so that for a 1T active param model, during a second you get 9e15 useful FLOPs, and spend 2,500 J. As you need 2e12 FLOPs per token (2 FLOPs per active param), this gets us 4,500 tokens in that second. This is continuous processing of about 2 input tokens per watt of available GB300 compute. Thus with 10 GW of datacenters, we get 18e9 tokens per second, or 2.2 tokens per second per person (in the whole world), or 52 tokens per second per American.
  
  For output tokens, 5x fewer tokens per second per chip seems to be a rule of thumb (5-15% compute utilization instead of 60%), corresponding to the difference in API prices for input and output tokens. So that’s 0.5 tokens per second for a person from the whole world, or 10 tokens per second for an American.
  - anaguma 2 Nov 2025 18:15 UTC
    1 point
    0
    Parent
    That makes sense, thanks for the corrections!
- Cleo Nardo 2 Nov 2025 17:15 UTC
  4 points
  0
  Parent
  Why would demand for AI inference be below 167 tokens/second/american? I expect it to be much higher, and for energy to be a constraint.