spending tens of billions of dollars to build clusters that could train a GPT-6-sized model in 2028
Traditionally steps of GPT series are roughly 100x in raw compute (I’m not counting effective compute, since it’s not relevant to cost of training). GPT-4 is 2e25 FLOPs. Which puts “GPT-6” at 2e29 FLOPs. To train a model in 2028, you would build an Nvidia Rubin Ultra NVL576 (Kyber) training system in 2027. Each rack holds 576 compute dies at about 3e15 BF16 FLOP/s per die[1] or 1.6e18 FLOP/s per rack. A Blackwell NVL72 datacenter costs about $4M per rack to build, possibly a non-Ultra Rubin NVL144 datacenter will cost about $5M per rack, and a Rubin Ultra NVL576 datacenter might cost about $12M per rack[2].
To get 2e29 BF16 FLOPs in 4 months at 40% utilization, you’d need 30K racks that would cost about $360B all-in (together with the rest of the training system). Which is significantly more than “tens of billions of dollars”.
GPT-8 would require trillions
“GPT-8” is two steps of 100x in raw compute up from “GPT-6″, at 2e33 FLOPs. You’d need to use 10000x more compute than what $360B buy in 2027. Divide it by how much cheaper that compute gets within a few years, let’s say 8x cheaper. What we get is $450T, which is much more than merely “trillions”, and also technologically impossible to produce at that time without transformative AI.
Chips in Blackwell GB200 systems are manufactured with 4nm process and produce about 2.5 dense BF16 FLOP/s per chip, with each chip holding 2 almost reticle sized compute dies. Rubin moves to 3nm, compared to Blackwell at 4nm, which makes each die about 2x more performant (from more transistors and higher clock speed, but the die size must remain the same), which predicts about 2.5 dense BF16 FLOP/s per die or 5 BF16 FLOP/s per 2-die chip. (Nvidia announced that dense FP8 performance will increase 3.3x, but that’s probably due to giving more transistors to FP8, which can’t be done as much for BF16 since it already needs a lot.)
To separately support this, today Google announced Ironwood, their 7th generation of TPU (that might go into production in late 2026). The announcement includes a video that shows that it’s a 2-die chip, same as non-Ultra Rubin, and it was also previously reported to be manufactured with 3nm. In today’s announcement, its performance is quoted as 4.6e15 FLOP/s, which from context of comparing with 459e12 FLOP/s of TPUv5p is likely dense BF16. This means 2.3e15 dense BF16 FLOP/s per compute die, close to my estimate for a Rubin compute die.
A Kyber rack was announced to need 600 KW per rack (1.04 KW/die within-rack all-in), compared to Blackwell NVL72 at 120-130 KW per rack (0.83-0.90 KW/die within-rack all-in). Earlier non-Ultra Rubin NVL144 is a rack with the same number of chips and compute dies as Blackwell NVL72, so it might be using at most slightly higher power per compute die (let’s say 0.90 KW/die within-rack all-in). Thus the clock speed for Rubin Ultra might be up to ~1.15x higher than for non-Ultra Rubin, meaning performance of Rubin Ultra might reach 2.9e15 dense BF16 FLOP/s per die (12e15 FLOP/s per chip, 1.6e18 FLOP/s per rack).
In a Rubin Ultra NVL576 rack, chips have 4 compute dies each, compared to only 2 dies per chip in a non-Ultra Rubin NVL144 rack. Since Nvidia sells at a large margin per compute die, and its real product is the whole system rather than the individual compute dies, it can afford to keep cutting the margin per die, while the cost of the rest of the system scales with the number of chips rather then the number of dies. The NVL576 rack has 2x more chips than the ~$5M NVL144 rack, so if the cost per chip only increases slightly, we get $12M per rack.
I think the idea of effective FLOPs has more narrow applicability than what you are running with, many things that count as compute multipliers don’t scale. They often only hold for particular capabilities that stop being worth boosting separately at greater levels of scale, or particular data that stops being available in sufficient quantity. An example of a scalable compute multiplier is MoE (even as it destroys data efficiency, and so damages some compute multipliers that rely on selection of high quality data). See Figure 4 in the Mamba paper for another example of a scalable compute multiplier (between GPT-3 transformer and Llama 2 transformer, Transformer and Transformer++ respectively in the figure). This issue is particularly serious when we extrapolate by many OOMs, and I think only very modest compute multipliers (like 1.5x/year) survive across all that, because most things that locally seem like compute multipliers don’t compound very far.
There are also issues I have with Epoch studies on this topic, mainly extrapolating scaling laws from things that are not scalable much at all and weren’t optimized for compute optimality, with trends being defined by limits of scalability and idiosyncratic choices of hyperparameters driven by other concerns, rather than a well-defined and consistent notion of compute optimality (which I’d argue wasn’t properly appreciated and optimized-for by the field until very recently, with Chinchilla still managing to fix glaring errors merely in 2022). Even now, papers arguing compute multipliers keep showing in-training loss/perplexity plots that weren’t cooled down before measurement, I think Figure 11 from the recent OLMo 2 paper illustrates this point brilliantly, showing how what looks like a large compute multiplier[1] before the learning rate schedule runs its course can lose all effect once it does.
To be fair it’s kind of a toy case where the apparent “compute multiplier” couldn’t seriously be expected to be real, but it does illustrate the issue with the methodology of looking at loss/perplexity plots where the learning rate is still high, or might differ significantly between points being compared on the plots for different architecture variants, hopelessly confounding the calculation of a compute multiplier.
I see the case for focusing only on compute (since relatively objective), but it still seems important to try to factor in some amount of algorithmic progress to pretraining (which means that the cost to achieve GPT-6 level performance will be dropping over time).
The points on Epoch are getting outside of my expertise – I see my role as to synthesise what experts are saying. It’s good to know these critiques exist and it would be cool to see them written up and discussed.
Traditionally steps of GPT series are roughly 100x in raw compute (I’m not counting effective compute, since it’s not relevant to cost of training). GPT-4 is 2e25 FLOPs. Which puts “GPT-6” at 2e29 FLOPs. To train a model in 2028, you would build an Nvidia Rubin Ultra NVL576 (Kyber) training system in 2027. Each rack holds 576 compute dies at about 3e15 BF16 FLOP/s per die[1] or 1.6e18 FLOP/s per rack. A Blackwell NVL72 datacenter costs about $4M per rack to build, possibly a non-Ultra Rubin NVL144 datacenter will cost about $5M per rack, and a Rubin Ultra NVL576 datacenter might cost about $12M per rack[2].
To get 2e29 BF16 FLOPs in 4 months at 40% utilization, you’d need 30K racks that would cost about $360B all-in (together with the rest of the training system). Which is significantly more than “tens of billions of dollars”.
“GPT-8” is two steps of 100x in raw compute up from “GPT-6″, at 2e33 FLOPs. You’d need to use 10000x more compute than what $360B buy in 2027. Divide it by how much cheaper that compute gets within a few years, let’s say 8x cheaper. What we get is $450T, which is much more than merely “trillions”, and also technologically impossible to produce at that time without transformative AI.
Chips in Blackwell GB200 systems are manufactured with 4nm process and produce about 2.5 dense BF16 FLOP/s per chip, with each chip holding 2 almost reticle sized compute dies. Rubin moves to 3nm, compared to Blackwell at 4nm, which makes each die about 2x more performant (from more transistors and higher clock speed, but the die size must remain the same), which predicts about 2.5 dense BF16 FLOP/s per die or 5 BF16 FLOP/s per 2-die chip. (Nvidia announced that dense FP8 performance will increase 3.3x, but that’s probably due to giving more transistors to FP8, which can’t be done as much for BF16 since it already needs a lot.)
To separately support this, today Google announced Ironwood, their 7th generation of TPU (that might go into production in late 2026). The announcement includes a video that shows that it’s a 2-die chip, same as non-Ultra Rubin, and it was also previously reported to be manufactured with 3nm. In today’s announcement, its performance is quoted as 4.6e15 FLOP/s, which from context of comparing with 459e12 FLOP/s of TPUv5p is likely dense BF16. This means 2.3e15 dense BF16 FLOP/s per compute die, close to my estimate for a Rubin compute die.
A Kyber rack was announced to need 600 KW per rack (1.04 KW/die within-rack all-in), compared to Blackwell NVL72 at 120-130 KW per rack (0.83-0.90 KW/die within-rack all-in). Earlier non-Ultra Rubin NVL144 is a rack with the same number of chips and compute dies as Blackwell NVL72, so it might be using at most slightly higher power per compute die (let’s say 0.90 KW/die within-rack all-in). Thus the clock speed for Rubin Ultra might be up to ~1.15x higher than for non-Ultra Rubin, meaning performance of Rubin Ultra might reach 2.9e15 dense BF16 FLOP/s per die (12e15 FLOP/s per chip, 1.6e18 FLOP/s per rack).
In a Rubin Ultra NVL576 rack, chips have 4 compute dies each, compared to only 2 dies per chip in a non-Ultra Rubin NVL144 rack. Since Nvidia sells at a large margin per compute die, and its real product is the whole system rather than the individual compute dies, it can afford to keep cutting the margin per die, while the cost of the rest of the system scales with the number of chips rather then the number of dies. The NVL576 rack has 2x more chips than the ~$5M NVL144 rack, so if the cost per chip only increases slightly, we get $12M per rack.
Thanks, useful to have these figures and an independent data on these calculations.
I’ve been estimating it based on a 500x increase in effective FLOP per generation, rather than 100x of regular FLOP.
Rough calculations are here.
At the current trajectory, the GPT-6 training run costs $6bn in 2028, and GPT-7 costs $130bn in 2031.
I think that makes GPT-8 a couple of trillion in 2034.
You’re right that if you wanted to train GPT-8 in 2031 instead, then it would cost roughly 500x more than training GPT-7 that year.
I think the idea of effective FLOPs has more narrow applicability than what you are running with, many things that count as compute multipliers don’t scale. They often only hold for particular capabilities that stop being worth boosting separately at greater levels of scale, or particular data that stops being available in sufficient quantity. An example of a scalable compute multiplier is MoE (even as it destroys data efficiency, and so damages some compute multipliers that rely on selection of high quality data). See Figure 4 in the Mamba paper for another example of a scalable compute multiplier (between GPT-3 transformer and Llama 2 transformer, Transformer and Transformer++ respectively in the figure). This issue is particularly serious when we extrapolate by many OOMs, and I think only very modest compute multipliers (like 1.5x/year) survive across all that, because most things that locally seem like compute multipliers don’t compound very far.
There are also issues I have with Epoch studies on this topic, mainly extrapolating scaling laws from things that are not scalable much at all and weren’t optimized for compute optimality, with trends being defined by limits of scalability and idiosyncratic choices of hyperparameters driven by other concerns, rather than a well-defined and consistent notion of compute optimality (which I’d argue wasn’t properly appreciated and optimized-for by the field until very recently, with Chinchilla still managing to fix glaring errors merely in 2022). Even now, papers arguing compute multipliers keep showing in-training loss/perplexity plots that weren’t cooled down before measurement, I think Figure 11 from the recent OLMo 2 paper illustrates this point brilliantly, showing how what looks like a large compute multiplier[1] before the learning rate schedule runs its course can lose all effect once it does.
To be fair it’s kind of a toy case where the apparent “compute multiplier” couldn’t seriously be expected to be real, but it does illustrate the issue with the methodology of looking at loss/perplexity plots where the learning rate is still high, or might differ significantly between points being compared on the plots for different architecture variants, hopelessly confounding the calculation of a compute multiplier.
Hmm interesting.
I see the case for focusing only on compute (since relatively objective), but it still seems important to try to factor in some amount of algorithmic progress to pretraining (which means that the cost to achieve GPT-6 level performance will be dropping over time).
The points on Epoch are getting outside of my expertise – I see my role as to synthesise what experts are saying. It’s good to know these critiques exist and it would be cool to see them written up and discussed.