I think the idea of effective FLOPs has more narrow applicability than what you are running with, many things that count as compute multipliers don’t scale. They often only hold for particular capabilities that stop being worth boosting separately at greater levels of scale, or particular data that stops being available in sufficient quantity. An example of a scalable compute multiplier is MoE (even as it destroys data efficiency, and so damages some compute multipliers that rely on selection of high quality data). See Figure 4 in the Mamba paper for another example of a scalable compute multiplier (between GPT-3 transformer and Llama 2 transformer, Transformer and Transformer++ respectively in the figure). This issue is particularly serious when we extrapolate by many OOMs, and I think only very modest compute multipliers (like 1.5x/year) survive across all that, because most things that locally seem like compute multipliers don’t compound very far.
There are also issues I have with Epoch studies on this topic, mainly extrapolating scaling laws from things that are not scalable much at all and weren’t optimized for compute optimality, with trends being defined by limits of scalability and idiosyncratic choices of hyperparameters driven by other concerns, rather than a well-defined and consistent notion of compute optimality (which I’d argue wasn’t properly appreciated and optimized-for by the field until very recently, with Chinchilla still managing to fix glaring errors merely in 2022). Even now, papers arguing compute multipliers keep showing in-training loss/perplexity plots that weren’t cooled down before measurement, I think Figure 11 from the recent OLMo 2 paper illustrates this point brilliantly, showing how what looks like a large compute multiplier[1] before the learning rate schedule runs its course can lose all effect once it does.
To be fair it’s kind of a toy case where the apparent “compute multiplier” couldn’t seriously be expected to be real, but it does illustrate the issue with the methodology of looking at loss/perplexity plots where the learning rate is still high, or might differ significantly between points being compared on the plots for different architecture variants, hopelessly confounding the calculation of a compute multiplier.
I see the case for focusing only on compute (since relatively objective), but it still seems important to try to factor in some amount of algorithmic progress to pretraining (which means that the cost to achieve GPT-6 level performance will be dropping over time).
The points on Epoch are getting outside of my expertise – I see my role as to synthesise what experts are saying. It’s good to know these critiques exist and it would be cool to see them written up and discussed.
Thanks, useful to have these figures and an independent data on these calculations.
I’ve been estimating it based on a 500x increase in effective FLOP per generation, rather than 100x of regular FLOP.
Rough calculations are here.
At the current trajectory, the GPT-6 training run costs $6bn in 2028, and GPT-7 costs $130bn in 2031.
I think that makes GPT-8 a couple of trillion in 2034.
You’re right that if you wanted to train GPT-8 in 2031 instead, then it would cost roughly 500x more than training GPT-7 that year.
I think the idea of effective FLOPs has more narrow applicability than what you are running with, many things that count as compute multipliers don’t scale. They often only hold for particular capabilities that stop being worth boosting separately at greater levels of scale, or particular data that stops being available in sufficient quantity. An example of a scalable compute multiplier is MoE (even as it destroys data efficiency, and so damages some compute multipliers that rely on selection of high quality data). See Figure 4 in the Mamba paper for another example of a scalable compute multiplier (between GPT-3 transformer and Llama 2 transformer, Transformer and Transformer++ respectively in the figure). This issue is particularly serious when we extrapolate by many OOMs, and I think only very modest compute multipliers (like 1.5x/year) survive across all that, because most things that locally seem like compute multipliers don’t compound very far.
There are also issues I have with Epoch studies on this topic, mainly extrapolating scaling laws from things that are not scalable much at all and weren’t optimized for compute optimality, with trends being defined by limits of scalability and idiosyncratic choices of hyperparameters driven by other concerns, rather than a well-defined and consistent notion of compute optimality (which I’d argue wasn’t properly appreciated and optimized-for by the field until very recently, with Chinchilla still managing to fix glaring errors merely in 2022). Even now, papers arguing compute multipliers keep showing in-training loss/perplexity plots that weren’t cooled down before measurement, I think Figure 11 from the recent OLMo 2 paper illustrates this point brilliantly, showing how what looks like a large compute multiplier[1] before the learning rate schedule runs its course can lose all effect once it does.
To be fair it’s kind of a toy case where the apparent “compute multiplier” couldn’t seriously be expected to be real, but it does illustrate the issue with the methodology of looking at loss/perplexity plots where the learning rate is still high, or might differ significantly between points being compared on the plots for different architecture variants, hopelessly confounding the calculation of a compute multiplier.
Hmm interesting.
I see the case for focusing only on compute (since relatively objective), but it still seems important to try to factor in some amount of algorithmic progress to pretraining (which means that the cost to achieve GPT-6 level performance will be dropping over time).
The points on Epoch are getting outside of my expertise – I see my role as to synthesise what experts are saying. It’s good to know these critiques exist and it would be cool to see them written up and discussed.