65T tokens doesn’t get you to 1e26 FLOP with 100B active params? You’d need well over 100T tokens: 6 * 100 billion * 65 trillion is 3.9e25 FLOP.
GPT-4.5 being trained on fewer tokens than GPT-4o doesn’t really make sense. GPT-4.5 only having 5x more active params than GPT-4o doesn’t quite make sense either, though I’m not as confident that’s wrong.
1e26 FLOP would have had a significant opportunity cost. Remember that OpenAI was and is very GPU constrained and may have valued GPU hours in a large-scale cluster a lot more than $2/hour. It would be worth it to make your flagship model good, but not worth it if it barely has any effect on your flagship model. I don’t think it’s a good idea to reason backwards from alleging some compute budget that OpenAI might have had at X date, to inferring the training FLOP of a model trained then.
1e26 FLOP would have had a significant opportunity cost.
At the end of 2023 Microsoft had 150K+ H100s, so reserving 30K doesn’t seem like too much (especially as they can use non-H100 and possibly non-Microsoft compute for research experiments). It’s difficult to get a lot of a new chip when it just comes out, or to get a lot in a single training system, or to suddenly get much more if demand surges. But for a frontier training run, there would’ve been months of notice. And the opportunity cost of not doing this is being left with an inferior model (or a less overtrained model that costs more in inference, and so requires more GPUs to serve for inference).
I don’t think it’s a good idea to reason backwards from alleging some compute budget that OpenAI might have had at X date, to inferring the training FLOP of a model trained then.
The main anchors are 32K H100s in a single training system, and frontier training compute scaling 4x per year. Currently, a year later, 3e26-6e26 FLOPs models are getting released (based on 100K H100s in Colossus and numbers in the Grok 3 announcement, 100K H100s at Goodyear site, 100K TPUv6e datacenters, Meta’s 128K H100s). The $3bn figure was just to point out that $140m following from such anchors is not a very large number.
65T tokens doesn’t get you to 1e26 FLOP with 100B active params?
Right, 45T-65T is for a compute optimal 1e26 model, I did the wrong calculation when editing in this detail. For a 10x overtrained model, it’s 3x more data than that, so for 150T total tokens you’d need 5 epochs of 30T tokens, which is still feasible (with almost no degradation compared to 150T unique tokens of that quality). The aim was to calculate this from 260B and 370B reduced 3x (rather than from 100B).
GPT-4.5 being trained on fewer tokens than GPT-4o doesn’t really make sense.
How so? If it uses 3x more compute but isn’t 10x overtrained, that means less data (with multiple epochs, it would probably use exactly the same unique data, repeated a bit less). The video presentation on GPT-4.5 mentioned work on lower precision in pretraining, so it might even be a 6e26 FLOPs model (though a priori it would be surprising if the first foray into this scale isn’t taken at the more conservative BF16). And it would still be less data (square root of 6x is less than 3x). Overtraining has a large effect on both the number of active parameters and the needed number of tokens, at a relatively minor cost in effective compute, thus it’s a very salient thing for use in production models.
65T tokens doesn’t get you to 1e26 FLOP with 100B active params? You’d need well over 100T tokens: 6 * 100 billion * 65 trillion is 3.9e25 FLOP.
GPT-4.5 being trained on fewer tokens than GPT-4o doesn’t really make sense. GPT-4.5 only having 5x more active params than GPT-4o doesn’t quite make sense either, though I’m not as confident that’s wrong.
1e26 FLOP would have had a significant opportunity cost. Remember that OpenAI was and is very GPU constrained and may have valued GPU hours in a large-scale cluster a lot more than $2/hour. It would be worth it to make your flagship model good, but not worth it if it barely has any effect on your flagship model. I don’t think it’s a good idea to reason backwards from alleging some compute budget that OpenAI might have had at X date, to inferring the training FLOP of a model trained then.
At the end of 2023 Microsoft had 150K+ H100s, so reserving 30K doesn’t seem like too much (especially as they can use non-H100 and possibly non-Microsoft compute for research experiments). It’s difficult to get a lot of a new chip when it just comes out, or to get a lot in a single training system, or to suddenly get much more if demand surges. But for a frontier training run, there would’ve been months of notice. And the opportunity cost of not doing this is being left with an inferior model (or a less overtrained model that costs more in inference, and so requires more GPUs to serve for inference).
The main anchors are 32K H100s in a single training system, and frontier training compute scaling 4x per year. Currently, a year later, 3e26-6e26 FLOPs models are getting released (based on 100K H100s in Colossus and numbers in the Grok 3 announcement, 100K H100s at Goodyear site, 100K TPUv6e datacenters, Meta’s 128K H100s). The $3bn figure was just to point out that $140m following from such anchors is not a very large number.
Right, 45T-65T is for a compute optimal 1e26 model, I did the wrong calculation when editing in this detail. For a 10x overtrained model, it’s 3x more data than that, so for 150T total tokens you’d need 5 epochs of 30T tokens, which is still feasible (with almost no degradation compared to 150T unique tokens of that quality). The aim was to calculate this from 260B and 370B reduced 3x (rather than from 100B).
How so? If it uses 3x more compute but isn’t 10x overtrained, that means less data (with multiple epochs, it would probably use exactly the same unique data, repeated a bit less). The video presentation on GPT-4.5 mentioned work on lower precision in pretraining, so it might even be a 6e26 FLOPs model (though a priori it would be surprising if the first foray into this scale isn’t taken at the more conservative BF16). And it would still be less data (square root of 6x is less than 3x). Overtraining has a large effect on both the number of active parameters and the needed number of tokens, at a relatively minor cost in effective compute, thus it’s a very salient thing for use in production models.