65T tokens doesn’t get you to 1e26 FLOP with 100B active params?
Right, 45T-65T is for a compute optimal 1e26 model, I did the wrong calculation when editing in this detail. For a 10x overtrained model, it’s 3x more data than that, so for 150T total tokens you’d need 5 epochs of 30T tokens, which is still feasible (with almost no degradation compared to 150T unique tokens of that quality). The aim was to calculate this from 260B and 370B reduced 3x (rather than from 100B).
GPT-4.5 being trained on fewer tokens than GPT-4o doesn’t really make sense.
How so? If it uses 3x more compute but isn’t 10x overtrained, that means less data (with multiple epochs, it would probably use exactly the same unique data, repeated a bit less). The video presentation on GPT-4.5 mentioned work on lower precision in pretraining, so it might even be a 6e26 FLOPs model (though a priori it would be surprising if the first foray into this scale isn’t taken at the more conservative BF16). And it would still be less data (square root of 6x is less than 3x). Overtraining has a large effect on both the number of active parameters and the needed number of tokens, at a relatively minor cost in effective compute, thus it’s a very salient thing for use in production models.
Right, 45T-65T is for a compute optimal 1e26 model, I did the wrong calculation when editing in this detail. For a 10x overtrained model, it’s 3x more data than that, so for 150T total tokens you’d need 5 epochs of 30T tokens, which is still feasible (with almost no degradation compared to 150T unique tokens of that quality). The aim was to calculate this from 260B and 370B reduced 3x (rather than from 100B).
How so? If it uses 3x more compute but isn’t 10x overtrained, that means less data (with multiple epochs, it would probably use exactly the same unique data, repeated a bit less). The video presentation on GPT-4.5 mentioned work on lower precision in pretraining, so it might even be a 6e26 FLOPs model (though a priori it would be surprising if the first foray into this scale isn’t taken at the more conservative BF16). And it would still be less data (square root of 6x is less than 3x). Overtraining has a large effect on both the number of active parameters and the needed number of tokens, at a relatively minor cost in effective compute, thus it’s a very salient thing for use in production models.