These examples seem to contradict note 2 where D/N falls for larger C. Now I’m not sure what the trend should be.
It feels like you could derive a rule of thumb based on the loss and the entropy of the dataset e.g. “If my model starts at a loss of 4 bits/token and the asymptote is 2 bits/token, I need X tokens of data to fully specify a model with Y bits stored in the parameters.”
For scaling to larger training systems, the trend is probably increasing, since larger datasets have lower quality, and soon repetition in training will become necessary, lowering quality per trained-on token. Also, MoE is a large compute multiplier (3x-6x, Figure 11 in the above MoE scaling paper), it’s not going to be ignored if at all possible. There are other studies that show a decreasing trend, but this probably won’t hold up in practice as we get to 250T and then 750T tokens within a few years even for a dense model.
For 1:32 MoE at 5e28 FLOPs (5 GW $150bn training systems of 2028), we get maybe 700 tokens/param optimal (counting effect of sparsity, effect of repetition, and effect of more compute), so that’s 3.5T active and 110T total params trained for 2.5e15 tokens (maybe 80T tokens repeated 30 times). Not sure if this kind of total params can be made to work.
Wonderful to get more numbers on this!
These examples seem to contradict note 2 where D/N falls for larger C. Now I’m not sure what the trend should be.
It feels like you could derive a rule of thumb based on the loss and the entropy of the dataset e.g. “If my model starts at a loss of 4 bits/token and the asymptote is 2 bits/token, I need X tokens of data to fully specify a model with Y bits stored in the parameters.”
For scaling to larger training systems, the trend is probably increasing, since larger datasets have lower quality, and soon repetition in training will become necessary, lowering quality per trained-on token. Also, MoE is a large compute multiplier (3x-6x, Figure 11 in the above MoE scaling paper), it’s not going to be ignored if at all possible. There are other studies that show a decreasing trend, but this probably won’t hold up in practice as we get to 250T and then 750T tokens within a few years even for a dense model.
For 1:32 MoE at 5e28 FLOPs (5 GW $150bn training systems of 2028), we get maybe 700 tokens/param optimal (counting effect of sparsity, effect of repetition, and effect of more compute), so that’s 3.5T active and 110T total params trained for 2.5e15 tokens (maybe 80T tokens repeated 30 times). Not sure if this kind of total params can be made to work.