A MoE transformer can reach the same loss as a compute optimal dense model using 3x-6x less compute, but will need the same amount of data to do it. So compute optimal MoEs don’t improve data efficiency, don’t contribute to mitigating data scarcity.
A new Jan 2025 paper offers straightforward compute multiplier results comparing dense transformers to MoE at various levels of sparsity, with isoFLOPs for various tokens/parameter ratios, using experiments of up to 1e21 FLOPs per datapoint. Compute multiplier results are in Figure 11, with about 3x compute multiplier for 87% (1:8) sparse MoE over dense, and about 6x-7x compute multiplier for 97% (1:32) sparse MoE (same sparsity as DeepSeek-V3).
But there’s a catch. Greater sparsity makes it compute optimal to use fewer active parameters, and therefore more data (training with the same compute). This can be seen on isoFLOP plots in Figure 12, left. As sparsity goes from 0% (dense) to 95% (1:20), compute optimal number of active parameters for their 1e21 FLOPs experiments goes from 2.9B to 1.3B. For 97% (1:32) sparsity, interpolating from experiments on the other compute budgets, the ratio of the number of active parameters seems to be about 2.5x. Keeping compute unchanged, 2.5x fewer parameters means 2.5x more data, or 6x greater tokens/parameter ratio for a compute optimal training run.
Thus a dense model can be replaced with a 97% sparse MoE model trained using 6x less compute that will achieve the same perplexity, but the tokens/parameter ratio of this MoE model will be 6x greater than for the original dense model. Both data and active parameters would go down by 2.5x from reducing compute 6x if the ratio didn’t change, but since it does change, in actuality only the number of active parameters goes down 6x, while the number of tokens stays the same.
Let’s take Llama-3-405B as an example, which is a 405B parameter compute optimal model trained for 15T tokens at 40 tokens/parameter, using 4e25 FLOPs. An equivalent 97% sparse model will have 70B active parameters, 2T total parameters, and will need to be trained for the same 15T tokens to reach the same perplexity/loss at 220 tokens/parameter, using 6e24 FLOPs. (Which is close to DeepSeek-V3′s 4e24-5e24 FLOPs actually, so anchoring to Llama-3-405B might be a good way of framing its compute efficiency.)
So compute optimal MoEs don’t improve data efficiency, don’t contribute to mitigating data scarcity.
I agree compute optimal MoEs don’t improve data utilization. But, naively you might expect that MoEs can be used to reduce issues with data scarcity at a fixed level of compute by training a much bigger model on a fixed amount of data.
As in, because there are returns to both more data and bigger models, you can use MoE to effectively use a much bigger model at the same compute.
Like, maybe you would have trained llama-3-405B on 15T tokens. You could instead train an 8 trillion parameter model with 400B active params on 15T tokens and a priori this could perform much better on that same amount of data. (In practice an MoE with X active params is more expensive to train than a dense model with X active params, so you might need to reduce active params somewhat.)
Chinchilla scaling shows that tokens/params ratio for compute optimal models only changes slowly with compute, making it a good anchor to frame other things in terms of. The experiments from this MoE scaling paper show that under fixed data, varying sparsity in MoEs that are compute optimal at that amount of data preserves perplexity. This also seems like a nice principle for framing the way compute optimal models sit in the space of hyperparameters.
With infinite data, isoFLOPs for loss depending on number of active params are parabolas with some minimum point. But with finite data you need to repeat it to train with fewer active params, which damages loss. This moves the minima of isoFLOPs to the right if the minima already required 5x repetition or more. So under data scarcity, compute optimal models have more active params than under infinite data, and the effect gets worse with more compute. This way we maintain the framing of search for compute optimal hyperparameters rather than undertraining.
Now consider the 1e20 FLOPs plot in Figure 12, left. If there’s only 2B tokens of training data and no more, all minima already ask for 12-31 epochs, so the distortion that increases loss will move the minima to the right (and up), and move the high sparsity minima further than lower sparsity minima compared to their original (infinite data) locations. The way the isoFLOPs are shaped suggests that 90-95% sparsity might turn out to be optimal here, that is you can only get worse loss with 98+% sparsity at 1e20 FLOPs, however you vary the number of epochs and active params! This seems counterintuitive, as in an infinite data regime more sparsity only makes things better (if we ignore practical difficulties). But sure, 90% sparsity will still be better than dense, at least until we use even more compute and sparser minima start asking for even more epochs.
The way the isoFLOPs are shaped suggests that 90-95% sparsity might turn out to be optimal here, that is you can only get worse loss with 98+% sparsity with 1e20 FLOPs, however you vary the number of epochs and active params!
I’m currently skeptical and more minimally, I don’t understand the argument you’re making. Probably not worth getting into.
I do think there will be a limit to how sparse you want to even in the very high compute relative to data regime for various reasons (computational if nothing else). I don’t see how these graphs support 90-95% sparsity, but I had a hard time understanding your argument.
Regardless, I don’t think this argues against my claim, not sure if you were trying to argue against the claim I was saying or add context. (Insofar as your argument is true, it does limit the returns from MoE in the regime with little data.)
With 90% sparsity you do get better loss than dense, this is sufficient to broadly carry your argument. But with 98% sparsity (your llama-3-405B variant example has 95% sparsity) you might get worse loss than with 90% when data is scarce, though it’ll still be better than dense. The principle about MoE damaging data efficiency (optimal tokens/param ratio) hints that this might be the case even before looking at the experiments.
A MoE transformer can reach the same loss as a compute optimal dense model using 3x-6x less compute, but will need the same amount of data to do it. So compute optimal MoEs don’t improve data efficiency, don’t contribute to mitigating data scarcity.
A new Jan 2025 paper offers straightforward compute multiplier results comparing dense transformers to MoE at various levels of sparsity, with isoFLOPs for various tokens/parameter ratios, using experiments of up to 1e21 FLOPs per datapoint. Compute multiplier results are in Figure 11, with about 3x compute multiplier for 87% (1:8) sparse MoE over dense, and about 6x-7x compute multiplier for 97% (1:32) sparse MoE (same sparsity as DeepSeek-V3).
But there’s a catch. Greater sparsity makes it compute optimal to use fewer active parameters, and therefore more data (training with the same compute). This can be seen on isoFLOP plots in Figure 12, left. As sparsity goes from 0% (dense) to 95% (1:20), compute optimal number of active parameters for their 1e21 FLOPs experiments goes from 2.9B to 1.3B. For 97% (1:32) sparsity, interpolating from experiments on the other compute budgets, the ratio of the number of active parameters seems to be about 2.5x. Keeping compute unchanged, 2.5x fewer parameters means 2.5x more data, or 6x greater tokens/parameter ratio for a compute optimal training run.
Thus a dense model can be replaced with a 97% sparse MoE model trained using 6x less compute that will achieve the same perplexity, but the tokens/parameter ratio of this MoE model will be 6x greater than for the original dense model. Both data and active parameters would go down by 2.5x from reducing compute 6x if the ratio didn’t change, but since it does change, in actuality only the number of active parameters goes down 6x, while the number of tokens stays the same.
Let’s take Llama-3-405B as an example, which is a 405B parameter compute optimal model trained for 15T tokens at 40 tokens/parameter, using 4e25 FLOPs. An equivalent 97% sparse model will have 70B active parameters, 2T total parameters, and will need to be trained for the same 15T tokens to reach the same perplexity/loss at 220 tokens/parameter, using 6e24 FLOPs. (Which is close to DeepSeek-V3′s 4e24-5e24 FLOPs actually, so anchoring to Llama-3-405B might be a good way of framing its compute efficiency.)
I agree compute optimal MoEs don’t improve data utilization. But, naively you might expect that MoEs can be used to reduce issues with data scarcity at a fixed level of compute by training a much bigger model on a fixed amount of data.
As in, because there are returns to both more data and bigger models, you can use MoE to effectively use a much bigger model at the same compute.
Like, maybe you would have trained llama-3-405B on 15T tokens. You could instead train an 8 trillion parameter model with 400B active params on 15T tokens and a priori this could perform much better on that same amount of data. (In practice an MoE with X active params is more expensive to train than a dense model with X active params, so you might need to reduce active params somewhat.)
Chinchilla scaling shows that tokens/params ratio for compute optimal models only changes slowly with compute, making it a good anchor to frame other things in terms of. The experiments from this MoE scaling paper show that under fixed data, varying sparsity in MoEs that are compute optimal at that amount of data preserves perplexity. This also seems like a nice principle for framing the way compute optimal models sit in the space of hyperparameters.
With infinite data, isoFLOPs for loss depending on number of active params are parabolas with some minimum point. But with finite data you need to repeat it to train with fewer active params, which damages loss. This moves the minima of isoFLOPs to the right if the minima already required 5x repetition or more. So under data scarcity, compute optimal models have more active params than under infinite data, and the effect gets worse with more compute. This way we maintain the framing of search for compute optimal hyperparameters rather than undertraining.
Now consider the 1e20 FLOPs plot in Figure 12, left. If there’s only 2B tokens of training data and no more, all minima already ask for 12-31 epochs, so the distortion that increases loss will move the minima to the right (and up), and move the high sparsity minima further than lower sparsity minima compared to their original (infinite data) locations. The way the isoFLOPs are shaped suggests that 90-95% sparsity might turn out to be optimal here, that is you can only get worse loss with 98+% sparsity at 1e20 FLOPs, however you vary the number of epochs and active params! This seems counterintuitive, as in an infinite data regime more sparsity only makes things better (if we ignore practical difficulties). But sure, 90% sparsity will still be better than dense, at least until we use even more compute and sparser minima start asking for even more epochs.
I’m currently skeptical and more minimally, I don’t understand the argument you’re making. Probably not worth getting into.
I do think there will be a limit to how sparse you want to even in the very high compute relative to data regime for various reasons (computational if nothing else). I don’t see how these graphs support 90-95% sparsity, but I had a hard time understanding your argument.
Regardless, I don’t think this argues against my claim, not sure if you were trying to argue against the claim I was saying or add context. (Insofar as your argument is true, it does limit the returns from MoE in the regime with little data.)
With 90% sparsity you do get better loss than dense, this is sufficient to broadly carry your argument. But with 98% sparsity (your llama-3-405B variant example has 95% sparsity) you might get worse loss than with 90% when data is scarce, though it’ll still be better than dense. The principle about MoE damaging data efficiency (optimal tokens/param ratio) hints that this might be the case even before looking at the experiments.
Even if it’s the same cost to train, wouldn’t it still be a win if inference is a significant part of your compute budget?