Given the uncertainty surrounding the scaling effects of MoE training, we leave detailed investigation to future work. Where applicable, we approximate MoE efficiency gains as 2×.
There are isoFLOPs for MoE models with different levels of sparsity, including dense transformers (on the same dataset), in this Jan 2025 paper, see Figure 12, left (in Appendix D.4). Figure 11 estimates compute multipliers given compute optimal amounts of data. My takeaway is that 1:8 sparsity gives a 3x compute multiplier (compared to dense), but requires 3x more tokens per active param for compute optimality, and similarly 1:32 sparsity gives a 6x compute multiplier, again at the cost of 6x more tokens per active param than would be compute optimal for a dense transformer.
Given the very different compute optimal amounts of data (between MoE and dense), and the tendency to wildly overtrain small models, the literature that doesn’t actually study the isoFLOPs (and only trains a dense/MoE counterpart as an afterthought in addition to the main thing the paper looks into) is going to underestimate the difference that only appears in full when both the dense transformer and the MoE transformer for the same amount of FLOPs get appropriate and very different numbers of tokens per active param (and correspondingly different numbers of active params).
There are isoFLOPs for MoE models with different levels of sparsity, including dense transformers (on the same dataset), in this Jan 2025 paper, see Figure 12, left (in Appendix D.4). Figure 11 estimates compute multipliers given compute optimal amounts of data. My takeaway is that 1:8 sparsity gives a 3x compute multiplier (compared to dense), but requires 3x more tokens per active param for compute optimality, and similarly 1:32 sparsity gives a 6x compute multiplier, again at the cost of 6x more tokens per active param than would be compute optimal for a dense transformer.
Given the very different compute optimal amounts of data (between MoE and dense), and the tendency to wildly overtrain small models, the literature that doesn’t actually study the isoFLOPs (and only trains a dense/MoE counterpart as an afterthought in addition to the main thing the paper looks into) is going to underestimate the difference that only appears in full when both the dense transformer and the MoE transformer for the same amount of FLOPs get appropriate and very different numbers of tokens per active param (and correspondingly different numbers of active params).