Using 100x more compute showed discontinuous changes so far, of which 10x is half and 3x is quarter. The scale of Grok 3 is 100K H100s, and 20K H100s clusters were around since summer 2023, so some current models are likely trained on merely 3x less compute than Grok 3. Also, if Gemini 2.0 Ultra was never planned (failed or not), then Pro got the bulk of the 2.0 compute, which is plausibly about 6e26 FLOPs, 2x the Grok 3 compute.
My sense is that the difference of 3x is less significant than post-training or obscure pretraining compute multipliers that can differ between contemporary models, and only the difference of 10x is usually noticeable (but can still be overcome with much better methods, especially at smaller scale). I think most compute multipliers from better data mixes and algorithms don’t really work in improving general intelligence (especially those demonstrated in terms of benchmark performance rather than perplexity), or don’t scale to much more compute (and therefore data), so raw compute remains a crucial anchor of capability. A 100x change in raw compute is likely to remain the single most important factor in explaining the difference in capability.
MoEs were recently shown to offer a 3x compute multiplier at 1:8 sparsity (as rumored for original GPT-4) compared to dense (like Llama-3-405B), and 6x multiplier at 1:32 sparsity (as in DeepSeek-V3). I think these multipliers are real, describe scaling of general intelligence. For example, raw compute of DeepSeek-V3 is about 4e24 FLOPs, which corresponds to effective compute of 2.5e25 FLOPs in a dense model, merely 1.5x less than 4e25 FLOPs of Llama-3-405B. And raw compute of original GPT-4 is rumored to be 2e25 FLOPs, which corresponds to 6e25 FLOPs in a dense model, 1.5x more than Llama-3-405B. Across this range, DeepSeek-V3 still manages to win out.
Using 100x more compute showed discontinuous changes so far, of which 10x is half and 3x is quarter. The scale of Grok 3 is 100K H100s, and 20K H100s clusters were around since summer 2023, so some current models are likely trained on merely 3x less compute than Grok 3. Also, if Gemini 2.0 Ultra was never planned (failed or not), then Pro got the bulk of the 2.0 compute, which is plausibly about 6e26 FLOPs, 2x the Grok 3 compute.
My sense is that the difference of 3x is less significant than post-training or obscure pretraining compute multipliers that can differ between contemporary models, and only the difference of 10x is usually noticeable (but can still be overcome with much better methods, especially at smaller scale). I think most compute multipliers from better data mixes and algorithms don’t really work in improving general intelligence (especially those demonstrated in terms of benchmark performance rather than perplexity), or don’t scale to much more compute (and therefore data), so raw compute remains a crucial anchor of capability. A 100x change in raw compute is likely to remain the single most important factor in explaining the difference in capability.
MoEs were recently shown to offer a 3x compute multiplier at 1:8 sparsity (as rumored for original GPT-4) compared to dense (like Llama-3-405B), and 6x multiplier at 1:32 sparsity (as in DeepSeek-V3). I think these multipliers are real, describe scaling of general intelligence. For example, raw compute of DeepSeek-V3 is about 4e24 FLOPs, which corresponds to effective compute of 2.5e25 FLOPs in a dense model, merely 1.5x less than 4e25 FLOPs of Llama-3-405B. And raw compute of original GPT-4 is rumored to be 2e25 FLOPs, which corresponds to 6e25 FLOPs in a dense model, 1.5x more than Llama-3-405B. Across this range, DeepSeek-V3 still manages to win out.