Vladimir_Nesov comments on Llama Does Not Look Good 4 Anything

Vladimir_Nesov 9 Apr 2025 14:13 UTC
22 points
5
The most important thing about Llama 4 is that the 100K H100s run that was promised got canceled, and its flagship model Behemoth will be a 5e25 FLOPs compute optimal model^[1] rather than a ~3e26 FLOPs model that a 100K H100s training system should be able to produce. This is merely 35% more compute than Llama-3-405B from last year, while GPT-4.5, Grok 3 and Gemini 2.5 Pro are probably around 3e26 FLOPs or a bit more. They even explicitly mention that it was trained on 32K GPUs (which must be H100s). Since Behemoth is the flagship model, a bigger model got pushed back to Llama 5, which will only come out much later, possibly not even this year.

In contrast, capabilities of Maverick are unsurprising and prompt no updates. It’s merely a 2e24 FLOPs ~7x overtrained model^[2], which is 2x less compute than DeepSeek-V3 and 100x less than the recent frontier models, and also it’s not a reasoning model for now. So of course it’s not very good. If it was very good with this little compute, that would be a feat on the level of Anthropic or DeepSeek, which would be a positive update about Meta’s model training competence, but this unexpected thing merely didn’t happen, so nothing to see here, what are people even surprised about (except some benchmarking shenanigans).

To the extent Llamas 1-3 were important open weights releases that could be run by normal people locally, Llama 4 does seem disappointing, because there are no small models (in total params), though as Llama 3.2 demonstrated this might change shortly. Even the smallest Scout model still has 109B total params, meaning a 4 bit quantized version might fit on high end consumer hardware, but all the rest is only practical with datacenter hardware.
1. ↩︎
  288B active params, 30T training tokens gives 5.2e25 FLOPs by 6ND. At 1:8 sparsity (2T total tokens, maybe ~250T active params within experts), data efficiency is 3x lower than for a dense model, and for Llama-3-405B the compute optimal amount of data was 40 tokens per param. This means that about 120 tokens per param would be optimal for Behemoth, and in fact it has 104 tokens per active param, so it’s not overtrained.
2. ↩︎
  17B active params, 22T tokens, which is 2.25e24 FLOPs by 6ND, and 1300 tokens per active param. It’s a weird mix of dense and MoE, so the degree of its sparsity probably doesn’t map to measurements for pure MoE, but at ~1:23 sparsity (from 400B total params) it might be ~5x less data efficient than dense, predicting ~200 tokens per param compute optimal, meaning 1300 tokens per param give ~7x overtraining.
What links here?
- Vladimir_Nesov's comment on o3 Will Use Its Tools For You by Zvi (19 Apr 2025 10:55 UTC; 5 points)