Your point is one of the clues I mentioned that I don’t see as comparably strong to the May 2023 paper, when it comes to prediction of loss/perplexity. The framing in your argument appeals to things other than the low-level metric of loss, so I opened my reply with focusing on it rather than the more nebulous things that are actually important in practice. Scaling laws work with loss the best (holding across many OOMs of compute), and repeating 3x rather than 7x (where loss first starts noticeably degrading) gives some margin of error. That is, a theoretical argument along the lines of what you are saying shifts my expectation for 10x-20x repetition (which might degrade faster when working with lower quality data), but not yet for 3x repetition (which I still expect to get an ~unchanged loss).
Also, check out https://www.reddit.com/r/LocalLLaMA, they are very disappointed how bad the released models turned out to be (yeah I know that’s not directly indicative of Behemoth performance)
So far I haven’t even seen anyone there notice that Behemoth means that Llama 4 was essentially canceled and instead we got some sort of Llama 3.5 MoE. That is, a 100K+ H100s training run that was the expected and announced crown jewel of Llama 4 won’t be coming out, probably until at least late 2025 and possibly even 2026. Since Behemoth is the flagship model for Llama 4, a 3e26+ FLOPs model that would’ve been appropriate for a 100K H100s training system instead got pushed back to Llama 5.
As Behemoth is only a 5e25 FLOPs model, even once it comes out it won’t be competing in the same weight class as GPT-4.5, Grok 3, and Gemini 2.5 Pro. Maverick is only a 2e24 FLOPs[1] model (2x less than DeepSeek-V3, ~100x less than the recent frontier models), so of course it’s not very good compared to the frontier models. Since Meta didn’t so far demonstrate competence on the level of DeepSeek or Anthropic, they do need the big compute to remain in the game, and Maverick is certainly not big compute.
(LocalLLaMA specifically is annoyed by absence of models with a small number of total parameters in the current Llama 4 announcement, which means you need high end consumer hardware to run even Scout in 4 bit quantization locally, and datacenter hardware for the rest.)
Your point is one of the clues I mentioned that I don’t see as comparably strong to the May 2023 paper, when it comes to prediction of loss/perplexity. The framing in your argument appeals to things other than the low-level metric of loss, so I opened my reply with focusing on it rather than the more nebulous things that are actually important in practice. Scaling laws work with loss the best (holding across many OOMs of compute), and repeating 3x rather than 7x (where loss first starts noticeably degrading) gives some margin of error. That is, a theoretical argument along the lines of what you are saying shifts my expectation for 10x-20x repetition (which might degrade faster when working with lower quality data), but not yet for 3x repetition (which I still expect to get an ~unchanged loss).
So far I haven’t even seen anyone there notice that Behemoth means that Llama 4 was essentially canceled and instead we got some sort of Llama 3.5 MoE. That is, a 100K+ H100s training run that was the expected and announced crown jewel of Llama 4 won’t be coming out, probably until at least late 2025 and possibly even 2026. Since Behemoth is the flagship model for Llama 4, a 3e26+ FLOPs model that would’ve been appropriate for a 100K H100s training system instead got pushed back to Llama 5.
As Behemoth is only a 5e25 FLOPs model, even once it comes out it won’t be competing in the same weight class as GPT-4.5, Grok 3, and Gemini 2.5 Pro. Maverick is only a 2e24 FLOPs[1] model (2x less than DeepSeek-V3, ~100x less than the recent frontier models), so of course it’s not very good compared to the frontier models. Since Meta didn’t so far demonstrate competence on the level of DeepSeek or Anthropic, they do need the big compute to remain in the game, and Maverick is certainly not big compute.
(LocalLLaMA specifically is annoyed by absence of models with a small number of total parameters in the current Llama 4 announcement, which means you need high end consumer hardware to run even Scout in 4 bit quantization locally, and datacenter hardware for the rest.)
It’s a 17B active parameter model trained for 22T tokens.