The loss goes down; whether that helps in some more legible way that also happens to be impactful is much harder to figure out. The experiments in the May 2023 paper show that training on some dataset and training on a random quarter of that dataset repeated 4 times result in approximately the same loss (Figure 4). Even 15 repetitions remain useful, though at that point somewhat less useful than 15 times more unique data. There is also some sort of double descent where loss starts getting better again after hundreds of repetitions (Figure 9 in Appendix D).
This strongly suggests that repeating merely 3 times will robustly be about as useful as having 3 times more data from the same distribution. I don’t know of comparably strong clues that would change this expectation.
I’m afraid you might have missed the core thesis of my comment, let me reword. I’m arguing one should not extrapolate findings from that paper on what’s Meta training now.
The Llama 4 model card says the herd was trained on “[a] mix of publicly available, licensed data and information from Meta’s products and services. This includes publicly shared posts from Instagram and Facebook and people’s interactions with Meta AI”: https://github.com/meta-llama/llama-models/blob/main/models/llama4/MODEL_CARD.md To use a term from information theory, these posts probably have much lower factual density than curated web text in C4. There’s no public information how fast the loss goes down even on the first epoch of this kind of data let alone several ones.
I generated a slightly more structured write-up of my argument and edited it manually, hope it will be useful
Let’s break down the extrapolation challenge:
Scale Difference:
Muennighoff et al.: Studied unique data budgets up to 178 billion tokens and total processed tokens up to 900 billion. Their models were up to 9 billion parameters.
Llama 4 Behemoth: Reportedly trained on >30 trillion tokens (>30,000 billion). The model has 2 trillion total parameters (~288B active).
The Gap: We’re talking about extrapolating findings from a regime with ~170x fewer unique tokens (comparing 178B to 30T) and models ~30x smaller (active params). While scaling laws can be powerful, extrapolating across 2 orders of magnitude in data scale carries inherent risk. New phenomena or different decay rates for repeated data could emerge.
Data Composition and Quality:
Muennighoff et al.: Used C4 (filtered web crawl) and OSCAR (less filtered web crawl), plus Python code. They found filtering was more beneficial for the noisier OSCAR.
Llama 4 Behemoth: The >30T tokens includes a vast amount of web data, code, books, etc., but is also likely to contain a massive proportion of public Facebook and Instagram data.
The Issue: Social media data has different characteristics: shorter texts, different conversational styles, potentially more repetition/near-duplicates, different types of noise, and potentially lower factual density compared to curated web text or books. How the “value decay” of repeating this specific type of data behaves at the 30T scale is not something the 2023 paper could have directly measured.
Model Architecture:
Muennighoff et al.: Used dense Transformer models (GPT-2 architecture).
Llama 4 Behemoth: Is a Mixture-of-Experts (MoE) model.
The Issue: While MoE models are still Transformers, the way data interacts with specialized experts might differ from dense models when it comes to repetition. Does repeating data lead to faster overfitting within specific experts, or does the routing mechanism mitigate this differently? This interaction wasn’t studied in the 2023 paper.
Conclusion: directly applying the quantitative findings (e.g., “up to 4 epochs is fine”, RD* ≈ 15) to the Llama 4 Behemoth scale and potential data mix is highly speculative.
The massive scale difference is a big concern.
The potentially different nature and quality of the data (social media) could significantly alter the decay rate of repeated tokens.
MoE architecture adds another layer of uncertainty.
The “Data Wall” Concern: even if Meta could have repeated data based on the 2023 paper’s principles, they either chose not to (perhaps due to internal experiments showing it wasn’t effective at their scale/data mix) or they are hitting a wall where even 30T unique tokens isn’t enough for the performance leap expected from a 2T parameter compute-optimal model, and repeating isn’t closing the gap effectively enough.
P. S.
Also, check out https://www.reddit.com/r/LocalLLaMA, they are very disappointed how bad the released models turned out to be (yeah I know that’s not directly indicative of Behemoth performance)
Your point is one of the clues I mentioned that I don’t see as comparably strong to the May 2023 paper, when it comes to prediction of loss/perplexity. The framing in your argument appeals to things other than the low-level metric of loss, so I opened my reply with focusing on it rather than the more nebulous things that are actually important in practice. Scaling laws work with loss the best (holding across many OOMs of compute), and repeating 3x rather than 7x (where loss first starts noticeably degrading) gives some margin of error. That is, a theoretical argument along the lines of what you are saying shifts my expectation for 10x-20x repetition (which might degrade faster when working with lower quality data), but not yet for 3x repetition (which I still expect to get an ~unchanged loss).
Also, check out https://www.reddit.com/r/LocalLLaMA, they are very disappointed how bad the released models turned out to be (yeah I know that’s not directly indicative of Behemoth performance)
So far I haven’t even seen anyone there notice that Behemoth means that Llama 4 was essentially canceled and instead we got some sort of Llama 3.5 MoE. That is, a 100K+ H100s training run that was the expected and announced crown jewel of Llama 4 won’t be coming out, probably until at least late 2025 and possibly even 2026. Since Behemoth is the flagship model for Llama 4, a 3e26+ FLOPs model that would’ve been appropriate for a 100K H100s training system instead got pushed back to Llama 5.
As Behemoth is only a 5e25 FLOPs model, even once it comes out it won’t be competing in the same weight class as GPT-4.5, Grok 3, and Gemini 2.5 Pro. Maverick is only a 2e24 FLOPs[1] model (2x less than DeepSeek-V3, ~100x less than the recent frontier models), so of course it’s not very good compared to the frontier models. Since Meta didn’t so far demonstrate competence on the level of DeepSeek or Anthropic, they do need the big compute to remain in the game, and Maverick is certainly not big compute.
(LocalLLaMA specifically is annoyed by absence of models with a small number of total parameters in the current Llama 4 announcement, which means you need high end consumer hardware to run even Scout in 4 bit quantization locally, and datacenter hardware for the rest.)
The loss goes down; whether that helps in some more legible way that also happens to be impactful is much harder to figure out. The experiments in the May 2023 paper show that training on some dataset and training on a random quarter of that dataset repeated 4 times result in approximately the same loss (Figure 4). Even 15 repetitions remain useful, though at that point somewhat less useful than 15 times more unique data. There is also some sort of double descent where loss starts getting better again after hundreds of repetitions (Figure 9 in Appendix D).
This strongly suggests that repeating merely 3 times will robustly be about as useful as having 3 times more data from the same distribution. I don’t know of comparably strong clues that would change this expectation.
I’m afraid you might have missed the core thesis of my comment, let me reword. I’m arguing one should not extrapolate findings from that paper on what’s Meta training now.
The Llama 4 model card says the herd was trained on “[a] mix of publicly available, licensed data and information from Meta’s products and services. This includes publicly shared posts from Instagram and Facebook and people’s interactions with Meta AI”: https://github.com/meta-llama/llama-models/blob/main/models/llama4/MODEL_CARD.md To use a term from information theory, these posts probably have much lower factual density than curated web text in C4. There’s no public information how fast the loss goes down even on the first epoch of this kind of data let alone several ones.
I generated a slightly more structured write-up of my argument and edited it manually, hope it will be useful
Let’s break down the extrapolation challenge:
Scale Difference:
Muennighoff et al.: Studied unique data budgets up to 178 billion tokens and total processed tokens up to 900 billion. Their models were up to 9 billion parameters.
Llama 4 Behemoth: Reportedly trained on >30 trillion tokens (>30,000 billion). The model has 2 trillion total parameters (~288B active).
The Gap: We’re talking about extrapolating findings from a regime with ~170x fewer unique tokens (comparing 178B to 30T) and models ~30x smaller (active params). While scaling laws can be powerful, extrapolating across 2 orders of magnitude in data scale carries inherent risk. New phenomena or different decay rates for repeated data could emerge.
Data Composition and Quality:
Muennighoff et al.: Used C4 (filtered web crawl) and OSCAR (less filtered web crawl), plus Python code. They found filtering was more beneficial for the noisier OSCAR.
Llama 4 Behemoth: The >30T tokens includes a vast amount of web data, code, books, etc., but is also likely to contain a massive proportion of public Facebook and Instagram data.
The Issue: Social media data has different characteristics: shorter texts, different conversational styles, potentially more repetition/near-duplicates, different types of noise, and potentially lower factual density compared to curated web text or books. How the “value decay” of repeating this specific type of data behaves at the 30T scale is not something the 2023 paper could have directly measured.
Model Architecture:
Muennighoff et al.: Used dense Transformer models (GPT-2 architecture).
Llama 4 Behemoth: Is a Mixture-of-Experts (MoE) model.
The Issue: While MoE models are still Transformers, the way data interacts with specialized experts might differ from dense models when it comes to repetition. Does repeating data lead to faster overfitting within specific experts, or does the routing mechanism mitigate this differently? This interaction wasn’t studied in the 2023 paper.
Conclusion: directly applying the quantitative findings (e.g., “up to 4 epochs is fine”, RD* ≈ 15) to the Llama 4 Behemoth scale and potential data mix is highly speculative.
The massive scale difference is a big concern.
The potentially different nature and quality of the data (social media) could significantly alter the decay rate of repeated tokens.
MoE architecture adds another layer of uncertainty.
The “Data Wall” Concern: even if Meta could have repeated data based on the 2023 paper’s principles, they either chose not to (perhaps due to internal experiments showing it wasn’t effective at their scale/data mix) or they are hitting a wall where even 30T unique tokens isn’t enough for the performance leap expected from a 2T parameter compute-optimal model, and repeating isn’t closing the gap effectively enough.
P. S.
Also, check out https://www.reddit.com/r/LocalLLaMA, they are very disappointed how bad the released models turned out to be (yeah I know that’s not directly indicative of Behemoth performance)
Your point is one of the clues I mentioned that I don’t see as comparably strong to the May 2023 paper, when it comes to prediction of loss/perplexity. The framing in your argument appeals to things other than the low-level metric of loss, so I opened my reply with focusing on it rather than the more nebulous things that are actually important in practice. Scaling laws work with loss the best (holding across many OOMs of compute), and repeating 3x rather than 7x (where loss first starts noticeably degrading) gives some margin of error. That is, a theoretical argument along the lines of what you are saying shifts my expectation for 10x-20x repetition (which might degrade faster when working with lower quality data), but not yet for 3x repetition (which I still expect to get an ~unchanged loss).
So far I haven’t even seen anyone there notice that Behemoth means that Llama 4 was essentially canceled and instead we got some sort of Llama 3.5 MoE. That is, a 100K+ H100s training run that was the expected and announced crown jewel of Llama 4 won’t be coming out, probably until at least late 2025 and possibly even 2026. Since Behemoth is the flagship model for Llama 4, a 3e26+ FLOPs model that would’ve been appropriate for a 100K H100s training system instead got pushed back to Llama 5.
As Behemoth is only a 5e25 FLOPs model, even once it comes out it won’t be competing in the same weight class as GPT-4.5, Grok 3, and Gemini 2.5 Pro. Maverick is only a 2e24 FLOPs[1] model (2x less than DeepSeek-V3, ~100x less than the recent frontier models), so of course it’s not very good compared to the frontier models. Since Meta didn’t so far demonstrate competence on the level of DeepSeek or Anthropic, they do need the big compute to remain in the game, and Maverick is certainly not big compute.
(LocalLLaMA specifically is annoyed by absence of models with a small number of total parameters in the current Llama 4 announcement, which means you need high end consumer hardware to run even Scout in 4 bit quantization locally, and datacenter hardware for the rest.)
It’s a 17B active parameter model trained for 22T tokens.