My point is that a bit of scaling (like 3x) doesn’t matter, even though at the scale of GPT-4.5 or Grok 3 it requires building a $5bn training system, but a lot of scaling (like 2000x up from the original GPT-4) is still the most important thing impacting capabilities that will predictably happen soon. And it’s going to arrive a little bit at a time, so won’t be obviously impactful at any particular step, not doing anything to disrupt the rumors of no longer being important. It’s a rising sea kind of thing (if you have the compute).
Long reasoning traces were always necessary to start working at some point, and s1 paper illustrates that we don’t really have evidence yet that R1-like training creates rather than elicits nontrivial capabilities (things that wouldn’t be possible to transfer in mere 1000 traces). Amodei is suggesting that RL training can be scaled to billions of dollars, but unclear if this assumes that AIs will automate creation of verifiable tasks. If constructing such tasks (or very good reward models) is the bottleneck, this direction of scaling can’t quickly get very far outside specialized domains like chess where a single verifiable task (winning a game) generates endless data.
The quality data wall and flatlining benchmarks (with base model scaling) are about compute multipliers that depend on good data but don’t scale very far. As opposed to scalable multipliers like high sparsity MoE. So I think these recent 4x a year compute multipliers mostly won’t work above 1e27-1e28 FLOPs, which superficially looks bad for scaling of pretraining, but won’t impact the less legible aspects of scaling token prediction (measured in perplexity on non-benchmark data) that are more important for general intelligence. There’s also the hard data wall of literally running out of text data, but being less stringent on data quality and training for multiple epochs (giving up the ephemeral compute multipliers from data quality) should keep it at bay for now.
My point is that a bit of scaling (like 3x) doesn’t matter, even though at the scale of GPT-4.5 or Grok 3 it requires building a $5bn training system, but a lot of scaling (like 2000x up from the original GPT-4) is still the most important thing impacting capabilities that will predictably happen soon. And it’s going to arrive a little bit at a time, so won’t be obviously impactful at any particular step, not doing anything to disrupt the rumors of no longer being important. It’s a rising sea kind of thing (if you have the compute).
Long reasoning traces were always necessary to start working at some point, and s1 paper illustrates that we don’t really have evidence yet that R1-like training creates rather than elicits nontrivial capabilities (things that wouldn’t be possible to transfer in mere 1000 traces). Amodei is suggesting that RL training can be scaled to billions of dollars, but unclear if this assumes that AIs will automate creation of verifiable tasks. If constructing such tasks (or very good reward models) is the bottleneck, this direction of scaling can’t quickly get very far outside specialized domains like chess where a single verifiable task (winning a game) generates endless data.
The quality data wall and flatlining benchmarks (with base model scaling) are about compute multipliers that depend on good data but don’t scale very far. As opposed to scalable multipliers like high sparsity MoE. So I think these recent 4x a year compute multipliers mostly won’t work above 1e27-1e28 FLOPs, which superficially looks bad for scaling of pretraining, but won’t impact the less legible aspects of scaling token prediction (measured in perplexity on non-benchmark data) that are more important for general intelligence. There’s also the hard data wall of literally running out of text data, but being less stringent on data quality and training for multiple epochs (giving up the ephemeral compute multipliers from data quality) should keep it at bay for now.