Hmm, let me think step by step. First, the pretraining slowdown isn’t about GPT-4.5 in particular. It’s about the various rumors that the data wall is already being run up against. It’s possible those rumors are unfounded but I’m currently guessing the situation is “Indeed, scaling up pretraining is going to be hard, due to lack of data; scaling up RL (and synthetic data more generally) is the future.” Also, separately, it seems that in terms of usefulness on downstream tasks, GPT 4.5 may not be that much better than smaller models… well, it’s too early to say I guess since they haven’t done all the reasoning/agency posttraining on GPT 4.5 yet it seems.
Idk. Maybe you are right and I should be updating based on the above. I still think the benchmarks+gaps argument works, and also, it’s taking slightly longer to get economically useful agents than I expected (though this could say more about the difficulties of building products and less about the underlying intelligence of the models, after all, RE bench and similar have been progressing faster than I expected)
My point is that a bit of scaling (like 3x) doesn’t matter, even though at the scale of GPT-4.5 or Grok 3 it requires building a $5bn training system, but a lot of scaling (like 2000x up from the original GPT-4) is still the most important thing impacting capabilities that will predictably happen soon. And it’s going to arrive a little bit at a time, so won’t be obviously impactful at any particular step, not doing anything to disrupt the rumors of no longer being important. It’s a rising sea kind of thing (if you have the compute).
Long reasoning traces were always necessary to start working at some point, and s1 paper illustrates that we don’t really have evidence yet that R1-like training creates rather than elicits nontrivial capabilities (things that wouldn’t be possible to transfer in mere 1000 traces). Amodei is suggesting that RL training can be scaled to billions of dollars, but unclear if this assumes that AIs will automate creation of verifiable tasks. If constructing such tasks (or very good reward models) is the bottleneck, this direction of scaling can’t quickly get very far outside specialized domains like chess where a single verifiable task (winning a game) generates endless data.
The quality data wall and flatlining benchmarks (with base model scaling) are about compute multipliers that depend on good data but don’t scale very far. As opposed to scalable multipliers like high sparsity MoE. So I think these recent 4x a year compute multipliers mostly won’t work above 1e27-1e28 FLOPs, which superficially looks bad for scaling of pretraining, but won’t impact the less legible aspects of scaling token prediction (measured in perplexity on non-benchmark data) that are more important for general intelligence. There’s also the hard data wall of literally running out of text data, but being less stringent on data quality and training for multiple epochs (giving up the ephemeral compute multipliers from data quality) should keep it at bay for now.
The high cost and slow speed of GPT-4.5 seems like a sign OpenAI is facing data constraints, though we don’t actually know the parameters and OpenAI might be charging an bigger margin than usual (it’s a “research preview” not a flagship commercial product). If data was more abundant, wouldn’t GPT-4.5 be more overtrained and have fewer parameters?
edit: FWIW Artificial Analysis measures GPT-4.5 at a not-that-bad 50 tokens per second whereas I’ve been experiencing a painfully slow 10-20 tokens/second in the chat app. So may just be growing pains until they get more inference GPUs online. But OpenAI does call it a “chonky” model, implying significant parameter scaling.
Hmm, let me think step by step. First, the pretraining slowdown isn’t about GPT-4.5 in particular. It’s about the various rumors that the data wall is already being run up against. It’s possible those rumors are unfounded but I’m currently guessing the situation is “Indeed, scaling up pretraining is going to be hard, due to lack of data; scaling up RL (and synthetic data more generally) is the future.” Also, separately, it seems that in terms of usefulness on downstream tasks, GPT 4.5 may not be that much better than smaller models… well, it’s too early to say I guess since they haven’t done all the reasoning/agency posttraining on GPT 4.5 yet it seems.
Idk. Maybe you are right and I should be updating based on the above. I still think the benchmarks+gaps argument works, and also, it’s taking slightly longer to get economically useful agents than I expected (though this could say more about the difficulties of building products and less about the underlying intelligence of the models, after all, RE bench and similar have been progressing faster than I expected)
My point is that a bit of scaling (like 3x) doesn’t matter, even though at the scale of GPT-4.5 or Grok 3 it requires building a $5bn training system, but a lot of scaling (like 2000x up from the original GPT-4) is still the most important thing impacting capabilities that will predictably happen soon. And it’s going to arrive a little bit at a time, so won’t be obviously impactful at any particular step, not doing anything to disrupt the rumors of no longer being important. It’s a rising sea kind of thing (if you have the compute).
Long reasoning traces were always necessary to start working at some point, and s1 paper illustrates that we don’t really have evidence yet that R1-like training creates rather than elicits nontrivial capabilities (things that wouldn’t be possible to transfer in mere 1000 traces). Amodei is suggesting that RL training can be scaled to billions of dollars, but unclear if this assumes that AIs will automate creation of verifiable tasks. If constructing such tasks (or very good reward models) is the bottleneck, this direction of scaling can’t quickly get very far outside specialized domains like chess where a single verifiable task (winning a game) generates endless data.
The quality data wall and flatlining benchmarks (with base model scaling) are about compute multipliers that depend on good data but don’t scale very far. As opposed to scalable multipliers like high sparsity MoE. So I think these recent 4x a year compute multipliers mostly won’t work above 1e27-1e28 FLOPs, which superficially looks bad for scaling of pretraining, but won’t impact the less legible aspects of scaling token prediction (measured in perplexity on non-benchmark data) that are more important for general intelligence. There’s also the hard data wall of literally running out of text data, but being less stringent on data quality and training for multiple epochs (giving up the ephemeral compute multipliers from data quality) should keep it at bay for now.
LLMs shaping human’s writing patterns in the wild
The high cost and slow speed of GPT-4.5 seems like a sign OpenAI is facing data constraints, though we don’t actually know the parameters and OpenAI might be charging an bigger margin than usual (it’s a “research preview” not a flagship commercial product). If data was more abundant, wouldn’t GPT-4.5 be more overtrained and have fewer parameters?
edit: FWIW Artificial Analysis measures GPT-4.5 at a not-that-bad 50 tokens per second whereas I’ve been experiencing a painfully slow 10-20 tokens/second in the chat app. So may just be growing pains until they get more inference GPUs online. But OpenAI does call it a “chonky” model, implying significant parameter scaling.