Yep, this is probabaly true for pretraining but this seems less and less relevant these days. For example, according to the Grok 4 presentation the model used as much compute in pretraining as in RL. I’d expect this trend to continue.
Yep, this is probabaly true for pretraining but this seems less and less relevant these days. For example, according to the Grok 4 presentation the model used as much compute in pretraining as in RL. I’d expect this trend to continue.