Thanks for the detailed response. If we assume that continued progress requires using much more compute on RL than pretraining, this could favor model sizes much smaller than Chinchilla optimal. It’s possible that future models will be trained with 99% RL and will employ some sort of continuous learning architecture.
I believe the number of active parameters will be determined by achieving similar load on compute and memory, which will depend on the current level of attention efficiency and the context necessary to hit agent ability targets.I don’t know how sparse very large (>20T) models could be, but they could probably be extremely sparse (maybe 0.1% or less, where most of the volume is taken by a router).
I think it’s possible that current and planned AI compute designs are not even close to optimal for training methods that may be developed within a year or so. Assuming something like DeepSeek DSA is relatively close to maximum efficiency, and RL pathways produced by agents will stretch to many millions of tokens, it may be the case that HBM will finally be too slow, and using SRAM will be the only way forward.
There’s also a completely different consideration. For example, inclusionAI/Ring-1T has 1T total and 50B active parameters, and the creators claim it can achieve silver in IMO using a multi-agent framework, with plans to train it further to achieve gold, it’s only 50b.
As you said, we’re probably entering compute ranges that enable training models with many T active parameters (assuming RL will be in roughly one-to-one proportion to pretraining). The question is whether pattern density in our environment is currently large enough to make use of it.”
Compute optimality makes sense for RL as much as for pretraining. A model that’s too large won’t see much data, and a model that’s too small won’t be able to learn well even from the correspondingly larger amount of data. So it’s a quantitative question, where compute optimality for RLVR happens to be (compared to pretraining). The papers are still hashing out stable/scalable RL training, rather than the tradeoffs of RL training constrained by fixed total compute (under varying model size specifically).
Thanks for the detailed response. If we assume that continued progress requires using much more compute on RL than pretraining, this could favor model sizes much smaller than Chinchilla optimal. It’s possible that future models will be trained with 99% RL and will employ some sort of continuous learning architecture.
I believe the number of active parameters will be determined by achieving similar load on compute and memory, which will depend on the current level of attention efficiency and the context necessary to hit agent ability targets.I don’t know how sparse very large (>20T) models could be, but they could probably be extremely sparse (maybe 0.1% or less, where most of the volume is taken by a router).
I think it’s possible that current and planned AI compute designs are not even close to optimal for training methods that may be developed within a year or so. Assuming something like DeepSeek DSA is relatively close to maximum efficiency, and RL pathways produced by agents will stretch to many millions of tokens, it may be the case that HBM will finally be too slow, and using SRAM will be the only way forward.
There’s also a completely different consideration. For example, inclusionAI/Ring-1T has 1T total and 50B active parameters, and the creators claim it can achieve silver in IMO using a multi-agent framework, with plans to train it further to achieve gold, it’s only 50b.
As you said, we’re probably entering compute ranges that enable training models with many T active parameters (assuming RL will be in roughly one-to-one proportion to pretraining). The question is whether pattern density in our environment is currently large enough to make use of it.”
Compute optimality makes sense for RL as much as for pretraining. A model that’s too large won’t see much data, and a model that’s too small won’t be able to learn well even from the correspondingly larger amount of data. So it’s a quantitative question, where compute optimality for RLVR happens to be (compared to pretraining). The papers are still hashing out stable/scalable RL training, rather than the tradeoffs of RL training constrained by fixed total compute (under varying model size specifically).