scaling laws for verifiable task RL training, for which there are still no clues in public
Why is that the case? While checking how it scales past o3/GPT-4-level isn’t possible, I’d expect people to have run experiments at lower ends of the scale (using lower-parameter models or less RL training), fitted a line to the results, and then assumed it extrapolates. Is there reason to think that wouldn’t work?
It needs verifiable tasks that might have to be crafted manually. It’s unknown what happens if you train too much with only a feasible amount of tasks, even if they are progressively more and more difficult. When data can’t be generated well, RL needs early stopping, has a bound on how far it goes, and in this case this bound could depend on the scale of the base model, or on the number/quality of verifiable tasks.
Depending on how it works, it might be impossible to use $100M for RL training at all, or scaling of pretraining might have a disproportionally large effect on quality of the optimally trained reasoning model based on it, or approximately no effect at all. Quantitative understanding of this is crucial for forecasting the consequences of the 2025-2028 compute scaleup. AI companies likely have some of this understanding, but it’s not public.
Why is that the case? While checking how it scales past o3/GPT-4-level isn’t possible, I’d expect people to have run experiments at lower ends of the scale (using lower-parameter models or less RL training), fitted a line to the results, and then assumed it extrapolates. Is there reason to think that wouldn’t work?
It needs verifiable tasks that might have to be crafted manually. It’s unknown what happens if you train too much with only a feasible amount of tasks, even if they are progressively more and more difficult. When data can’t be generated well, RL needs early stopping, has a bound on how far it goes, and in this case this bound could depend on the scale of the base model, or on the number/quality of verifiable tasks.
Depending on how it works, it might be impossible to use $100M for RL training at all, or scaling of pretraining might have a disproportionally large effect on quality of the optimally trained reasoning model based on it, or approximately no effect at all. Quantitative understanding of this is crucial for forecasting the consequences of the 2025-2028 compute scaleup. AI companies likely have some of this understanding, but it’s not public.