Thanks Jacob. It is less of a mathematical mistake and more me trying to make a qualitative connection between the observed poor scaling of RL training and theoretical mechanism I’d just written about of poor information efficiency, both of which look very big. I agree that the theoretical explanation doesn’t seem to be quite the right shape to explain the empirical issue.
Of your potential reasons, I do think longer episodes is part of it. The R1 paper has a chart on page 8 showing that without trying to affect episode lengths, they increased linearly from 500 tokens to ~9000 tokens over 8000 episodes, suggesting pretty much 1 token increase per episode on average. Thus the information efficiency was going down linearly with episodes during training. It is a bit tricky to compare this with the o1 chart, whose x-axis is both logarithmic and also measuring training compute rather than episode number. I think this means it should be declining as 1/episodes = 1/root(compute) — since training compute would be growing as the square of the number of episodes. And I think that just accounts for the power being 0.5 lower than pretraining, rather than the 3 that I’m claiming. (But its late, and I haven’t checked this through on paper.)
I do think you are right that the information inefficiency can’t explain the whole issue, but it might be able to explain part of it. i.e. a shift to the left can’t explain a line with a different slope, but the slope changing part of the way there could be part of the explanation.
Actually, here is a slightly simpler way to think about it. How many more training steps do you do with RL when you 100x the compute? Given the linear episode length growth, you only do root(100) = 10x the number of training steps. So if capability gain were linear in the log of the number of training steps, it would grow as log(root(compute)) = log(compute)/2, whereas for pretraining it would grow as log(compute). So if inference-scaling were going as well as pre-training scaling (contra the 3⁄2 estimate I appealed to in my piece) then the information inefficiency theoretical explanation could exactly account for the observed scaling behaviour.
I’m not sure this is right (there were a couple of biggish assumptions there) but it does feel closer to being able to be a larger part of the actual explanation.
Nice observation, and I agree with your calculation that linear episode length growth would account for a worse scaling exponent by a factor of 2 (or more generally, episode length growing with exponent k would account for a worse scaling exponent by a factor of k+1).
Note also that this suggests a potential remedy, namely controlling episode length, but there is less incentive to apply this when data is more of a constraint than compute.
Thanks Jacob. It is less of a mathematical mistake and more me trying to make a qualitative connection between the observed poor scaling of RL training and theoretical mechanism I’d just written about of poor information efficiency, both of which look very big. I agree that the theoretical explanation doesn’t seem to be quite the right shape to explain the empirical issue.
Of your potential reasons, I do think longer episodes is part of it. The R1 paper has a chart on page 8 showing that without trying to affect episode lengths, they increased linearly from 500 tokens to ~9000 tokens over 8000 episodes, suggesting pretty much 1 token increase per episode on average. Thus the information efficiency was going down linearly with episodes during training. It is a bit tricky to compare this with the o1 chart, whose x-axis is both logarithmic and also measuring training compute rather than episode number. I think this means it should be declining as 1/episodes = 1/root(compute) — since training compute would be growing as the square of the number of episodes. And I think that just accounts for the power being 0.5 lower than pretraining, rather than the 3 that I’m claiming. (But its late, and I haven’t checked this through on paper.)
I do think you are right that the information inefficiency can’t explain the whole issue, but it might be able to explain part of it. i.e. a shift to the left can’t explain a line with a different slope, but the slope changing part of the way there could be part of the explanation.
Actually, here is a slightly simpler way to think about it. How many more training steps do you do with RL when you 100x the compute? Given the linear episode length growth, you only do root(100) = 10x the number of training steps. So if capability gain were linear in the log of the number of training steps, it would grow as log(root(compute)) = log(compute)/2, whereas for pretraining it would grow as log(compute). So if inference-scaling were going as well as pre-training scaling (contra the 3⁄2 estimate I appealed to in my piece) then the information inefficiency theoretical explanation could exactly account for the observed scaling behaviour.
I’m not sure this is right (there were a couple of biggish assumptions there) but it does feel closer to being able to be a larger part of the actual explanation.
Nice observation, and I agree with your calculation that linear episode length growth would account for a worse scaling exponent by a factor of 2 (or more generally, episode length growing with exponent k would account for a worse scaling exponent by a factor of k+1).
Note also that this suggests a potential remedy, namely controlling episode length, but there is less incentive to apply this when data is more of a constraint than compute.