Actually, here is a slightly simpler way to think about it. How many more training steps do you do with RL when you 100x the compute? Given the linear episode length growth, you only do root(100) = 10x the number of training steps. So if capability gain were linear in the log of the number of training steps, it would grow as log(root(compute)) = log(compute)/ā2, whereas for pretraining it would grow as log(compute). So if inference-scaling were going as well as pre-training scaling (contra the 3ā2 estimate I appealed to in my piece) then the information inefficiency theoretical explanation could exactly account for the observed scaling behaviour.
Iām not sure this is right (there were a couple of biggish assumptions there) but it does feel closer to being able to be a larger part of the actual explanation.
Nice observation, and I agree with your calculation that linear episode length growth would account for a worse scaling exponent by a factor of 2 (or more generally, episode length growing with exponent k would account for a worse scaling exponent by a factor of k+1).
Note also that this suggests a potential remedy, namely controlling episode length, but there is less incentive to apply this when data is more of a constraint than compute.
Actually, here is a slightly simpler way to think about it. How many more training steps do you do with RL when you 100x the compute? Given the linear episode length growth, you only do root(100) = 10x the number of training steps. So if capability gain were linear in the log of the number of training steps, it would grow as log(root(compute)) = log(compute)/ā2, whereas for pretraining it would grow as log(compute). So if inference-scaling were going as well as pre-training scaling (contra the 3ā2 estimate I appealed to in my piece) then the information inefficiency theoretical explanation could exactly account for the observed scaling behaviour.
Iām not sure this is right (there were a couple of biggish assumptions there) but it does feel closer to being able to be a larger part of the actual explanation.
Nice observation, and I agree with your calculation that linear episode length growth would account for a worse scaling exponent by a factor of 2 (or more generally, episode length growing with exponent k would account for a worse scaling exponent by a factor of k+1).
Note also that this suggests a potential remedy, namely controlling episode length, but there is less incentive to apply this when data is more of a constraint than compute.