I’ve also looked at the results for o1-preview vs o1 vs o3 on code forces and those models appear to each be 1 SD better than the prior model. When looking at AIME, we also see each model doing around 1 SD better than the prior model. OpenAI said that o3 is 10x more compute than o1 (potentially somewhat more effective compute due to algorithmic/data improvement[7]) and it seems pretty plausible that o1-preview vs o1 is also roughly 10x more effective compute on RL. So, this matches the guess of 1 SD per OOM. (These gaps appear to be more consistent with an SD model than with a rank ordering model.)
You’re only talking about RL compute here, right? This seems like an underestimate of effective compute → SD increase if so.
E.g. if o3′s RL compute is ~equal to its pretraining compute (I think it’s probably not equal, and is unlikely to be like 5x higher than pretraining), then o1-preview’s RL step would only be ~1% of its total compute budget, and o1′s would be ~10%. I’m inclined to think it’s not a good idea to neglect pretraining compute here, since it gives the models relevant skills that then don’t take much RL to unlock.
Yes, but I expect it’s only a mild underestimate as the returns probably mostly come from scaling up RL for that task. As in, if you scaled up RL and pretraining at the compute optimal ratios, I think you would have seen similar numbers because (for this task) o1 was effectively way less RL than would be optimal (again, for this task), so most of the gains come from scaling up RL rather than scaling both in parallel.
Not super sure about this, and naively, I think there could be in the argument in the opposite direction (that scaling up RL compute was relatively low hanging and we should typically expect lower returns). But, overall, my sense is that RL is driving most of the improvement on this sort of competitive programming task such that the just looking at RL compute is actually pretty reasonable.
You’re only talking about RL compute here, right? This seems like an underestimate of effective compute → SD increase if so.
E.g. if o3′s RL compute is ~equal to its pretraining compute (I think it’s probably not equal, and is unlikely to be like 5x higher than pretraining), then o1-preview’s RL step would only be ~1% of its total compute budget, and o1′s would be ~10%. I’m inclined to think it’s not a good idea to neglect pretraining compute here, since it gives the models relevant skills that then don’t take much RL to unlock.
Yes, but I expect it’s only a mild underestimate as the returns probably mostly come from scaling up RL for that task. As in, if you scaled up RL and pretraining at the compute optimal ratios, I think you would have seen similar numbers because (for this task) o1 was effectively way less RL than would be optimal (again, for this task), so most of the gains come from scaling up RL rather than scaling both in parallel.
Not super sure about this, and naively, I think there could be in the argument in the opposite direction (that scaling up RL compute was relatively low hanging and we should typically expect lower returns). But, overall, my sense is that RL is driving most of the improvement on this sort of competitive programming task such that the just looking at RL compute is actually pretty reasonable.