Yes, but I expect it’s only a mild underestimate as the returns probably mostly come from scaling up RL for that task. As in, if you scaled up RL and pretraining at the compute optimal ratios, I think you would have seen similar numbers because (for this task) o1 was effectively way less RL than would be optimal (again, for this task), so most of the gains come from scaling up RL rather than scaling both in parallel.
Not super sure about this, and naively, I think there could be in the argument in the opposite direction (that scaling up RL compute was relatively low hanging and we should typically expect lower returns). But, overall, my sense is that RL is driving most of the improvement on this sort of competitive programming task such that the just looking at RL compute is actually pretty reasonable.
Yes, but I expect it’s only a mild underestimate as the returns probably mostly come from scaling up RL for that task. As in, if you scaled up RL and pretraining at the compute optimal ratios, I think you would have seen similar numbers because (for this task) o1 was effectively way less RL than would be optimal (again, for this task), so most of the gains come from scaling up RL rather than scaling both in parallel.
Not super sure about this, and naively, I think there could be in the argument in the opposite direction (that scaling up RL compute was relatively low hanging and we should typically expect lower returns). But, overall, my sense is that RL is driving most of the improvement on this sort of competitive programming task such that the just looking at RL compute is actually pretty reasonable.