GPT-5 probably isn’t based on a substantially better pretrained model which is some evidence that OpenAI thinks the marginal returns from pretraining are pretty weak relative to the returns from RL
The model seems to be “small”, but not necessarily with less pretraining in it (in the form of overtraining) than RLVR. There are still no papers I’m aware of on what the compute optimal (or GPU-time optimal) pretraining:RLVR ratio could be like. Matching GPU-time of pretraining and RLVR results in something like 4:1 (in terms of FLOPs), which would only be compute optimal (or GPU-time optimal) by unlikely coincidence.
If the optimal ratio of pretraining:RLVR is something like 1:10 (in FLOPs), then overtraining even smaller models is unimportant. But it could also be more like 40:1, in which case overtraining becomes a must (if inference cost/speed and HBM capacity of the legacy 8-chip servers force the param count to be smaller than compute optimal given the available training compute and the HBM capacity of GB200 NVL72).
The model seems to be “small”, but not necessarily with less pretraining in it (in the form of overtraining) than RLVR. There are still no papers I’m aware of on what the compute optimal (or GPU-time optimal) pretraining:RLVR ratio could be like. Matching GPU-time of pretraining and RLVR results in something like 4:1 (in terms of FLOPs), which would only be compute optimal (or GPU-time optimal) by unlikely coincidence.
If the optimal ratio of pretraining:RLVR is something like 1:10 (in FLOPs), then overtraining even smaller models is unimportant. But it could also be more like 40:1, in which case overtraining becomes a must (if inference cost/speed and HBM capacity of the legacy 8-chip servers force the param count to be smaller than compute optimal given the available training compute and the HBM capacity of GB200 NVL72).