Vladimir_Nesov comments on My AGI timeline updates from GPT-5 (and 2025 so far)

Vladimir_Nesov 20 Aug 2025 16:42 UTC
LW: 22 AF: 7
0
AF

GPT-5 probably isn’t based on a substantially better pretrained model which is some evidence that OpenAI thinks the marginal returns from pretraining are pretty weak relative to the returns from RL

The model seems to be “small”, but not necessarily with less pretraining in it (in the form of overtraining) than RLVR. There are still no papers I’m aware of on what the compute optimal (or GPU-time optimal) pretraining:RLVR ratio could be like. Matching GPU-time of pretraining and RLVR results in something like 4:1 (in terms of FLOPs), which would only be compute optimal (or GPU-time optimal) by unlikely coincidence.

If the optimal ratio of pretraining:RLVR is something like 1:10 (in FLOPs), then overtraining even smaller models is unimportant. But it could also be more like 40:1, in which case overtraining becomes a must (if inference cost/speed and HBM capacity of the legacy 8-chip servers force the param count to be smaller than compute optimal given the available training compute and the HBM capacity of GB200 NVL72).