J Bostock comments on 10x more training compute = 5x greater task length (kind of)

J Bostock 13 Jul 2025 20:01 UTC
5 points
0
Interesting, but I think squashing total training compute to a scalar value is inappropriate here. If you draw a line through GPT-2, 3, 3.5, 4 on that plot, you can see the RLVR models R1 and 3.7 Sonnet are above the trend line. Since we don’t really know the optimal ratio of pretraining/RLVR compute, nor how it scales, the total compute is missing a lot of important information.