Interesting, but I think squashing total training compute to a scalar value is inappropriate here. If you draw a line through GPT-2, 3, 3.5, 4 on that plot, you can see the RLVR models R1 and 3.7 Sonnet are above the trend line. Since we don’t really know the optimal ratio of pretraining/RLVR compute, nor how it scales, the total compute is missing a lot of important information.
Interesting, but I think squashing total training compute to a scalar value is inappropriate here. If you draw a line through GPT-2, 3, 3.5, 4 on that plot, you can see the RLVR models R1 and 3.7 Sonnet are above the trend line. Since we don’t really know the optimal ratio of pretraining/RLVR compute, nor how it scales, the total compute is missing a lot of important information.