And how much the improved reasoning is from using a different base model vs. different post-training. It’s possible R1-like training didn’t work for models below GPT-4 level, and then that same training started working at GPT-4 level (at which point you can iterate from a working prototype or simply distill to get it to work for weaker models). So it might work even better for the next level of scale of base models, without necessarily changing the RL part all that much.
And how much the improved reasoning is from using a different base model vs. different post-training. It’s possible R1-like training didn’t work for models below GPT-4 level, and then that same training started working at GPT-4 level (at which point you can iterate from a working prototype or simply distill to get it to work for weaker models). So it might work even better for the next level of scale of base models, without necessarily changing the RL part all that much.