RLHF. Doesn’t fit: earliest model that was RLHFed was InstructGPT, released in 2022, way before the change in trend.
Chain-of-Thought. This fits better, but not perfectly. The piecewise linear fit shows that the faster trend started around February or March 2024. o1-preview, the first model that used CoT natively, was released in September, many months after the change in trend, and several non-CoT models are on the faster trend. Even if the estimated date of change is off by a month or two, it would still mean that the trend changed before o1-preview was released, and several non-CoT models would still be on the faster trend. I’m leaning towards this explanation as the other two seem much less likely.
Some secret sauce that the labs are very good at hiding. This seems unlikely. If it was important enough to change the trend permanently as opposed to offering a one-time improvement, it would’ve been common knowledge by now.
To my understanding, RLHF is primarily for getting models to be obedient and to provide sanitized HHH outputs. GRPO was published in 2024, opening the field for similar RLVR tactics. I would say that that’s the cleanest fit, and a much better candidate.
I would also say, by looking at the plot, that you could safely place the breakpoint further forward if it makes more sense given our real-world priors, up to around around 2024.5.
To my understanding, RLHF is primarily for getting models to be obedient and to provide sanitized HHH outputs. GRPO was published in 2024, opening the field for similar RLVR tactics. I would say that that’s the cleanest fit, and a much better candidate.
I would also say, by looking at the plot, that you could safely place the breakpoint further forward if it makes more sense given our real-world priors, up to around around 2024.5.