A very interesting paper quantifying the conventional wisdom that a large degree of progress is from inference scaling.
You have got access to GPT-3 and 3.5 from OpenAI, but have you tried to get the same for now publicly unavailable Claude models for a better coverage of the first half of 2025?
I have not seen any research backing this claim up, have you?