Yeah, I tried running the code on SciCode, GPQA, and HLE. Overall, the results were somewhat similar but much more noisy. Using method 2 we got very similar results but with lower HLE growth. Using method 1: we got somewhat lower growth rates in SciCode and higher growth rates in GPQA diamond at the high end (but given the way I constructed the frontier, there were only two points in the 70+ bucket).
This is neat! I like the idea of isolating technical progress. I’m curious whether you’ve tried this analysis on more benchmarks, considering that we found significant variation in slope across benchmarks in https://epoch.ai/data-insights/llm-inference-price-trends?insight-option=All+benchmarks
Yeah, I tried running the code on SciCode, GPQA, and HLE. Overall, the results were somewhat similar but much more noisy. Using method 2 we got very similar results but with lower HLE growth. Using method 1: we got somewhat lower growth rates in SciCode and higher growth rates in GPQA diamond at the high end (but given the way I constructed the frontier, there were only two points in the 70+ bucket).