Researcher at MIT CSAIL.
Hans Gundlach
Yeah, I tried running the code on SciCode, GPQA, and HLE. Overall, the results were somewhat similar but much more noisy. Using method 2 we got very similar results but with lower HLE growth. Using method 1: we got somewhat lower growth rates in SciCode and higher growth rates in GPQA diamond at the high end (but given the way I constructed the frontier, there were only two points in the 70+ bucket).
Thanks so much for the feedback! 1. I agree, I think many providers have a blended price that it is still optimized for low latency. We hope that selecting the lowest price provider among all providers of a given model mitigates this (or at least selects a consistent set of providers that make consistent choices) 2. I’d be really interested in such data. I don’t know of any such empirical results, but Erdil et al’s Inference Economics has a good theoretical model.
I just want to make it clear that both are paper and Epoch’s paper addresses innovations that occur from 2012-2023 (and only the first half of 2023). We are aware of MLA, muon optimizer, long context unlocks, and RL, and think they are important contributors. However, all these innovations are explicitly outside of the scope of our current paper which seeks to account for Epoch’s estimates in that time period.