Someone referred me back to this post for comment, so I want to share a couple of updates on how we think about training run lengths at Epoch.
First, we now have better data. Across notable models, we have seen training run lengths get longer by around 30%/year in the last decade. This would naively imply we would see 3x longer training runs by the end of the decade. Recent large training runs often take up to 90 days (eg Llama 3), so this would naively lead to nine-month training runs by the end of the decade.
Second, I still believe the argument given in the original post is coherent and makes for a compelling upper bound, after accounting for uncertainty on the relevant trends.
This is not the only consideration that goes into deciding how long to train for. In practice, my understanding is that developers are mostly weighing the improvement they see during training versus the costs of a delayed release in terms of attention and market share. But I still expect for the upper bound of ~a year to be roughly binding, at least while hardware and algorithmic improvements continue progressing as fast as in recent years.
Quick comment, this is not correct. As of this time, we have not evaluated Grok 4 on FrontierMath Tier 4 questions. Our preliminary evaluation was conducted only with Tier 1-3 questions.