It’s interesting that task length is better correlated with release date than with training compute; I was not expecting that.
It’s partially because we filtered for frontier models. If you had people train models of all different sizes that were compute-optimal architectures at their respective sizes, the correlation between time horizon and compute would plausibly be much better.
It’s partially because we filtered for frontier models. If you had people train models of all different sizes that were compute-optimal architectures at their respective sizes, the correlation between time horizon and compute would plausibly be much better.