The way METR time horizons tie into AI 2027 is very narrow: As a trend not even necessarily on coding/software engineering skills but on machine learning engineering. I think that is hard to attack except by claiming that the trend will taper off. AI 2027 does not require unrealistic generalisation.
The reason why I think that time horizons are much more solid evidence of AI progress then earlier benchmarks, is that the calculated time horizons explain the trends in AI-assisted coding over the last few years very well. For example it’s not by chance that “vibe coding” became a thing when it became a thing.
I have computed time horizon trends for more general software engineering tasks (i.e. with a bigger context) and my preliminary results point towards a logistic trend, i.e. the exponential is already tapering off. However, I am still pretty uncertain about that.
I have computed time horizon trends for more general software engineering tasks (i.e. with a bigger context) and my preliminary results point towards a logistic trend, i.e. the exponential is already tapering off. However, I am still pretty uncertain about that.
I predict this is basically due to noise, or at best is a very short-lived trend, similarly to the purported faster trend of RL scaling allowing a doubling of 4 months on certain tasks that is basically driven by good scaffolding (which is what RL-on-CoTs was mostly shown to be) and not a creation of new capabilities.
The way METR time horizons tie into AI 2027 is very narrow: As a trend not even necessarily on coding/software engineering skills but on machine learning engineering. I think that is hard to attack except by claiming that the trend will taper off. AI 2027 does not require unrealistic generalisation.
The reason why I think that time horizons are much more solid evidence of AI progress then earlier benchmarks, is that the calculated time horizons explain the trends in AI-assisted coding over the last few years very well. For example it’s not by chance that “vibe coding” became a thing when it became a thing.
I have computed time horizon trends for more general software engineering tasks (i.e. with a bigger context) and my preliminary results point towards a logistic trend, i.e. the exponential is already tapering off. However, I am still pretty uncertain about that.
I predict this is basically due to noise, or at best is a very short-lived trend, similarly to the purported faster trend of RL scaling allowing a doubling of 4 months on certain tasks that is basically driven by good scaffolding (which is what RL-on-CoTs was mostly shown to be) and not a creation of new capabilities.
Very possible.
I plan to watch this a bit longer and also analyse how the trend changes with repo size.