(written for a Twitter audience)
Has AI progress slowed down? I’ll write some personal takes and predictions in this post.
The main metric I look at is METR’s time horizon, which measures the length of tasks agents can perform. It has been doubling for more than 6 years now, and might have sped up recently.
By measuring the length of tasks AI agents can complete, we can get a continuous metric of AI capabilities.
Since 2019, the time horizon has been doubling every 7 months. But since 2024, it’s been doubling every 4 months. What if we irresponsibly extrapolated these to 2030?
If AI progress continues at its recent rate, we get AI systems which can do one month (167 hours) of low-context SWE work by the end of 2027. If AI progress continues at the long-run historical rate, we get them by the end of 2029 instead.
How to interpret one work-month? I’d say it’s something like the first project a new hire would do, or the type of work a researcher who just switched teams would be able to do in a month. Our time horizon metric currently doesn’t define high time horizons super sharply.
Changing the success rate threshold from 50% to 80% only shifts the extrapolation from recent progress by a few months, but shifts the extrapolation from the long-run historical rate by around a year.
I don’t think these lines should be extrapolated much past one work-month, as progress will likely speed up even more once AIs are automating significant parts of AI research. Additionally, bottlenecks identified by Epoch AI might slow down compute scaling around 2030. https://epoch.ai/blog/can-ai-scaling-continue-through-2030
Our task suite is currently composed of well-scoped easily-scoreable tasks, which makes them pretty different from the type of work done in the real world. This means that we should be cautious when interpreting these extrapolations.
My best guess is that future models will be a closer fit to the extrapolation from recent progress than the extrapolation from long-run progress. But even the more conservative trend implies that AIs will be doing month-long tasks by the end of the decade.
More concretely, my median is that AI research will be automated by the end of 2028, and AI will be better than humans at >95% of current intellectual labor by the end of 2029.
Recent acceleration in the METR time horizon trend is plausibly the scale of RLVR rapidly catching up to the scale of pretraining. Which it almost already caught up with, for Grok 4 and possibly GPT-5, except these were probably not yet using GB200 NVL72, which will add some efficiency for RLVR compared to the older 8-chip servers (relative to pretraining). If the acceleration from RLVR stops soon, then looking at the log-time plots, the cumulative effect is that it just pushes the longer term trend in time horizons forward less than a year, a one-time effect.
The recent trend in scaling of training compute will continue until 2026 (when the 1 GW datacenter campuses for a single AI company will be completed, such as the Crusoe/Oracle/OpenAI Abilene site), which will be visible in AIs training on these systems through 2027. There is currently not enough talk about 5 GW datacenter campuses by 2028 (for a single AI company) to be confident the trend continues at the current pace past 2026, though they’ll probably still be there in 2029-2031. This puts the performance predicted by the older trend for 2029 (a year after the 5 GW campuses if hypothetically still on-trend), shifted 1 year forward due to RLVR (thus we need to look at the prediction for 2030 instead), at 2030-2032 (in reality). Which is still above the 1 month threshold for 50% success rate, but below it for 80% success rate.
After that, scaling probably slows down even further. Not even for the Epoch reasons, but rather because a 5 GW datacenter campus would already cost $140bn in compute equipment alone, and an additional $60bn to construct the buildings, cooling, and power infrastructure (here’s another datapoint for the $12bn per GW for the datacenter-without-compute-equipment estimate).
Alas, it might have been the consequence not just of scaling up RLVR, but of something else. Nine days ago I remarked that “The time horizon of base LLMs experienced a slowdown or plateau[1] between GPT-4 (5 minutes, Mar′23) and GPT-4o (9 min, May ’24),” and a similar semi-plateau was experienced by DeepSeek, implying that acceleration could be driven by another undisclosed breakthrough.
Daniel Kokotajlo also thinks that “we should have some credence on new breakthroughs e.g. neuralese,[2] online learning,[3] whatever. Maybe like 8%/yr? Of a breakthrough that would lead to superhuman coders within a year or two, after being appropriately scaled up and tinkered with.”
There is also Gemini Diffusion previewed around three months ago, already known to be OOMs faster[4] and likely having interpretability problems. What if Gemini Diffusion is released to the public in about 1-3 months[5] and beats lots of models of similar compute class on various benchmarks?
While GPT-4.5 has a time horizon between 30 and 40 mins, it, unlike GPT-4o, was a MoE and was trained on CoTs.
Or ideas like Knight Lee’s proposal which make the model more interpretable and nudgeable than neuralese while offering less capabilities boost than neuralese. What if Lee-like architectures are used in Agent-2-level systems and neuralese is used in Agent-3+?
However, I fail to understand how online learning boosts capabilities.
If diffusion models use OOMs less compute than traditional LLMs, then can one make training runs of diffusion models similarly cheaper?
For comparison, o1 was previewed on Sep 12, 2024 and released on Dec 5, 2024. o3 was previewed on Dec 20, 2024 and released on Apr 16, 2025, meaning that a model is likely to be released 3-4 months after the preview. There is also GPT-5-thinking, which Zvi, quoting VictorTaelin, compares with o4. If it’s true, then o4-mini is released 4 months ahead of o4. o3-mini was released about a month after o3′s preview, implying that o4 could’ve been previewed 5 months before the release. If Gemini Diffusion is released 6 months after the preview, then it will be released 3 months from now.
It seems like you think AI research will be (fully?) automated once AIs are at 80% reliability on 1 month long benchmarkable / easy-to-check SWE tasks. (Or within 1 year of 50% reliability on 1 month tasks.) Isn’t this surprisingly fast? I’d guess there is a decently large gap between 1 month tasks and “fully automated AI R&D”. Partial automation will speed things up, but will it speed things up this much?
I think a 1-year 50%-time-horizon is very likely not enough to automate AI research. The reason I think AI research will be automated by EOY 2028 is because of speedups from partial automation as well as leaving open the possibility of additional breakthroughs naturally occurring.
A few considerations that make me think the doubling time will get faster:
AI speeding up AI research probably starts making a dent in the doubling time (making it at least 10% faster) by the time we hit 100hr time horizons (although it’s pretty hard to reason about the impacts here)
I think I place some probability on the “inherently superexponential time horizons” hypothesis. The reason I think it is because to me, 1-month-coherence, 1-year-coherence, and 10-year-coherence (of the kind performed by humans) seem like extremely similar skills and will thus be learned in quick succession.
It’s plausible reasoning models decreased the doubling time from 7 months to 4 months. It’s plausible we get another reasoning-shaped breakthrough.
So my best guess for the 50% and 80% time horizons at EOY 2028 are more like 10yrs and 3yrs or something. But past ~2027 I care more about how much AI R&D is being automated rather than the time horizon itself (partially because I have FUD about what N-year tasks should even look like by definition).
There’s something that I think is usually missing from time-horizon discussions, which is that the human brain seems to operate on a very long time horizon for entirely different reasons. The story for LLMs looks like this: LLMs become better at programming tasks, therefore they become capable of doing (in a relatively short amount of time) tasks that would take increasingly longer for humans to do. Humans, instead, can just do stuff for a lifetime, and we don’t know where the cap is, and our brain has ways to manage its memories depending on how often they are recalled, and probably other ways to keep itself coherent over long periods. It’s a completely different sort of thing! This makes me think that the trend here isn’t very “deep”. The line will continue to go up as LLMs become better and better at programming, and then it will slow down due to capability gains generally slowing down due to training compute bottlenecks and due to limited inference compute budgets. On the other hand, I think it’s pretty dang likely that we get a drastic trend break in the next few years (i.e., the graph essentially loses its relevance) when we crack the actual mechanisms and capabilities related to continuous operation. For example, continuous learning, clever memory management, and similar things that we might be completely missing at the moment even as concepts.
If in-context learning from long context can play the role of continual learning, AIs with essentially current architectures may soon match human capabilities in this respect. Long contexts will get more feasible as hardware gets more HBM in a scale-up world (a collection of chips with good networking that can act as a very large composite chip for some purposes). This year we are moving from 0.64-1.44 TB of the 8-chip servers to 14 TB of GB200 NVL72, next year it’s 20 TB with GB300 NVL72, and then in 2028 Rubin Ultra NVL576 will have 147 TB (100x what was feasible in early 2025, if you are not Google).
At 40K tokens per day (as a human might read or think), 1 month is only 1.2M tokens, and 3 years is only 44M tokens, which is merely 44x more than what is being currently offered. One issue of course is that this gives up some key AI advantages, it doesn’t offer a straightforward way of combining learnings from many parallel instances into one mind. And it’s not completely clear that it will work at all, but there is also no particular reason that it won’t, or that it will even need anything new invented, other than more scale and some RLVR that incentivises making good use of contexts this long.
My biggest question mark on this would be medical diagnosis, which I would classify as intellectual labor. Not because I don’t think AI could be capable of medical diagnosis (heck, I think GPT-5 is probably better than most GPs at most tasks now), but because a lack of sufficient training data combined with a heavy resistance to its use might simply make it hard to reach or properly measure this goal.