Some argue that AI progress will speed up as AIs help with their own development. Some argue that we will hit a wall. Will progress be smooth, or punctuated by sudden leaps?
Using the length of tasks that AIs can complete—their time horizon—as a measure of their general capability, this post attempts to shed some light on these questions.
METR estimates the task duration (human expert completion time) for software engineering tasks that AIs can complete with 50% success rate (the 50% time horizon), and plots the results in a graph:
METR time horizon is arguably one of the most useful measures for predicting future AI capabilities, and is used in notable forecast initiatives like the AI 2027 scenario and the AI Futures Model. Unlike most benchmarks, there is no ceiling on performance[1]. It correlates strongly with other capability measurements such as the Epoch Capabilities Index, and AI software engineering skill is indicative of how useful AIs are for AI R&D.
Assuming that METR time horizon is a good proxy for AI progress, how should it be projected into the future?
The longer trend line (dashed green line in the figure above) suggests that the 50% time horizon is doubling every 196 days (~6.5 months). However, if we include only models released since 2024, the doubling time drops to just 89 days (~3 months).
Should we expect future progress to follow this faster trend line? Perhaps there will be additional shifts to even faster doubling times (let’s call this the Segmented Exponential projection). The trend could also revert to the longer trend line (Revert to 6.5 Months projection), or become superexponential as AIs improve themselves (Smooth Superexponential). This figure illustrates these scenarios:
Will reality follow one of these projections?
These are not the only possibilities, of course. For example, the trend might revert to 6.5-month doubling time, then undergo another sudden shift in pace, which then turns superexponential.
(Note that the Segmented Exponential scenario may be difficult to distinguish from the Smooth Superexponential scenario in practice, since measurement noise could make both appear as superexponential progress.)
The rest of this post reviews the reasons why AI progress might speed up or slow down.
(I will focus on forces affecting the pace of development under relatively “normal” circumstances, setting aside events such as disruptions to the compute supply chain, regulations slowing development, or extremely large investments driven by international competition.)
Speed Up
AI Feedback Loops
AIs are steadily taking a more active role in their own development, enabling feedback loops where smarter AIs accelerate AI R&D. These loops include:
Data generation feedback loop, where AIs generate synthetic training data.
Coding feedback loop, where AIs automate coding tasks for AI R&D.
Research taste feedback loop, where AIs set research directions and select experiments[2].
Chip technology feedback loop, where AIs design better computer chips[3].
Chip production feedback loop, where AIs automate chip manufacturing.
Economic feedback loop, where AIs automate the broader economy.
The first three feedback loops (data generation, coding, and research taste) result in faster software improvements, while the chip technology and chip production loopsresult in more compute which can be used for AI research or deployment. The economic loop boosts investment in both software development and hardware.
Estimating exactly how active these loops currently are is out-of-scope for this article, but I’ll include some notes in a footnote[4].
Infinite Time Horizon in Finite Time
As METR themselves note in the original report, future AIs could have infinite time horizon:
If an artificial general intelligence (AGI) is capable of completing all tasks expert humans can with a success rate of at least X%, its X% time horizon will necessarily be infinite.
If this is to be achieved in finite time, the time horizon growth rate must eventually become superexponential.
At least two mechanisms might drive such a development.
It seems like for humans the gap in difficulty between 1 month and 2 month tasks is lower than between 1 day and 2 days.
When you can already complete month-long tasks, you don’t need to learn many new skills to handle 2-month tasks within the same domain (e.g. software development), so the jump from 1→2 months should be easier than from 1→2 days. In other words, difficulty scales sublinearly with task duration.
The second mechanism is related but distinct: longer tasks are often more decomposable. In the words of Ajeya Kotra:
In other words, very few tasks feel intrinsically likeyear-long tasks, the way that writing one bash command feels like an intrinsically one-second task, or debugging one simple bug feels intrinsically like a one-hour task. Maybe a mathematician banging their head against a hard conjecture for a year before finally making a breakthrough is a “real” year-long task? But most many-person-month software projects in the real world sort of feel like they might be a bunch of few-week tasks in a trenchcoat, the way that a hundred-question elementary school math test is really 100 thirty-second tasks in a trenchcoat.
New Paradigms
So far, there has only been a single obvious deviation from the original, longer doubling time. This coincided with the paradigm shift from standard transformer models to reasoning models. Perhaps each major paradigm shift comes with its own, faster doubling time.
This would suggest that the doubling time will remain at the faster pace until a new breakthrough changes the field, at which point the doubling time will shorten further. (This hypothesis aligns well with the Segmented Exponential scenario discussed earlier.)
Such breakthroughs are rare—the transformer was introduced in 2017, reasoning models in 2024—and the next could be years away.
AI Teams
AIs have different strengths and weaknesses, meaning that a team of AIs may succeed where a single AI would fail.
In addition, to estimate a lower bound of the longest time horizons we are able to legitimately measure with our current task suite, we evaluated the performance of a “composite” agent which, for each task in our suite, performs as well as the best-performing agent in that task. This results in a time-horizon measurement of about 10 hours when considering our full task suite, or 15hrs 35m when we ablate the potentially problematic tasks. Note that this is significantly biased upward, because this picks whichever agent got “luckiest” on each task after seeing the results, and may include reward hacking runs. We do not think that there is a way to build a composite agent with anything close to this level of performance if you are only allowed one attempt per task.
The same report estimates that GPT-5.2-Codex-Max has a 50% time horizon of 2hr 42min on the full task suite, and 3hr 28min when excluding “potentially problematic tasks”. This was the longest time horizon so far.
The “composite” agent got roughly 4× the time horizon of any single AI, though as METR notes, this figure is biased upwards due to selecting for lucky agents.
In the future, it may be more appropriate to estimate time horizons for AI teams, which would probably better represent how AIs will actually be deployed on long and complex tasks.
Transitioning from measuring single-AI time horizons to AI-team time horizons could produce a one-time jump without necessarily steepening the trend.
Slow Down
Reinforcement Learning Scales Poorly
Frontier AI development seems to have shifted toward reinforcement learning (RL) since the rise of reasoning models, which may have contributed to the recent rapid pace of progress (as noted in the New Paradigms section).
Toby Ord argues that this is largely because RL unlocks inference-scaling: AIs can think longer (use more inference compute) while completing a task in order to achieve better results.
Ord also argues that AI progress may slow as AIs approach the “top of the human-range and can no longer copy our best techniques”, at which point imitation-based training yields diminishing returns. RL may be necessary to push beyond the human frontier (which has already happened in many narrow domains, such as several games).
So RL scaling appears to be the way forward, both empirically and conceptually, and it largely works by enabling AIs to think longer while completing tasks. With that background, we can examine the main arguments for why RL scaling may not sustain the recent rapid pace:
Inference-scaling is expensive: Performance gains from scaling training compute applies to all future uses of an AI, but scaling inference compute applies only to a single task at a time. While it improves performance, it may result in higher hourly costs than humans[5].
Counterargument: The cost for inference at a given capability level is falling rapidly over time. Jean-Stanislas Denain points out that software and hardware improvements, combined with AIs learning to reason more concisely, makes inference scaling more affordable.
Scaling RL by several orders of magnitude is no longer feasible: When RL used only a tiny fraction of training compute, it was easy to scale by several orders of magnitude. Now that RL requires significant compute (reportedly ~50% of training compute for Grok 4, based on an image in its launch video), such rapid scaling is no longer possible.
Counterargument: Denain argues that “RL scaling data is thin, and there’s likely been substantial compute efficiency progress in RL since o1 and o3.” While scaling RL compute by several orders of magnitude may be unfeasible, the effectiveness of RL compute usage may still improve dramatically (possibly by several orders of magnitude).
RL training is inefficient: When doing RL on tasks with long completion time, an AI may need to reason extensively before providing an answer. Toby Ord points out that this results in the model receiving very little feedback relative to the effort expended, making such RL very compute expensive. If further development requires RL on even longer tasks, progress could stall due to insufficient compute.
Counterargument: AIs learn much in pre-training, before being trained with RL. This means that the RL has a sound base to improve upon, with existing neural network connections that can be rewired towards completing RL tasks, making RL more efficient than it might first appear. It should also be possible to increase how much AIs learn per task (e.g. by scoring partial success, effectiveness of strategies, etc.). (See also the counterargument to the previous point, which applies here as well.)
Longer training runs: Even if there is sufficient compute, training on long tasks may require extensive wall-clock training time.
Counterargument: AIs can usually complete tasks much faster than humans. Tasks that would take days for human experts may take an AI only hours or minutes, making RL on such tasks much more feasible. RL for physical actuators (e.g. operating a robot) may still take time, but such training could largely be conducted in simulation.
Longer research iteration cycles: Experiments should take longer if they require the AI to complete long tasks, increasing the length of research iteration cycles.
Counterargument: Each experiment might yield correspondingly greater insight for AI R&D. (See also the counterargument for the previous point.)
RL may produce narrow capability gains: Ord argues that RL has a poor track record at instilling general capabilities. It has been used to train superhuman AIs at various games, for instance, but RL on one game doesn’t generally transfer to others.
Counterargument: RL appears to work well for improving general capabilities in reasoning models so far (though this could change as RL is scaled further). RL may also work better for general skills when applied to a model that already possesses highly general capabilities.
It’s possible that the transition to RL scaling will slow the pace of AI progress, but if so, we might have seen signs of a slowdown already. So far, we haven’t.
Return to Baseline
In trend extrapolation, the longer trend is often more robust. The recent rapid pace may be temporary[6]. Perhaps there are diminishing returns to R&D with reasoning models. Perhaps RL doesn’t scale well. Perhaps future paradigm shifts will, on average, yield doubling times closer to 7 months than 3 months.
Time Horizon Overestimation
METR time horizon is measured using standardized tasks that can be evaluated automatically, while real-world tasks are often messy—so time horizons likely overestimate real-world performance.
We can compare time horizons to other benchmarks designed to more closely match real-world task difficulty, such as the Remote Labor Index (RLI). The mean human completion time on RLI tasks is 28.9 hours, with roughly half taking 10 hours or less (see Figure 4 in the report). The highest score so far is only 4.17% on this benchmark, achieved by Opus 4.6, which has an estimated 50% time horizon of ~12 hours and an 80% time horizon at 1h and 10 minutes.
This suggests that “real-world” time horizon might be significantly lower than METR’s time horizon (though the discrepancy could also reflect a shift in domain, as METR measures software engineering skill while RLI includes projects from multiple sectors).
However, as time horizons increase, tasks grow more complex even when they remain easy to score automatically. This should make agentic capabilities increasingly necessary.
Consider an AI tasked with a coding project which would take a few hours for a human expert. It may succeed without proper documentation or testing, since the project is small enough to get away with it. But for a task requiring a month or more of expert effort? The AI would likely need to do everything properly—writing clean code, testing thoroughly—just as a human must be more careful on large codebases than small ones. Longer tasks should also demand more cross-domain capabilities, failure handling, and possibly coordination with other humans or AIs.
This should reduce time horizon overestimation from automated scoring and narrow evaluation tasks as time horizons grow longer.
(That said, the “real-world” time horizon might also increase rapidly, so this may only result in a temporarily less steep slope.)
Development Bottlenecks
AI progress may stall if major bottlenecks emerge. Some commonly discussed candidates include training data, energy, and compute.
Training compute used for frontier language models has grown by ~5× per year since 2020. How long can that rate be sustained? Can datacenters be built fast enough? Will there be enough energy to power them?
According to Epoch AI, it appears as though AI scaling can continue at a similar pace at least until ~2030 (though the analysis is from August 2024 and assumes 4× annual growth in training compute rather than of 5×):
(“Latency wall” is another potential bottleneck discussed in the analysis, referring to a type of “speed limit” where extremely large models may require prohibitively long training time.)
The compute bottleneck is also considered in the AI Futures Model:
For the leading AI company’s compute, we project a ∼2x slowdown in its growth rate by 2030, and further slowdowns afterward. This is due to investment and fab capacity constraints. We refer the reader to our supplementary materials for more details on our compute forecasts.
The model also considers other potential bottlenecks. For instance, ideas for increasing AI capabilities become harder to find over time.
A few additional bottlenecks deserving further scrutiny:
High-quality training data for complex tasks may take a long time to gather, as it requires human experts working on tasks that can take weeks or months.
Human overseer capacity could become a limiting factor if AIs cannot be trusted to oversee other AI systems during training and deployment.
Insufficient security could also delay progress, if AI companies must take extreme measures against model theft or other IP threats, forcing time-consuming security protocols or slowing development until security systems improve.
They don’t want everyone to be aware of how dangerous their AIs are, which could increase anti-AI sentiment and provoke regulatory restrictions.
They are not confident they can detect and prevent all misuse.
They need more time for alignment training and safety testing.
They need more time for ensuring compliance with regulations.
They want to conceal information about their most capable models from competitors.
Delayed releases are perhaps more likely if one or very few companies are several months ahead and can afford to wait without being overtaken.
This actually occurs in in the AI 2027 scenario, where the fictional leading lab OpenBrain decides to withhold its AI called Agent-2, which could “autonomously develop and execute plans to hack into AI servers, install copies of itself, evade detection, and use that secure base to pursue whatever other goals it might have”:
Knowledge of Agent-2’s full capabilities is limited to an elite silo containing the immediate team, OpenBrain leadership and security, a few dozen U.S. government officials, and the legions of CCP spies who have infiltrated OpenBrain foryears.
Note that delayed releases will only produce the appearance of slower AI development. The METR time horizons trend could continue at its breakneck pace or accelerate significantly, while the general public remains blissfully ignorant.
Key Takeaways
The time horizons trend will probably become superexponential at some point, as the Infinite Time Horizon in Finite Time argument suggests.
When AIs develop sufficiently good research taste, the pace of self-improvement in taste will probably be the primary driver of further improvement.
In the near future, the RL Scales Poorly argument is probably the strongest case for slowdown, while Development Bottlenecks become more important around 2030.
METR’s time horizon appears to overestimatereal-world performance, but also underestimates performance by of AIs deployed in teams.
Even after analyzing all these arguments, I find it difficult to project time horizons into the future with confidence. I still don’t know if one of the scenarios outlined earlier will prove correct. But I feel I understand the landscape better now, which counts for something.
If you have any insights I’ve missed, please comment!
Thank you for reading! If you found value in this post, please consider subscribing!
Data generation:There is little public information on how synthetic data is used for training frontier AIs, but I suspect it to be quite useful.
Coding: Both OpenAI and Anthropic claimthat they used their own AIs to build their latest AI models, utilizing AI coding skill.
Research Taste: Current AIs may not be sophisticated enough to outperform human experts in suggesting experiments,though they are highly useful research tasks such as finding and summarizing research articles.
Chip technology: Nvidia has experimented with AI assistants for their chip designers, while Google DeepMind developed an AI called AlphaChip to “accelerate and optimize chip design”.
Chip production: Robots are already being used in chip manufacturing, but the process could surely be further automated. At some point of robotic and AI capabilities, factories would not require human labor at all.
Economic: AIs automate parts of the economy, and some of the resulting economic growth is reinvested into AI R&D or hardware (for an analysis on this, see the GATE model). AI infrastructure investments (such as datacenters) are already in the hundreds of billions of US dollars.
Note that AIs use more inference while completing longer tasks, which doesn’t necessarily increase inference costs compared to humans (who would also need to spend more time on such tasks). See this post by Ryan Greenblatt.
Will AI Progress Accelerate or Slow Down? Projecting METR Time Horizons
Link post
Some argue that AI progress will speed up as AIs help with their own development. Some argue that we will hit a wall. Will progress be smooth, or punctuated by sudden leaps?
Using the length of tasks that AIs can complete—their time horizon—as a measure of their general capability, this post attempts to shed some light on these questions.
Before reading further, I recommend checking out METR’s evaluations of time horizon in software engineering, if you have not done so already.
METR estimates the task duration (human expert completion time) for software engineering tasks that AIs can complete with 50% success rate (the 50% time horizon), and plots the results in a graph:
(Source: Task-Completion Time Horizons of Frontier AI Models)
METR time horizon is arguably one of the most useful measures for predicting future AI capabilities, and is used in notable forecast initiatives like the AI 2027 scenario and the AI Futures Model. Unlike most benchmarks, there is no ceiling on performance[1]. It correlates strongly with other capability measurements such as the Epoch Capabilities Index, and AI software engineering skill is indicative of how useful AIs are for AI R&D.
Assuming that METR time horizon is a good proxy for AI progress, how should it be projected into the future?
The longer trend line (dashed green line in the figure above) suggests that the 50% time horizon is doubling every 196 days (~6.5 months). However, if we include only models released since 2024, the doubling time drops to just 89 days (~3 months).
Should we expect future progress to follow this faster trend line? Perhaps there will be additional shifts to even faster doubling times (let’s call this the Segmented Exponential projection). The trend could also revert to the longer trend line (Revert to 6.5 Months projection), or become superexponential as AIs improve themselves (Smooth Superexponential). This figure illustrates these scenarios:
Will reality follow one of these projections?
These are not the only possibilities, of course. For example, the trend might revert to 6.5-month doubling time, then undergo another sudden shift in pace, which then turns superexponential.
(Note that the Segmented Exponential scenario may be difficult to distinguish from the Smooth Superexponential scenario in practice, since measurement noise could make both appear as superexponential progress.)
The rest of this post reviews the reasons why AI progress might speed up or slow down.
(I will focus on forces affecting the pace of development under relatively “normal” circumstances, setting aside events such as disruptions to the compute supply chain, regulations slowing development, or extremely large investments driven by international competition.)
Speed Up
AI Feedback Loops
AIs are steadily taking a more active role in their own development, enabling feedback loops where smarter AIs accelerate AI R&D. These loops include:
Data generation feedback loop, where AIs generate synthetic training data.
Coding feedback loop, where AIs automate coding tasks for AI R&D.
Research taste feedback loop, where AIs set research directions and select experiments[2].
Chip technology feedback loop, where AIs design better computer chips[3].
Chip production feedback loop, where AIs automate chip manufacturing.
Economic feedback loop, where AIs automate the broader economy.
The first three feedback loops (data generation, coding, and research taste) result in faster software improvements, while the chip technology and chip production loops result in more compute which can be used for AI research or deployment. The economic loop boosts investment in both software development and hardware.
Estimating exactly how active these loops currently are is out-of-scope for this article, but I’ll include some notes in a footnote[4].
Infinite Time Horizon in Finite Time
As METR themselves note in the original report, future AIs could have infinite time horizon:
If this is to be achieved in finite time, the time horizon growth rate must eventually become superexponential.
At least two mechanisms might drive such a development.
The first is considered in the AI 2027 Timelines forecast:
When you can already complete month-long tasks, you don’t need to learn many new skills to handle 2-month tasks within the same domain (e.g. software development), so the jump from 1→2 months should be easier than from 1→2 days. In other words, difficulty scales sublinearly with task duration.
The second mechanism is related but distinct: longer tasks are often more decomposable. In the words of Ajeya Kotra:
New Paradigms
So far, there has only been a single obvious deviation from the original, longer doubling time. This coincided with the paradigm shift from standard transformer models to reasoning models. Perhaps each major paradigm shift comes with its own, faster doubling time.
This would suggest that the doubling time will remain at the faster pace until a new breakthrough changes the field, at which point the doubling time will shorten further. (This hypothesis aligns well with the Segmented Exponential scenario discussed earlier.)
Such breakthroughs are rare—the transformer was introduced in 2017, reasoning models in 2024—and the next could be years away.
AI Teams
AIs have different strengths and weaknesses, meaning that a team of AIs may succeed where a single AI would fail.
This applies to time horizons as well. From METR’s report on GPT-5.1-Codex-Max:
The same report estimates that GPT-5.2-Codex-Max has a 50% time horizon of 2hr 42min on the full task suite, and 3hr 28min when excluding “potentially problematic tasks”. This was the longest time horizon so far.
The “composite” agent got roughly 4× the time horizon of any single AI, though as METR notes, this figure is biased upwards due to selecting for lucky agents.
In the future, it may be more appropriate to estimate time horizons for AI teams, which would probably better represent how AIs will actually be deployed on long and complex tasks.
Transitioning from measuring single-AI time horizons to AI-team time horizons could produce a one-time jump without necessarily steepening the trend.
Slow Down
Reinforcement Learning Scales Poorly
Frontier AI development seems to have shifted toward reinforcement learning (RL) since the rise of reasoning models, which may have contributed to the recent rapid pace of progress (as noted in the New Paradigms section).
Toby Ord argues that this is largely because RL unlocks inference-scaling: AIs can think longer (use more inference compute) while completing a task in order to achieve better results.
Ord also argues that AI progress may slow as AIs approach the “top of the human-range and can no longer copy our best techniques”, at which point imitation-based training yields diminishing returns. RL may be necessary to push beyond the human frontier (which has already happened in many narrow domains, such as several games).
So RL scaling appears to be the way forward, both empirically and conceptually, and it largely works by enabling AIs to think longer while completing tasks. With that background, we can examine the main arguments for why RL scaling may not sustain the recent rapid pace:
Inference-scaling is expensive: Performance gains from scaling training compute applies to all future uses of an AI, but scaling inference compute applies only to a single task at a time. While it improves performance, it may result in higher hourly costs than humans[5].
Counterargument: The cost for inference at a given capability level is falling rapidly over time. Jean-Stanislas Denain points out that software and hardware improvements, combined with AIs learning to reason more concisely, makes inference scaling more affordable.
Scaling RL by several orders of magnitude is no longer feasible: When RL used only a tiny fraction of training compute, it was easy to scale by several orders of magnitude. Now that RL requires significant compute (reportedly ~50% of training compute for Grok 4, based on an image in its launch video), such rapid scaling is no longer possible.
Counterargument: Denain argues that “RL scaling data is thin, and there’s likely been substantial compute efficiency progress in RL since o1 and o3.” While scaling RL compute by several orders of magnitude may be unfeasible, the effectiveness of RL compute usage may still improve dramatically (possibly by several orders of magnitude).
RL training is inefficient: When doing RL on tasks with long completion time, an AI may need to reason extensively before providing an answer. Toby Ord points out that this results in the model receiving very little feedback relative to the effort expended, making such RL very compute expensive. If further development requires RL on even longer tasks, progress could stall due to insufficient compute.
Counterargument: AIs learn much in pre-training, before being trained with RL. This means that the RL has a sound base to improve upon, with existing neural network connections that can be rewired towards completing RL tasks, making RL more efficient than it might first appear. It should also be possible to increase how much AIs learn per task (e.g. by scoring partial success, effectiveness of strategies, etc.). (See also the counterargument to the previous point, which applies here as well.)
Longer training runs: Even if there is sufficient compute, training on long tasks may require extensive wall-clock training time.
Counterargument: AIs can usually complete tasks much faster than humans. Tasks that would take days for human experts may take an AI only hours or minutes, making RL on such tasks much more feasible. RL for physical actuators (e.g. operating a robot) may still take time, but such training could largely be conducted in simulation.
Longer research iteration cycles: Experiments should take longer if they require the AI to complete long tasks, increasing the length of research iteration cycles.
Counterargument: Each experiment might yield correspondingly greater insight for AI R&D. (See also the counterargument for the previous point.)
RL may produce narrow capability gains: Ord argues that RL has a poor track record at instilling general capabilities. It has been used to train superhuman AIs at various games, for instance, but RL on one game doesn’t generally transfer to others.
Counterargument: RL appears to work well for improving general capabilities in reasoning models so far (though this could change as RL is scaled further). RL may also work better for general skills when applied to a model that already possesses highly general capabilities.
It’s possible that the transition to RL scaling will slow the pace of AI progress, but if so, we might have seen signs of a slowdown already. So far, we haven’t.
Return to Baseline
In trend extrapolation, the longer trend is often more robust. The recent rapid pace may be temporary[6]. Perhaps there are diminishing returns to R&D with reasoning models. Perhaps RL doesn’t scale well. Perhaps future paradigm shifts will, on average, yield doubling times closer to 7 months than 3 months.
Time Horizon Overestimation
METR time horizon is measured using standardized tasks that can be evaluated automatically, while real-world tasks are often messy—so time horizons likely overestimate real-world performance.
We can compare time horizons to other benchmarks designed to more closely match real-world task difficulty, such as the Remote Labor Index (RLI). The mean human completion time on RLI tasks is 28.9 hours, with roughly half taking 10 hours or less (see Figure 4 in the report). The highest score so far is only 4.17% on this benchmark, achieved by Opus 4.6, which has an estimated 50% time horizon of ~12 hours and an 80% time horizon at 1h and 10 minutes.
This suggests that “real-world” time horizon might be significantly lower than METR’s time horizon (though the discrepancy could also reflect a shift in domain, as METR measures software engineering skill while RLI includes projects from multiple sectors).
However, as time horizons increase, tasks grow more complex even when they remain easy to score automatically. This should make agentic capabilities increasingly necessary.
Consider an AI tasked with a coding project which would take a few hours for a human expert. It may succeed without proper documentation or testing, since the project is small enough to get away with it. But for a task requiring a month or more of expert effort? The AI would likely need to do everything properly—writing clean code, testing thoroughly—just as a human must be more careful on large codebases than small ones. Longer tasks should also demand more cross-domain capabilities, failure handling, and possibly coordination with other humans or AIs.
This should reduce time horizon overestimation from automated scoring and narrow evaluation tasks as time horizons grow longer.
(That said, the “real-world” time horizon might also increase rapidly, so this may only result in a temporarily less steep slope.)
Development Bottlenecks
AI progress may stall if major bottlenecks emerge. Some commonly discussed candidates include training data, energy, and compute.
Training compute used for frontier language models has grown by ~5× per year since 2020. How long can that rate be sustained? Can datacenters be built fast enough? Will there be enough energy to power them?
According to Epoch AI, it appears as though AI scaling can continue at a similar pace at least until ~2030 (though the analysis is from August 2024 and assumes 4× annual growth in training compute rather than of 5×):
(Source: Can AI scaling continue through 2030?)
(“Latency wall” is another potential bottleneck discussed in the analysis, referring to a type of “speed limit” where extremely large models may require prohibitively long training time.)
The compute bottleneck is also considered in the AI Futures Model:
The model also considers other potential bottlenecks. For instance, ideas for increasing AI capabilities become harder to find over time.
A few additional bottlenecks deserving further scrutiny:
High-quality training data for complex tasks may take a long time to gather, as it requires human experts working on tasks that can take weeks or months.
Human overseer capacity could become a limiting factor if AIs cannot be trusted to oversee other AI systems during training and deployment.
Insufficient security could also delay progress, if AI companies must take extreme measures against model theft or other IP threats, forcing time-consuming security protocols or slowing development until security systems improve.
Delayed Releases
As AI companies develop highly dangerous AIs (e.g. AIs that can help humans acquire biological weapons, or AIs that can replicate autonomously), they may hesitate in releasing these models to the general public, even as mere chatbots. Reasons include:
They don’t want everyone to be aware of how dangerous their AIs are, which could increase anti-AI sentiment and provoke regulatory restrictions.
They are not confident they can detect and prevent all misuse.
They need more time for alignment training and safety testing.
They need more time for ensuring compliance with regulations.
They want to conceal information about their most capable models from competitors.
Delayed releases are perhaps more likely if one or very few companies are several months ahead and can afford to wait without being overtaken.
This actually occurs in in the AI 2027 scenario, where the fictional leading lab OpenBrain decides to withhold its AI called Agent-2, which could “autonomously develop and execute plans to hack into AI servers, install copies of itself, evade detection, and use that secure base to pursue whatever other goals it might have”:
Note that delayed releases will only produce the appearance of slower AI development. The METR time horizons trend could continue at its breakneck pace or accelerate significantly, while the general public remains blissfully ignorant.
Key Takeaways
The time horizons trend will probably become superexponential at some point, as the Infinite Time Horizon in Finite Time argument suggests.
When AIs develop sufficiently good research taste, the pace of self-improvement in taste will probably be the primary driver of further improvement.
In the near future, the RL Scales Poorly argument is probably the strongest case for slowdown, while Development Bottlenecks become more important around 2030.
METR’s time horizon appears to overestimate real-world performance, but also underestimates performance by of AIs deployed in teams.
Even after analyzing all these arguments, I find it difficult to project time horizons into the future with confidence. I still don’t know if one of the scenarios outlined earlier will prove correct. But I feel I understand the landscape better now, which counts for something.
If you have any insights I’ve missed, please comment!
Thank you for reading! If you found value in this post, please consider subscribing!
Although, measuring time horizon is becoming increasingly difficult and expensive over time as longer evaluation tasks are required.
This is a core element of the AI Futures Model.
This feedback loop, as well as the chip production feedback loop, is discussed by Forethought in Three Types of Intelligence Explosion.
The data generation, coding and research taste loops are all included in their notion of software feedback loop.
Data generation: There is little public information on how synthetic data is used for training frontier AIs, but I suspect it to be quite useful.
Coding: Both OpenAI and Anthropic claim that they used their own AIs to build their latest AI models, utilizing AI coding skill.
Research Taste: Current AIs may not be sophisticated enough to outperform human experts in suggesting experiments, though they are highly useful research tasks such as finding and summarizing research articles.
Chip technology: Nvidia has experimented with AI assistants for their chip designers, while Google DeepMind developed an AI called AlphaChip to “accelerate and optimize chip design”.
Chip production: Robots are already being used in chip manufacturing, but the process could surely be further automated. At some point of robotic and AI capabilities, factories would not require human labor at all.
Economic: AIs automate parts of the economy, and some of the resulting economic growth is reinvested into AI R&D or hardware (for an analysis on this, see the GATE model). AI infrastructure investments (such as datacenters) are already in the hundreds of billions of US dollars.
Note that AIs use more inference while completing longer tasks, which doesn’t necessarily increase inference costs compared to humans (who would also need to spend more time on such tasks). See this post by Ryan Greenblatt.
This argument was partially inspired by this post by johncrox on LessWrong.