In the 9 months since the METR time horizon paper (during which AI time horizons have increased by ~6x), it’s generated lots of attention as well as various criticismon LW and elsewhere. As one of the main authors, I think much of the criticism is a valid response to misinterpretations, and want to list my beliefs about limitations of our methodology and time horizon more broadly. This is not a complete list, but rather whatever I thought of in a few hours.
Time horizon is not the length of time AIs can work independently
Rather, it’s the amount of serial human labor they can replace with a 50% success rate. When AIs solve tasks they’re usually much faster than humans.
Error bars have historically been a factor of ~2 in each direction, worse with current models like Opus 4.5 as our benchmark begins to saturate.
Because model performance is correlated, error bars for relative comparisons between models are a bit smaller. But it still makes little sense to care about whether a model is just below frontier, 10% above the previous best model, or 20% above.
Time horizon differs between domains by orders of magnitude
The original paper measured it on mostly software and research tasks. Applying the same methodology in a follow-up found that time horizons are fairly similar for math, but 40-100x lower for visual computer use tasks, due to eg poor perception.
Time horizon does not apply to every task distribution
On SWE-Lancer OpenAI observed that a task’s monetary value (which should be a decent proxy for engineer-hours) doesn’t correlate with a model’s success rate. I still don’t know why this is.
Benchmark vs real-world task distribution
We’re making tasks just ahead of what we expect future models to be able to do, and benchmark construction has many design choices.
We try to make tasks representative of the real world, but as in any benchmark, there are inherent tradeoffs between realism, diversity, fixed costs (implementation), and variable costs (ease of running the benchmark). Inspect has made this easier but there will obviously be factors that cause our benchmarks to favor or disfavor models.
Because anything automatically gradable can be an RL environment, and models are extensively trained using RLVR [1], making gradable tasks that don’t overestimate real-world performance at all essentially means making more realistic RLVR settings than labs, which is hard.
Figure 1: What it feels like making benchmarks before frontier models saturate them
Our benchmarks differ from the real world in many ways, some of which are discussed in the original paper.
Low vs high context (low-context tasks are isolated and don’t require prior knowledge about a codebase)
Well-defined vs poorly defined
“Messy” vs non-messy tasks (see section 6.2 of original paper)
Different conventions around human baseline times could affect time horizon by >1.25x.
I think we made reasonable choices, but there were certainly judgement calls here– the most important thing was to be consistent.
Baseliner skill level: Our baseliner pool was “skilled professionals in software engineering, machine learning, and cybersecurity”, but top engineers, e.g. lab employees, would be faster.
We didn’t incorporate failed baselines into time estimates because baseliners often failed for non-relevant reasons. If we used survival analysis to interpret an X-hour failed baseline as information that the task takes >X hours, we would increase measured task lengths.
When a task had multiple successful baselines we aggregated these using the geometric mean. Baseline times have high variance, so using the arithmetic mean would increase averages by ~25%.
A 50% time horizon of X hours does not mean we can delegate tasks under X hours to AIs.
Some (reliability-critical and poorly verifiable) tasks require 98%+ success probabilities to be worth automating
Doubling the time horizon does not double the degree of automation. Even if the AI requires half as many human interventions, it will probably fail in more complex ways requiring more human labor per intervention.
To convert time horizons to research speedup, we need to measure how much time a human spends prompting AIs, waiting for generations, checking AI output, writing code manually, etc. when doing an X hour task assisted by an AI with time horizon Y hours. Then we plug this into the uplift equation. This process is nontrivial and requires a much richer data source like Cursor logs or screen recordings.
20% and 80% time horizons are kind of fake because there aren’t enough parameters to fit them separately.
It would be better to use some kind of spline with logit link and monotonicity constraint. The reasons we haven’t done this yet: (a) 80% time horizon was kind of an afterthought/robustness check, (b) we wanted our methods to be easily understandable, (c) there aren’t enough tasks to fit more than a couple more parameters, and (d) anything more sophisticated than logistic regression would take longer to run, and we do something like 300,000 logistic fits (mostly for bootstrapped confidence intervals) to reproduce the pipeline. I do recommend doing this for anyone who wants to measure higher quantiles and has a large enough benchmark to do so meaningfully.
Time horizons at 99%+ reliability levels cannot be fit at all without much larger and higher-quality benchmarks.
Measuring 99% time horizons would require ~300 highly diverse tasks in each time bucket. If the tasks are not highly diverse and realistic, we could fail to sample the type of task that would trip up the AI in actual use.
The tasks also need <<1% label noise. If they’re broken/unfair/have label noise, the benchmark could saturate at 98% and we would estimate the 99% time horizon of every model to be zero.
Speculating about the effects of a months- or years-long time horizon is fraught.
The distribution of tasks from which the suite is drawn from is not super well-defined, and so reasonable different extrapolations could get quite different time-horizon trends.
One example: all of the tasks in METR-HRS are self-contained, whereas most months-long tasks humans do require collaboration.
If an AI has a 3-year time horizon, does this mean an AI can competently substitute for a human for a 3-year long project with the same level of feedback from a manager, or be able to do the human’s job completely independently? We have no tasks involving multi-turn interaction with a human so there is no right answer.
There is a good argument that AGI would have an infinite time horizon and so time horizon will eventually start growing superexponentially. However, the AI Futures timelines model is highly sensitive to exactly how superexponential future time horizon growth will be, which we have little data on. This parameter, “Doubling Difficulty Growth Factor”, can change the date of the first Automated Coder AI between 2028 and 2050.
Despite these limitations, what conclusions do I still stand by?
The most important numbers to estimate were the slope of the long-run trend (one doubling every 6-7 months) and a linear extrapolation of this trend predicting when AIs would reach 1 month / 167 working hours time horizon (2030), not the exact time horizon of any particular model. I think the paper did well here.
Throughout the project we did the least work we could to establish a sufficiently robust result, because task construction and baselining were both super expensive. As a result, the data are insufficient to do some secondary and subset analyses. I still think it’s fine but have increasing worries as the benchmark nears saturation.
Without SWAA the error bars are super wide, and SWAA is lower quality than some easy (non-software) benchmarks like GSM8k. This might seem worrying, but it’s fine because it doesn’t actually matter for the result whether GPT-2’s time horizon is 0.5 seconds or 3 seconds; the slope of the trend is pretty similar. All that matters is that we can estimate it at all with a benchmark that isn’t super biased.
Some tasks have time estimates rather than actual human baselines, and the tasks that do have baselines have few of them. This is statistically ok because in our sensitivity analysis, adding IID baseline noise had minimal impact on the results, and the range of task lengths (spanning roughly a factor of 10,000) means that even baselining error correlated with task length wouldn’t affect the doubling time much.
However, Tom Cunningham points out that most of the longer tasks don’t have baselines, so if we’re systematically over/under-estimating the length of long tasks we could be misjudging the degree of acceleration in 2025.
The paper had a small number of tasks (only ~170) because we prioritize quality over quantity. The dataset size was originally fine but is now becoming a problem as we lack longer, 2h+ tasks to evaluate future models.
I think we’re planning to update the task suite soon to include most of the HCAST tasks (the original paper had only a subset) plus some new tasks. Beyond this, we have various plans to continue measuring AI capabilities, both through benchmarks and other means like RCTs.
I basically agree with everything you say here and wish we had a better way to try to ground AGI timelines forecasts. Do you recommend any other method? E.g. extrapolating revenue? Just thinking through arguments about whether the current paradigm will work, and then using intuition to make the final call? We discuss some methods that appeal to us here.
This parameter, “Doubling Difficulty Growth Factor”, can change the date of the first Automated Coder AI between 2028 and 2050.
Note that we allow it to go subexponential, so actually it can change the date arbitrarily far in the future if you really want it to. Also, dunno what’s happening with Eli’s parameters, but with my parameter settings putting the doubling difficulty growth factor to 1 (i.e. pure exponential trend, neither super or sub exponential) gets to AC in 2035. (Though I don’t think we should put much weight on this number, as it depends on other parameters which are subjective & important too, such as the horizon length which corresponds to AC, which people disagree a lot about)
The simple model I mentioned on Slack (still WIP, hopefully to be written up this week) tracks capability directly in terms of labor speedup and extrapolates that. Of course, for a more serious timelines forecast you have to ground it in some data.
Here’s what I said to Eli on Slack; I don’t really have more thoughts since then
we can get f_2026 [uplift fraction in 2026] from
transcripts of realistic cursor usage + success judge + difficulty judge calibrated on tasks of known lengths
uplift study
asking lab people about their current uplift (since parallel uplift and 1/(1-f) are equivalent in the simple model)
v [velocity of automation as capabilities improve] can be obtained by
guessing the distribution of tasks, using time horizon, maybe using a correction factor for real vs benchmark time horizon
multiple uplift studies over time
comparing older models to newer ones, or having them try things people use 4.5 opus for
Nice. Yeah I also am excited about coding uplift as a key metric to track that would probably make time horizons obsolete (or at least, constitute a significantly stronger source of evidence than time horizons). We at AIFP don’t have capacity to estimate the trend in uplift over time (I mean we can do small-N polls of frontier AI company employees...) but we hope someone does.
My understanding is that you can still have a similarly unattractive issue with the 50% time horizon where performing better at high horizon lengths can reduce the 50% time horizon because it makes the slope less steep, but it doesn’t seem to be as high magnitude of an issue as with 20+80%.
Yep! Here’s an example where the 50% horizon and 80% horizon can be lower for an agent whose success profile dominates another agent (i.e. higher success rate at all task lengths), even for (1) monotone nonincreasing success rates (i.e. longer tasks are harder) (2) success rate of 1 at minimum task length (3) success rate of 0 at maximum task length
I doubt that reducing the 50% TH is likely. Aside from Claude Opus 4.5, the four other[1] historic METR graphs (GPT-5.1 Codex Max and the trio of GPT-5, Grok 4, o3) which I can easily find displayed similarly sharp slopes in the region close to the 50% time horizon. Imagine a model which solves a task having a time horizon t with probability P=0.951+(t/thor)α Trying to fit such a model into the METR benchmark would be very unlikely to lower the 50% horizon well below thorunless the time thor was close to the simplest tasks. But it would likely lower the 80% TH (think of Grok 4 and Claude Opus 4.5) and, if thor is close to the hardest tasks, to elevate the 50% TH.
I’m Niels, a video journalist at KRO-NCRV/Pointer in the Netherlands. I’ve been following your work on AI time horizons, and I’m building a piece around the question of why AI progress looks so different on clean benchmarks versus messy, real-world tasks.
The messiness analysis in your Long Tasks paper — and the performance drop it reveals — is exactly what I’d like to discuss. I think it’s one of the key underreported findings in recent AI research.
Would you be up for a 20-minute video call? Happy to work around your schedule.
As for the AGI’s superexponential horizon, Kokotajlo changed his mind on that point (see also my argument for the same possibility). Additionally, I expect that Claude Opus 4.5′s “high 50% time horizon” could have been caused by Claude’s failures on simple tasks. What horizon would be displayed by the agent who outsources all tasks with a less than x minute horizon to GPT-5.1 Codex Max while giving tasks of length x mins or bigger to Claude? And what about preliminary estimates of the 50% and 80% TH of GPT-5.2 (Codex?) and/or Gemini 3 Pro based on the task suite which you have already created? I hope that this might allow us to extract a few more bits of evidence...
Reasons time horizon is overrated and misinterpreted:
(This post is now live on the METR website in a slightly edited form)
In the 9 months since the METR time horizon paper (during which AI time horizons have increased by ~6x), it’s generated lots of attention as well as various criticism on LW and elsewhere. As one of the main authors, I think much of the criticism is a valid response to misinterpretations, and want to list my beliefs about limitations of our methodology and time horizon more broadly. This is not a complete list, but rather whatever I thought of in a few hours.
Time horizon is not the length of time AIs can work independently
Rather, it’s the amount of serial human labor they can replace with a 50% success rate. When AIs solve tasks they’re usually much faster than humans.
Time horizon is not precise
When METR says “Claude Opus 4.5 has a 50%-time horizon of around 4 hrs 49 mins (95% confidence interval of 1 hr 49 mins to 20 hrs 25 mins)”, we mean those error bars. They were generated via bootstrapping, so if we randomly subsample harder tasks our code would spit out <1h49m 2.5% of the time. I really have no idea whether Claude’s “true” time horizon is 3.5h or 6.5h.
Error bars have historically been a factor of ~2 in each direction, worse with current models like Opus 4.5 as our benchmark begins to saturate.
Because model performance is correlated, error bars for relative comparisons between models are a bit smaller. But it still makes little sense to care about whether a model is just below frontier, 10% above the previous best model, or 20% above.
Time horizon differs between domains by orders of magnitude
The original paper measured it on mostly software and research tasks. Applying the same methodology in a follow-up found that time horizons are fairly similar for math, but 40-100x lower for visual computer use tasks, due to eg poor perception.
Claude 4.5 Sonnet’s real-world coffee-making time horizon is only ~2 minutes
Time horizon does not apply to every task distribution
On SWE-Lancer OpenAI observed that a task’s monetary value (which should be a decent proxy for engineer-hours) doesn’t correlate with a model’s success rate. I still don’t know why this is.
Benchmark vs real-world task distribution
We’re making tasks just ahead of what we expect future models to be able to do, and benchmark construction has many design choices.
We try to make tasks representative of the real world, but as in any benchmark, there are inherent tradeoffs between realism, diversity, fixed costs (implementation), and variable costs (ease of running the benchmark). Inspect has made this easier but there will obviously be factors that cause our benchmarks to favor or disfavor models.
Because anything automatically gradable can be an RL environment, and models are extensively trained using RLVR [1], making gradable tasks that don’t overestimate real-world performance at all essentially means making more realistic RLVR settings than labs, which is hard.
Figure 1: What it feels like making benchmarks before frontier models saturate them
Our benchmarks differ from the real world in many ways, some of which are discussed in the original paper.
Low vs high context (low-context tasks are isolated and don’t require prior knowledge about a codebase)
Well-defined vs poorly defined
“Messy” vs non-messy tasks (see section 6.2 of original paper)
Different conventions around human baseline times could affect time horizon by >1.25x.
I think we made reasonable choices, but there were certainly judgement calls here– the most important thing was to be consistent.
Baseliner skill level: Our baseliner pool was “skilled professionals in software engineering, machine learning, and cybersecurity”, but top engineers, e.g. lab employees, would be faster.
We didn’t incorporate failed baselines into time estimates because baseliners often failed for non-relevant reasons. If we used survival analysis to interpret an X-hour failed baseline as information that the task takes >X hours, we would increase measured task lengths.
When a task had multiple successful baselines we aggregated these using the geometric mean. Baseline times have high variance, so using the arithmetic mean would increase averages by ~25%.
A 50% time horizon of X hours does not mean we can delegate tasks under X hours to AIs.
Some (reliability-critical and poorly verifiable) tasks require 98%+ success probabilities to be worth automating
Doubling the time horizon does not double the degree of automation. Even if the AI requires half as many human interventions, it will probably fail in more complex ways requiring more human labor per intervention.
To convert time horizons to research speedup, we need to measure how much time a human spends prompting AIs, waiting for generations, checking AI output, writing code manually, etc. when doing an X hour task assisted by an AI with time horizon Y hours. Then we plug this into the uplift equation. This process is nontrivial and requires a much richer data source like Cursor logs or screen recordings.
20% and 80% time horizons are kind of fake because there aren’t enough parameters to fit them separately.
We fit a two-parameter logistic model which doesn’t fit the top and bottom of the success curve separately, so improving performance on 20% horizon tasks can lower 80% horizon.
It would be better to use some kind of spline with logit link and monotonicity constraint. The reasons we haven’t done this yet: (a) 80% time horizon was kind of an afterthought/robustness check, (b) we wanted our methods to be easily understandable, (c) there aren’t enough tasks to fit more than a couple more parameters, and (d) anything more sophisticated than logistic regression would take longer to run, and we do something like 300,000 logistic fits (mostly for bootstrapped confidence intervals) to reproduce the pipeline. I do recommend doing this for anyone who wants to measure higher quantiles and has a large enough benchmark to do so meaningfully.
Time horizons at 99%+ reliability levels cannot be fit at all without much larger and higher-quality benchmarks.
Measuring 99% time horizons would require ~300 highly diverse tasks in each time bucket. If the tasks are not highly diverse and realistic, we could fail to sample the type of task that would trip up the AI in actual use.
The tasks also need <<1% label noise. If they’re broken/unfair/have label noise, the benchmark could saturate at 98% and we would estimate the 99% time horizon of every model to be zero.
Speculating about the effects of a months- or years-long time horizon is fraught.
The distribution of tasks from which the suite is drawn from is not super well-defined, and so reasonable different extrapolations could get quite different time-horizon trends.
One example: all of the tasks in METR-HRS are self-contained, whereas most months-long tasks humans do require collaboration.
If an AI has a 3-year time horizon, does this mean an AI can competently substitute for a human for a 3-year long project with the same level of feedback from a manager, or be able to do the human’s job completely independently? We have no tasks involving multi-turn interaction with a human so there is no right answer.
There is a good argument that AGI would have an infinite time horizon and so time horizon will eventually start growing superexponentially. However, the AI Futures timelines model is highly sensitive to exactly how superexponential future time horizon growth will be, which we have little data on. This parameter, “Doubling Difficulty Growth Factor”, can change the date of the first Automated Coder AI between 2028 and 2050.
Despite these limitations, what conclusions do I still stand by?
The most important numbers to estimate were the slope of the long-run trend (one doubling every 6-7 months) and a linear extrapolation of this trend predicting when AIs would reach 1 month / 167 working hours time horizon (2030), not the exact time horizon of any particular model. I think the paper did well here.
Throughout the project we did the least work we could to establish a sufficiently robust result, because task construction and baselining were both super expensive. As a result, the data are insufficient to do some secondary and subset analyses. I still think it’s fine but have increasing worries as the benchmark nears saturation.
Without SWAA the error bars are super wide, and SWAA is lower quality than some easy (non-software) benchmarks like GSM8k. This might seem worrying, but it’s fine because it doesn’t actually matter for the result whether GPT-2’s time horizon is 0.5 seconds or 3 seconds; the slope of the trend is pretty similar. All that matters is that we can estimate it at all with a benchmark that isn’t super biased.
Some tasks have time estimates rather than actual human baselines, and the tasks that do have baselines have few of them. This is statistically ok because in our sensitivity analysis, adding IID baseline noise had minimal impact on the results, and the range of task lengths (spanning roughly a factor of 10,000) means that even baselining error correlated with task length wouldn’t affect the doubling time much.
However, Tom Cunningham points out that most of the longer tasks don’t have baselines, so if we’re systematically over/under-estimating the length of long tasks we could be misjudging the degree of acceleration in 2025.
The paper had a small number of tasks (only ~170) because we prioritize quality over quantity. The dataset size was originally fine but is now becoming a problem as we lack longer, 2h+ tasks to evaluate future models.
I think we’re planning to update the task suite soon to include most of the HCAST tasks (the original paper had only a subset) plus some new tasks. Beyond this, we have various plans to continue measuring AI capabilities, both through benchmarks and other means like RCTs.
[1] see eg DeepSeek R1 paper: https://arxiv.org/abs/2501.12948
I basically agree with everything you say here and wish we had a better way to try to ground AGI timelines forecasts. Do you recommend any other method? E.g. extrapolating revenue? Just thinking through arguments about whether the current paradigm will work, and then using intuition to make the final call? We discuss some methods that appeal to us here.
Note that we allow it to go subexponential, so actually it can change the date arbitrarily far in the future if you really want it to. Also, dunno what’s happening with Eli’s parameters, but with my parameter settings putting the doubling difficulty growth factor to 1 (i.e. pure exponential trend, neither super or sub exponential) gets to AC in 2035. (Though I don’t think we should put much weight on this number, as it depends on other parameters which are subjective & important too, such as the horizon length which corresponds to AC, which people disagree a lot about)
The simple model I mentioned on Slack (still WIP, hopefully to be written up this week) tracks capability directly in terms of labor speedup and extrapolates that. Of course, for a more serious timelines forecast you have to ground it in some data.
Here’s what I said to Eli on Slack; I don’t really have more thoughts since then
we can get f_2026 [uplift fraction in 2026] from
transcripts of realistic cursor usage + success judge + difficulty judge calibrated on tasks of known lengths
uplift study
asking lab people about their current uplift (since parallel uplift and 1/(1-f) are equivalent in the simple model)
v [velocity of automation as capabilities improve] can be obtained by
guessing the distribution of tasks, using time horizon, maybe using a correction factor for real vs benchmark time horizon
multiple uplift studies over time
comparing older models to newer ones, or having them try things people use 4.5 opus for
listing how many things get automated each year
Nice. Yeah I also am excited about coding uplift as a key metric to track that would probably make time horizons obsolete (or at least, constitute a significantly stronger source of evidence than time horizons). We at AIFP don’t have capacity to estimate the trend in uplift over time (I mean we can do small-N polls of frontier AI company employees...) but we hope someone does.
My understanding is that you can still have a similarly unattractive issue with the 50% time horizon where performing better at high horizon lengths can reduce the 50% time horizon because it makes the slope less steep, but it doesn’t seem to be as high magnitude of an issue as with 20+80%.
Yep! Here’s an example where the 50% horizon and 80% horizon can be lower for an agent whose success profile dominates another agent (i.e. higher success rate at all task lengths), even for
(1) monotone nonincreasing success rates (i.e. longer tasks are harder)
(2) success rate of 1 at minimum task length
(3) success rate of 0 at maximum task length
before points are
[(0,1), (1, 1⁄15), (2, 0), (3,0)]
after points are
[(0,1), (1, 0.1), (2, 0.1), (3, 1⁄15)]
https://www.desmos.com/calculator/nqwn6ofmzq
I doubt that reducing the 50% TH is likely. Aside from Claude Opus 4.5, the four other[1] historic METR graphs (GPT-5.1 Codex Max and the trio of GPT-5, Grok 4, o3) which I can easily find displayed similarly sharp slopes in the region close to the 50% time horizon. Imagine a model which solves a task having a time horizon t with probability P=0.951+(t/thor)α Trying to fit such a model into the METR benchmark would be very unlikely to lower the 50% horizon well below thor unless the time thor was close to the simplest tasks. But it would likely lower the 80% TH (think of Grok 4 and Claude Opus 4.5) and, if thor is close to the hardest tasks, to elevate the 50% TH.
IIRC METR compiled a list of graphs for the 12 pre-o3 models which were SOTA at the time of release, but I can’t find it. UPD: found it.
Hi Thomas,
I’m Niels, a video journalist at KRO-NCRV/Pointer in the Netherlands. I’ve been following your work on AI time horizons, and I’m building a piece around the question of why AI progress looks so different on clean benchmarks versus messy, real-world tasks.
The messiness analysis in your Long Tasks paper — and the performance drop it reveals — is exactly what I’d like to discuss. I think it’s one of the key underreported findings in recent AI research.
Would you be up for a 20-minute video call? Happy to work around your schedule.
Thanks,
Niels
KRO-NCRV / Pointer
As for the AGI’s superexponential horizon, Kokotajlo changed his mind on that point (see also my argument for the same possibility). Additionally, I expect that Claude Opus 4.5′s “high 50% time horizon” could have been caused by Claude’s failures on simple tasks. What horizon would be displayed by the agent who outsources all tasks with a less than x minute horizon to GPT-5.1 Codex Max while giving tasks of length x mins or bigger to Claude? And what about preliminary estimates of the 50% and 80% TH of GPT-5.2 (Codex?) and/or Gemini 3 Pro based on the task suite which you have already created? I hope that this might allow us to extract a few more bits of evidence...