I’ve been puzzling about the meaning of horizon lengths and whether to expect trends to be exponential or superexponential. Also how much R&D acceleration we should expect to come from what horizon length levels—Eli was saying something like “90%-horizons of 100 years sound about right for Superhuman Coder level performance” and I’m like “that’s insane, I would have guessed 80%-horizons of 1 month.” How to arbitrate this dispute?
This appendix from METR’s original paper seems relevant. I’m going to think out loud below.
OK so, how should we define horizon length? On one way of defining it, it’s inherently pegged to what human experts can do. E.g. arguably, METR’s HCAST benchmark is constructed by selecting tasks that human experts can do, and labelling them with horizon lengths based on how long it takes human experts to do them. Thus an arbitrarily extended HCAST (with longer and longer, more difficult tasks) would still only have tasks in it that human experts can do. Thus, superintelligent AI would have infinite horizon length. Thus, the trend must be superexponential, because it needs to get to infinity in finite time (unless you think ASI is impossible)
But maybe that’s not the only way of thinking about it. Maybe instead we should think of some bigger set of tasks and a bigger set of skills, such that some of them simply can’t be done by any human. And we should think of HCAST as sampling from that bigger set. And it’s just a minor methodological issue that the current version of HCAST only includes tasks humans can do; the proper way to think about it—and the proper way to think about how to extend it—would be to sample tasks in some unbiased way that will gradually include more tasks over time that are harder and harder and that fewer and fewer humans can do, eventually including tasks that no human can do.
In this case, horizon length only goes to infinity if and when there is an AI which can do all tasks from the bigger set. Depending on how ambitiously you define that bigger set, this may not be possible even in principle. But at any rate, an AI system which is exactly as good as the best humans at everything while being faster and cheaper (minimum possible superintelligence) would still fail at a bunch of tasks from this bigger set, so it would still have a finite horizon length. So even if the underlying trend was superexponential it wouldn’t be going to infinity anytime soon, and maybe it would make sense to model it as exponential.
So which of these two ways is the better way of thinking about it?
Enter the above picture from the original paper.
While they are filtering to only include tasks humans can do, it seems like fewer and fewer humans can do the tasks as they get longer. So maybe that means it’s appropriate to imagine extrapolating HCAST in the second way, the way in which it eventually gets to tasks that no human can do. On that way, individual humans have a finite horizon length and the max of all humans (i.e. for each task, pick the world’s best human to do that task) also has a finite horizon length, and ASI would probably also have a finite horizon length.
So then the question becomes, what are those horizon lengths? What is the horizon length (on coding tasks) of the best frontier AI company engineers? What is the horizon length (on coding tasks) of the max of all humans?
Well, according to the image above, it seems like the horizon length (on coding tasks) of METR’s baseliners is… about one and a half hours!!!
Further complication. That’s the horizon length of baseliners given 8 hours; had they been given more time they would have performed better. Similarly perhaps, we shouldn’t talk about the horizon length of the max of all humans, or ASI, or any given AI system, in isolation, but rather relative to a time/compute/etc. budget. (Or perhaps more cleanly, the allotted time/compute/etc. budget should be considered part of the task.)
Further complication: As METR points out in the screenshot, they condition on success when defining the time horizon of a task. So e.g. if they have 10 baseliners try a task, allotted 8 hours, and only 2 succeed, then the task length is defined as the average of how long it took those 2 people. But this is a bit crazy because it guarantees that the horizon length of the task will be less than 8 hours even in cases where it arguably should be more! E.g. suppose the 10 people would have all succeeded at the task if they were given more time, and the median time-to-succeed would have been 20 hours, but the top 2 people with a combination of skill and luck managed to do it in 3 and 6 hours respectively. Then the task would be labelled as a 4.5 hour task even though the median baseliner needs 20 hours to do it.
At this point there are enough complications that I’m confused and want to take a break and return to the matter later. I’d be grateful for any comments shedding light on the matter.
I talked to the AI Futures team in person and shared roughly these thoughts:
Time horizon as measured in the original paper is underspecified in several ways.
Time horizon varies by domain and AI companies have multiple types of work. It is not clear if HCAST time horizon will be a constant factor longer than realistic time horizon, but it’s a reasonable guess.
As I see it, task lengths for time horizon should be something like the average amount of labor spent on each instance of a task by actual companies, and all current methodologies are approximations of this.
To convert time horizon to speedup, you would need to estimate the average labor involved in supervising an AI on a task that would take a human X hours and the AI can do with reliability Y, which we currently don’t have data on.
As I see it, time horizon is in theory superexponential, as it has to go to infinity when we get AGI / superhuman coder. But the current data is not good enough to just fit a superexponential and get a timelines forecast. It could already be superexponential, or it could only go superexponential after time horizon hits 10 years.
Cursor and the Claude Code team probably already have data that tracks the speed of generations, plus how long humans spend reading AI code, correcting AI mistakes, and supervising AI in other ways, that one could construct a better forecast from.
It is also unclear what speedup an AI with infinite software time horizon would bring to AI R&D, because this would depend on its speed at doing existing tasks, how many useful novel tasks it invents that humans can’t do, and ability to interface with non-software parts of the business.
it has to go to infinity when we get AGI / superhuman coder.
This isn’t necessarily true, as even an AGI or a superhuman coder might get worse at tasks-that-take-humans-longer compared to tasks-that-take-humans-shorter (this seems pretty likely given constant-error-rate considerations), meaning that even an extremely capable AI might be like 99.999% reliable for 1 hour tasks, but only 99.9% reliable for 10,000 hour tasks, meaning the logistic fit still has an intercept with 50%, it’s just a very high number.
In order for the 50% intercept to approach infinity, you’d need a performance curve which approaches a flat line, and this seems very hard to pull off and probably requires wildly superhuman AI.
Under the logistic methodology where we don’t actually have long enough tasks to measure the 50% point, sure. But if we actually have years-long tasks, a true superhuman coder should be able to do them more reliably than humans, which is more than 50% if we filter the problem distribution to things humans can do with more than about 50% probability. There are other methodologies that I think are more meaningful, where it might also make sense to have the SC’s time horizon be infinity.
I disagree that the old trend better predicted Grok 4 and GPT-5. Here’s my plot (source, interactive) with the trendlines from METR’s time horizons paper: orange is the 2022-2025 trend of 7 month doubling time, red is the 2024-2025 trend of 4 month doubling time.
Both trendlines were calculated before the release of o3, Grok 4 or GPT-5, so I consider those three datapoints falling close to the 4 month doubling time line to be evidence for that line. Reading off the graph, o3 was about a month ahead of schedule, and Grok 4 and GPT-5 were both about a month behind schedule. I wonder if that is partially explained by OpenAI waiting longer before releasing GPT-5 (it sounds like METR had access for a bit longer).
Those points arent close to the 4 month doubling time line. The line is way above them. A month behind schedule is a lot when your schedule is a 4 month doubling time.
To be fair they also don’t look that close to the slower (6 month?) doubling time line, I guess we’re still on a slightly faster trend. I’m probably seeing what I expected to see here; I expected the slope to level off and it’s easy for me to read that off of the graph even though it’s not really clear yet.
Eli was saying something like “90%-horizons of 100 years sound about right for Superhuman Coder level performance”
To be clear, this is on (a theoretical extrapolated version of) METR HCAST, not the real world distribution of software engineering projects.
Also to remind others of the definition of superhuman coder, it’s a pretty high bar:
Superhuman coder (SC): An AI system for which the company could run with 5% of their compute budget 30x as many agents as they have human research engineers, each of which is on average accomplishing coding tasks involved in AI research (e.g. experiment implementation but not ideation/prioritization) at 30x the speed (i.e. the tasks take them 30x less time, not necessarily that they write or “think” at 30x the speed of humans) of the company’s best engineer. This includes being able to accomplish tasks that are in any human researchers’ area of expertise.
The way I’m thinking about it is, various inputs are scaling exponentially (e.g. compute, data, human labor) to produce the trend we see. Obviously, when input scaling slows down, progress will slow too, creating a sort of sigmoid.
But I’m interested in what happens if the inputs continue to scale at approximately the same rate. I’ll then adjust up or down from there depending on a separate projection of how the inputs will scale or slow.
Do you expect a sigmoid even if the inputs continue to scale at approximately the same rate? If so, why? Do you think there is some inherent limit on the agentic coding task horizon length of AIs trained within roughly the current paradigm?
The current paradigm seems likely to become a sigmoid before the 8 hour mark. I tried to place a bet on this with the AI 2027 authors but no one accepted. (At this moment I’m cash-poor so it would have to be a small amount or just provide me liquidity on the manifold market I linked)
This is mostly because I don’t expect “neuralese” to just work, as indeed it has not yet (?) despite one paper (which was cited in AI 2027) that apparently hasn’t replicated, because tokens are too low bandwidth. I also think reinforcement learning remains hard even if you have a good predictive model of text on the internet.
I expect these problems to kick in around the 4-16 hour mark, when it’s necessary to build a sophisticated mental model of a particular problem.
We’re already seeing that LLMs may not speed up developer productivity and do not produce usable PRs on large projects. These limitations will perhaps become blocking at time horizons slightly longer than the current SOTA, when you can’t just one-shot a project.
On the other hand, the slower exponential trend has proven more robust than I expected. I think that the faster exponential for reasoning models is dying off (with Grok 4 and GPT-5) and the slower exponential dies off next, but I am becoming slightly less confident of this on balance.
Oh cool, yeah I’m happy to bet with you on that I think. What would the exact terms be? 8+ hour horizon length AI by mid-2027?
I sure hope neuralese doesn’t work, and agree it hasn’t been working so far. In general I really hope you are right about the incoming wall that deep learning is about to hit, but I don’t think you are, deep learning has smashed through so many alleged walls recently.
That’s way below the slower exponential, I think I need slightly more favorable terms for 50:50.
I would take that bet if EITHER the resolution is end of March 2027 rather than mid-2027 or the number is 12+, both of which are still substantially below the slower exponential.
Since all these dates are pretty far out I’d bet, say, 250 USD.
I also hope I’m right, but I don’t necessarily except deep learning to hit a wall, only the current paradigm within deep learning.
Is it way below? Eyeballing the original graph from the paper, it seems like it would just be slightly below the slower exponential:
Anyhow I’m happy to take the bet at 50:50 odds for 8 hours at end of March 2027. I’m not confident I’ll win, but I think I’m somewhat more likely than not to win. 250 USD sounds good to me.
Right, I am projecting the slower exponential growth rate forward starting from this point.
I am also not confident that I will win, but I accept, let’s make a prediction market to track this specific bet (I am curious what the manifold odds will look like).
The 4-month doubling trend implies getting 8h+ horizon length by early 2026 and an order of magnitude more by mid-2027. If the best time horizon length in mid-2027 would be 9h, would you feel like you had won the argument, even if you had won the bet?
I interpret my opponent here as saying things will go slower than the 7-month doubling trend, not as saying that things will go slower than the 4-month doubling trend.
That said if it’s 9hr then it’s basically right on the line between winning and losing the bet, only a very weak winning of the bet, so no I wouldn’t really think of myself as having won the argument.
I’ve been puzzling about the meaning of horizon lengths and whether to expect trends to be exponential or superexponential. Also how much R&D acceleration we should expect to come from what horizon length levels—Eli was saying something like “90%-horizons of 100 years sound about right for Superhuman Coder level performance” and I’m like “that’s insane, I would have guessed 80%-horizons of 1 month.” How to arbitrate this dispute?
This appendix from METR’s original paper seems relevant. I’m going to think out loud below.
OK so, how should we define horizon length? On one way of defining it, it’s inherently pegged to what human experts can do. E.g. arguably, METR’s HCAST benchmark is constructed by selecting tasks that human experts can do, and labelling them with horizon lengths based on how long it takes human experts to do them. Thus an arbitrarily extended HCAST (with longer and longer, more difficult tasks) would still only have tasks in it that human experts can do. Thus, superintelligent AI would have infinite horizon length. Thus, the trend must be superexponential, because it needs to get to infinity in finite time (unless you think ASI is impossible)
But maybe that’s not the only way of thinking about it. Maybe instead we should think of some bigger set of tasks and a bigger set of skills, such that some of them simply can’t be done by any human. And we should think of HCAST as sampling from that bigger set. And it’s just a minor methodological issue that the current version of HCAST only includes tasks humans can do; the proper way to think about it—and the proper way to think about how to extend it—would be to sample tasks in some unbiased way that will gradually include more tasks over time that are harder and harder and that fewer and fewer humans can do, eventually including tasks that no human can do.
In this case, horizon length only goes to infinity if and when there is an AI which can do all tasks from the bigger set. Depending on how ambitiously you define that bigger set, this may not be possible even in principle. But at any rate, an AI system which is exactly as good as the best humans at everything while being faster and cheaper (minimum possible superintelligence) would still fail at a bunch of tasks from this bigger set, so it would still have a finite horizon length. So even if the underlying trend was superexponential it wouldn’t be going to infinity anytime soon, and maybe it would make sense to model it as exponential.
So which of these two ways is the better way of thinking about it?
Enter the above picture from the original paper.
While they are filtering to only include tasks humans can do, it seems like fewer and fewer humans can do the tasks as they get longer. So maybe that means it’s appropriate to imagine extrapolating HCAST in the second way, the way in which it eventually gets to tasks that no human can do. On that way, individual humans have a finite horizon length and the max of all humans (i.e. for each task, pick the world’s best human to do that task) also has a finite horizon length, and ASI would probably also have a finite horizon length.
So then the question becomes, what are those horizon lengths? What is the horizon length (on coding tasks) of the best frontier AI company engineers? What is the horizon length (on coding tasks) of the max of all humans?
Well, according to the image above, it seems like the horizon length (on coding tasks) of METR’s baseliners is… about one and a half hours!!!
Further complication. That’s the horizon length of baseliners given 8 hours; had they been given more time they would have performed better. Similarly perhaps, we shouldn’t talk about the horizon length of the max of all humans, or ASI, or any given AI system, in isolation, but rather relative to a time/compute/etc. budget. (Or perhaps more cleanly, the allotted time/compute/etc. budget should be considered part of the task.)
Further complication: As METR points out in the screenshot, they condition on success when defining the time horizon of a task. So e.g. if they have 10 baseliners try a task, allotted 8 hours, and only 2 succeed, then the task length is defined as the average of how long it took those 2 people. But this is a bit crazy because it guarantees that the horizon length of the task will be less than 8 hours even in cases where it arguably should be more! E.g. suppose the 10 people would have all succeeded at the task if they were given more time, and the median time-to-succeed would have been 20 hours, but the top 2 people with a combination of skill and luck managed to do it in 3 and 6 hours respectively. Then the task would be labelled as a 4.5 hour task even though the median baseliner needs 20 hours to do it.
At this point there are enough complications that I’m confused and want to take a break and return to the matter later. I’d be grateful for any comments shedding light on the matter.
I talked to the AI Futures team in person and shared roughly these thoughts:
Time horizon as measured in the original paper is underspecified in several ways.
Time horizon varies by domain and AI companies have multiple types of work. It is not clear if HCAST time horizon will be a constant factor longer than realistic time horizon, but it’s a reasonable guess.
As I see it, task lengths for time horizon should be something like the average amount of labor spent on each instance of a task by actual companies, and all current methodologies are approximations of this.
To convert time horizon to speedup, you would need to estimate the average labor involved in supervising an AI on a task that would take a human X hours and the AI can do with reliability Y, which we currently don’t have data on.
As I see it, time horizon is in theory superexponential, as it has to go to infinity when we get AGI / superhuman coder. But the current data is not good enough to just fit a superexponential and get a timelines forecast. It could already be superexponential, or it could only go superexponential after time horizon hits 10 years.
Cursor and the Claude Code team probably already have data that tracks the speed of generations, plus how long humans spend reading AI code, correcting AI mistakes, and supervising AI in other ways, that one could construct a better forecast from.
It is also unclear what speedup an AI with infinite software time horizon would bring to AI R&D, because this would depend on its speed at doing existing tasks, how many useful novel tasks it invents that humans can’t do, and ability to interface with non-software parts of the business.
This isn’t necessarily true, as even an AGI or a superhuman coder might get worse at tasks-that-take-humans-longer compared to tasks-that-take-humans-shorter (this seems pretty likely given constant-error-rate considerations), meaning that even an extremely capable AI might be like 99.999% reliable for 1 hour tasks, but only 99.9% reliable for 10,000 hour tasks, meaning the logistic fit still has an intercept with 50%, it’s just a very high number.
In order for the 50% intercept to approach infinity, you’d need a performance curve which approaches a flat line, and this seems very hard to pull off and probably requires wildly superhuman AI.
Under the logistic methodology where we don’t actually have long enough tasks to measure the 50% point, sure. But if we actually have years-long tasks, a true superhuman coder should be able to do them more reliably than humans, which is more than 50% if we filter the problem distribution to things humans can do with more than about 50% probability. There are other methodologies that I think are more meaningful, where it might also make sense to have the SC’s time horizon be infinity.
The recent trend does not look superexponential though right?
It briefly looked like the slope had increased with reasoning models but at a glance the older trend better predicted Grok 4 and GPT-5.
Too early to tell IMO.
I disagree that the old trend better predicted Grok 4 and GPT-5. Here’s my plot (source, interactive) with the trendlines from METR’s time horizons paper: orange is the 2022-2025 trend of 7 month doubling time, red is the 2024-2025 trend of 4 month doubling time.
Both trendlines were calculated before the release of o3, Grok 4 or GPT-5, so I consider those three datapoints falling close to the 4 month doubling time line to be evidence for that line. Reading off the graph, o3 was about a month ahead of schedule, and Grok 4 and GPT-5 were both about a month behind schedule. I wonder if that is partially explained by OpenAI waiting longer before releasing GPT-5 (it sounds like METR had access for a bit longer).
Those points arent close to the 4 month doubling time line. The line is way above them. A month behind schedule is a lot when your schedule is a 4 month doubling time.
To be fair they also don’t look that close to the slower (6 month?) doubling time line, I guess we’re still on a slightly faster trend. I’m probably seeing what I expected to see here; I expected the slope to level off and it’s easy for me to read that off of the graph even though it’s not really clear yet.
To be clear, this is on (a theoretical extrapolated version of) METR HCAST, not the real world distribution of software engineering projects.
Also to remind others of the definition of superhuman coder, it’s a pretty high bar:
Exponential and superexponential are not the only plausible options here.
I’m still expecting a sigmoid.
Exponential is starting to look more convincing though.
The way I’m thinking about it is, various inputs are scaling exponentially (e.g. compute, data, human labor) to produce the trend we see. Obviously, when input scaling slows down, progress will slow too, creating a sort of sigmoid.
But I’m interested in what happens if the inputs continue to scale at approximately the same rate. I’ll then adjust up or down from there depending on a separate projection of how the inputs will scale or slow.
Do you expect a sigmoid even if the inputs continue to scale at approximately the same rate? If so, why? Do you think there is some inherent limit on the agentic coding task horizon length of AIs trained within roughly the current paradigm?
The current paradigm seems likely to become a sigmoid before the 8 hour mark. I tried to place a bet on this with the AI 2027 authors but no one accepted. (At this moment I’m cash-poor so it would have to be a small amount or just provide me liquidity on the manifold market I linked)
A priori I expect the current model to fail at continual learning: https://www.lesswrong.com/posts/vvgND6aLjuDR6QzDF/my-model-of-what-is-going-on-with-llms
This is mostly because I don’t expect “neuralese” to just work, as indeed it has not yet (?) despite one paper (which was cited in AI 2027) that apparently hasn’t replicated, because tokens are too low bandwidth. I also think reinforcement learning remains hard even if you have a good predictive model of text on the internet.
I expect these problems to kick in around the 4-16 hour mark, when it’s necessary to build a sophisticated mental model of a particular problem.
We’re already seeing that LLMs may not speed up developer productivity and do not produce usable PRs on large projects. These limitations will perhaps become blocking at time horizons slightly longer than the current SOTA, when you can’t just one-shot a project.
On the other hand, the slower exponential trend has proven more robust than I expected. I think that the faster exponential for reasoning models is dying off (with Grok 4 and GPT-5) and the slower exponential dies off next, but I am becoming slightly less confident of this on balance.
Oh cool, yeah I’m happy to bet with you on that I think. What would the exact terms be? 8+ hour horizon length AI by mid-2027?
I sure hope neuralese doesn’t work, and agree it hasn’t been working so far. In general I really hope you are right about the incoming wall that deep learning is about to hit, but I don’t think you are, deep learning has smashed through so many alleged walls recently.
That’s way below the slower exponential, I think I need slightly more favorable terms for 50:50.
I would take that bet if EITHER the resolution is end of March 2027 rather than mid-2027 or the number is 12+, both of which are still substantially below the slower exponential.
Since all these dates are pretty far out I’d bet, say, 250 USD.
I also hope I’m right, but I don’t necessarily except deep learning to hit a wall, only the current paradigm within deep learning.
Is it way below? Eyeballing the original graph from the paper, it seems like it would just be slightly below the slower exponential:
Anyhow I’m happy to take the bet at 50:50 odds for 8 hours at end of March 2027. I’m not confident I’ll win, but I think I’m somewhat more likely than not to win. 250 USD sounds good to me.
Right, I am projecting the slower exponential growth rate forward starting from this point.
I am also not confident that I will win, but I accept, let’s make a prediction market to track this specific bet (I am curious what the manifold odds will look like).
Deal. Thanks!
Please let me know soon if you want any part of the description modified.
The 4-month doubling trend implies getting 8h+ horizon length by early 2026 and an order of magnitude more by mid-2027. If the best time horizon length in mid-2027 would be 9h, would you feel like you had won the argument, even if you had won the bet?
I interpret my opponent here as saying things will go slower than the 7-month doubling trend, not as saying that things will go slower than the 4-month doubling trend.
That said if it’s 9hr then it’s basically right on the line between winning and losing the bet, only a very weak winning of the bet, so no I wouldn’t really think of myself as having won the argument.
That’s also my impression: https://www.lesswrong.com/posts/KrgBkqeChtAWuPsLP/what-llms-lack