Edit: Full post here with 9 domains and updated conclusions!
Cross-domain time horizon:
We know AI time horizons (human time-to-complete at which a model has a 50% success rate) on software tasks are currently ~1.5hr and doubling every 4-7 months, but what about other domains? Here’s a preliminary result comparing METR’s task suite (orange line) to benchmarks in other domains, all of which have some kind of grounding in human data:
Observations
Time horizons on agentic computer use (OSWorld) is ~100x shorter than other domains. Domains like Tesla self-driving (tesla_fsd), scientific knowledge (gpqa), and math contests (aime), video understanding (video_mme), and software (hcast_r_s) all have roughly similar horizons.
My guess is this means models are good at taking in information from a long context but bad at acting coherently. Most work requires agency like OSWorld, which may be why AIs can’t do the average real-world 1-hour task yet.
There are likely other domains that fall outside this cluster; these are just the five I examined
Note the original version had a unit conversion error that gave 60x too high horizons for video_mme; this has been fixed (thanks @ryan_greenblatt )
Rate of improvement varies significantly; math contests have improved ~50x in the last year but Tesla self-driving only 6x in 3 years.
HCAST is middle of the pack in both.
Note this is preliminary and uses a new methodology so there might be data issues. I’m currently writing up a full post!
Is this graph believable? What do you want to see analyzed?
but bad at acting coherently. Most work requires agency like OSWorld, which may be why AIs can’t do the average real-world 1-hour task yet.
I’d have guessed that poor performance on OSWorld is mostly due to poor vision and mouse manipulation skills, rather than insufficient ability to act coherantly.
I’d guess that typical self-contained 1-hour task (as in, a human professional could do it in 1 hour with no prior context except context about the general field) also often require vision or non-text computer interaction and if they don’t, I bet the AIs actually do pretty well.
I’m skeptical and/or confused about the video MME results:
You show Gemini 2.5 Pro’s horizon length as ~5000 minutes or 80 hours. However, the longest videos in the benchmark are 1 hour long (in the long category they range from 30 min to 1 hr). Presumably you’re trying to back out the 50% horizon length using some assumptions and then because Gemini 2.5 Pro’s performance is 85%, you back out a 80-160x multiplier on the horizon length! This feels wrong/dubious to me if it is what you are doing.
Based on how long these horizon lengths are, I’m guessing you assumed that answering a question about a 1 hour long video takes a human 1 hr. This seems very wrong to me. I’d bet humans can typically answer these questions much faster by panning through the video looking for where the question might be answered and then looking at just that part. Minimally, you can sometimes answer the question by skimming the transcript and it should be possible to watch at 2x/3x speed. I’d guess the 1 hour video tasks take more like 5-10 min for a well practiced human, and I wouldn’t be surprised by much shorter.
For this benchmark, (M)LLM performance seemingly doesn’t vary much with video duration, invalidating that horizon length (at least horizon length based on video length) is a good measure on this dataset!
There was a unit conversion mistake, it should have been 80 minutes. Now fixed.
Besides that, I agree with everything here; these will all be fixed in the final blog post. I already looked at one of the 30m-1h questions and it appeared to be doable in ~3 minutes with the ability to ctrl-f transcripts but would take longer without transcripts, unknown how long.
In the next version I will probably use the no-captions AI numbers and measure myself without captions to get a rough video speed multiplier, then possibly do better stats that separate out domains with strong human-time-dependent difficulty from domains without (like this and SWE-Lancer).
New graph with better data, formatting still wonky though. Colleagues say it reminds them of a subway map.
With individual question data from Epoch, and making an adjustment for human success rate (adjusted task length = avg human time / human success rate), AIME looks closer to the others, and it’s clear that GPQA Diamond has saturated.
Can you explain what a point on this graph means? Like, if I see Gemini 2.5 Pro Experimental at 110 minutes on GPQA, what does that mean? It takes an average bio+chem+physics PhD 110 minutes to get a score as high as Gemini 2.5 Pro Experimental?
There is a decreasing curve of Gemini success probability vs average human time on questions in the benchmark, and the curve intersects 50% at roughly 110 minutes.
Rate of improvement also varies significantly; math contests have improved ~50x in the last year but Tesla self-driving only 6x in 3 years.
I wish I had thought to blind myself to these results and try to predict them in advance. I think I would have predicted that Tesla self-driving would be the slowest and that aime would be the fastest. Not confident though.
(Solving difficult math problems is just about the easiest long-horizon task to train for,* and in the last few months we’ve seen OpenAI especially put a lot of effort into training this.)
*Only tokens, no images. Also no need for tools/plugins to the internet or some code or game environment. Also you have ground-truth access to the answers, it’s impossible to reward hack.
I think I would have predicted that Tesla self-driving would be the slowest
For graphs like these, it obviously isn’t important how the worst or mediocre competitors are doing, but the best one. It doesn’t matter who’s #5. Tesla self-driving is a longstanding, notorious failure. (And apparently is continuing to be a failure, as they continue to walk back the much-touted Cybertaxi launch, which keeps shrinking like a snowman in hell, now down to a few invited users in a heavily-mapped area with teleop.)
I’d be much more interested in Waymo numbers, as that is closer to SOTA, and they have been ramping up miles & cities.
I would love to have Waymo data. It looks like it’s only available since September 2024 so I’ll still need to use Tesla for the earlier period. More critically they don’t publish disengagement data, only crash/injury. There are Waymo claims of things like 1 disengagement every 17,000 miles but I don’t believe them without a precise definition for what this number represents.
For some reason, all current benchmarks, with the sole exception of OSWorld[1], now seem to differ by a factor of less than 3. Does this imply that the progress in every benchmark is likely to slow down?
OSWorld resembles a physical task, which LLMs tend to fail. However, the article about LLMs failing basic physical tasks was written in April 14, before the pre-release of Gemini Diffusion. Mankind has yet to determine how well diffusion-based LLMs deal with physical tasks.
The ‘regression to the mean’ pattern is striking: domains with a lower starting point have been growing faster, and those that started with a longer horizon have mostly been grower slower.
I wonder if that pattern of catchup growth & drag on leaders will mostly hold up over more time and with a larger set of task types.
Just to be sure: as in the METR results, ‘horizon’ here means ‘the time needed to complete the task for humans with appropriate expertise’, correct? I assume so but it would be useful to make that explicit (especially since many people who skimmed the METR results initially got the impression that it was ‘the time needed for the model to complete the task’).
My intuition is that quite a few crucial AI safety R&D tasks are probably much shorter-horizon than AI capabilities R&D, which should be very helpful for automating AI safety R&D relatively early. E.g. the compute and engineer-hours time spent on pretraining (where most capabilities [still] seem to be coming from) are a-few-OOMs larger than those spent on fine-tuning (where most intent-alignment seems to be coming from).
This expanded list is great, but is still conspicuously missing white-collar work. Software was already the basis for the trend, so the only new one here that seems to give clear information on human labor impacts would be tesla_fsd.
(And even there replacing human drivers with AI drivers doesn’t seem like it would change much for humanity, compared to lawyers/doctors/accountants/sales/etc.)
Is it the case that for most non-software white-collar work, agents can only do ~10-20 human-minute tasks with any reliability, so the doubling time is hard to measure?
I, too, would like to know how long it will be until my job is replaced by AI; and what fields, among those I could reasonably pivot to, will last the longest.
I really like this direction! It feels a bit like looking at other data to verify the trend lines which is quite nice.
I was wondering if there’s an easy way for you to look at the amount of doubling per compute/money spent over time for the different domains to see if the differences are even larger? It might be predictive as well since if we can see that tesla has spent a lot on self-driving but haven’t been able to make progress compared to the rest that might give us information that the task is harder than others.
I think Vladimir Nesov wrote somewhere about different investment thresholds being dependent on capabilities return so that would be very interesting to see an analysis of! (What the doubling per compute says about different investment strategies as different phases and it being an important variable for determining investment phase transitions e.g bear or bull market.)
Edit: Full post here with 9 domains and updated conclusions!
Cross-domain time horizon:
We know AI time horizons (human time-to-complete at which a model has a 50% success rate) on software tasks are currently ~1.5hr and doubling every 4-7 months, but what about other domains? Here’s a preliminary result comparing METR’s task suite (orange line) to benchmarks in other domains, all of which have some kind of grounding in human data:
Observations
Time horizons on agentic computer use (OSWorld) is ~100x shorter than other domains. Domains like Tesla self-driving (tesla_fsd), scientific knowledge (gpqa), and math contests (aime), video understanding (video_mme), and software (hcast_r_s) all have roughly similar horizons.
My guess is this means models are good at taking in information from a long context but bad at acting coherently. Most work requires agency like OSWorld, which may be why AIs can’t do the average real-world 1-hour task yet.
There are likely other domains that fall outside this cluster; these are just the five I examined
Note the original version had a unit conversion error that gave 60x too high horizons for video_mme; this has been fixed (thanks @ryan_greenblatt )
Rate of improvement varies significantly; math contests have improved ~50x in the last year but Tesla self-driving only 6x in 3 years.
HCAST is middle of the pack in both.
Note this is preliminary and uses a new methodology so there might be data issues. I’m currently writing up a full post!
Is this graph believable? What do you want to see analyzed?
edit: fixed Video-MME numbers
I’d have guessed that poor performance on OSWorld is mostly due to poor vision and mouse manipulation skills, rather than insufficient ability to act coherantly.
I’d guess that typical self-contained 1-hour task (as in, a human professional could do it in 1 hour with no prior context except context about the general field) also often require vision or non-text computer interaction and if they don’t, I bet the AIs actually do pretty well.
I’m skeptical and/or confused about the video MME results:
You show Gemini 2.5 Pro’s horizon length as ~5000 minutes or 80 hours. However, the longest videos in the benchmark are 1 hour long (in the long category they range from 30 min to 1 hr). Presumably you’re trying to back out the 50% horizon length using some assumptions and then because Gemini 2.5 Pro’s performance is 85%, you back out a 80-160x multiplier on the horizon length! This feels wrong/dubious to me if it is what you are doing.
Based on how long these horizon lengths are, I’m guessing you assumed that answering a question about a 1 hour long video takes a human 1 hr. This seems very wrong to me. I’d bet humans can typically answer these questions much faster by panning through the video looking for where the question might be answered and then looking at just that part. Minimally, you can sometimes answer the question by skimming the transcript and it should be possible to watch at 2x/3x speed. I’d guess the 1 hour video tasks take more like 5-10 min for a well practiced human, and I wouldn’t be surprised by much shorter.
For this benchmark, (M)LLM performance seemingly doesn’t vary much with video duration, invalidating that horizon length (at least horizon length based on video length) is a good measure on this dataset!
There was a unit conversion mistake, it should have been 80 minutes. Now fixed.
Besides that, I agree with everything here; these will all be fixed in the final blog post. I already looked at one of the 30m-1h questions and it appeared to be doable in ~3 minutes with the ability to ctrl-f transcripts but would take longer without transcripts, unknown how long.
In the next version I will probably use the no-captions AI numbers and measure myself without captions to get a rough video speed multiplier, then possibly do better stats that separate out domains with strong human-time-dependent difficulty from domains without (like this and SWE-Lancer).
No captions feels very unnatural because both llms and humans could first apply relatively dumb speech to text tools.
New graph with better data, formatting still wonky though. Colleagues say it reminds them of a subway map.
With individual question data from Epoch, and making an adjustment for human success rate (adjusted task length = avg human time / human success rate), AIME looks closer to the others, and it’s clear that GPQA Diamond has saturated.
Can you explain what a point on this graph means? Like, if I see Gemini 2.5 Pro Experimental at 110 minutes on GPQA, what does that mean? It takes an average bio+chem+physics PhD 110 minutes to get a score as high as Gemini 2.5 Pro Experimental?
There is a decreasing curve of Gemini success probability vs average human time on questions in the benchmark, and the curve intersects 50% at roughly 110 minutes.
Basically it’s trying to measure the same quantity as the original paper (https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/) but the numbers are less accurate since we have less data for these benchmarks.
I wish I had thought to blind myself to these results and try to predict them in advance. I think I would have predicted that Tesla self-driving would be the slowest and that aime would be the fastest. Not confident though.
(Solving difficult math problems is just about the easiest long-horizon task to train for,* and in the last few months we’ve seen OpenAI especially put a lot of effort into training this.)
*Only tokens, no images. Also no need for tools/plugins to the internet or some code or game environment. Also you have ground-truth access to the answers, it’s impossible to reward hack.
For graphs like these, it obviously isn’t important how the worst or mediocre competitors are doing, but the best one. It doesn’t matter who’s #5. Tesla self-driving is a longstanding, notorious failure. (And apparently is continuing to be a failure, as they continue to walk back the much-touted Cybertaxi launch, which keeps shrinking like a snowman in hell, now down to a few invited users in a heavily-mapped area with teleop.)
I’d be much more interested in Waymo numbers, as that is closer to SOTA, and they have been ramping up miles & cities.
I would love to have Waymo data. It looks like it’s only available since September 2024 so I’ll still need to use Tesla for the earlier period. More critically they don’t publish disengagement data, only crash/injury. There are Waymo claims of things like 1 disengagement every 17,000 miles but I don’t believe them without a precise definition for what this number represents.
You could add cooking tasks with robots.
For some reason, all current benchmarks, with the sole exception of OSWorld[1], now seem to differ by a factor of less than 3. Does this imply that the progress in every benchmark is likely to slow down?
OSWorld resembles a physical task, which LLMs tend to fail. However, the article about LLMs failing basic physical tasks was written in April 14, before the pre-release of Gemini Diffusion. Mankind has yet to determine how well diffusion-based LLMs deal with physical tasks.
The ‘regression to the mean’ pattern is striking: domains with a lower starting point have been growing faster, and those that started with a longer horizon have mostly been grower slower.
I wonder if that pattern of catchup growth & drag on leaders will mostly hold up over more time and with a larger set of task types.
Just to be sure: as in the METR results, ‘horizon’ here means ‘the time needed to complete the task for humans with appropriate expertise’, correct? I assume so but it would be useful to make that explicit (especially since many people who skimmed the METR results initially got the impression that it was ‘the time needed for the model to complete the task’).
I would love to see an AI safety R&D category.
My intuition is that quite a few crucial AI safety R&D tasks are probably much shorter-horizon than AI capabilities R&D, which should be very helpful for automating AI safety R&D relatively early. E.g. the compute and engineer-hours time spent on pretraining (where most capabilities [still] seem to be coming from) are a-few-OOMs larger than those spent on fine-tuning (where most intent-alignment seems to be coming from).
This expanded list is great, but is still conspicuously missing white-collar work. Software was already the basis for the trend, so the only new one here that seems to give clear information on human labor impacts would be tesla_fsd.
(And even there replacing human drivers with AI drivers doesn’t seem like it would change much for humanity, compared to lawyers/doctors/accountants/sales/etc.)
Is it the case that for most non-software white-collar work, agents can only do ~10-20 human-minute tasks with any reliability, so the doubling time is hard to measure?
I, too, would like to know how long it will be until my job is replaced by AI; and what fields, among those I could reasonably pivot to, will last the longest.
Nice, this jives with my impression of all the LLM Plays Pokemon findings, I’d have been surprised if it were otherwise.
I really like this direction! It feels a bit like looking at other data to verify the trend lines which is quite nice.
I was wondering if there’s an easy way for you to look at the amount of doubling per compute/money spent over time for the different domains to see if the differences are even larger? It might be predictive as well since if we can see that tesla has spent a lot on self-driving but haven’t been able to make progress compared to the rest that might give us information that the task is harder than others.
I think Vladimir Nesov wrote somewhere about different investment thresholds being dependent on capabilities return so that would be very interesting to see an analysis of! (What the doubling per compute says about different investment strategies as different phases and it being an important variable for determining investment phase transitions e.g bear or bull market.)