Time horizon of o3 is ~1.5 hours vs Claude 3.7′s 54 minutes, and it’s statistically significant that it’s above the long-term trend. It’s been less than 2 months since the release of Claude 3.7. If time horizon continues doubling every 3.5 months as it has over the last year, we only have another 12 months until time horizon hits 16 hours and we are unable to measure it with HCAST.
My guess is that future model time horizon will double every 3-4 months for well-defined tasks (HCAST, RE-Bench, most automatically scorable tasks) that labs can RL on, while capability on more realistic tasks will follow the long-term 7-month doubling time.
What’s your basis for “well-defined tasks” vs. “realistic tasks” to have very different doubling times going forward? Is the idea that the recent acceleration seems to be specifically due to RL, and RL will be applicable to well-defined tasks but not realistic tasks?
This seems like an extremely important question, so if you have any further thoughts / intuitions / data to share, I’d be very interested.
Yes. RL will at least be more applicable to well-defined tasks. Some intuitions:
In my everyday, the gap between well-defined task ability and working with the METR codebase is growing
4 month doubling time is faster than the rate of progress in most other realistic or unrealistic domains
Recent models really like to reward hack, suggesting that RL can cause some behaviors not relevant to realistic tasks
This trend will break at some point, eg when labs get better at applying RL to realistic tasks, or when RL hits diminishing returns, but I have no idea when
I thank y’all for rapidly replicating and extending this eval. This is the most important eval extant. Units are truly comparable, and it’s directly connected to the questions of “coding for ML/AI research” and “long-horizon agency” that seem cruxy for short timelines. I did not expect @Daniel Kokotajlo to be right about the superexponentiality so quickly.
My long-timeline probability mass is increasingly dependent on “this doesn’t generalize past formally verifiable domains + formally verifiable domains are insufficient for to automate AI algorithmic progress substantially” or “somehow this progress doesn’t extend to the arbitrarily messy and novel real world.” But it ain’t looking good.
I agree that RE-bench aggregate results should be interpreted with caution, given the low sample size. Let’s focus on HCAST instead.
A few questions:
Would someone from the METR team be able to clarify the updates to the HCAST task set? The exec summary states: “While these measurements are not directly comparable with the measurements published in our previous work due to updates to the task set”. Was Claude 3.7 Sonnet retested on the updated HCAST test set?
On HCAST o3 and o4-mini get a 16M token limit vs 2M for Claude 3.7 Sonnet (if my reading of the paper is correct). Do we know how Claude would do if given a higher token budget? Maybe this isn’t relevant as it never gets close to the budget, and it submits answers well before hitting the limit? I want to make sure improvements are just due to shifting token budgets.
Updates to HCAST are just generally newer tasks, clarifications, bug fixes, etc. No specific change in direction or focus. Any given plot is only including data that is on the same task set version, so yes, Claude 3.7 Sonnet was retested on the updated HCAST and that’s the number shown on the headline bar chart. In contrast, the “trendline plot” with the o3 and o4-mini additions (posted to twitter) is showing only results on the original HCAST from the original trendline paper (including for o3 and o4-mini—we also ran them on the older task set version so that we could put it on the trendline).
The 2M limit was originally chosen as “high enough to not be a bottleneck”. See Performance of current agents seems to plateau quite early here. So I mostly do not expect that increasing the token budget would meaningfully improve performance. But that choice is possibly outdated by now. A team member has been working on re-evaluating that and I think we may have an update on this soon.
Time horizon of o3 is ~1.5 hours vs Claude 3.7′s 54 minutes, and it’s statistically significant that it’s above the long-term trend. It’s been less than 2 months since the release of Claude 3.7. If time horizon continues doubling every 3.5 months as it has over the last year, we only have another 12 months until time horizon hits 16 hours and we are unable to measure it with HCAST.
My guess is that future model time horizon will double every 3-4 months for well-defined tasks (HCAST, RE-Bench, most automatically scorable tasks) that labs can RL on, while capability on more realistic tasks will follow the long-term 7-month doubling time.
What’s your basis for “well-defined tasks” vs. “realistic tasks” to have very different doubling times going forward? Is the idea that the recent acceleration seems to be specifically due to RL, and RL will be applicable to well-defined tasks but not realistic tasks?
This seems like an extremely important question, so if you have any further thoughts / intuitions / data to share, I’d be very interested.
Yes. RL will at least be more applicable to well-defined tasks. Some intuitions:
In my everyday, the gap between well-defined task ability and working with the METR codebase is growing
4 month doubling time is faster than the rate of progress in most other realistic or unrealistic domains
Recent models really like to reward hack, suggesting that RL can cause some behaviors not relevant to realistic tasks
This trend will break at some point, eg when labs get better at applying RL to realistic tasks, or when RL hits diminishing returns, but I have no idea when
I thank y’all for rapidly replicating and extending this eval. This is the most important eval extant. Units are truly comparable, and it’s directly connected to the questions of “coding for ML/AI research” and “long-horizon agency” that seem cruxy for short timelines. I did not expect @Daniel Kokotajlo to be right about the superexponentiality so quickly.
My long-timeline probability mass is increasingly dependent on “this doesn’t generalize past formally verifiable domains + formally verifiable domains are insufficient for to automate AI algorithmic progress substantially” or “somehow this progress doesn’t extend to the arbitrarily messy and novel real world.” But it ain’t looking good.
Thanks for re-running the analysis!
I agree that RE-bench aggregate results should be interpreted with caution, given the low sample size. Let’s focus on HCAST instead.
A few questions:
Would someone from the METR team be able to clarify the updates to the HCAST task set? The exec summary states: “While these measurements are not directly comparable with the measurements published in our previous work due to updates to the task set”. Was Claude 3.7 Sonnet retested on the updated HCAST test set?
On HCAST o3 and o4-mini get a 16M token limit vs 2M for Claude 3.7 Sonnet (if my reading of the paper is correct). Do we know how Claude would do if given a higher token budget? Maybe this isn’t relevant as it never gets close to the budget, and it submits answers well before hitting the limit? I want to make sure improvements are just due to shifting token budgets.
(source: I work at METR)
Thanks for the questions!
Updates to HCAST are just generally newer tasks, clarifications, bug fixes, etc. No specific change in direction or focus. Any given plot is only including data that is on the same task set version, so yes, Claude 3.7 Sonnet was retested on the updated HCAST and that’s the number shown on the headline bar chart. In contrast, the “trendline plot” with the o3 and o4-mini additions (posted to twitter) is showing only results on the original HCAST from the original trendline paper (including for o3 and o4-mini—we also ran them on the older task set version so that we could put it on the trendline).
The 2M limit was originally chosen as “high enough to not be a bottleneck”. See Performance of current agents seems to plateau quite early here. So I mostly do not expect that increasing the token budget would meaningfully improve performance. But that choice is possibly outdated by now. A team member has been working on re-evaluating that and I think we may have an update on this soon.
Thanks Lucas!