I agree that RE-bench aggregate results should be interpreted with caution, given the low sample size. Let’s focus on HCAST instead.
A few questions:
Would someone from the METR team be able to clarify the updates to the HCAST task set? The exec summary states: “While these measurements are not directly comparable with the measurements published in our previous work due to updates to the task set”. Was Claude 3.7 Sonnet retested on the updated HCAST test set?
On HCAST o3 and o4-mini get a 16M token limit vs 2M for Claude 3.7 Sonnet (if my reading of the paper is correct). Do we know how Claude would do if given a higher token budget? Maybe this isn’t relevant as it never gets close to the budget, and it submits answers well before hitting the limit? I want to make sure improvements are just due to shifting token budgets.
Updates to HCAST are just generally newer tasks, clarifications, bug fixes, etc. No specific change in direction or focus. Any given plot is only including data that is on the same task set version, so yes, Claude 3.7 Sonnet was retested on the updated HCAST and that’s the number shown on the headline bar chart. In contrast, the “trendline plot” with the o3 and o4-mini additions (posted to twitter) is showing only results on the original HCAST from the original trendline paper (including for o3 and o4-mini—we also ran them on the older task set version so that we could put it on the trendline).
The 2M limit was originally chosen as “high enough to not be a bottleneck”. See Performance of current agents seems to plateau quite early here. So I mostly do not expect that increasing the token budget would meaningfully improve performance. But that choice is possibly outdated by now. A team member has been working on re-evaluating that and I think we may have an update on this soon.
Thanks for re-running the analysis!
I agree that RE-bench aggregate results should be interpreted with caution, given the low sample size. Let’s focus on HCAST instead.
A few questions:
Would someone from the METR team be able to clarify the updates to the HCAST task set? The exec summary states: “While these measurements are not directly comparable with the measurements published in our previous work due to updates to the task set”. Was Claude 3.7 Sonnet retested on the updated HCAST test set?
On HCAST o3 and o4-mini get a 16M token limit vs 2M for Claude 3.7 Sonnet (if my reading of the paper is correct). Do we know how Claude would do if given a higher token budget? Maybe this isn’t relevant as it never gets close to the budget, and it submits answers well before hitting the limit? I want to make sure improvements are just due to shifting token budgets.
(source: I work at METR)
Thanks for the questions!
Updates to HCAST are just generally newer tasks, clarifications, bug fixes, etc. No specific change in direction or focus. Any given plot is only including data that is on the same task set version, so yes, Claude 3.7 Sonnet was retested on the updated HCAST and that’s the number shown on the headline bar chart. In contrast, the “trendline plot” with the o3 and o4-mini additions (posted to twitter) is showing only results on the original HCAST from the original trendline paper (including for o3 and o4-mini—we also ran them on the older task set version so that we could put it on the trendline).
The 2M limit was originally chosen as “high enough to not be a bottleneck”. See Performance of current agents seems to plateau quite early here. So I mostly do not expect that increasing the token budget would meaningfully improve performance. But that choice is possibly outdated by now. A team member has been working on re-evaluating that and I think we may have an update on this soon.
Thanks Lucas!