Updates to HCAST are just generally newer tasks, clarifications, bug fixes, etc. No specific change in direction or focus. Any given plot is only including data that is on the same task set version, so yes, Claude 3.7 Sonnet was retested on the updated HCAST and that’s the number shown on the headline bar chart. In contrast, the “trendline plot” with the o3 and o4-mini additions (posted to twitter) is showing only results on the original HCAST from the original trendline paper (including for o3 and o4-mini—we also ran them on the older task set version so that we could put it on the trendline).
The 2M limit was originally chosen as “high enough to not be a bottleneck”. See Performance of current agents seems to plateau quite early here. So I mostly do not expect that increasing the token budget would meaningfully improve performance. But that choice is possibly outdated by now. A team member has been working on re-evaluating that and I think we may have an update on this soon.
(source: I work at METR)
Thanks for the questions!
Updates to HCAST are just generally newer tasks, clarifications, bug fixes, etc. No specific change in direction or focus. Any given plot is only including data that is on the same task set version, so yes, Claude 3.7 Sonnet was retested on the updated HCAST and that’s the number shown on the headline bar chart. In contrast, the “trendline plot” with the o3 and o4-mini additions (posted to twitter) is showing only results on the original HCAST from the original trendline paper (including for o3 and o4-mini—we also ran them on the older task set version so that we could put it on the trendline).
The 2M limit was originally chosen as “high enough to not be a bottleneck”. See Performance of current agents seems to plateau quite early here. So I mostly do not expect that increasing the token budget would meaningfully improve performance. But that choice is possibly outdated by now. A team member has been working on re-evaluating that and I think we may have an update on this soon.
Thanks Lucas!