@Thomas Kwa will we see task length evaluations for Claude Opus 4 soon?
Anthropic reports that Claude can work on software engineering tasks coherently for hours, but it’s not clear if this means it can actually perform tasks that would take a human hours. I am slightly suspicious because they reported that Claude was making better use of memory on Pokémon, but this did not actually cash out as improved play. This seems like a fairly decisive test of my prediction that task lengths would stagnate at this point; if it does succeed at hours long tasks, I will want to see a careful evaluation of which tasks may or may not have been leaked, are the tasks cleaner than typical hours long SE tasks, etc.
I don’t run the evaluations but probably we will; no timeframe yet though as we would need to do elicitation first. Claude’s SWE-bench Verified scores suggest that it will be above 2 hours on the METR task set; the benchmarks are pretty similar apart from their different time annotations.
That’s a bit higher than I would have guessed. I compared the known data points that have SWE-bench and METR medians (sonnet 3.5,3.6,3.7, o1, o3, o4-mini) and got an r^2 = 0.96 model assuming linearity between log(METR_median) and log(swe-bench-error).
That gives an estimate more like 110 minutes for an Swe-bench score of 72.7%. Which works out to a sonnet doubling time of ~3.3 months. (If I throw out o4-mini, estimator is ~117 minutes.. still below 120)
Also would imply an 85% swe-bench score is something like a 6-6.5 hour METR median.
Since reasoning trace length increases with more steps of RL training (unless intentionally constrained), probably underlying scaling of RL training by AI companies will be observable in the form of longer reasoning traces. Claude 4 is more obviously a pretrained model update, not necessarily a major RLVR update (compared to Claude 3.7), and coherent long task performance seems like something that would greatly benefit from RLVR if it applies at all (which it plausibly does).
So I don’t particularly expect Claude 4 to be much better on this metric, but some later Claude ~4.2-4.5 update with more RLVR post-training released in a few months might do much better.
Sure, but trends like this only say anything meaningful across multiple years, any one datapoint adds almost no signal, in either direction. This is what makes scaling laws much more predictive, even as they are predicting the wrong things. So far there are no published scaling laws for RLVR, the literature is still developing a non-terrible stable recipe for the first few thousand training steps.
@Thomas Kwa will we see task length evaluations for Claude Opus 4 soon?
Anthropic reports that Claude can work on software engineering tasks coherently for hours, but it’s not clear if this means it can actually perform tasks that would take a human hours. I am slightly suspicious because they reported that Claude was making better use of memory on Pokémon, but this did not actually cash out as improved play. This seems like a fairly decisive test of my prediction that task lengths would stagnate at this point; if it does succeed at hours long tasks, I will want to see a careful evaluation of which tasks may or may not have been leaked, are the tasks cleaner than typical hours long SE tasks, etc.
I don’t run the evaluations but probably we will; no timeframe yet though as we would need to do elicitation first. Claude’s SWE-bench Verified scores suggest that it will be above 2 hours on the METR task set; the benchmarks are pretty similar apart from their different time annotations.
That’s a bit higher than I would have guessed. I compared the known data points that have SWE-bench and METR medians (sonnet 3.5,3.6,3.7, o1, o3, o4-mini) and got an r^2 = 0.96 model assuming linearity between log(METR_median) and log(swe-bench-error).
That gives an estimate more like 110 minutes for an Swe-bench score of 72.7%. Which works out to a sonnet doubling time of ~3.3 months. (If I throw out o4-mini, estimator is ~117 minutes.. still below 120)
Also would imply an 85% swe-bench score is something like a 6-6.5 hour METR median.
Since reasoning trace length increases with more steps of RL training (unless intentionally constrained), probably underlying scaling of RL training by AI companies will be observable in the form of longer reasoning traces. Claude 4 is more obviously a pretrained model update, not necessarily a major RLVR update (compared to Claude 3.7), and coherent long task performance seems like something that would greatly benefit from RLVR if it applies at all (which it plausibly does).
So I don’t particularly expect Claude 4 to be much better on this metric, but some later Claude ~4.2-4.5 update with more RLVR post-training released in a few months might do much better.
We can still check if it lies on the projected slower exponential curve before reasoning models were introduced.
Sure, but trends like this only say anything meaningful across multiple years, any one datapoint adds almost no signal, in either direction. This is what makes scaling laws much more predictive, even as they are predicting the wrong things. So far there are no published scaling laws for RLVR, the literature is still developing a non-terrible stable recipe for the first few thousand training steps.