I think that ~4 hour task length is the right estimate for tasks that Sonnet 4.5 consistently succeeds on with the harness that’s built into Claude Code
Some context: most of the people I interact with are bear-ish on model progress. Not the LW crowd, though, I suppose. I forget to factor this in when I post here.
re: METR’s study: I’m saying there’s acceleration in model progress. The acceleration seems partly due to the architecture changes being easy to find, test, and scale (e.g. variants on attention, modifications on sampling). And that these changes do not make models more expensive to train and inference; if anything it seems to make them cheaper to train (e.g. MoEs, native spare attention, DeepSeek sparse attention).
Prior to this announcement (plus DeepSeek’s drop today), I thought increasing the task-length that models can succeed on would’ve required more non-trivial architecture updates. Something on the scale of DeepSeek’s RL on reasoning chains or the use of FlashAttention to scale training FLOPs by an OOM.
Are you implying they have over 4 hour task length? I’m confused about what you’re updating on.
I think that ~4 hour task length is the right estimate for tasks that Sonnet 4.5 consistently succeeds on with the harness that’s built into Claude Code
Some context: most of the people I interact with are bear-ish on model progress. Not the LW crowd, though, I suppose. I forget to factor this in when I post here.
re: METR’s study: I’m saying there’s acceleration in model progress. The acceleration seems partly due to the architecture changes being easy to find, test, and scale (e.g. variants on attention, modifications on sampling). And that these changes do not make models more expensive to train and inference; if anything it seems to make them cheaper to train (e.g. MoEs, native spare attention, DeepSeek sparse attention).
Prior to this announcement (plus DeepSeek’s drop today), I thought increasing the task-length that models can succeed on would’ve required more non-trivial architecture updates. Something on the scale of DeepSeek’s RL on reasoning chains or the use of FlashAttention to scale training FLOPs by an OOM.