I want to clarify that this is Anthropic choosing to display a testimonial from one of their customers on the release page. Anthropic’s team is editorializing, but I don’t think they intended to communicate that the models are consistently completing tasks that require 10s of hours (yet).
I included the organization and person making the testimonial to try and highlight this, but I should’ve made it explicit in the post.
re: how this updated my model of AI progress:
We don’t quite know what the updates were that improved this model’s ability at long contexts, but it doesn’t seem like it was a dramatic algorithm or architecture update. It also doesn’t seem like it was OOMs more spend in training (either in FLOPs or dollars).
If I had to guess, it might be something like an attention variant (like this one by DeepSeek) that was “cheap” to re-train an existing model with. Maybe some improvements with their post-training RL setup. Whatever it was—it seems (relatively) cheap.
These improvements are being found faster than I expected (based on METR’s timeline) and for cheaper than I anticipated (I predicted going from 2hrs > 4hrs would’ve needed some factor increase in additional training plus a 2-3 non-trivial algorithm improvements). I suppose this is something Anthropic is capable of in a few months, but then to verify it and deploy it at scale in <half a year is quite impressive and not what I would’ve predicted.
I think that ~4 hour task length is the right estimate for tasks that Sonnet 4.5 consistently succeeds on with the harness that’s built into Claude Code
Some context: most of the people I interact with are bear-ish on model progress. Not the LW crowd, though, I suppose. I forget to factor this in when I post here.
re: METR’s study: I’m saying there’s acceleration in model progress. The acceleration seems partly due to the architecture changes being easy to find, test, and scale (e.g. variants on attention, modifications on sampling). And that these changes do not make models more expensive to train and inference; if anything it seems to make them cheaper to train (e.g. MoEs, native spare attention, DeepSeek sparse attention).
Prior to this announcement (plus DeepSeek’s drop today), I thought increasing the task-length that models can succeed on would’ve required more non-trivial architecture updates. Something on the scale of DeepSeek’s RL on reasoning chains or the use of FlashAttention to scale training FLOPs by an OOM.
I want to clarify that this is Anthropic choosing to display a testimonial from one of their customers on the release page. Anthropic’s team is editorializing, but I don’t think they intended to communicate that the models are consistently completing tasks that require 10s of hours (yet).
I included the organization and person making the testimonial to try and highlight this, but I should’ve made it explicit in the post.
re: how this updated my model of AI progress:
We don’t quite know what the updates were that improved this model’s ability at long contexts, but it doesn’t seem like it was a dramatic algorithm or architecture update. It also doesn’t seem like it was OOMs more spend in training (either in FLOPs or dollars).
If I had to guess, it might be something like an attention variant (like this one by DeepSeek) that was “cheap” to re-train an existing model with. Maybe some improvements with their post-training RL setup. Whatever it was—it seems (relatively) cheap.
These improvements are being found faster than I expected (based on METR’s timeline) and for cheaper than I anticipated (I predicted going from 2hrs > 4hrs would’ve needed some factor increase in additional training plus a 2-3 non-trivial algorithm improvements). I suppose this is something Anthropic is capable of in a few months, but then to verify it and deploy it at scale in <half a year is quite impressive and not what I would’ve predicted.
Are you implying they have over 4 hour task length? I’m confused about what you’re updating on.
I think that ~4 hour task length is the right estimate for tasks that Sonnet 4.5 consistently succeeds on with the harness that’s built into Claude Code
Some context: most of the people I interact with are bear-ish on model progress. Not the LW crowd, though, I suppose. I forget to factor this in when I post here.
re: METR’s study: I’m saying there’s acceleration in model progress. The acceleration seems partly due to the architecture changes being easy to find, test, and scale (e.g. variants on attention, modifications on sampling). And that these changes do not make models more expensive to train and inference; if anything it seems to make them cheaper to train (e.g. MoEs, native spare attention, DeepSeek sparse attention).
Prior to this announcement (plus DeepSeek’s drop today), I thought increasing the task-length that models can succeed on would’ve required more non-trivial architecture updates. Something on the scale of DeepSeek’s RL on reasoning chains or the use of FlashAttention to scale training FLOPs by an OOM.