You are misunderstanding what METR time-horizons represent. The time-horizon is not simply the length of time for which the model can remain coherent while working on a task (or anything which corresponds directly to such a time-horizon).
We can imagine a model with the ability to carry out tasks indefinitely without losing coherence but which had a METR 50% time-horizon of only ten minutes. This is because the METR task-lengths are a measure of something closer to the complexity of the problem than the length of time the model must remain coherent in order to solve it.
Now, a model’s coherence time-horizon is surely a factor in its performance on METR’s benchmarks. But intelligence matters too. Because the coherence time-horizon is not the only factor in the METR time-horizon, your leap from “Anthropic claims Claude Sonnet can remain coherent for 30+ hours” to “If its METR time-horizon is not in that ballpark that means Anthropic is untrustworthy” is not reasonable.
You see. the tasks in the HCAST task set (or whatever task set METR is now using) tend to be tasks some aspect of which cannot be found in much shorter tasksanyany. That is, a task of length one hour won’t be “write a programme which quite clearly just requires solving ten simpler tasks, each of which would take about six minutes to solve”. There tends to be an overarching complexity to the task.
I think that’s reasonable, but Anthropic’s claim would still be a little misleading if it can only do tasks that take humans 2 hours. Like, if you claim that your model can work for 30 hours, people don’t have much way to conceptualize what that means except in terms of human task lengths. After all, I can make GPT-4 remain coherent for 30 hours by running it very slowly. So what does this number actually mean?
I want to clarify that this is Anthropic choosing to display a testimonial from one of their customers on the release page. Anthropic’s team is editorializing, but I don’t think they intended to communicate that the models are consistently completing tasks that require 10s of hours (yet).
I included the organization and person making the testimonial to try and highlight this, but I should’ve made it explicit in the post.
re: how this updated my model of AI progress:
We don’t quite know what the updates were that improved this model’s ability at long contexts, but it doesn’t seem like it was a dramatic algorithm or architecture update. It also doesn’t seem like it was OOMs more spend in training (either in FLOPs or dollars).
If I had to guess, it might be something like an attention variant (like this one by DeepSeek) that was “cheap” to re-train an existing model with. Maybe some improvements with their post-training RL setup. Whatever it was—it seems (relatively) cheap.
These improvements are being found faster than I expected (based on METR’s timeline) and for cheaper than I anticipated (I predicted going from 2hrs > 4hrs would’ve needed some factor increase in additional training plus a 2-3 non-trivial algorithm improvements). I suppose this is something Anthropic is capable of in a few months, but then to verify it and deploy it at scale in <half a year is quite impressive and not what I would’ve predicted.
I think that ~4 hour task length is the right estimate for tasks that Sonnet 4.5 consistently succeeds on with the harness that’s built into Claude Code
Some context: most of the people I interact with are bear-ish on model progress. Not the LW crowd, though, I suppose. I forget to factor this in when I post here.
re: METR’s study: I’m saying there’s acceleration in model progress. The acceleration seems partly due to the architecture changes being easy to find, test, and scale (e.g. variants on attention, modifications on sampling). And that these changes do not make models more expensive to train and inference; if anything it seems to make them cheaper to train (e.g. MoEs, native spare attention, DeepSeek sparse attention).
Prior to this announcement (plus DeepSeek’s drop today), I thought increasing the task-length that models can succeed on would’ve required more non-trivial architecture updates. Something on the scale of DeepSeek’s RL on reasoning chains or the use of FlashAttention to scale training FLOPs by an OOM.
I mean, if true, or even close to true, it would totally change my model of AI progress.
If the task length is like 4 hours then I’ll just update towards not believing Anthropic about this stuff.
You are misunderstanding what METR time-horizons represent. The time-horizon is not simply the length of time for which the model can remain coherent while working on a task (or anything which corresponds directly to such a time-horizon).
We can imagine a model with the ability to carry out tasks indefinitely without losing coherence but which had a METR 50% time-horizon of only ten minutes. This is because the METR task-lengths are a measure of something closer to the complexity of the problem than the length of time the model must remain coherent in order to solve it.
Now, a model’s coherence time-horizon is surely a factor in its performance on METR’s benchmarks. But intelligence matters too. Because the coherence time-horizon is not the only factor in the METR time-horizon, your leap from “Anthropic claims Claude Sonnet can remain coherent for 30+ hours” to “If its METR time-horizon is not in that ballpark that means Anthropic is untrustworthy” is not reasonable.
You see. the tasks in the HCAST task set (or whatever task set METR is now using) tend to be tasks some aspect of which cannot be found in much shorter tasksanyany. That is, a task of length one hour won’t be “write a programme which quite clearly just requires solving ten simpler tasks, each of which would take about six minutes to solve”. There tends to be an overarching complexity to the task.
I think that’s reasonable, but Anthropic’s claim would still be a little misleading if it can only do tasks that take humans 2 hours. Like, if you claim that your model can work for 30 hours, people don’t have much way to conceptualize what that means except in terms of human task lengths. After all, I can make GPT-4 remain coherent for 30 hours by running it very slowly. So what does this number actually mean?
I want to clarify that this is Anthropic choosing to display a testimonial from one of their customers on the release page. Anthropic’s team is editorializing, but I don’t think they intended to communicate that the models are consistently completing tasks that require 10s of hours (yet).
I included the organization and person making the testimonial to try and highlight this, but I should’ve made it explicit in the post.
re: how this updated my model of AI progress:
We don’t quite know what the updates were that improved this model’s ability at long contexts, but it doesn’t seem like it was a dramatic algorithm or architecture update. It also doesn’t seem like it was OOMs more spend in training (either in FLOPs or dollars).
If I had to guess, it might be something like an attention variant (like this one by DeepSeek) that was “cheap” to re-train an existing model with. Maybe some improvements with their post-training RL setup. Whatever it was—it seems (relatively) cheap.
These improvements are being found faster than I expected (based on METR’s timeline) and for cheaper than I anticipated (I predicted going from 2hrs > 4hrs would’ve needed some factor increase in additional training plus a 2-3 non-trivial algorithm improvements). I suppose this is something Anthropic is capable of in a few months, but then to verify it and deploy it at scale in <half a year is quite impressive and not what I would’ve predicted.
Are you implying they have over 4 hour task length? I’m confused about what you’re updating on.
I think that ~4 hour task length is the right estimate for tasks that Sonnet 4.5 consistently succeeds on with the harness that’s built into Claude Code
Some context: most of the people I interact with are bear-ish on model progress. Not the LW crowd, though, I suppose. I forget to factor this in when I post here.
re: METR’s study: I’m saying there’s acceleration in model progress. The acceleration seems partly due to the architecture changes being easy to find, test, and scale (e.g. variants on attention, modifications on sampling). And that these changes do not make models more expensive to train and inference; if anything it seems to make them cheaper to train (e.g. MoEs, native spare attention, DeepSeek sparse attention).
Prior to this announcement (plus DeepSeek’s drop today), I thought increasing the task-length that models can succeed on would’ve required more non-trivial architecture updates. Something on the scale of DeepSeek’s RL on reasoning chains or the use of FlashAttention to scale training FLOPs by an OOM.