You are misunderstanding what METR time-horizons represent. The time-horizon is not simply the length of time for which the model can remain coherent while working on a task (or anything which corresponds directly to such a time-horizon).
We can imagine a model with the ability to carry out tasks indefinitely without losing coherence but which had a METR 50% time-horizon of only ten minutes. This is because the METR task-lengths are a measure of something closer to the complexity of the problem than the length of time the model must remain coherent in order to solve it.
Now, a model’s coherence time-horizon is surely a factor in its performance on METR’s benchmarks. But intelligence matters too. Because the coherence time-horizon is not the only factor in the METR time-horizon, your leap from “Anthropic claims Claude Sonnet can remain coherent for 30+ hours” to “If its METR time-horizon is not in that ballpark that means Anthropic is untrustworthy” is not reasonable.
You see. the tasks in the HCAST task set (or whatever task set METR is now using) tend to be tasks some aspect of which cannot be found in much shorter tasksanyany. That is, a task of length one hour won’t be “write a programme which quite clearly just requires solving ten simpler tasks, each of which would take about six minutes to solve”. There tends to be an overarching complexity to the task.
I think that’s reasonable, but Anthropic’s claim would still be a little misleading if it can only do tasks that take humans 2 hours. Like, if you claim that your model can work for 30 hours, people don’t have much way to conceptualize what that means except in terms of human task lengths. After all, I can make GPT-4 remain coherent for 30 hours by running it very slowly. So what does this number actually mean?
You are misunderstanding what METR time-horizons represent. The time-horizon is not simply the length of time for which the model can remain coherent while working on a task (or anything which corresponds directly to such a time-horizon).
We can imagine a model with the ability to carry out tasks indefinitely without losing coherence but which had a METR 50% time-horizon of only ten minutes. This is because the METR task-lengths are a measure of something closer to the complexity of the problem than the length of time the model must remain coherent in order to solve it.
Now, a model’s coherence time-horizon is surely a factor in its performance on METR’s benchmarks. But intelligence matters too. Because the coherence time-horizon is not the only factor in the METR time-horizon, your leap from “Anthropic claims Claude Sonnet can remain coherent for 30+ hours” to “If its METR time-horizon is not in that ballpark that means Anthropic is untrustworthy” is not reasonable.
You see. the tasks in the HCAST task set (or whatever task set METR is now using) tend to be tasks some aspect of which cannot be found in much shorter tasksanyany. That is, a task of length one hour won’t be “write a programme which quite clearly just requires solving ten simpler tasks, each of which would take about six minutes to solve”. There tends to be an overarching complexity to the task.
I think that’s reasonable, but Anthropic’s claim would still be a little misleading if it can only do tasks that take humans 2 hours. Like, if you claim that your model can work for 30 hours, people don’t have much way to conceptualize what that means except in terms of human task lengths. After all, I can make GPT-4 remain coherent for 30 hours by running it very slowly. So what does this number actually mean?