The 50% reliability mark on METR is interpreted wrong. A long 50% time horizon is more useful than it seems because a 50% failure rate doesn’t mean 50% of the time your output is useful and 50% of the time it’s worthless.
For shorter tasks, this maybe true, since fixing a short task takes as much time as just doing it yourself, but for longer tasks, among the 50% of failures, it’s more like 30% of the time you need to nudge it a bit, 10% of the time you need to go to another model, final 10% you need to sit down and take your time to debug.
Even the final 10% where I need to debug, sometimes looking at the model’s attempt teaches me a trick or two about a tool I didn’t know about, or an obscure feature of a tool I did know about.
On the flip side, the 50% “success” rate is more like 30% “you need substantial follow-up in the form of additional instructions or manual edits to get something production ready”, 10% “it’s not production ready but in a way you can solve with a new workflow/skill step that is always safe or a noop”, 10% “it’s fine to deploy as-is”.
The 50% reliability mark on METR is interpreted wrong. A long 50% time horizon is more useful than it seems because a 50% failure rate doesn’t mean 50% of the time your output is useful and 50% of the time it’s worthless.
For shorter tasks, this maybe true, since fixing a short task takes as much time as just doing it yourself, but for longer tasks, among the 50% of failures, it’s more like 30% of the time you need to nudge it a bit, 10% of the time you need to go to another model, final 10% you need to sit down and take your time to debug.
Even the final 10% where I need to debug, sometimes looking at the model’s attempt teaches me a trick or two about a tool I didn’t know about, or an obscure feature of a tool I did know about.
On the flip side, the 50% “success” rate is more like 30% “you need substantial follow-up in the form of additional instructions or manual edits to get something production ready”, 10% “it’s not production ready but in a way you can solve with a new workflow/skill step that is always safe or a noop”, 10% “it’s fine to deploy as-is”.