ryan_greenblatt comments on Interpreting the METR Time Horizons Post

ryan_greenblatt 30 Apr 2025 5:39 UTC
15 points
4

If we adjust for the 5-18x speed improvement measured for experienced workers, and target an 80% task success rate, that pushes the timeline out by over three years

I don’t think this is a good interpretation of the 5-18x multiplier. In particular, I think the “acquire context multiplier” will be increasingly small for longer tasks.

Like, the task of “get a bunch of context then complete this 1 month long task” is a task that will typically take humans who don’t already have context 2 months, maybe 6 months in particularly tricky domains, not 5-18 months. So, maybe you add a doubling or so, adding more like 7 months to the timeline.

Another way to put this is that the 5-18x multiplier is an artifact of taking months of context and applying that to a short task (like maybe 10 min or 1 hour), but if you take a 1 month task that requires a bunch of context (e.g., implement this optimization pass in llvm), having the context already is probably only a factor of 2-4 (or basically no multiplier depending on the task). (There is probably an additional experience effect where people who are e.g. experienced with compilers will be faster, but this feels a bit separate to me and not cleanly applicable to the AIs.)

To be clear, this will substantially reduce the usefulness of AIs with shorter horizon lengths (like 8-32 hours), cutting down the AI R&D multipliers we see along the way.

(I agree the 80% task success is needed and pushes out the timeline some.)
- snewman 30 Apr 2025 15:14 UTC
  1 point
  0
  Parent
  Agreed that we should expect the performance difference between high- and low-context human engineers to diminish as task sizes increase. Also agreed that the right way to account for that might be to simply discount the 5-18x multiplier when projecting forwards, but I’m not entirely sure. I did think about this before writing the post, and I kept coming back to the view that when we measure Claude 3.7 as having a 50% success rate at 50-minute tasks, or o3 at 1.5-hour tasks, we should substantially discount those timings. On reflection, I suppose the counterargument is that this makes the measured doubling times look more impressive, because (plausibly) if we look at a pair of tasks that take low-context people 10 and 20 minutes respectively, the time ratio for realistically high-context people might be more than 2x. But I could imagine this playing out in other ways as well (e.g. maybe we aren’t yet looking at task sizes where people have time to absorb a significant amount of context, and so as the models climb from 1 to 4 to 16 to 64 minute tasks, the humans they’re being compared against aren’t yet benefiting from context-learning effects).
  One always wishes for more data – in this case, more measurements of human task completion times with high and low context, on more problem types and a wider range of time horizons...