FWIW, that’s not a crux for me. I can totally see METR’s agency-horizon trend continuing, such that 21 months later, the SOTA model beats METR’s 8-hour tests. What I expect is that this won’t transfer to real-world performance: you wouldn’t be able to plop that model into a software engineer’s chair, prompt it with the information in the engineer’s workstation, and get one workday’s worth of output from it.
At least, not reliably and not in the generel-coding setting. It’s possible this sort of performance would be achieved in some narrow domains, and that this would happen once in a while on any task. (Indeed, I think that’s already the case?) And I do expect nonzero extension of general-purpose real-world agency horizons. But what I expect is slower growth, with the real-world performance increasingly lagging behind the performance on the agency-horizon benchmark.
FWIW, that’s not a crux for me. I can totally see METR’s agency-horizon trend continuing, such that 21 months later, the SOTA model beats METR’s 8-hour tests. What I expect is that this won’t transfer to real-world performance: you wouldn’t be able to plop that model into a software engineer’s chair, prompt it with the information in the engineer’s workstation, and get one workday’s worth of output from it.
At least, not reliably and not in the generel-coding setting. It’s possible this sort of performance would be achieved in some narrow domains, and that this would happen once in a while on any task. (Indeed, I think that’s already the case?) And I do expect nonzero extension of general-purpose real-world agency horizons. But what I expect is slower growth, with the real-world performance increasingly lagging behind the performance on the agency-horizon benchmark.