I originally thought that the METR results meant that this or next year might be the year where AI coding agents had their breakthrough moment. The reasoning behind this was that if the trend holds AI coding agents will be able to do several hour long tasks with a certain probability of success, which would make the overhead and cost of using the agent suddenly very economically viable.
I now realised that this argument has a big hole: All the METR tasks are timed for un-aided humans, i.e. humans without the help of LLMs. This means that especially for those tasks that can be successfully completed by AI coding agents, the actual time a human aided by LLMs would need is much shorter.
I’m not sure how many task completion time doublings this buys before AI coding agents take over a large part of coding, but the farther we extrapolate from the existing data points the higher the uncertainty that the trend will hold.
Estimating task completion times for AI-aided humans would have been an interesting addition to the study. The correlation of the time-savings through AI-support with the task completion probability by AI coding agents might have allowed the prediction of the actual economic competitiveness of AI coding agents in the near future.
I originally thought that the METR results meant that this or next year might be the year where AI coding agents had their breakthrough moment. The reasoning behind this was that if the trend holds AI coding agents will be able to do several hour long tasks with a certain probability of success, which would make the overhead and cost of using the agent suddenly very economically viable.
I now realised that this argument has a big hole: All the METR tasks are timed for un-aided humans, i.e. humans without the help of LLMs. This means that especially for those tasks that can be successfully completed by AI coding agents, the actual time a human aided by LLMs would need is much shorter.
I’m not sure how many task completion time doublings this buys before AI coding agents take over a large part of coding, but the farther we extrapolate from the existing data points the higher the uncertainty that the trend will hold.
Estimating task completion times for AI-aided humans would have been an interesting addition to the study. The correlation of the time-savings through AI-support with the task completion probability by AI coding agents might have allowed the prediction of the actual economic competitiveness of AI coding agents in the near future.