I conducted an exercise at METR to simulate what our work would be like in 2027, when we have 200 hour time horizon AIs. Some observations:
The pace of research was much faster than today, something like 3x. I would guess that speedup goes as time horizon to the 0.3 or 0.4 power, though we didn’t run the game with enough fidelity to tell
No time to develop ideas before implementing: Agents implement ideas as soon as you think of them, so rather than ideating for days at a time, you can make an MVP in a couple of hours and revise. If the task isn’t near the limit of agent capabilities, you spend all your time understanding results; if it is, you spend all your time checking its work.
Keeping agents fed overnight: Overnight, agents can do maybe 200 human hours of work, but only for very agent-shaped tasks, so researchers need to deliberately sequence projects such that very long tasks suitable for agents happen overnight, e.g. optimizing a well-defined metric.
Prioritization and organization are bottlenecks: If agents can execute all your ideas nearly as fast as you can prompt them, there’s no point in implementing only your best idea. It might be better to implement your top three ideas all in parallel, but this makes it harder to stay organized. Even with AI-written dashboards to optimally help humans understand, the complexity of projects will probably go up in a way that makes projects much harder to manage.
Anything that takes serial time will no longer happen in parallel with execution, but rather becomes a serial bottleneck. Perhaps the vast majority of total project time will be taken by things like human data, ML experiments, and feedback (from peers, managers, especially external advisors).
One thing that stood out to me from the post was this line: “With less verifiable tasks, the gamemaster decides how successful the AI is.” This seems like a major assumption which could change the takeaways.
I’ve been experimenting with using Opus to implement experiments with only high-level feedback. I often find once I dig into the low-level details that Opus has made some very poor decisions which would not be observable from only looking at the final outputs, and that my uplift has been significantly reduced by the time I am satisfied with the quality of the experiment.
I’m not sure if these problems will be solved by just instructing the AI to provide a proof of correctness or a summary report, although this probably depends on the specific domain. For less verifiable tasks, verification could become a huge bottleneck that scales with task complexity and significantly reduces uplift.
I cannot understand how exactly the game was played. Would players just write down what AI accomplishes into the spreadsheet to simulate AI working on a task? If so, how do you simulate 200 hour time horizon? E.g. how do you decide that this piece of work done by AI can be done by 200 hour time horizon AI and this one cannot?
Regarding predictions to loosen the bottleneck:
Agent’s best-guess about what comments you’d get from Beth, Hjalmar, Ajeya.
Agent’s best-guess about survey results, if you launched the survey.
Agent’s best-guess about how this will be received on Twitter.
These just feel like basic sanity checking. I doubt AI would be very good at such predictions. Is this the intention? Or am I missing something?
Yes, I (as GM) was constantly monitoring the spreadsheet, asking players to explain their actions, and deciding whether the AI would succeed. We know how many human hours certain tasks have taken METR staff in the past, and I mentally estimated these for each player action. 200h task that was as clean/benchmarky/verifiable as HCAST/RE-Bench tasks would succeed with 50% chance or have ~1 big mistake on average. Based on 80% time horizons being ~5 times shorter, a clean 40 human hour task would have 80% success chance.
In an earlier version, I assigned a “messiness score” from 0 to 5, set the effective task length = human time / 2^messiness, and rolled for AI success. But this was too cumbersome to do for 3 players with 16+ actions each, and the messiness score was subjective anyway, so the players just made a quick intuitive judgement and ran it by me.
Agents are starting to be good at this kind of thing if you give them enough context. With access to every google doc Beth Barnes has ever written, and the ability to update its custom instructions with every piece of Beth’s actual feedback, my guess is the Beth simulator agent will be good enough to overlap 50%+ with Beth’s actual feedback, which makes this a crucial step when actual Beth has 3x as many things going on.
Even with AI-written dashboards to optimally help humans understand, the complexity of projects will probably go up in a way that makes projects much harder to manage.
This makes me realize that we really need the AI-written dashboard you are talking about.
This post in general has so many AI startup ideas embedded into it. The general feeling I get is that we really need an AI IDE (which is to an AI workflow what a regular IDE is to a coding workflow). All of the plans, AI task results, “short term utility functions” etc. would require a really specialized UI to keep track of while minimizing friction and thus maximizing productivity.
I conducted an exercise at METR to simulate what our work would be like in 2027, when we have 200 hour time horizon AIs. Some observations:
The pace of research was much faster than today, something like 3x. I would guess that speedup goes as time horizon to the 0.3 or 0.4 power, though we didn’t run the game with enough fidelity to tell
No time to develop ideas before implementing: Agents implement ideas as soon as you think of them, so rather than ideating for days at a time, you can make an MVP in a couple of hours and revise. If the task isn’t near the limit of agent capabilities, you spend all your time understanding results; if it is, you spend all your time checking its work.
Keeping agents fed overnight: Overnight, agents can do maybe 200 human hours of work, but only for very agent-shaped tasks, so researchers need to deliberately sequence projects such that very long tasks suitable for agents happen overnight, e.g. optimizing a well-defined metric.
Prioritization and organization are bottlenecks: If agents can execute all your ideas nearly as fast as you can prompt them, there’s no point in implementing only your best idea. It might be better to implement your top three ideas all in parallel, but this makes it harder to stay organized. Even with AI-written dashboards to optimally help humans understand, the complexity of projects will probably go up in a way that makes projects much harder to manage.
Anything that takes serial time will no longer happen in parallel with execution, but rather becomes a serial bottleneck. Perhaps the vast majority of total project time will be taken by things like human data, ML experiments, and feedback (from peers, managers, especially external advisors).
You can read more at the METR blog.
One thing that stood out to me from the post was this line: “With less verifiable tasks, the gamemaster decides how successful the AI is.” This seems like a major assumption which could change the takeaways.
I’ve been experimenting with using Opus to implement experiments with only high-level feedback. I often find once I dig into the low-level details that Opus has made some very poor decisions which would not be observable from only looking at the final outputs, and that my uplift has been significantly reduced by the time I am satisfied with the quality of the experiment.
I’m not sure if these problems will be solved by just instructing the AI to provide a proof of correctness or a summary report, although this probably depends on the specific domain. For less verifiable tasks, verification could become a huge bottleneck that scales with task complexity and significantly reduces uplift.
Thank you for sharing!
I cannot understand how exactly the game was played. Would players just write down what AI accomplishes into the spreadsheet to simulate AI working on a task? If so, how do you simulate 200 hour time horizon? E.g. how do you decide that this piece of work done by AI can be done by 200 hour time horizon AI and this one cannot?
Regarding predictions to loosen the bottleneck:
These just feel like basic sanity checking. I doubt AI would be very good at such predictions. Is this the intention? Or am I missing something?
Yes, I (as GM) was constantly monitoring the spreadsheet, asking players to explain their actions, and deciding whether the AI would succeed. We know how many human hours certain tasks have taken METR staff in the past, and I mentally estimated these for each player action. 200h task that was as clean/benchmarky/verifiable as HCAST/RE-Bench tasks would succeed with 50% chance or have ~1 big mistake on average. Based on 80% time horizons being ~5 times shorter, a clean 40 human hour task would have 80% success chance.
In an earlier version, I assigned a “messiness score” from 0 to 5, set the effective task length = human time / 2^messiness, and rolled for AI success. But this was too cumbersome to do for 3 players with 16+ actions each, and the messiness score was subjective anyway, so the players just made a quick intuitive judgement and ran it by me.
Agents are starting to be good at this kind of thing if you give them enough context. With access to every google doc Beth Barnes has ever written, and the ability to update its custom instructions with every piece of Beth’s actual feedback, my guess is the Beth simulator agent will be good enough to overlap 50%+ with Beth’s actual feedback, which makes this a crucial step when actual Beth has 3x as many things going on.
I think (1) is quite important context and it would be nice to make it discoverable in the main article.
This makes me realize that we really need the AI-written dashboard you are talking about.
This post in general has so many AI startup ideas embedded into it. The general feeling I get is that we really need an AI IDE (which is to an AI workflow what a regular IDE is to a coding workflow). All of the plans, AI task results, “short term utility functions” etc. would require a really specialized UI to keep track of while minimizing friction and thus maximizing productivity.
I’ve had much the same toughts previously. Work is gonna be weird.