Sure, I suppose that now I’ve started recklessly speculating about the future I might as well follow through.
I expect the departure to be pretty clear though, because we won’t see superhuman ai engineers before 2030. Even that prediction needs to be operationalized a bit of course.
Great, thanks! You are off to a good start, since I’m predicting superhuman autonomous AI coders by 2030 (and in fact, I’d say 50% by mid-2028 nowadays) whereas you are predicting that won’t happen. Good crux. Got any other disagreements, ideally ones that would be resolved prior to 2027? E.g. do you think that whatever the best version of METR’s agentic coding horizon length benchmark exists a year from now, will show a plateauing of horizon lengths instead of e.g. at least a 4x improvement over today’s SOTA?
FWIW, that’s not a crux for me. I can totally see METR’s agency-horizon trend continuing, such that 21 months later, the SOTA model beats METR’s 8-hour tests. What I expect is that this won’t transfer to real-world performance: you wouldn’t be able to plop that model into a software engineer’s chair, prompt it with the information in the engineer’s workstation, and get one workday’s worth of output from it.
At least, not reliably and not in the generel-coding setting. It’s possible this sort of performance would be achieved in some narrow domains, and that this would happen once in a while on any task. (Indeed, I think that’s already the case?) And I do expect nonzero extension of general-purpose real-world agency horizons. But what I expect is slower growth, with the real-world performance increasingly lagging behind the performance on the agency-horizon benchmark.
Yes. Though, I find it a bit hard to visualize a 4 hour software engineering task that can’t be done in 1 hour, so I’m more clear on there not being a 16x or so improvement in 2 years
OK, great. Wow, that was easy. We totally drilled down to the crux pretty fast. I agree that if agentic coding horizon lengths falter (failing to keep up with the METR trend) then my timelines will lengthen significantly.
Though I didn’t predict the trend would break down this early, this does provide some evidence it may hold up.
Still, I admit I’m a little confused by the report regarding o3/o4-mini. Here is the task performance:
And here are the projected horizons:
To me, the first plot doesn’t look like it shows a lot of improvement. Visually, o3 seems to perform about as well as o1-preview. Its average performance is actually lowest. Am I just being data-illiterate? Why is there such a large factor difference on the second plot? o4 seems to show significant improvement but only because of the kernel optimization task. Is it possible OpenAI finetuned on kernel optimization to game this benchmark? I think I would need to see more robust across-the-board improvement to be convinced.
I think o3 maybe does worse on RE-bench than 3.7 sonnet due to often attempting reward hacking. It could also be noise, it is just a small number of tasks. (Presumably these reward hacks would have worked better in the OpenAI training setup but METR filters it out?) It doesn’t attempt reward hacking as much / as aggresssively on the rest of METR’s tasks so it does better there and this pulls up the overall horizon length. (I think.)
There is a very cynical take which I don’t endorse but can’t quite dismiss: the idea that o3 is “misaligned” and lies to users or tries to hack the rewards is easier for OpenAI to spin—it still sounds like their models are on the path to AGI, they’re just smart enough to be getting dangerous now. Maybe that’s a narrative that they want. It certainly sounds better than “actually reasoning models are still useless because they make stuff up and can’t be trained to track reality while performing multiple-step tasks.”
Am I being paranoid here, or does it seem suspicious that o3 does such blatant reward hacking? Is that really something they couldn’t RLHF out? Or does it intentionally make the models look smart-but-unaligned instead of dumb and confused?
Sure, I suppose that now I’ve started recklessly speculating about the future I might as well follow through.
I expect the departure to be pretty clear though, because we won’t see superhuman ai engineers before 2030. Even that prediction needs to be operationalized a bit of course.
Great, thanks! You are off to a good start, since I’m predicting superhuman autonomous AI coders by 2030 (and in fact, I’d say 50% by mid-2028 nowadays) whereas you are predicting that won’t happen. Good crux. Got any other disagreements, ideally ones that would be resolved prior to 2027? E.g. do you think that whatever the best version of METR’s agentic coding horizon length benchmark exists a year from now, will show a plateauing of horizon lengths instead of e.g. at least a 4x improvement over today’s SOTA?
FWIW, that’s not a crux for me. I can totally see METR’s agency-horizon trend continuing, such that 21 months later, the SOTA model beats METR’s 8-hour tests. What I expect is that this won’t transfer to real-world performance: you wouldn’t be able to plop that model into a software engineer’s chair, prompt it with the information in the engineer’s workstation, and get one workday’s worth of output from it.
At least, not reliably and not in the generel-coding setting. It’s possible this sort of performance would be achieved in some narrow domains, and that this would happen once in a while on any task. (Indeed, I think that’s already the case?) And I do expect nonzero extension of general-purpose real-world agency horizons. But what I expect is slower growth, with the real-world performance increasingly lagging behind the performance on the agency-horizon benchmark.
Yes. Though, I find it a bit hard to visualize a 4 hour software engineering task that can’t be done in 1 hour, so I’m more clear on there not being a 16x or so improvement in 2 years
OK, great. Wow, that was easy. We totally drilled down to the crux pretty fast. I agree that if agentic coding horizon lengths falter (failing to keep up with the METR trend) then my timelines will lengthen significantly.
Similarly, if the METR trend continues I will become very worried that AGI is near.
So far, METR seems to believe horizons are growing even faster than expected: https://metr.github.io/autonomy-evals-guide/openai-o3-report/
Though I didn’t predict the trend would break down this early, this does provide some evidence it may hold up.
Still, I admit I’m a little confused by the report regarding o3/o4-mini. Here is the task performance:
To me, the first plot doesn’t look like it shows a lot of improvement. Visually, o3 seems to perform about as well as o1-preview. Its average performance is actually lowest. Am I just being data-illiterate? Why is there such a large factor difference on the second plot? o4 seems to show significant improvement but only because of the kernel optimization task. Is it possible OpenAI finetuned on kernel optimization to game this benchmark? I think I would need to see more robust across-the-board improvement to be convinced.
I think o3 maybe does worse on RE-bench than 3.7 sonnet due to often attempting reward hacking. It could also be noise, it is just a small number of tasks. (Presumably these reward hacks would have worked better in the OpenAI training setup but METR filters it out?) It doesn’t attempt reward hacking as much / as aggresssively on the rest of METR’s tasks so it does better there and this pulls up the overall horizon length. (I think.)
There is a very cynical take which I don’t endorse but can’t quite dismiss: the idea that o3 is “misaligned” and lies to users or tries to hack the rewards is easier for OpenAI to spin—it still sounds like their models are on the path to AGI, they’re just smart enough to be getting dangerous now. Maybe that’s a narrative that they want. It certainly sounds better than “actually reasoning models are still useless because they make stuff up and can’t be trained to track reality while performing multiple-step tasks.”
Am I being paranoid here, or does it seem suspicious that o3 does such blatant reward hacking? Is that really something they couldn’t RLHF out? Or does it intentionally make the models look smart-but-unaligned instead of dumb and confused?