Though I didn’t predict the trend would break down this early, this does provide some evidence it may hold up.
Still, I admit I’m a little confused by the report regarding o3/o4-mini. Here is the task performance:
And here are the projected horizons:
To me, the first plot doesn’t look like it shows a lot of improvement. Visually, o3 seems to perform about as well as o1-preview. Its average performance is actually lowest. Am I just being data-illiterate? Why is there such a large factor difference on the second plot? o4 seems to show significant improvement but only because of the kernel optimization task. Is it possible OpenAI finetuned on kernel optimization to game this benchmark? I think I would need to see more robust across-the-board improvement to be convinced.
I think o3 maybe does worse on RE-bench than 3.7 sonnet due to often attempting reward hacking. It could also be noise, it is just a small number of tasks. (Presumably these reward hacks would have worked better in the OpenAI training setup but METR filters it out?) It doesn’t attempt reward hacking as much / as aggresssively on the rest of METR’s tasks so it does better there and this pulls up the overall horizon length. (I think.)
There is a very cynical take which I don’t endorse but can’t quite dismiss: the idea that o3 is “misaligned” and lies to users or tries to hack the rewards is easier for OpenAI to spin—it still sounds like their models are on the path to AGI, they’re just smart enough to be getting dangerous now. Maybe that’s a narrative that they want. It certainly sounds better than “actually reasoning models are still useless because they make stuff up and can’t be trained to track reality while performing multiple-step tasks.”
Am I being paranoid here, or does it seem suspicious that o3 does such blatant reward hacking? Is that really something they couldn’t RLHF out? Or does it intentionally make the models look smart-but-unaligned instead of dumb and confused?
So far, METR seems to believe horizons are growing even faster than expected: https://metr.github.io/autonomy-evals-guide/openai-o3-report/
Though I didn’t predict the trend would break down this early, this does provide some evidence it may hold up.
Still, I admit I’m a little confused by the report regarding o3/o4-mini. Here is the task performance:
To me, the first plot doesn’t look like it shows a lot of improvement. Visually, o3 seems to perform about as well as o1-preview. Its average performance is actually lowest. Am I just being data-illiterate? Why is there such a large factor difference on the second plot? o4 seems to show significant improvement but only because of the kernel optimization task. Is it possible OpenAI finetuned on kernel optimization to game this benchmark? I think I would need to see more robust across-the-board improvement to be convinced.
I think o3 maybe does worse on RE-bench than 3.7 sonnet due to often attempting reward hacking. It could also be noise, it is just a small number of tasks. (Presumably these reward hacks would have worked better in the OpenAI training setup but METR filters it out?) It doesn’t attempt reward hacking as much / as aggresssively on the rest of METR’s tasks so it does better there and this pulls up the overall horizon length. (I think.)
There is a very cynical take which I don’t endorse but can’t quite dismiss: the idea that o3 is “misaligned” and lies to users or tries to hack the rewards is easier for OpenAI to spin—it still sounds like their models are on the path to AGI, they’re just smart enough to be getting dangerous now. Maybe that’s a narrative that they want. It certainly sounds better than “actually reasoning models are still useless because they make stuff up and can’t be trained to track reality while performing multiple-step tasks.”
Am I being paranoid here, or does it seem suspicious that o3 does such blatant reward hacking? Is that really something they couldn’t RLHF out? Or does it intentionally make the models look smart-but-unaligned instead of dumb and confused?