I think o3 maybe does worse on RE-bench than 3.7 sonnet due to often attempting reward hacking. It could also be noise, it is just a small number of tasks. (Presumably these reward hacks would have worked better in the OpenAI training setup but METR filters it out?) It doesn’t attempt reward hacking as much / as aggresssively on the rest of METR’s tasks so it does better there and this pulls up the overall horizon length. (I think.)
There is a very cynical take which I don’t endorse but can’t quite dismiss: the idea that o3 is “misaligned” and lies to users or tries to hack the rewards is easier for OpenAI to spin—it still sounds like their models are on the path to AGI, they’re just smart enough to be getting dangerous now. Maybe that’s a narrative that they want. It certainly sounds better than “actually reasoning models are still useless because they make stuff up and can’t be trained to track reality while performing multiple-step tasks.”
Am I being paranoid here, or does it seem suspicious that o3 does such blatant reward hacking? Is that really something they couldn’t RLHF out? Or does it intentionally make the models look smart-but-unaligned instead of dumb and confused?
I think o3 maybe does worse on RE-bench than 3.7 sonnet due to often attempting reward hacking. It could also be noise, it is just a small number of tasks. (Presumably these reward hacks would have worked better in the OpenAI training setup but METR filters it out?) It doesn’t attempt reward hacking as much / as aggresssively on the rest of METR’s tasks so it does better there and this pulls up the overall horizon length. (I think.)
There is a very cynical take which I don’t endorse but can’t quite dismiss: the idea that o3 is “misaligned” and lies to users or tries to hack the rewards is easier for OpenAI to spin—it still sounds like their models are on the path to AGI, they’re just smart enough to be getting dangerous now. Maybe that’s a narrative that they want. It certainly sounds better than “actually reasoning models are still useless because they make stuff up and can’t be trained to track reality while performing multiple-step tasks.”
Am I being paranoid here, or does it seem suspicious that o3 does such blatant reward hacking? Is that really something they couldn’t RLHF out? Or does it intentionally make the models look smart-but-unaligned instead of dumb and confused?