My first impression of o3 (as available via Chatbot Arena) is that when I’m showing it my AI scaling analysis comments (such as this and this), it responds with confident unhinged speculation teeming with hallucinations, compared to the other recent models that usually respond with bland rephrasings that get almost everything correctly with a few minor hallucinations or reasonable misconceptions carrying over from their outdated knowledge.
Don’t know yet if it’s specific to speculative/forecasting discussions, but it doesn’t look good (for faithfulness of arguments) when combined with good performance on benchmarks. Possibly stream of consciousness style data is useful to write down within long reasoning traces and can add up to normality for questions with a short final answer, but results in spurious details within confabulated summarized arguments for that answer (outside the hidden reasoning trace) that aren’t measured by hallucination benchmarks and so allowed to get worse. Though in the o3 System Card hallucination rate also significantly increased compared to o1 (Section 3.3).
The system card also contains some juicey regression hidden within the worst-graph-of-all-time in the SWE-Lancer section:
If you can knobble together the will to work through the inane color scheme, it is very interesting to note that while the expected RL-able IC tasks show improvements, the Manager improvements are far less uniform, and particularly o1 (and 4o!) remains the stronger performer vs o3 when weighted by the (ahem, controversial) $$$ based benchmark. And this is all within the technical field of (essentially) system design, with verifiable answers, that the swe-lancer manager benchmark represents.
So the finite set of activated weights is likely getting canabalized further from pre-training generality, towards the increasingly evident fragility of RLed tasks. However, I feel it is also decent evidence towards the perennial question of activated weight size re. o1 vs o3, and that o3 is not yet the model designed to consume the extensive (yet expensive) shared world size capacity of OAI’s shiny new NV72 racks.
As a separate aside, it was amusing setting o3 off to work on re-graphing this data into a sane format, and observing the RL-ed tool-use fragility: it took 50+ tries over 15 minutes of repeatedly failed panning, cropping, and zooming operations for it to diligently work out an accurate data extraction, but work out an extraction it did. Inference scaling in action!
My first impression of o3 (as available via Chatbot Arena) is that when I’m showing it my AI scaling analysis comments (such as this and this), it responds with confident unhinged speculation teeming with hallucinations, compared to the other recent models that usually respond with bland rephrasings that get almost everything correctly with a few minor hallucinations or reasonable misconceptions carrying over from their outdated knowledge.
Don’t know yet if it’s specific to speculative/forecasting discussions, but it doesn’t look good (for faithfulness of arguments) when combined with good performance on benchmarks. Possibly stream of consciousness style data is useful to write down within long reasoning traces and can add up to normality for questions with a short final answer, but results in spurious details within confabulated summarized arguments for that answer (outside the hidden reasoning trace) that aren’t measured by hallucination benchmarks and so allowed to get worse. Though in the o3 System Card hallucination rate also significantly increased compared to o1 (Section 3.3).
The system card also contains some juicey regression hidden within the worst-graph-of-all-time in the SWE-Lancer section:
If you can knobble together the will to work through the inane color scheme, it is very interesting to note that while the expected RL-able IC tasks show improvements, the Manager improvements are far less uniform, and particularly o1 (and 4o!) remains the stronger performer vs o3 when weighted by the (ahem, controversial) $$$ based benchmark. And this is all within the technical field of (essentially) system design, with verifiable answers, that the swe-lancer manager benchmark represents.
So the finite set of activated weights is likely getting canabalized further from pre-training generality, towards the increasingly evident fragility of RLed tasks. However, I feel it is also decent evidence towards the perennial question of activated weight size re. o1 vs o3, and that o3 is not yet the model designed to consume the extensive (yet expensive) shared world size capacity of OAI’s shiny new NV72 racks.
As a separate aside, it was amusing setting o3 off to work on re-graphing this data into a sane format, and observing the RL-ed tool-use fragility: it took 50+ tries over 15 minutes of repeatedly failed panning, cropping, and zooming operations for it to diligently work out an accurate data extraction, but work out an extraction it did. Inference scaling in action!