That sounds surprising. If it is ‘usually’ the case that o3 fails abysmally and 4o succeeds, then could you link to a pair of o3 vs 4o conversations showing that behavior on an identical prompt—preferably where the prompt is as short and simple as possible?
That sounds surprising. If it is ‘usually’ the case that o3 fails abysmally and 4o succeeds, then could you link to a pair of o3 vs 4o conversations showing that behavior on an identical prompt—preferably where the prompt is as short and simple as possible?