gwern comments on So You Think You’ve Awoken ChatGPT

gwern 22 Jul 2025 5:11 UTC
4 points
2
That sounds surprising. If it is ‘usually’ the case that o3 fails abysmally and 4o succeeds, then could you link to a pair of o3 vs 4o conversations showing that behavior on an identical prompt—preferably where the prompt is as short and simple as possible?