The account in that story contradicts other reports that I’ve seen of the performance of o4-mini that are less openly enthusiastic, e.g. Greg Burnham:
OpenAI released o3 and o4-mini last week. I’ll use this post to review the new models against standard math benchmarks, as well as on some of the less-formal evaluations I’ve developed in previous posts. [...]
TL;DR: in terms of math, the new models are an incremental step forward compared to OpenAI’s previous models. Their progress comes from tightening up existing capabilities rather than breaking new ground. They are roughly tied with Gemini 2.5 Pro as “best math model”. From the anecdotal excitement I’ve seen about these new models, it seems like OpenAI achieved more progress in other, non-math areas. I wouldn’t be surprised if that trend continues.
and Daniel Litt (his opening message says he’s testing both o3 and o4-mini but later only talks about testing “it”; but I presume that if he tested both, he’s referring to whichever got better results; this is supported by the “Incidentally, o4-mini-high seems a bit better than o3 for this kind of thing, though I’m not confident about that” later in the thread):
In this thread I’ll record some brief impressions from trying to use o3/o4-mini (the new OpenAI models) for mathematical tasks. [...]
I asked it for comments on my latest paper. Here I was very underwhelmed; it hallucinated several results, outlines of arguments were wrong. Several comments seemed not to refer correctly to anything of the paper. FWIW Gemini 2.5 pro seems to do much better at this task.
I tried a few of my standard test questions; here it did somewhat better than past OpenAI models, and comparable to or better than other SOTA models. One thing I found very impressive was its use of tools to find examples.
For example, I asked “does there exist a 2x2 integer matrix which is diagonalizable mod p for all p and over Q, but not over Z,” and it correctly found an example (rather than “proving” that the answer is “no,” which is typical for other models). [...]
I asked for a sketch of Katz’s proof of the p-curvature conjecture for Gauss-Manin connections. The results were disappointing; it could not reliably get the statement correct, and it never correctly summarized the proof. This is a well-exposited argument, so I was surprised.
I asked it about the statistics of lines in a random cubic surface over a finite field; it correctly understood that this is really a question about the statistics of a certain transitive action of the finite group W(E_6). (Previous models also have understood this.)
Like Gemini 2.5 pro it attempted to do some computation of these statistics and then claimed to give a precise answer; this was completely hallucinated. That is, it made up a number p and wrote code that amounted to “print p,” claiming it was actually computing something. [...]
I asked it to solve some problems from the exam I’m administering today, and from some homeworks I’ve assigned this semester; it did very well (as did o1/o3-mini before it).
A few proofs had bluffs but it certainly would have received a passing grade, maybe even an A. FWIW I do think this is bad behavior—bluffing in proofs is generally a bad sign for research potential. It’s not truth-seeking. [...]
I asked for proofs of a few basic lemmas in a paper I’m writing. All of these were fatally flawed. [...]
I asked about the proofs of a few standard but non-trivial facts from representation theory. Here results were mixed; some arguments were excellent but some were nonsense.
For example, let G be a finite group and V a complex irreducible representation of G. Then dim(V) divides |G|. The proof is a slightly tricky argument using integrality of values of characters. It produced several wrong “elementary” arguments.
It correctly alluded to using algebraic integrality but I was never able to get it to give a correct proof. This is surprising because the answer is very googleable.
(Incidentally, o4-mini-high seems a bit better than o3 for this kind of thing, though I’m not confident about that.)
I asked it to prove a difficult theorem of Serre about tensor products of semisimple representations in positive characteristic. Here it correctly recognized and cited the result of Serre; unlike past models it didn’t try to give an elementary (wrong) proof without prodding.
When asked to sketch a proof, as before the results were nonsense.
The account in that story contradicts other reports that I’ve seen of the performance of o4-mini that are less openly enthusiastic, e.g. Greg Burnham:
and Daniel Litt (his opening message says he’s testing both o3 and o4-mini but later only talks about testing “it”; but I presume that if he tested both, he’s referring to whichever got better results; this is supported by the “Incidentally, o4-mini-high seems a bit better than o3 for this kind of thing, though I’m not confident about that” later in the thread):