Mo Putera comments on Mo Putera’s Shortform

Mo Putera 23 May 2026 19:24 UTC
2 points
0
Episode #2 of the AI industrialisation of math series: GDM’s best Gemini 3.1 Pro-based agent autonomously resolved 9 open Erdős problems out of the full set of 353 in the open-source Formal Conjectures repo at a few hundred dollars per problem. (They added that “we do not intend USD as a comparison of AI and human mathematical labor.”) I’ve previously joked that the current SOTA on the “Erdos problems benchmark” is Terry Tao’s guess of 1-2%; now we can say that with 3.1 Pro it’s ⁹⁄₃₅₃ = 2.5% (Tao’s disclaimer #1 notwithstanding).
Outside of Erdős problems, the agents also resolved an open question in optimization theory with a proof departing from approaches in previous work, proved a variant of the graph reconstruction conjecture (one of the oldest open problems in combinatorics) and made progress on the full-strength conjecture, proved an open graph theory conjecture posed in 1996 by an automated conjecturing system called Graffiti, solved two open problems out of four attempted in algebraic geometry including one that had been a well-known open problem for 15 years, resolved problem #57 from Green’s well-known list of open conjectures, and resolved multiple conjectures of a certain form in quantum optics. Papers based on several of these results are currently in preparation.
We built AlphaProof Nexus with the belief that the future of mathematics lies in human-machine partnership, where interactive AI tools serve to expand a mathematician’s creative capacity. Our results support this vision. Our mathematician collaborators found that proof attempts by our agents enhanced their understanding of a problem, even when an agent could not prove the claim at hand. Because the sketches were formal, experts could focus on the unresolved subgoals rather than re-verifying the entire argument. Moreover, the agents were powerful tools for detecting misformalizations. These experiences suggest that AI-driven formal proof search can serve not only to solve problems but to deepen human understanding.
At present, our agents’ successes are concentrated in areas such as combinatorics, convex optimization, and number theory, where Lean’s mathematics library [mathlib2020] is mature and tasks often decompose into tractable subgoals. Even most Erdős problems remain out of reach, let alone problems that require extensive new theory. Additionally, our agents inherit the biases of their underlying LLMs and exhibit high search variance. Characterizing the agents’ boundaries and expanding them is an important direction for future work.
The solved problems:
On cost per problem:
Per-problem inference costs exhibit high variance due to the stochastic nature of our agents. The reported costs also do not capture the full cost of discovery: we applied the full-featured agent to all 353 Erdős problems in Formal Conjectures, and identifying tractable problems was itself a significant computational investment. AlphaProof cost approximately 27.5 TPU hours ($60 USD) per problem on v6e TPUs.
Figure 9: Box plots illustrating the distribution of a cost for successful proof for the Erdős problems. Four distinct experimental configurations are compared: basic@K=10 (blue), basic with AlphaProof @K=10 (orange), basic with evolution (green), and full (red). The solid horizontal line within each box denotes the median cost, while the upper and lower box boundaries represent the third and first quartiles, respectively. Whiskers extend to the rest of the distribution, and individual circles indicate outlier data points. Triangular markers denote instances with a single data point where only one proof attempt was successful.
Figure 10: Solve rate versus mean inference cost (USD) across nine Erdős problem instances. Seven system configurations shown: basic (blue circles), basic with AlphaProof (AP) (orange squares), basic with evolution (green diamonds), full (red triangles), full@1 (purple downward triangles), full@3 (brown pluses), and full@6 (pink x), where @S are the variants of full system with a given number of parallel LLM generation threads. Connected curves denote the number of independent attempts K. Error bars indicate one standard error interval. For the basic and basic with AlphaProof configurations, each curve traces the cost–performance Pareto frontier as K increases, revealing diminishing marginal returns at higher budgets. Note that basic with AlphaProof and full do not include the inference cost of AlphaProof.
Figure 11: Box plots illustrating the distribution of a wall-clock time for a successful proof for the Erdős problems. Three distinct experimental configurations are compared: basic@K=10 (blue), basic with AlphaProof @K=10 (orange), and full (red). The solid horizontal line within each box denotes the median time, while the upper and lower box boundaries represent the third and first quartiles, respectively. Whiskers extend to the rest of the distribution, and individual circles indicate outlier data points. Triangular markers denote instances with a single data point where only one proof attempt was successful. A cut-off of 48 hours applied to all experiments.