I have no opinion about whether formalizing proofs will be a hard problem in 2025, but I think you’re underestimating the difficulty of the task (“math proofs are math proofs” is very much a false statement for today’s LLMs, for example).
In any event, my issue is that formalizing proofs is very clearly not involved in the o1/o3 pipeline, since those models make so many formally incorrect arguments. The people behind FrontierMath have said that o3 solved many of the problems using heuristic algorithms with wrong reasoning behind them; that’s not something a model trained on formally verified proofs would do. I see the same thing with o1, which was evaluated on the Putnam and got the right answer with a wrong proof on nearly every question.
Right now, it seems to be important to not restrict the transcripts at all. This is a hard exploration problem, where most of the answers are useless, and it takes a lot of time for correct answers to finally emerge. Given that, you need to keep the criteria as relaxed as possible, as they are already on the verge of impossibility.
The r1, the other guys, and OAers too on Twitter now seem to emphasize that the obvious appealing approach of rewarding tokens for predicted correctness or doing search on tokens, just doesn’t work (right now). You need to ‘let the LLMs yap’ until they reach the final correct answer. This appears to be the reason for the bizarre non sequiturs or multi-lingual diversions in transcripts—that’s just the cost of rolling out solution attempts which can go anywhere and keeping the winners. They will do all sorts of things which are unnecessary (and conversely, omit tokens which are ‘necessary’). Think of it as the equivalent of how DRL agents will ‘jitter’ and take many unnecessary actions, because those actions don’t change the final reward more than epsilon, and the RL feedback just isn’t rich enough to say ‘you don’t need to bounce up and down randomly while waiting for the ball to bounce back, that doesn’t actually help or hurt you’ (and if you try to reward-shape away those wasteful movements, you may discover your DRL agent converges to a local optimum where it doesn’t do anything, ever, because the jitters served to explore the environment and find new tricks, and you made it too expensive to try useless-seeming tricks so it never found any payoffs or laddered its way up in capabilities).
So you wouldn’t want to impose constraints like ‘must be 100% correct valid Lean proof’. Because it is hard enough to find a ‘correct’ transcript even when you don’t penalize it for spending a while yapping in Japanese or pseudo-skipping easy steps by not writing them down. If you imposed constraints like that, instead of rolling out 1000 episodes and getting 1 useful transcript and the bootstrap working, you’d get 0 useful transcripts and it’d go nowhere.
What you might do is impose a curriculum: solve it any way you can at first, then solve it the right way. Once you have your o1 bootstrap working and have seen large capability gains, you can go back and retrain on the easiest problems with stricter criteria, and work your way back up through the capability levels, but now in some superior way. (In the DRL agent context, you might train to convergence and only then impose a very, very small penalty on each movement, and gradually ramp it up until the performance degrades a little bit but it’s no longer jittering.) The same way you might be taught something informally, and then only much later, after you’ve worked with it a lot, do you go back and learn or prove it rigorously. You might impose a progressive shrinking constraint, for example, where the transcript has to be fewer tokens each time, in order to distill the knowledge into the forward passes to make it vastly cheaper to run (even cheaper, for hard problems, than simply training a small dumb model on the transcripts). You might try to iron out the irrelevancies and digressions by having a judge/critic LLM delete irrelevant parts. You might try to eliminate steganography by rewriting the entire transcript using a different model. Or you might simply prompt it to write a proof in Lean, and score it by whether the final answer validates.
There’s still my original question of where the feedback comes from. You say keep the transcripts where the final answer is correct, but how do you know the final answer? And how do you come up with the question?
What seems to be going on is that these models are actually quite supervised, despite everyone’s insistence on calling them unsupervised RL. The questions and answers appear to be high-quality human annotation instead of being machine generated. Let me know if I’m wrong about this.
If I’m right, it has implications for scaling. You need human annotators to scale, and you need to annotate increasingly hard problems. You don’t get to RL your way to infinite skill like alphazero; if, say, the Riemann hypothesis turns out to be like 3 OOMs of difficulty beyond what humans can currently annotate, then this type of training will never solve Riemann no matter how you scale.
In FBAI’s COCONUT they use a curriculum to teach it to think shorter and differently and it works. They are teaching it to think using fewer steps, but compress into latent vectors instead of tokens.
first it thinks with tokens
then they replace one thinking step with a latent <thought> token
then 2
...
It’s not RL, but what is RL any more? It’s becoming blurry. They don’t reward or punish it for anything in the thought token. So it learns thoughts that are helpful in outputting the correct answer.
It’s not RL, but what is RL any more? It’s becoming blurry. They don’t reward or punish it for anything in the thought token. So it learns thoughts that are helpful in outputting the correct answer.
That’s definitely RL (and what I was explaining was simply the obvious basic approach anyone in DRL would think of in this context and so of course there is research trying things like it). It’s being rewarded for a non-differentiable global loss where the correct alternative or answer or label is not provided (not even information of the existence of a better decision) and so standard supervised learning is impossible, requiring exploration. Conceptually, this is little different from, say, training a humanoid robot NN to reach a distant point in fewer actions: it can be a hard exploration problem (most sequences of joint torques or actions simply result in a robot having a seizure while laying on the ground going nowhere), where you want to eventually reach the minimal sequence (to minimize energy / wear-and-tear / time) and you start by solving the problem in any way possible, rewarding solely on the final success, and then reward-shape into a desirable answer, which in effect breaks up the hard original problem into two more feasible problems in a curriculum - ‘reach the target ever’ followed by ‘improve a target-reaching sequence of actions to be shorter’.
Two months later I tried to try actually implementing a nontrivial conversion of a natural language mathematical argument to a fully formalized Lean proof in order to check if I was indeed underestimating it (TBH, I have never tried a proof assistant before).
So I took a difficult integral from a recent MathSE question I couldn’t solve analytically myself, had Gemini 2.5 Pro solve it 0-shot,[1] verified it numerically, set up a Lean environment in Google Colab and then asked if another instance of Gemini 2.5 could convert the solution into a proof. It told me that it is indeed hard:
This is not a trivial task. Here’s why:
Informal vs. Formal: My natural language explanation, while aiming for clarity, likely skipped steps, relied on implicit assumptions (like function continuity, differentiability, domain constraints), or used intuitive leaps that Lean demands be made explicit and rigorously justified using defined theorems and axioms.
Library Navigation: Finding the exact theorems in mathlib4 that correspond to each step (e.g., the correct version of integration by parts, substitution, limit theorems, properties of specific functions) requires familiarity with the library.
Side Conditions: Every theorem in Lean (like the chain rule or integration by parts) has precise side conditions (e.g., f is differentiable, g’ is continuous, the function is integrable). The natural language proof might not have explicitly stated or verified all of these, but the Lean proof must.
Calculations: Even seemingly simple algebraic manipulations or derivative calculations need to be carried out using Lean’s tactics (ring, linarith, simp, rw, etc.) or proven step-by-step.
Proof Structure & Tactics: Structuring the proof correctly in Lean and using the appropriate tactics (apply, exact, calc, by, etc.) to guide the prover is a skill in itself.
My Limitations: While I can generate code snippets, generating a complete, correct, and non-trivial formal proof interactively is currently beyond my capabilities. It often requires a human expert to guide the process, debug errors, and find the right lemmas.
<...>
It is highly unlikely that I can produce a complete, automatically verifiable Lean proof for a “tricky” integral directly from a natural language description. However, if you provide the details, I can attempt to sketch out the Lean concepts involved, which could be a starting point for someone (perhaps you in Colab, with time and learning) to build the actual proof.
Gemini and I weren’t able to set up mathlib4 in Lean 4 and I gave up on the task, but already by just looking on a solution Gemini listed the following problems[2] (I put it here as a screen capture instead of a proper collapsible section because I couldn’t figure out how to copypaste the formulas right):
To sum up, yes, I did underestimate the hardness of the task, it is certainly beyond the reach of current SOTA LLMs.
However, I believe that since this type of task is verifiable in silico and really very convenient for synthetic training data generation, Google folks behind AlphaGeometry are probably going to solve this problem in a year or two.
The fact that an LLM solved it 0-shot is notable in its own right BTW. Generally, I’ld estimate that Gemini 2.5 and o3-mini are able to solve most of the definite integrals posted in MathSE questions. It was very different at the beginning of this year!
I have no opinion about whether formalizing proofs will be a hard problem in 2025, but I think you’re underestimating the difficulty of the task (“math proofs are math proofs” is very much a false statement for today’s LLMs, for example).
In any event, my issue is that formalizing proofs is very clearly not involved in the o1/o3 pipeline, since those models make so many formally incorrect arguments. The people behind FrontierMath have said that o3 solved many of the problems using heuristic algorithms with wrong reasoning behind them; that’s not something a model trained on formally verified proofs would do. I see the same thing with o1, which was evaluated on the Putnam and got the right answer with a wrong proof on nearly every question.
Right now, it seems to be important to not restrict the transcripts at all. This is a hard exploration problem, where most of the answers are useless, and it takes a lot of time for correct answers to finally emerge. Given that, you need to keep the criteria as relaxed as possible, as they are already on the verge of impossibility.
The r1, the other guys, and OAers too on Twitter now seem to emphasize that the obvious appealing approach of rewarding tokens for predicted correctness or doing search on tokens, just doesn’t work (right now). You need to ‘let the LLMs yap’ until they reach the final correct answer. This appears to be the reason for the bizarre non sequiturs or multi-lingual diversions in transcripts—that’s just the cost of rolling out solution attempts which can go anywhere and keeping the winners. They will do all sorts of things which are unnecessary (and conversely, omit tokens which are ‘necessary’). Think of it as the equivalent of how DRL agents will ‘jitter’ and take many unnecessary actions, because those actions don’t change the final reward more than epsilon, and the RL feedback just isn’t rich enough to say ‘you don’t need to bounce up and down randomly while waiting for the ball to bounce back, that doesn’t actually help or hurt you’ (and if you try to reward-shape away those wasteful movements, you may discover your DRL agent converges to a local optimum where it doesn’t do anything, ever, because the jitters served to explore the environment and find new tricks, and you made it too expensive to try useless-seeming tricks so it never found any payoffs or laddered its way up in capabilities).
So you wouldn’t want to impose constraints like ‘must be 100% correct valid Lean proof’. Because it is hard enough to find a ‘correct’ transcript even when you don’t penalize it for spending a while yapping in Japanese or pseudo-skipping easy steps by not writing them down. If you imposed constraints like that, instead of rolling out 1000 episodes and getting 1 useful transcript and the bootstrap working, you’d get 0 useful transcripts and it’d go nowhere.
What you might do is impose a curriculum: solve it any way you can at first, then solve it the right way. Once you have your o1 bootstrap working and have seen large capability gains, you can go back and retrain on the easiest problems with stricter criteria, and work your way back up through the capability levels, but now in some superior way. (In the DRL agent context, you might train to convergence and only then impose a very, very small penalty on each movement, and gradually ramp it up until the performance degrades a little bit but it’s no longer jittering.) The same way you might be taught something informally, and then only much later, after you’ve worked with it a lot, do you go back and learn or prove it rigorously. You might impose a progressive shrinking constraint, for example, where the transcript has to be fewer tokens each time, in order to distill the knowledge into the forward passes to make it vastly cheaper to run (even cheaper, for hard problems, than simply training a small dumb model on the transcripts). You might try to iron out the irrelevancies and digressions by having a judge/critic LLM delete irrelevant parts. You might try to eliminate steganography by rewriting the entire transcript using a different model. Or you might simply prompt it to write a proof in Lean, and score it by whether the final answer validates.
There’s still my original question of where the feedback comes from. You say keep the transcripts where the final answer is correct, but how do you know the final answer? And how do you come up with the question?
What seems to be going on is that these models are actually quite supervised, despite everyone’s insistence on calling them unsupervised RL. The questions and answers appear to be high-quality human annotation instead of being machine generated. Let me know if I’m wrong about this.
If I’m right, it has implications for scaling. You need human annotators to scale, and you need to annotate increasingly hard problems. You don’t get to RL your way to infinite skill like alphazero; if, say, the Riemann hypothesis turns out to be like 3 OOMs of difficulty beyond what humans can currently annotate, then this type of training will never solve Riemann no matter how you scale.
So how could I have thought that faster might actually be a sensible training trick for reasoning models.
In FBAI’s COCONUT they use a curriculum to teach it to think shorter and differently and it works. They are teaching it to think using fewer steps, but compress into latent vectors instead of tokens.
first it thinks with tokens
then they replace one thinking step with a latent <thought> token
then 2
...
It’s not RL, but what is RL any more? It’s becoming blurry. They don’t reward or punish it for anything in the thought token. So it learns thoughts that are helpful in outputting the correct answer.
There’s another relevant paper “Compressed Chain of Thought: Efficient Reasoning through Dense Representations” which used teacher forcing. Although I haven’t read the whole thing yet.
That’s definitely RL (and what I was explaining was simply the obvious basic approach anyone in DRL would think of in this context and so of course there is research trying things like it). It’s being rewarded for a non-differentiable global loss where the correct alternative or answer or label is not provided (not even information of the existence of a better decision) and so standard supervised learning is impossible, requiring exploration. Conceptually, this is little different from, say, training a humanoid robot NN to reach a distant point in fewer actions: it can be a hard exploration problem (most sequences of joint torques or actions simply result in a robot having a seizure while laying on the ground going nowhere), where you want to eventually reach the minimal sequence (to minimize energy / wear-and-tear / time) and you start by solving the problem in any way possible, rewarding solely on the final success, and then reward-shape into a desirable answer, which in effect breaks up the hard original problem into two more feasible problems in a curriculum - ‘reach the target ever’ followed by ‘improve a target-reaching sequence of actions to be shorter’.
Two months later I tried to try actually implementing a nontrivial conversion of a natural language mathematical argument to a fully formalized Lean proof in order to check if I was indeed underestimating it (TBH, I have never tried a proof assistant before).
So I took a difficult integral from a recent MathSE question I couldn’t solve analytically myself, had Gemini 2.5 Pro solve it 0-shot,[1] verified it numerically, set up a Lean environment in Google Colab and then asked if another instance of Gemini 2.5 could convert the solution into a proof. It told me that it is indeed hard:
Gemini and I weren’t able to set up mathlib4 in Lean 4 and I gave up on the task, but already by just looking on a solution Gemini listed the following problems[2] (I put it here as a screen capture instead of a proper collapsible section because I couldn’t figure out how to copypaste the formulas right):
To sum up, yes, I did underestimate the hardness of the task, it is certainly beyond the reach of current SOTA LLMs.
However, I believe that since this type of task is verifiable in silico and really very convenient for synthetic training data generation, Google folks behind AlphaGeometry are probably going to solve this problem in a year or two.
The fact that an LLM solved it 0-shot is notable in its own right BTW. Generally, I’ld estimate that Gemini 2.5 and o3-mini are able to solve most of the definite integrals posted in MathSE questions. It was very different at the beginning of this year!
I haven’t checked accuracy of all the generated details due to lack of competence and time but generally expect the outline to be broadly correct