I did actually try a bit of that back in the o1 days. What I’ve found is that getting LLMs to output formal Lean proofs is pretty difficult: they really don’t want to do that. When they’re not making mistakes, they use informal language as connective tissue between Lean snippets, they put in “sorry”s (a placeholder that makes a lemma evaluate as proven), and otherwise try to weasel out of it.
This is something that should be solvable by fine-tuning, but at the time, there weren’t any publicly available decent models fine-tuned for that.
We do have DeepSeek-Prover-V2 now, though. I should look into it at some point. But I am not optimistic, sounds like it’s doing the same stuff, just more cleverly.
(I had a bit of an epistemic rollercoaster making this prediction, I updated “by the time someone makes an actually worthwhile Math AI, even if lean was an important part of it’s training process, it’s probably not that hard to do additional fine tuning that gets it to output stuff in a more standard mathy format. But, then, it seemed like it was still going to be important to quickly check it wasn’t blatantly broken as part of the process)
I agree that it’s a promising direction.
I did actually try a bit of that back in the o1 days. What I’ve found is that getting LLMs to output formal Lean proofs is pretty difficult: they really don’t want to do that. When they’re not making mistakes, they use informal language as connective tissue between Lean snippets, they put in “sorry”s (a placeholder that makes a lemma evaluate as proven), and otherwise try to weasel out of it.
This is something that should be solvable by fine-tuning, but at the time, there weren’t any publicly available decent models fine-tuned for that.
We do have DeepSeek-Prover-V2 now, though. I should look into it at some point. But I am not optimistic, sounds like it’s doing the same stuff, just more cleverly.
Relevant: Terence Tao does find them helpful for some Lean-related applications.
yeah, it’s less that I’d bet it works now, just, whenever it DOES start working, it seems likely it’d be through this mechanism.
⚖ If Thane Ruthenis thinks there are AI tools that can meaningfully help with Math by this point, did they first have a noticeable period (> 1 month) where it was easier to get work out of them via working in lean-or-similar? (Raymond Arnold: 25% & 60%)
(I had a bit of an epistemic rollercoaster making this prediction, I updated “by the time someone makes an actually worthwhile Math AI, even if lean was an important part of it’s training process, it’s probably not that hard to do additional fine tuning that gets it to output stuff in a more standard mathy format. But, then, it seemed like it was still going to be important to quickly check it wasn’t blatantly broken as part of the process)