# Progress on automated mathematical theorem proving?

In a recent comment thread I expressed skepticism as to whether there’s been meaningful progress on general artificial intelligence.

I hedged because of my lack of subject matter knowledge, but thinking it over, I realized that I do have relevant subject matter knowledge, coming from my background in pure math.

In a blog post from April 2013, Fields Medalist Timothy Gowers wrote:

Over the last three years, I have been collaborating with Mohan Ganesalingam, a computer scientist, linguist and mathematician (and amazingly good at all three) on a project to write programs that can solve mathematical problems. We have recently produced our first program. It is rather rudimentary: the only problems it can solve are ones that mathematicians would describe as very routine and not requiring any ideas, and even within that class of problems there are considerable restrictions on what it can do; we plan to write more sophisticated programs in the future.

**I don’t know of any computer programs that have been able to prove theorems outside of the class “very routine and not requiring any ideas,” without human assistance (and without being heavily specialized to an individual theorem).** I think that if such projects existed, Gowers would be aware of them and would likely have commented on them within his post.

It’s easy to give an algorithm that generates a proof of a mathematical theorem that’s provable: choose a formal language with definitions and axioms, and for successive values of n, enumerate all sequences of mathematical deductions of length n, halting if the final line of a sequence is the statement of the the desired theorem. But the running time of this algorithm is exponential in the length of the proof, and the algorithm is infeasible to implement except for theorems with very short proofs.

**It appears that the situation is not “there are computer programs that are able to prove mathematical theorems, just not as yet as efficiently as humans,” but rather “computer programs are unable to prove any nontrivial theorems.”**

I’ll highlight the Sylow theorems from group theory as examples of nontrivial theorems. Their statements are simple, and their proofs are not very long, but they’re ingenious, and involve substantive ideas. If somebody was able to write a program that’s able to find proofs of the Sylow theorems, I would consider that to be strong evidence that there’s been meaningful progress on general artificial intelligence. In absence of such examples, I have a strong prior against there having been meaningful progress on general artificial intelligence.

I will update my views if I learn of the existence of such programs, or of programs that are capable of comparably impressive original research in domains outside of math.

- 3 Jul 2013 18:40 UTC; 1 point) 's comment on Tiling Agents for Self-Modifying AI (OPFAI #2) by (
- 3 Jul 2013 18:41 UTC; 0 points) 's comment on Tiling Agents for Self-Modifying AI (OPFAI #2) by (

There’s apparently an annual automated theorem proving competition, looking at the kinds of problems there might be useful.

Thanks for the reference.

The space of theorems that are “very routine and not requiring any ideas” is actually pretty interesting. In particular, compilers do quite a lot of theorem-proving of the “these functions are equivalent” variety, and it’s

waymore elaborate than most people imagine.I acknowledge this.

I would

loveto read more about this. Examples? Links?I haven’t studied the field in awhile, but back when I did, I got a lot from Advanced Compiler Design & Implementation by Steven Muchnick.

There is an unfortunate tendency, when teaching about compilers, to teach about components in the order they would be built in a project course, which puts a lot of undue emphasis on the boring solved problem of parsing. The layman’s view of a compiler is that it translates from high-level languages to machine code; the reality is that it translates from a high-level language through a series of intermediate representations, each of which brings out some property that makes it amenable to optimizations; and these optimizations and representations are what make up the bulk of a compiler. A good litmus-test for understanding, is ability to translate a function into static single assignment (SSA) form.

Muchnick’s book is excellent, and we wrote the open source GLSL compiler using it. I wish it was a little more opinionated on how to do things right and avoid wasting your time (Go SSA! Right away! Even if you feel your problem space is special and SSA might not help you!) as opposed to just reporting on all the variations that exist in the wild, but it’s hard to fault it for that when I wish software theory was more grounded in reality in general.

And, yeah, I’m proud to say I still don’t know how to write a lexer or parser. I’ve got flex/bison for that.

For what it’s worth, my compilers professor shared this complaint, and was vocal about it on the first day. He structured his class so that he gave us a (nearly) working compiler for a simple arithmetic language (with lex/yacc). Throughout the semester, we gradually added features to the language it compiled until it looked more like C. By giving us the BNF for the grammar in each assignment, expanding the given parser was trivial. It did mean that we were guided towards maintaining his pipeline stages instead of devising our own, but they changed more and more throughout the various projects. This also meant that poor design early on lead to a great deal of refactoring in later projects.

|A kind of counter-example to your claim is the following: http://www.math.rutgers.edu/~zeilberg/GT.html It is an automated reasoning system for Euclidean geometry. Starting from literally nothing, it derived all of the geometric propositions in Euclid’s Elements in a matter of seconds. Then it proceeded to produce a number of geometric theorems of human interest that were never noticed by any previous Euclidean geometers, classical or modern.

This is simply to point out that there are some fields of math that are classically very hard but computers find trivial. Another example is the verification of random algebraic identities by brute force.

On the other hand, large fields of mathematics have not yet come to be dominated by computers. For those I think this paper is a good introduction to some state-of-the-art, machine-learning based techniques: http://arxiv.org/abs/1108.3446 One can see from the paper that there is plenty of room for machine learning techniques to be ported from fields like speech and vision.

Progress in machine learning in vision and speech has recently been driven by the existence of huge training data-sets. It is only within the last few years that truly large databases of human-made proofs in things like set theory or group theory have been formalized. I think that future progress will come as these databases continue to grow.

This is interesting.

It’s hard for me to assess it from the outside. In particular, I don’t have a good sense for the number of sequences of logical derivations one has to consider in order to arrive at the theorems that were proved if one were to proceed by brute force. I find it more interesting that it honed in on classical theorems on its own than that it was able to prove them (one can use coordinate geometry to reduce proofs to solving polynomial equations).

I think that it’s significant that Euclidean geometry fell out of fashion a long time ago: the fraction of modern mathematicians who think about Euclidean geometry is negligible, and this may reflect an accurate assessment of the field as mathematically shallow. I didn’t appreciate geometry until I learned about discrete groups of isometries acting on homogeneous Riemannian manifolds.

Thanks for the reference

Howmuch future progress? :-)Where do you get these detailed claims about the program? I don’t see anything like them in the book or here or here

Just to point out the obvious, your first link certainly does say that

Oh, yeah, I missed that Zeilberger used the phrase “from scratch,” so that explains why JH says “from literally nothing” (though I disagree), but why did you quote the rest of the passage? Do you think it suggests any other of JH’s claims?

I quoted the rest so that it was clear that the key phrase ‘from scratch all the statements’ was referring to a computational, implemented, software package for Euclidean / plane geometry in particular.

Yes this approach is valid, but modern theorem provers more commonly reduce theorem proving to a sequence of SAT problems. Very roughly, for a first order sentence P, the idea is to search for counterexamples alternately to P and ~P in models of size 1,2,… . SAT solvers

haveimproved rapidly over the past decade (http://www.satcompetition.org/), though they are still not good enough to facilitate theorem provers that can solve interesting math problems.I highly recommend the concise and excellent “Handbook of Practical Logic and Automated Reasoning” http://www.cl.cam.ac.uk/~jrh13/atp/

The only novel theorem I can recall being solved by a general artificial theorem prover is the Robbins conjecture. I’m sometimes surprised this area doesn’t attract more attention, since proving theorems is very similar to writing programs and automated program writers would be extremely useful.

I disagree. You have to specify the program that you want to write. Essentially, it means that you can use constraint programming instead of whatever other programming paradigm you’d use. I’m not convinced that that’s substantially easier.

Constraint programming would be extremely powerful. They aren’t a silver bullet, but they would make a ton of hard problems a lot easier. Particularly when it comes to proving whole-program correctness properties like “does not use more than X memory”.

For example, when a human programmer is designing a public key cryptosystem, they’re trying to satisfy properties like “there is no tractable Turing Machine that takes a public key as input and outputs the corresponding private key with non-negligible probability”. Doing that is

really, really hardcompared to just writing down what we want to satisfy.But that’s going to take an insanely powerful theorem-prover. Also, this is more the theorem-proving end than automated program end.

It’s true that it would need to be insanely powerful. Even clever humans haven’t managed to prove that one-way hash functions exist (though we know a function that is one-way if and only if they exist...).

Note theorem proving is isomorphic to programming. Even if it wasn’t quite so equivalent, you could just ask for a constructive proof of a program satisfying whatever and the proof would contain the program.

The Ganesalingam-Gowers project is not about the cutting edge of automated theorem proving and there is no reason to expect that blog post to talk about the cutting edge. The contribution of that project is to make the input theorem and output proof be human-readable.

This is a fair point. But what

isthe cutting edge? Are there counterexamples to my central claim?Current theorem provers don’t have a “sense of direction”.

From the description of Polya’s Mathematics and Plausible Reasoning: Vol. II: Patterns of Plausible Inference:

Current theorem provers search for a proof, and even contain some ad hoc heuristics for exploring their search trees. (For example, Paulson’s book ML for the working programmer ends with a chapter on theorem proving and offers “We have a general heuristic: never apply universal:left or existential:right to a goal if a different rule can usefully be applied.) However, Polya’s notion of being a good guesser has been formalised as Bayesian reasoning. If a theorem prover, facing a fork in its search tree, uses Bayes Theorem to pick the most likely branch first, it may find much deeper results.

I don’t think there is currently much overlap between those working on theorem proving and those working on Bayesian statistics. There is perhaps even a clash of temperaments between those who seek absolute certainty and those who seek an edge in the messy process of muddling through. Nevertheless I foresee great possibilities of using Bayesian reasoning to guide the internal search of theorem provers, thus giving them a sense of direction.

Reasoning “X hasn’t been progressing recently, therefore X is unlikely to progress in the near future.” is good reasoning if that is really all that we know. But it is also weak reasoning which we should be quick to discard if we see a relevant change looming. The next ten or twenty years might see the incorporation of Bayesian reasoning as a guiding principle for tree search heuristics inside theorem provers and lead to large advances.

What I question is whether the Bayesian reasoning algorithms are at all computationally feasible to implement to solve nontrivial problems. See the final two paragraphs of my comment here about computational complexity. Do you have evidence in the other direction?

No. I think one typically has to come up with a brutally truncated approximation to actually Bayesian reasoning. For example, if you have

npropositions, instead of considering all 2^nbasic conjunctions, ones first idea is to assume that they are all independent. Typically that is a total failure; the independence assumption abolishes the very interactions that were of interest. So one might let propositionndepend on propositionn-1 and reinvent Markov models.I don’t see much hope of being able to anticipate which, if any, crude approximations to Bayesian reason are going to work well enough. One just has to try it and see. I don’t think that my comment goes any deeper than saying that there are lots of close to practical things due to be tried soon, so I expect one or two pleasant surprises.

Ok, thanks for the clarification

Bayesian probability theory tells you how to process evidence as well as possible. If you know what’s evidence of a correct path, you can make it into an ad-hoc hueristic more easily than a part of a Bayesian update. Seems like the real insight required is to figure out what’s evidence of a correct path.

You’ve put your finger on a weakness of my optimistic vision. If the guesses are calling it 90% of the time, they significantly extend the feasible depth of search. But 60:40? Meh! There is a lot of room for the insights to fail to be sharp enough, which turns the Bayesian stuff into CPU-cycle wasting overhead.

That strikes me as like saying, “Until this huge task is within a hair’s breadth of being complete, I will doubt that any progress has been made.” Surely there are more intermediate milestones, no?

It’s both unclear to me that there are intermediate milestones and that getting to the point of a computer being a generate proofs of the Sylow theorems brings you within a hair’s breadth of being complete.

I’m way out of my field here. Can you name some intermediate milestones?

Well, I have no knowledge of the Sylow theorems, but it seems likely that if any system can efficiently generate short, ingenious, idea-based proofs, it must have something analogous to a mathematician’s

understanding.Or at least have a mechanism for formulating and relating

concepts, which to me (admittedly a layman), sounds like the main challenge for AGI.I suppose the key scenario I can imagine right now where an AI is able to conduct arbitrary human-level mathematical reasoning, but can not be called a fully general intelligence, is if there is some major persistent difficulty in transferring the ability to reason about the purely conceptual world of mathematics to the domain of the physically real world one is embedded within. In particular (inspired by Eliezer’s comments on AIXI), I could imagine there being difficulty with the problem of locating oneself within that world and distinguishing oneself from the surroundings.

“If people do not believe that mathematics is simple, it is only because they do not realize how complicated life is.”—John von Neumann.

See also the quotation in this comment.

Ha, I’ve seen that quote before, good point! (and likewise in the linked comment) I suppose one reason to think that mathematical reasoning is close to AGI is that it seems similar to programming. And if an AI can program, that seems significant.

Maybe a case could be made that the key difficulty in programming will turn out to be in formulating what program to write. I’m not sure what the analogue is in mathematics. Generally it’s pretty easy to formally state a theorem to prove, even if you have no idea how to prove it, right?

If so, that might lend support for the argument that automated general mathematical reasoning is still a ways off from AGI.

The mathematical counterpart may of the “recognizing important concepts and asking good questions” variety. A friend of my has an idea of how to formalize the notion of an “important concept” in a mathematical field, and possible relevance to AI, but at the moment it’s all very vague speculation :-).

It’s pretty easy for a human of significantly above average intelligence. That doesn’t imply easy for an average human or an AI.

As for intermediate milestones, I think all the progress that has been made in AI (and neuroscience!) for the last half-century should count. We now know a whole lot more about AI and brains than we used to.

EDIT: To be more specific, I think such feats as beating humans at chess and Jeopardy! or being able to passably translate texts and drive cars are significant.

Future milestones might include:

beating the Go world champion

significantly better machine translations

much more efficient theorem provers

strong general game players

I could see all of those happening without being able to replicate all human-generated mathematical proofs. But it would still seem like significant progress has been made.

I agree that there exist intermediate milestone. The question is how far the ones that have been surpassed are from the end goal. The relevant thing isn’t how much we understand relative to what we used to know, but how much we understand relative to what’s necessary to build a general artificial intelligence. The latter can be small even if the former is large.

That’s fair.

Given that I think that capturing the ingenuity of human mathematicians is one of the last things I’d expect to see before seeing human-level-or-smarter AI, the poor state of automated theorem proving doesn’t seem to me like much evidence on how far away GAI might be.

Why is the ingenuity of human mathematicians one of the last things that you’d expect to see before seeing human-level-or-smarter AI? My intuition is that it’s one of the earlier things that you’d expect to see. Mikhail Gromov wrote:

That’s a different meaning of the term “mathematical ability”. In this context, you should read it as “calculating ability”, and computers are pretty good at that—although still not as good as our brains.

It was not intended to imply that low-level brainware is any good at abstract mathematics.

Where did you get your interpretation from, and why do you think that yours is more accurate than mine? :-)

I believe that he was referring to the brain’s

pattern recognition abilityrather thancalculating ability. Existing supercomputers have a lot of calculating ability, but they can’t recognize five-fold symmetry regardless of a particular size, shape or color of an object.Are you sure? This sounds possible.

Possible in principle, but my understanding of the current state of AI is that computer programs are nowhere near being able to do this.

Are you saying we can’t make programs that would identify portions of an image that are highly fold fold symmetric? This seems really unlikely to me.

Some looking turns up a paper on Skewed Rotation Symmetry Group Detection which appears to do this.

This is interesting to me, and I had not known that such research has been done.

I’ve heard that there’s a consistent problem in machine learning of people overtraining their algorithms to particular data sets. The diversity of examples in the paper appears to be impressive, but it could be that the algorithm would break if given images that would appear to us to be qualitatively similar to the ones displayed.

I think that Gromov may not have expressed himself very clearly, and his remarks may not have been intended to be taken literally. Consider the many starfish in this picture. By looking at the photo, one can infer that any given star-fish has five-fold symmetry with high probability, even though some of the ones in the distance wouldn’t look like they had five-fold symmetry (or even look like star-fish at all) if they were viewed in isolation. I don’t think that existing AI has the capacity to make these sorts of inferences at a high level of generality.

I think #3 is the real issue. Most of the starfishes in that picture aren’t 5-fold symmetric, but a person who had never seen starfish before would first notice “those all look like variations of a general form” and then “that general form is 5-fold symmetric”. I don’t know of any learning algorithms that do this, but I also don’t know what to search for.

So you’re probably right that it’s an issue of “pattern recognition ability”, but it’s not as bad as you originally said.

Also note that the Sylow theorems are, in some sense, a pretty low bar. Compare with the Feit-Thompson Theorem, which has a proof running 250 pages, and which builds on previously known results. I have a friend who’s a finite group theorist who said that in many cases he struggled to verify the truth of individual lines of the proof.

Why do you think so?

Verifying proofs and writing programs are isomorphic. A smarter-than-human theorem prover is likely to also be a better-than-human programmer, and you can ask it to do things like find good theorem provers...

More or less what Strilanc said, and not what gjm said.

The difference between chess and math is that a superhuman-at-theorem-proving AI is a lot more useful for speeding up further AI research than a superhuman-at-chess-playing AI would be. (Not to mention the chance that being better-than-human at mathematical intuition already unlocks enough munchkinning of physics to take over the world directly.)

Because doing mathematics well is something that takes really exceptional brainpower and that even most very intelligent people aren’t capable of.

Just like chess.

I’d like to make explicit the connection of this idea to hard takeoff, since it’s something I’ve thought about before but isn’t stated explicitly very often. Namely, this provides some reason to think that by the time an AGI is human-level in the things humans have evolved to do, it will be very superhuman in things that humans have more difficulty with, like math and engineering.

I’m against sarcasm as a way of making an argument; it makes it less pleasant to discuss things here, and it can hide a bad argument.

First of all: I hereby apologize to

orthonormalif s/he felt insulted or belittled or mocked by what I wrote. That was in no way my intention.I’m not sure “sarcasm” is quite the right term, though. I really do think that, among human beings, doing mathematics well takes exceptional brainpower, and even most very intelligent people can’t do it; I really do think that playing chess well has the same feature; and I really do think that that feature is why

orthonormalexpects doing mathematics well to be one of the last achievements of AI before general superhumanness.If my last paragraph had said “I don’t think that’s very good evidence, though; you could say the same about playing chess well, and we all know what computers have done to

that” then it would have been clearer and that might have been worth the loss of brevity. But I’m not suresarcasmis the failure mode (if failure it be) here.Following up on JoshuaZ’s comment, humans brains aren’t at all optimized for doing higher math — it’s a mystery that humans can do higher math at all. Humans optimized for things like visual processing and adapting to the cultures that they grow up in. So I would expect human-level math AI to be easier than human-level visual processing AI.

What do you mean by “really exceptional brainpower”? And what do you mean by “doing mathematics well” ?

What I meant was: (1) empirically it seems that very few human beings are good at proving theorems (or even following other people’s proofs), (2) being good at proving theorems seems to correlate somewhat with other things we think of as cleverness, (3) these facts are probably part of why it seems to

orthonormal(and, I bet, to lots of others) as if skill in theorem-proving would be one of the hardest things for AI to achieve, but (4) #1 and #2 hold to some extent for other things, like being good at playing chess, that also used to be thought of as particularly impressive human achievements but that seem to beeasierto make computers do than all sorts of things that initially seem straightforward.All of which is rather cumbersome, which is why I put it in the elliptical way I did.

Heh, I missed the irony :-)

I didn’t, but realized someone could, hence why my other comment started off with the words “more explicitly” to essentially unpack gjm’s remark.

More explicitly, there are a lot of things that humans can do easily that seem to be extremely difficult for AI (vision and language are the most obvious), and there are many things that humans have trouble with that we can easily make AI do a decent job. At some level, this shouldn’t be that surprising because a lot of the things humans find really easy are things we have millions of years of evolution optimizing those procedures.

The Prover company is working on the safety of train signalling software. Basically, they seek to prove that a given program is “safe” along a number of formal criteria. It involves the translation of the program in some (boolean based) standard form, which is then analysed.

The formal criteria are chosen manually, but the proofs are found completely automatically.

Despite the sizeable length of the proofs, combinatorial explosion is generally avoided, because programs written by humans (and therefore their standard form translation) tend to have shapes that lend them amenable to massive cuts in the search tree.

It doesn’t always work: first, the criteria are simple and bounded. Second, combinatorial explosion sometimes does occur, in which case human-devised tweaks are needed.

Oh, and it’s all proprietary. Maybe there’s some related academic papers, though.

I don’t think this is what Jonah is talking about. This is just one of thousands of examples of formal verification of safety-critical software.

The key part is that some of those formal verification processes involve

automatedproof generation. This is exactly what Jonah is talking about:Those who make (semi-)automated proof for a living have a vested interest in making such things as useful as possible. Among other things, this means as automated as possible, and as general as possible. They’re not there yet, but they’re definitely working on it.

If you expand your net outside of pure symbolic manipulation, there has been a lot of progress in general artificial intelligence. There is no evidence that a practical mathematical theorem prover, if ever created, will work using pure symbolic manipulation approaches.

EDIT: I’m curious to know why the person who down-voted me did so.

Nitpick: This only works if the chosen formal system only contains rules that produce longer theorems.

Edit: Oops, looks like I misread the quote. This method will enumerate all theorems, but not (as I first interpreted it) decide if a given string is a theorem in finite time.