Results from MIRI’s December workshop

Last week (Dec. 14-20), MIRI ran its 6th research workshop on logic, probability, and reflection. Writing up mathematical results takes time, and in the past, it’s taken quite a while for results from these workshops to become available even in draft form. Because of this, at the December workshop, we tried something new: taking time during and in the days immediately after the workshop to write up results in quick and somewhat dirty form, while they still feel fresh and exciting.

In total, there are seven short writeups. Here’s a list, with short descriptions of each. Before you get started on these writeups, you may want to read John Baez’s blog post about the workshop, which gives an introduction to the two main themes of the workshop.

Theme 1: Scientific induction in mathematics

One of the main themes was using Bayesian probability to represent uncertainty about mathematical statements. Like human mathematicians, an AI will be able to outright prove or disprove many mathematical statements, but there will also be many that it will be uncertain about, and the obvious thing to try to get a handle on dealing with such uncertainty is to assign a probability to each such statement. This would mean choosing some sort of prior, and then updating on evidence: For example, if you’re not sure whether the twin prime conjecture is true, then each time you discover a new twin prime larger than all that you have seen before, you should ever so slightly increase the probability you assign to the conjecture.

But what sort of prior should we choose? That’s the problem that people at the December workshop tried to make some progress on. Here is an interesting problem which the best previous proposal, due to Abram Demski, fails on: Suppose that Q(x) is a predicate symbol, and suppose that you find evidence that Q(x) is true of exactly 90% of all numbers between x=1 and x=10^100. Then we would expect that if you plug in some arbitrary number in this range, say n = floor(7^7^7^7 * pi) mod 10^100, then the posterior probability of Q(n), after conditioning on your evidence, would be 0.9. But it turns out that in Demski’s proposal, the probability of Q(n) would be approximately 0.5.

To learn more about this, see “Scientific Induction in Probabilistic Mathematics”, written up by Jeremy Hahn. Jeremy writes: “At the workshop, after pinning down the above example of undesired behaiviour we turned to other proposals for priors. None of the ideas presented are in a polished enough form to be considered a complete proposal, but we are very happy with the progress that was made.”

Theme 2: The “procrastination paradox”

The other main theme of the workshop was the Löbian obstacle to self-modifying AI, and more specifically, something we’re calling the “procrastination paradox”. One approach for constructing a self-modifying AI based on mathematical logic would be to require the AI to prove, whenever it wants to take an action—be it an external action or making some modification to its own source code—that this action is “safe”. Then you would naively expect the AI to be able to reason as follows: “The following simple self-modification X is safe, because the modified AI will still only take actions that it has proven to be safe, so it will only take safe actions.” But by Löb’s theorem, a consistent system T which is as strong as Peano Arithmetic cannot go from “whatever the modified AI does will have been proven (in T) to be safe” to “whatever the modified AI does will in fact be safe”. (See here for discussion of why it seems reasonable to tackle self-modifying AI from the perspective of mathematical logic, even though a realistic AI will certainly need to use probability to reason about the physical world.) In particular, we cannot have a consistent system T as strong as PA that proves “If S is provable in T, then S” for all sentences S.

If we can’t have a system T that trusts itself in this way, maybe we can have an infinite sequence of consistent systems T_n such that each T_n proves “If S is provable in T_(n+1), then S”, for all sentences S? Marcello (Herreshoff) has shown that in fact there is such a sequence, explained in his paper with Eliezer on the Löbian obstacle. However, although it is consistent (meaning that it doesn’t prove everything; in particular, it doesn’t prove 0=1), this system is unsound (meaning that it proves some things which aren’t actually true). We now have an intuition for why this is so, and a corresponding formal proof that any sequence like this must be unsound: Imagine a sequence of agents A_n, using the systems T_n, whose goal it is that a particular button gets pressed—they don’t care when it gets pressed, just that it does get pressed, eventually. Each of these agents will try to prove that one of its successors presses the button; if it can’t find such a proof, then it will press the button itself. Now A_n reasons as follows (in T_n):

Either my successor, A_(n+1), will press the button, or it won’t. If it doesn’t press the button, then it must have found a proof that the button gets pressed at some later time. This proof is in T_(n+1), and I know that if something is provable in T_(n+1), then it is true. Thus, either way, the button gets pressed—by my successor, or at some later time.

So A_n has found a proof that the button gets pressed, and therefore doesn’t press the button itself. But since this is true for every n, the button never gets pressed—and therefore, all of the systems T_n are unsound, since they prove a statement which isn’t true. Metaphorically, every time the AI thinks about pressing the button, it convinces itself that there is some future time when it will stop procrastinating and actually press the button, so today it is surely fine to procrastinate—and in this way, the AI procrastinates forever. For more, including formal proofs, see “The Procrastination Paradox”, written up by Eliezer.

However, on the positive side, it turns out that if you modify Marcello’s construction so that T_n merely proves that T_(n+1) is consistent, rather than proving that everything T_(n+1) proves is true, then Marcello’s proof not only shows that the T_n are all consistent, but also that they are sound. (This is really easy to see: T_n is PA plus the axiom “psi(n) → Con(T_(n+1))”, for some formula psi(n). Marcello’s proof shows that all of these systems T_n are consistent. But then, since T_(n+1) is consistent, “psi(n) → Con(T_(n+1))” is true, so all axioms of T_n are true and T_n is sound.) Moreover, although there are things that T_n proves that T_(n+1) doesn’t—namely, “Con(T_(n+1))”, since if T_(n+1) proved that, it would be inconsistent—you can construct these T_n so that they all have the same proof-theoretic ordinal, so to the extent that you think that proof-theoretic ordinals are a good measure of the “mathematical strength” of a theory, all of the theories T_n can be said to have the same amount of mathematical strength. Finally, although T_n doesn’t prove the Löb schema “If T_(n+1) proves S, then S is true” for all sentences S, it does in fact prove it for a particular class of sentences (the Pi_1 sentences). For details, see “An infinitely descending sequence of sound theories each proving the next consistent”, written up by Benja Fallenstein, a.k.a. myself.

Is it enough to have the Löb schema for Pi_1 sentences? (These are sentences of the form “forall x. phi(x)”, where phi(x) is such that you can write a computer program that takes x as input and outputs whether phi(x) is true or false.) Actually, there’s an argument to be made that it might be exactly right for tiling agents: It seems to be as much as you can get without running into the procrastination paradox, and it’s enough to formalize goals like “The world is not destroyed before time n” (if the world is computable). Moreover, there’s a way to combine this with ideas from a different solution to the tiling agents problem, parametric polymorphism (one version of which is described in the tiling agents paper), in order to also handle goals that don’t as obviously fit into the Pi_1 mold; we call this hybrid of ideas “Fallenstein’s monster”, written up by Nate Soares a.k.a. So8res, because it looks rather grafted together. (In fact, it was even uglier when we came up with it, because back then we didn’t even have the sound version of Marcello’s construction yet. I’m hopeful that we will find prettier solutions to the remaining monstrous aspects as well, so that I will soon be able to ask my trusted research assistant, Igor, to dispose of the monster.)

Finally, we usually talk of a set of equations like “T_n = PA + psi(n) → Con(T_(n+1))” as if they define one particular sequence of theories T_n, but it’s not obvious that different ways of making this formal lead to logically equivalent theories. In fact, I’d have guessed that in general they don’t, but Will Sawin had the opposite intuition and was right; any recursively enumerable set of equations of this type has a unique solution (in the sense that if PA proves that two r.e. lists of theories U_i and V_i are both solutions, then PA proves that they are equivalent). For details, see “Recursively-defined logical theories are well-defined”, written up by Nisan Stiennon. Interestingly, this turns out to be an equivalent formulation of Löb’s theorem!

Odds and ends

In addition to the two main themes, there were some odds and ends: First, it turns out that the 5-and-10 problem, which has been discussed on LW quite a bit, is almost a problem for the formalism in the tiling agents paper; the exact system discussed in the paper narrowly avoids the problem, but a simple and natural variant runs into it. For details, see “The 5-and-10 problem and the tiling agents formalism”, written up by me.

And both last and least, we considered and disposed of a conjecture that a particular version of parametric polymorphism might be able to avoid the problem of losing mathematical strength when an agent using it rewrites itself. This would have been an interesting result if it had been true, so it seems worth recording, but since it turned out to be false, it’s not particularly exciting. See “Decreasing mathematical strength in one formalization of parametric polymorphism”, also written up by me.