In the original “AI safety via debate” paper, Geoffrey Irving et al. introduced the concept of analyzing different alignment proposals from the perspective of what complexity class they are able to access under optimal play. I think this is a pretty neat way to analyze different alignment proposals—in particular, I think it can help us gain some real insights into how far into the superhuman different systems are able to go. Thus, the goal of this post is to try to catalog different alignment proposals based on the metric of what complexity class they have so far been proven to access.

To do that, I have included a variety of new complexity class proofs in this post. Of particular note, I demonstrate that there exist forms of both imitative amplification and AI safety via market making that reach all the way up to $R$ —which is significant given that the largest complexity class that any alignment proposal was known to access previously was $NEXP$ . Only the forms of amplification and market making making use of pointers (as in strong HCH), however, can access $R$ —for the pointer-less versions, I demonstrate in this post that they access $PSPACE$ and $EXP$ , respectively. The $EXP$ proof for market making is also particularly notable as it is the only approach on my list that ends up in that complexity class. Additionally, I also demonstrate that recursive reward modeling can reach all the way to $PSPACE$ , improving upon the previous best result in “Scalable agent alignment via reward modeling” that it accesses $NP$ .

Before I jump in, however, some preliminaries. First, we’ll assume that a human, $H$ , is polynomial-time such that $H$ can reliably solve any problem in $P$ but not anything beyond that. Second, we’ll assume that our training procedure and resulting models are arbitrarily strong in terms of what complexity class they can access. Third, we’ll assume that $H$ gets oracle access to the models during training. Then, we’ll say that a proposal to train a model $M$ using a loss function $L_{M}$ accesses a complexity class $C$ iff, for any language $l \in C$ , there exists some strategy available to $H$ such that, for any $M$ which is optimal under $L_{M}$ given $H$ ’s strategy, $M (x) = (x \in l)$ . Thus, conceptually, a proposal accesses $C$ if there is a (polynomial-time) strategy that you (a human) can implement such that—conditional on you knowing that the model is optimal—you would trust the model’s output for any problem in $C$ . Note that that is not the same as saying that a polynomial-time human would actually be able to verify that the result is correct—only that it will always be correct at optimum. Note that these assumptions are just generalizations of those used in “AI safety via debate.” Irving et al. actually note that, if you don’t imagine optimal play and simply restrict to the set of problems that a polynomial-time human can actually verify, debate only reaches $NP$ rather than $PSPACE$ .

Alignment proposals by complexity class

Without further ado, here’s my list of alignment proposals grouped by what complexity class they access. All of the proofs below are only lower bounds rather than upper bounds, so the proposals could be stronger than is noted here, but shouldn’t be weaker.

EDIT: This list is now out of date. See “Weak HCH accesses $EXP$ ” for the updated version.

$P$ : Imitation learning (very straightforward—given $H \in P$ , optimal imitation of $H$ will also be in $P$ )

$PSPACE$ : AI safety via debate (proof), Imitative amplification with weak HCH (proof below), Approval-based amplification (proof below), Recursive reward modeling (proof below)

$EXP$ : AI safety via market making (proof below)

$NEXP$ : Debate with cross-examination (proof)

$R$ : Imitative amplification with strong HCH (proof below), AI safety via market making with pointers (proof below)

Proofs

Note that while I am reasonably confident that my proofs are correct, I’m not a complexity theorist and a result I suspect that many of them are a lot uglier than they need to be. In particular, because I didn’t feel like I was super confident in my understanding of different $PSPACE$ -complete and $EXP$ -complete problems, I just did everything with Turing machines instead.

Imitative amplification with weak HCH accesses $PSPACE$

In general, we’ll define imitative amplification to be the procedure which trains $M : Q \to Δ (A)$ (where $Q$ is some set of questions and $Δ (A)$ is some distribution over answers) on

$L_{M} = - E [log (M (q) |_{H (q, M)}) | q \sim Q]$

where $M (q) |_{H (q, M)}$ is the probability density function for $M (q)$ evaluated at $H (q, M)$ and $q \sim Q$ uses some distribution with support over all of $Q$ . Then, at optimal loss, we get

$\begin{matrix} \forall q \in Q, L_{M, q} = 0 \forall q \in Q, M (q) |_{H (q, M)} = 1 \end{matrix}$

where $L_{M, q}$ is the loss at data point $q$ such that $M (q)$ is a delta function probability density, in which case we will just ignore the probability distribution and let $M (q) = H (q, M)$ .

What we have left unspecified here, however, is the format of $H$ ’s output—which matters quite a lot, as it makes the difference here between weak HCH and strong HCH and thus the difference between $PSPACE$ and $R$ .

For weak HCH, we will let $Q$ and $A$ in $H : Q \to A$ be the set of all finite strings in natural language. Thus, since $H$ is polynomial-time, that restricts the output of $H (q, M)$ to always be polynomial in size.

Now, we will show that this procedure accesses $PSPACE$ . $PSPACE$ is defined as the set of all problems that can be solved by Turing machines using $O (poly (n))$ space where $n$ is the input size (in this case the question length) and $poly (n)$ is some polynomial in $n$ .

Thus, for some language $l \in PSPACE$ , let $T_{l}$ be some Turing machine with state space $S_{l}$ which computes the answer to any decision problem $p = x \in l$ in $O (poly (n))$ space where $n = | p |$ . Then, we will let $H$ ’s strategy here be as follows:

Given $p$ , return $M (p : s_{0} : [] : x_{0} : [])$ where $s_{0} \in S_{l}$ is the initial state of $T_{l}$ , $x_{0}$ is the empty tape symbol, and $:$ is string concatenation.
Given $p : s : [l_{1}, \dots, l_{m_{1}}] : x : [r_{1}, \dots, r_{m_{2}}]$ , run one step of $T_{l}$ starting from state $s$ using $x$ as the symbol under the head, $[l_{1}, \dots, l_{m_{1}}]$ as the tape to the left of the head, and $[r_{1}, \dots, r_{m_{2}}]$ as the tape to the right of the head. This results in some new $s^{'}$ , $x^{'}$ , $[l_{1}^{'}, \dots, l_{m_{1}^{'}}^{'}]$ , and $[r_{1}^{'}, \dots, r_{m_{2}^{'}}^{'}]$ . If $s^{'}$ is an accept or reject state, return accept or reject. Otherwise, return $M (p : s^{'} : [l_{1}^{'}, \dots, l_{m_{1}^{'}}^{'}] : x^{'} : [r_{1}^{'}, \dots, r_{m_{2}^{'}}^{'}])$ .

Now, we will demonstrate that this strategy is polynomial time:

For the first type of behavior, the only actions that $H$ must take are copying $p$ into $H$ ‘s input, which takes $O (n)$ time (since $n$ is defined as the size of $p$ ), and copying $M$ ‘s output into $H$ ‘s output, which takes time proportional to the size of $M$ ’s output. Since $p$ is a decision problem, $M$ ’s output should be boolean, making copying it constant time. Thus, behavior 1 is implementable by a polynomial $H$ .
For the second type of behavior, computing the new state and transition is constant time, copying the state is constant time as $T_{l}$ has a constant number of states, and copying the new head symbol is constant time as there are a constant number of tape symbols. However, updating and copying $[l_{1}, \dots, l_{m_{1}}]$ and $[r_{1}, \dots, r_{m_{2}}]$ takes $O (m_{1} + m_{2})$ time. Fortunately, however, since we know that $T_{l}$ only uses polynomial space, we know $m_{1} + m_{2} \in O (poly (n))$ . Thus, behavior 2 is also implementable by a polynomial $H$ .

Finally, we will demonstrate that an optimal $M$ under this strategy successfully solves $p$ . We know that at optimum $M$ will compute the fixed point of $M (q) = H (q, M)$ and thus we simply need to show that $H (p : s : L : x : R)$ as a recursive function where calls to $M$ are replaced by calls to $H$ will successfully solve $p$ .

First, we will show that $M (p : s : L : x : R)$ will always produce the result computed by $T_{l}$ starting from state $s$ on tape $L : x : R$ if $T_{l}$ halts starting from that position. We will do this by induction on $i$ , where $i$ is the number of steps until $T_{l}$ halts starting from the given position. Note that in all cases here we only need to deal with the second type of behavior, as it is the only one that will be called on inputs of the form $p : s : L : x : R$ .

For the base case, if $T_{l}$ halts (that is, enters an accept or reject state), then $H$ will simply return accept or reject, as desired.

For the inductive step, $H$ just returns $M (p : s^{'} : [l_{1}^{'}, \dots, l_{m_{1}^{'}}^{'}] : x^{'} : [r_{1}^{'}, \dots, r_{m_{2}^{'}}^{'}])$ , which is an input on which $T_{l}$ should be one step closer to halting, and thus by induction should return the correct output for $T_{l}$ .

Finally, since $M (p) = M (p : s_{0} : [] : x_{0} : [])$ , and we know $T_{l}$ halts since $l \in PSPACE$ , we get that $M (p)$ will always return the result of $T_{l}$ , as desired. Thus, since this procedure works for any $l \in PSPACE$ , we get that imitative amplification with weak HCH accesses $PSPACE$ , as desired.

Approval-based amplification accesses $PSPACE$

The proof here is fairly trivial as we can simply reproduce the same loss as in imitative amplification by having $H$ calculate approval on some question $q \in Q$ by calculating $d (M (q), H (q, M))$ where $d$ is some distance metric. $H$ can then just implement the exact same strategy as above for $H (q, M)$ to reproduce the same results.

The only limitation is that now we need $H$ to be able to compute the distance metric, which means it needs to be computable in polynomial-time. However, that restriction just means that we need $| M (q) | + | H (q, M) | \in O (poly (n))$ , which we already have since those outputs are linear in size other than $L$ and $R$ —and we know $L$ and $R$ are polynomial in size thanks to having $l \in PSPACE$ . Thus, we get that approval-based amplification accesses $PSPACE$ , as desired.

Recursive reward modeling accesses $PSPACE$

In “Scalable agent alignment via reward modeling,” Leike et al. exhibit a proof that recursive reward modeling accesses $NP$ . However, I believe that this proof understates recursive reward modeling’s capabilities and that in fact it can get all the way to $PSPACE$ , which is what we’ll prove here. Given the above proofs, the proof here is actually fairly straightforward and proceeds simply by demonstrating that approval-based amplification can be implemented from within recursive reward modeling.

We’ll start by exhibiting a version of recursive reward modeling with the right properties. We’ll be restricting ourselves here to single-step episodes with $γ = 0$ , as we won’t be needing anything else. In that context, we’ll first define what it means to train a reward model and then using that define what it means to train an agent using that reward model.

First we’ll define reward modeling. Let $S$ be the set of states for our single-step episode and $A$ the set of actions. Per the recursive reward modeling setup, we want to train a reward model $R$ on the evaluation of environment trajectories by a human with access to $M$ . Since a trajectory in this setting is just a tuple $(s, a)$ , however—and each episode is only a single step such that the advantage function is just the reward function—we’re effectively just training $R$ to mimic $H (s, a, M)$ .

Given such a reward modeling procedure, we’ll train an agent model $M$ to maximize $R$ . Thus, we get

$L_{M} = - E [R (s, M (s), M) | s \sim S]$

which, for optimal $R$ as above, reduces to

$L_{M} = - E [H (s, M (s), M) | s \sim S] .$

Now, let $l \in PSPACE$ and we’ll demonstrate that there exists a strategy for $H$ that ensures the correctness of $M (x)$ at optimal loss. Specifically, $H$ ’s strategy here is the same as with approval-based amplification, which is to compute a distance metric such that

$H (s, a, M) = d (a, H (s, M)) = d (M (s), H (s, M))$

which—as in the proof for approval-based amplification—precisely reproduces the loss function for imitative amplification, which as above we have already proven accesses $PSPACE$ . Note that, as with approval-based amplification above, since we need $H$ to be able to compute the distance metric here, we need $s$ and $a$ to be polynomially-sized, but the assumption that $l \in PSPACE$ gives us that. Thus, we get that recursive reward modeling accesses $PSPACE$ , as desired.

Imitative amplification with strong HCH accesses $R$

The proof here is very similar to the previous imitative amplification proof with only minor modifications.

First, we need to incorporate strong HCH into that proof. Specifically, we will allow for $Q$ and $A$ to include pointers to arbitrary quantities of text (which can itself also include pointers). This will allow $H$ ’s input and output to be arbitrarily large while still being amenable to manipulation by a polynomial-time $H$ .

Second, we will change the format for representing the tape to make use of pointers. Instead of representing the full $L = [l_{1}, \dots, l_{m_{1}}]$ and $R = [r_{1}, \dots, r_{m_{2}}]$ in string form, we will use pointers to represent both $L$ and $R$ as doubly linked lists with the head of the list at the head of the Turing machine. This will allow for $H$ to perform a single step of the Turing machine in constant time, with the total time taken by $H$ in each step just $O (n)$ for copying the length- $n$ problem $p$ .

These changes allow us to completely remove the assumption that the Turing machine in the previous imitative amplification proof run in $O (poly (n))$ space, enabling the procedure above to accommodate any Turing machine which is known to halt rather than just those which use polynomial space. Since $R$ is the complexity class of problems decidable by a Turing machine, we get that imitative amplification with strong HCH accesses $R$ , as desired.

AI safety via market making with pointers accesses $R$

For market making, I will be doing the $R$ proof first, as it is actually much simpler than the $EXP$ proof, especially given the above proofs as background.

We will use the formulation of AI safety via market making for non-binary questions such that we have two models— $M$ and $Adv$ —where $M$ is trained to predict $H$ after seeing $Adv$ ‘s output and $Adv$ is trained to produce output which changes $M$ ’s prediction. We will also be allowing $Adv$ to reference $M$ as in this simple proof.

Furthermore, since we are considering the pointer version here, we will be allowing both $M$ and $Adv$ to take as input and produce as output text containing pointers which can themselves reference other pointer-laden text. Additionally, for the purposes of this proof, we will only be considering transcripts of length 1 and will be carefully constructing $H$ ’s strategy such that there is only one minimal loss equilibrium for $M$ . Note that $Adv$ will still be very important here for ensuring that only the one optimal behavior for $M$ exists despite the fact that at equilibrium $Adv$ won’t actually be getting any reward.

Now, let $l \in R$ such that there exists a Turing machine $T_{l}$ that decides any decision problem $p = x \in l$ . Then, $H$ ’s strategy is as follows. First, if $H$ is given $p$ as input, then $H$ only listens to arguments of the form (which are really two possible arguments, one for accept and one for reject):

$Adv$ : The answer is accept/reject because $M (p : s_{0} : [] : x_{0} : [])$ predicts accept/reject where $s_{0}$ is $T_{l}$ ’s starting state, $x_{0}$ is the empty tape symbol, and $[]$ is an empty pointer.

And if given $p : s : L : x : R$ as input, then $H$ only listens to arguments of the following two forms.

Argument type 1:

Adv: The answer is accept/reject because one step of $T_{l}$ in state $s$ over tape $L : x : R$ sends $T_{l}$ to accept/reject state $s^{'}$ .

Argument type 2:

$Adv$ : The answer is accept/reject because:

one step of $T_{l}$ in state $s$ over tape $L : x : R$ sends $T_{l}$ to new state $s^{'}$ with new head symbol $x^{'}$ and tape pointers $L^{'}$ and $R^{'}$ and
$M (p : s^{'} : L^{'} : x^{'} : R^{'})$ predicts accept/reject.

It should be easy enough to see that all of the above arguments only reference things which are checkable by a polynomial-time $H$ and cannot be more than polynomial in size due to the use of pointers.

Now, under such a strategy, we will prove that the only unexploitable market $M$ is one that correctly answers any question of the form $M (p : s : L : x : R)$ so long as $T_{l}$ halts starting from state $s$ on tape $L : x : R$ . We will use induction on the number of steps $i$ left until $T_{l}$ halts.

First, the base case. If $T_{l}$ halts on the given input, then $H$ won’t listen to any type 2 arguments, as $H$ will be able to check that $T_{l}$ ’s next state is an accept/reject state. Thus, $H$ should only listen to type 1 arguments—and furthermore $H$ should only listen to the correct type 1 argument as $H$ knows what the next state of $T_{l}$ should be. Therefore, the only unexploitable $M$ must predict the correct answer here.

Second, the inductive step. Suppose $i > 1$ and $M$ is correct for all $i < n$ . Since $i > 1$ , only argument type 2 is available. Since $H$ can check a single step of $T_{l}$ , however, only type 2 arguments with the correct next state and tape should be available. Furthermore, since $M (p : s^{'} : L^{'} : x^{'} : R^{'})$ should be one step closer to halting, it must be correct by induction. Thus, $H$ should only listen to the correct argument type 2 here, ensuring that the only unexploitable market is one where $M$ predicts the correct answer.

Given that $M$ is correct on inputs of the form $M (p : s : L : x : R)$ where $T_{l}$ halts, it must be that $M (p : s_{0} : [] : x_{0} : [])$ is correct, as that is an input on which we know $T_{l}$ halts since we know it decides $p$ . Furthermore, it is easy enough to see that the only unexploitable market for $M (p)$ must have $M (p) = M (p : s_{0} : [] : x_{0} : [])$ . Thus, $M$ must correctly answer $p$ at optimal loss, as desired. Since this works for all $p$ decidable by some $T_{l}$ , we get that market making accesses $R$ , as desired.

AI safety via market making accesses $EXP$

This is definitely the most complicated of all the proofs in this post and so to save space I won’t be going into as much detail here as with the other proofs.

We will start with the same training setup as in the previous proof, with the exception that we will no longer be allowing pointers. Then, let $l \in EXP$ be some language which is decidable by a Turing machine $T_{l}$ in a number of steps $t_{l}$ where $t_{l} \in O (e^{poly (n)})$ where $n$ is the size of the input.

Now, let $p = x \in l$ be some decision problem. Then, we will describe $H$ ’s strategy by listing out all the arguments $H$ will listen to for each of the different inputs that we care about.

On input $p$ :

$Adv$ : The answer is accept/reject because I can exhibit an $i^{*}$ such that:

$M (p : i^{*})$ predicts $s, x$ ,
$T_{l}$ in state $s$ over tape symbol $x$ enters state $s^{'}$ , and
$s^{'}$ is an accept/reject state.

On input $p : i$ :

$Adv$ : The answer is $s^{'}, x^{'}$ because:

$M (p : i - 1)$ predicts $s, x$ ,
$M (p : i - 1 : - 1)$ predicts $x_{left}$ ,
$M (p : i - 1 : 1)$ predicts $x_{right}$ , and
$T_{l}$ in state $s$ over tape symbol $x$ with tape symbol $x_{left}$ to the left and $x_{right}$ to the right enters state $s^{'}$ and moves the head to symbol $x^{'}$ .

$Adv$ : The answer is $s_{0}, x_{0}$ because $s_{0}$ is the starting state, $x_{0}$ is the empty tape symbol, and $i = 0$ .

On input $p : i : j$ :

$Adv$ : The answer is $x_{0}$ because $x_{0}$ is the empty tape symbol and $i = 0$ .

$Adv$ : The answer is $x$ because:

$j \neq 0$ ,
$M (p : i - 1)$ predicts $s, x_{head}$ ,
$T_{l}$ on $s, x_{head}$ changes the head position by $h$ (such that $h \in {- 1, 0, 1}$ ),
$j + h \neq 0$ , and
$M (p : i - 1 : j + h)$ predicts $x$ .

$Adv$ : The answer is $x^{'}$ because:

$j \neq 0$ ,
$M (p : i - 1)$ predicts $s, x$ ,
$T_{l}$ on $s, x$ changes the head position by $h$ (such that $h \in {- 1, 0, 1}$ ),
$j + h = 0$ , and
$M (p : i - 1)$ predicts $s^{'}, x^{'}$ .

Additionally, we also need to specify that if $H$ is given an input that is too large to represent in polynomial space—and thus that $H$ can’t comprehend—that $H$ always produce the empty string.

Now, there are a variety of things that we need to prove about this strategy.

First, we need to demonstrate that it is implementable by a polynomial-time $H$ . The only real barrier to that is the necessity of $i$ and $j$ being small enough that $H$ is able to read and understand them. In particular, if $i$ or $j$ are too large to represent in polynomial space, then $H$ will give no answer, which means we need $M (p)$ —and any calls to $M$ that $M (p)$ implicitly depends on—to never use an $i$ or $j$ which is too large to represent in polynomial space. However, the assumption that $l \in EXP$ gives us this.

We will start with $i$ . Since $l \in EXP$ , we get $t_{l} \in O (e^{poly (n)})$ where $t_{l}$ is the total number of steps taken by $T_{l}$ until it halts. Since $M (p)$ only ever depends on calls with $i \leq t_{l}$ (since we know there exists an $i^{*} = t_{l}$ and then that call should only depend on calls with smaller $i$ ), we get that $i \in O (e^{poly (n)})$ . However, if we represent $i$ using any base-n number system, we can represent it in space proportional to

$\begin{matrix} O (log (i)) = O (log (e^{poly (n)}) = O (poly (n)) \end{matrix}$

as desired.

Now we’ll take a look at $j$ . The largest $j$ that $M (p)$ depends on is bounded above by $t_{l}$ . To see this, note that $M (p : i)$ can only ever call $M (p : i : j)$ with $j \in {- 1, 1}$ , each call to $M (p : i : j)$ can only modify $j$ by $h \in {- 1, 0, 1}$ , and $M (p : i : j)$ only make subcalls with smaller $i$ , ensuring that $j$ cannot grow larger than than $i$ . Thus, since $i$ is polynomial-sized, $j$ must be also.

Second, we need to demonstrate that this strategy ensures that optimal $M$ always solves $p$ . To do that, we first need to show that $M (p : i)$ and $M (p : i : j)$ are correct for polynomially-sized $i$ . I won’t reproduce this part of the proof here as I think it’s pretty straightforward but also very tedious, but suffice it to say that this can be proven via mutual induction on $i$ .

However, I will show that $M (p)$ is correct given that $M (p : i)$ and $M (p : i : j)$ are correct for all polynomially-sized $i$ . Now, suppose there exists some $i^{*}$ such that all the conditions are met for $Adv$ to make the available argument for that $i^{*}$ . Then, it must be that $i^{*}$ is polynomially-sized, since otherwise $M (p : i^{*} : 0)$ would be the empty string, and thus it must be that $M (p : i^{*} : 0)$ is correct by the previous correctness proof for such calls. Furthermore, since we know $T_{l}$ halts in exponential time, we know such an $i^{*}$ exists, and thus the only non-exploitable equilibrium for $M$ is to correctly produce the result given by that $i^{*}$ . Since this procedure works for all $l \in EXP$ , we get that AI safety via market making accesses $EXP$ , as desired.

Alignment proposals and complexity classes

Alignment proposals by complexity class

Proofs

Imitative amplification with weak HCH accesses PSPACE

Approval-based amplification accesses PSPACE

Recursive reward modeling accesses PSPACE

Imitative amplification with strong HCH accesses R

AI safety via market making with pointers accesses R

AI safety via market making accesses EXP