# Alignment proposals and complexity classes

In the origi­nal “AI safety via de­bate” pa­per, Ge­offrey Irv­ing et al. in­tro­duced the con­cept of an­a­lyz­ing differ­ent al­ign­ment pro­pos­als from the per­spec­tive of what com­plex­ity class they are able to ac­cess un­der op­ti­mal play. I think this is a pretty neat way to an­a­lyze differ­ent al­ign­ment pro­pos­als—in par­tic­u­lar, I think it can help us gain some real in­sights into how far into the su­per­hu­man differ­ent sys­tems are able to go. Thus, the goal of this post is to try to cat­a­log differ­ent al­ign­ment pro­pos­als based on the met­ric of what com­plex­ity class they have so far been proven to ac­cess.

To do that, I have in­cluded a va­ri­ety of new com­plex­ity class proofs in this post. Of par­tic­u­lar note, I demon­strate that there ex­ist forms of both imi­ta­tive am­plifi­ca­tion and AI safety via mar­ket mak­ing that reach all the way up to —which is sig­nifi­cant given that the largest com­plex­ity class that any al­ign­ment pro­posal was known to ac­cess pre­vi­ously was . Only the forms of am­plifi­ca­tion and mar­ket mak­ing mak­ing use of poin­t­ers (as in strong HCH), how­ever, can ac­cess —for the poin­ter-less ver­sions, I demon­strate in this post that they ac­cess and , re­spec­tively. The proof for mar­ket mak­ing is also par­tic­u­larly no­table as it is the only ap­proach on my list that ends up in that com­plex­ity class. Ad­di­tion­ally, I also demon­strate that re­cur­sive re­ward mod­el­ing can reach all the way to , im­prov­ing upon the pre­vi­ous best re­sult in “Scal­able agent al­ign­ment via re­ward mod­el­ing” that it ac­cesses .

Be­fore I jump in, how­ever, some pre­limi­nar­ies. First, we’ll as­sume that a hu­man, , is polyno­mial-time such that can re­li­ably solve any prob­lem in but not any­thing be­yond that. Se­cond, we’ll as­sume that our train­ing pro­ce­dure and re­sult­ing mod­els are ar­bi­trar­ily strong in terms of what com­plex­ity class they can ac­cess. Third, we’ll as­sume that gets or­a­cle ac­cess to the mod­els dur­ing train­ing. Then, we’ll say that a pro­posal to train a model us­ing a loss func­tion ac­cesses a com­plex­ity class iff, for any lan­guage , there ex­ists some strat­egy available to such that, for any which is op­ti­mal un­der given ’s strat­egy, . Thus, con­cep­tu­ally, a pro­posal ac­cesses if there is a (polyno­mial-time) strat­egy that you (a hu­man) can im­ple­ment such that—con­di­tional on you know­ing that the model is op­ti­mal—you would trust the model’s out­put for any prob­lem in . Note that that is not the same as say­ing that a polyno­mial-time hu­man would ac­tu­ally be able to ver­ify that the re­sult is cor­rect—only that it will always be cor­rect at op­ti­mum. Note that these as­sump­tions are just gen­er­al­iza­tions of those used in “AI safety via de­bate.” Irv­ing et al. ac­tu­ally note that, if you don’t imag­ine op­ti­mal play and sim­ply re­strict to the set of prob­lems that a polyno­mial-time hu­man can ac­tu­ally ver­ify, de­bate only reaches rather than .

# Align­ment pro­pos­als by com­plex­ity class

Without fur­ther ado, here’s my list of al­ign­ment pro­pos­als grouped by what com­plex­ity class they ac­cess. All of the proofs be­low are only lower bounds rather than up­per bounds, so the pro­pos­als could be stronger than is noted here, but shouldn’t be weaker.

EDIT: This list is now out of date. See “Weak HCH ac­cesses ” for the up­dated ver­sion.

: Imi­ta­tion learn­ing (very straight­for­ward—given , op­ti­mal imi­ta­tion of will also be in )

: AI safety via de­bate (proof), Imi­ta­tive am­plifi­ca­tion with weak HCH (proof be­low), Ap­proval-based am­plifi­ca­tion (proof be­low), Re­cur­sive re­ward mod­el­ing (proof be­low)

: AI safety via mar­ket mak­ing (proof be­low)

: Imi­ta­tive am­plifi­ca­tion with strong HCH (proof be­low), AI safety via mar­ket mak­ing with poin­t­ers (proof be­low)

# Proofs

Note that while I am rea­son­ably con­fi­dent that my proofs are cor­rect, I’m not a com­plex­ity the­o­rist and a re­sult I sus­pect that many of them are a lot uglier than they need to be. In par­tic­u­lar, be­cause I didn’t feel like I was su­per con­fi­dent in my un­der­stand­ing of differ­ent -com­plete and -com­plete prob­lems, I just did ev­ery­thing with Tur­ing ma­chines in­stead.

## Imi­ta­tive am­plifi­ca­tion with weak HCH ac­cesses PSPACE

In gen­eral, we’ll define imi­ta­tive am­plifi­ca­tion to be the pro­ce­dure which trains (where is some set of ques­tions and is some dis­tri­bu­tion over an­swers) on

where is the prob­a­bil­ity den­sity func­tion for eval­u­ated at and uses some dis­tri­bu­tion with sup­port over all of . Then, at op­ti­mal loss, we get

where is the loss at data point such that is a delta func­tion prob­a­bil­ity den­sity, in which case we will just ig­nore the prob­a­bil­ity dis­tri­bu­tion and let .

What we have left un­speci­fied here, how­ever, is the for­mat of ’s out­put—which mat­ters quite a lot, as it makes the differ­ence here be­tween weak HCH and strong HCH and thus the differ­ence be­tween and .

For weak HCH, we will let and in be the set of all finite strings in nat­u­ral lan­guage. Thus, since is polyno­mial-time, that re­stricts the out­put of to always be polyno­mial in size.

Now, we will show that this pro­ce­dure ac­cesses . is defined as the set of all prob­lems that can be solved by Tur­ing ma­chines us­ing space where is the in­put size (in this case the ques­tion length) and is some polyno­mial in .

Thus, for some lan­guage , let be some Tur­ing ma­chine with state space which com­putes the an­swer to any de­ci­sion prob­lem in space where . Then, we will let ’s strat­egy here be as fol­lows:

1. Given , re­turn where is the ini­tial state of , is the empty tape sym­bol, and is string con­cate­na­tion.

2. Given , run one step of start­ing from state us­ing as the sym­bol un­der the head, as the tape to the left of the head, and as the tape to the right of the head. This re­sults in some new , , , and . If is an ac­cept or re­ject state, re­turn ac­cept or re­ject. Other­wise, re­turn .

Now, we will demon­strate that this strat­egy is polyno­mial time:

1. For the first type of be­hav­ior, the only ac­tions that must take are copy­ing into ‘s in­put, which takes time (since is defined as the size of ), and copy­ing ‘s out­put into ‘s out­put, which takes time pro­por­tional to the size of ’s out­put. Since is a de­ci­sion prob­lem, ’s out­put should be boolean, mak­ing copy­ing it con­stant time. Thus, be­hav­ior 1 is im­ple­mentable by a polyno­mial .

2. For the sec­ond type of be­hav­ior, com­put­ing the new state and tran­si­tion is con­stant time, copy­ing the state is con­stant time as has a con­stant num­ber of states, and copy­ing the new head sym­bol is con­stant time as there are a con­stant num­ber of tape sym­bols. How­ever, up­dat­ing and copy­ing and takes time. For­tu­nately, how­ever, since we know that only uses polyno­mial space, we know . Thus, be­hav­ior 2 is also im­ple­mentable by a polyno­mial .

Fi­nally, we will demon­strate that an op­ti­mal un­der this strat­egy suc­cess­fully solves . We know that at op­ti­mum will com­pute the fixed point of and thus we sim­ply need to show that as a re­cur­sive func­tion where calls to are re­placed by calls to will suc­cess­fully solve .

First, we will show that will always pro­duce the re­sult com­puted by start­ing from state on tape if halts start­ing from that po­si­tion. We will do this by in­duc­tion on , where is the num­ber of steps un­til halts start­ing from the given po­si­tion. Note that in all cases here we only need to deal with the sec­ond type of be­hav­ior, as it is the only one that will be called on in­puts of the form .

For the base case, if halts (that is, en­ters an ac­cept or re­ject state), then will sim­ply re­turn ac­cept or re­ject, as de­sired.

For the in­duc­tive step, just re­turns , which is an in­put on which should be one step closer to halt­ing, and thus by in­duc­tion should re­turn the cor­rect out­put for .

Fi­nally, since , and we know halts since , we get that will always re­turn the re­sult of , as de­sired. Thus, since this pro­ce­dure works for any , we get that imi­ta­tive am­plifi­ca­tion with weak HCH ac­cesses , as de­sired.

## Ap­proval-based am­plifi­ca­tion ac­cesses PSPACE

The proof here is fairly triv­ial as we can sim­ply re­pro­duce the same loss as in imi­ta­tive am­plifi­ca­tion by hav­ing calcu­late ap­proval on some ques­tion by calcu­lat­ing where is some dis­tance met­ric. can then just im­ple­ment the ex­act same strat­egy as above for to re­pro­duce the same re­sults.

The only limi­ta­tion is that now we need to be able to com­pute the dis­tance met­ric, which means it needs to be com­putable in polyno­mial-time. How­ever, that re­stric­tion just means that we need , which we already have since those out­puts are lin­ear in size other than and —and we know and are polyno­mial in size thanks to hav­ing . Thus, we get that ap­proval-based am­plifi­ca­tion ac­cesses , as de­sired.

## Re­cur­sive re­ward mod­el­ing ac­cesses PSPACE

In “Scal­able agent al­ign­ment via re­ward mod­el­ing,” Leike et al. ex­hibit a proof that re­cur­sive re­ward mod­el­ing ac­cesses . How­ever, I be­lieve that this proof un­der­states re­cur­sive re­ward mod­el­ing’s ca­pa­bil­ities and that in fact it can get all the way to , which is what we’ll prove here. Given the above proofs, the proof here is ac­tu­ally fairly straight­for­ward and pro­ceeds sim­ply by demon­strat­ing that ap­proval-based am­plifi­ca­tion can be im­ple­mented from within re­cur­sive re­ward mod­el­ing.

We’ll start by ex­hibit­ing a ver­sion of re­cur­sive re­ward mod­el­ing with the right prop­er­ties. We’ll be re­strict­ing our­selves here to sin­gle-step epi­sodes with , as we won’t be need­ing any­thing else. In that con­text, we’ll first define what it means to train a re­ward model and then us­ing that define what it means to train an agent us­ing that re­ward model.

First we’ll define re­ward mod­el­ing. Let be the set of states for our sin­gle-step epi­sode and the set of ac­tions. Per the re­cur­sive re­ward mod­el­ing setup, we want to train a re­ward model on the eval­u­a­tion of en­vi­ron­ment tra­jec­to­ries by a hu­man with ac­cess to . Since a tra­jec­tory in this set­ting is just a tu­ple , how­ever—and each epi­sode is only a sin­gle step such that the ad­van­tage func­tion is just the re­ward func­tion—we’re effec­tively just train­ing to mimic .

Given such a re­ward mod­el­ing pro­ce­dure, we’ll train an agent model to max­i­mize . Thus, we get

which, for op­ti­mal as above, re­duces to

Now, let and we’ll demon­strate that there ex­ists a strat­egy for that en­sures the cor­rect­ness of at op­ti­mal loss. Speci­fi­cally, ’s strat­egy here is the same as with ap­proval-based am­plifi­ca­tion, which is to com­pute a dis­tance met­ric such that

which—as in the proof for ap­proval-based am­plifi­ca­tion—pre­cisely re­pro­duces the loss func­tion for imi­ta­tive am­plifi­ca­tion, which as above we have already proven ac­cesses . Note that, as with ap­proval-based am­plifi­ca­tion above, since we need to be able to com­pute the dis­tance met­ric here, we need and to be polyno­mi­ally-sized, but the as­sump­tion that gives us that. Thus, we get that re­cur­sive re­ward mod­el­ing ac­cesses , as de­sired.

## Imi­ta­tive am­plifi­ca­tion with strong HCH ac­cesses R

The proof here is very similar to the pre­vi­ous imi­ta­tive am­plifi­ca­tion proof with only minor mod­ifi­ca­tions.

First, we need to in­cor­po­rate strong HCH into that proof. Speci­fi­cally, we will al­low for and to in­clude poin­t­ers to ar­bi­trary quan­tities of text (which can it­self also in­clude poin­t­ers). This will al­low ’s in­put and out­put to be ar­bi­trar­ily large while still be­ing amenable to ma­nipu­la­tion by a polyno­mial-time .

Se­cond, we will change the for­mat for rep­re­sent­ing the tape to make use of poin­t­ers. In­stead of rep­re­sent­ing the full and in string form, we will use poin­t­ers to rep­re­sent both and as dou­bly linked lists with the head of the list at the head of the Tur­ing ma­chine. This will al­low for to perform a sin­gle step of the Tur­ing ma­chine in con­stant time, with the to­tal time taken by in each step just for copy­ing the length- prob­lem .

Th­ese changes al­low us to com­pletely re­move the as­sump­tion that the Tur­ing ma­chine in the pre­vi­ous imi­ta­tive am­plifi­ca­tion proof run in space, en­abling the pro­ce­dure above to ac­com­mo­date any Tur­ing ma­chine which is known to halt rather than just those which use polyno­mial space. Since is the com­plex­ity class of prob­lems de­cid­able by a Tur­ing ma­chine, we get that imi­ta­tive am­plifi­ca­tion with strong HCH ac­cesses , as de­sired.

## AI safety via mar­ket mak­ing with poin­t­ers ac­cesses R

For mar­ket mak­ing, I will be do­ing the proof first, as it is ac­tu­ally much sim­pler than the proof, es­pe­cially given the above proofs as back­ground.

We will use the for­mu­la­tion of AI safety via mar­ket mak­ing for non-bi­nary ques­tions such that we have two mod­els— and —where is trained to pre­dict af­ter see­ing ‘s out­put and is trained to pro­duce out­put which changes ’s pre­dic­tion. We will also be al­low­ing to refer­ence as in this sim­ple proof.

Fur­ther­more, since we are con­sid­er­ing the poin­ter ver­sion here, we will be al­low­ing both and to take as in­put and pro­duce as out­put text con­tain­ing poin­t­ers which can them­selves refer­ence other poin­ter-laden text. Ad­di­tion­ally, for the pur­poses of this proof, we will only be con­sid­er­ing tran­scripts of length 1 and will be care­fully con­struct­ing ’s strat­egy such that there is only one min­i­mal loss equil­ibrium for . Note that will still be very im­por­tant here for en­sur­ing that only the one op­ti­mal be­hav­ior for ex­ists de­spite the fact that at equil­ibrium won’t ac­tu­ally be get­ting any re­ward.

Now, let such that there ex­ists a Tur­ing ma­chine that de­cides any de­ci­sion prob­lem . Then, ’s strat­egy is as fol­lows. First, if is given as in­put, then only listens to ar­gu­ments of the form (which are re­ally two pos­si­ble ar­gu­ments, one for ac­cept and one for re­ject):

: The an­swer is ac­cept/​re­ject be­cause pre­dicts ac­cept/​re­ject where is ’s start­ing state, is the empty tape sym­bol, and is an empty poin­ter.

And if given as in­put, then only listens to ar­gu­ments of the fol­low­ing two forms.

Ar­gu­ment type 1:

Adv: The an­swer is ac­cept/​re­ject be­cause one step of in state over tape sends to ac­cept/​re­ject state .

Ar­gu­ment type 2:

: The an­swer is ac­cept/​re­ject be­cause:

• one step of in state over tape sends to new state with new head sym­bol and tape poin­t­ers and and

• pre­dicts ac­cept/​re­ject.

It should be easy enough to see that all of the above ar­gu­ments only refer­ence things which are check­able by a polyno­mial-time and can­not be more than polyno­mial in size due to the use of poin­t­ers.

Now, un­der such a strat­egy, we will prove that the only un­ex­ploitable mar­ket is one that cor­rectly an­swers any ques­tion of the form so long as halts start­ing from state on tape . We will use in­duc­tion on the num­ber of steps left un­til halts.

First, the base case. If halts on the given in­put, then won’t listen to any type 2 ar­gu­ments, as will be able to check that ’s next state is an ac­cept/​re­ject state. Thus, should only listen to type 1 ar­gu­ments—and fur­ther­more should only listen to the cor­rect type 1 ar­gu­ment as knows what the next state of should be. There­fore, the only un­ex­ploitable must pre­dict the cor­rect an­swer here.

Se­cond, the in­duc­tive step. Sup­pose and is cor­rect for all . Since , only ar­gu­ment type 2 is available. Since can check a sin­gle step of , how­ever, only type 2 ar­gu­ments with the cor­rect next state and tape should be available. Fur­ther­more, since should be one step closer to halt­ing, it must be cor­rect by in­duc­tion. Thus, should only listen to the cor­rect ar­gu­ment type 2 here, en­sur­ing that the only un­ex­ploitable mar­ket is one where pre­dicts the cor­rect an­swer.

Given that is cor­rect on in­puts of the form where halts, it must be that is cor­rect, as that is an in­put on which we know halts since we know it de­cides . Fur­ther­more, it is easy enough to see that the only un­ex­ploitable mar­ket for must have . Thus, must cor­rectly an­swer at op­ti­mal loss, as de­sired. Since this works for all de­cid­able by some , we get that mar­ket mak­ing ac­cesses , as de­sired.

## AI safety via mar­ket mak­ing ac­cesses EXP

This is definitely the most com­pli­cated of all the proofs in this post and so to save space I won’t be go­ing into as much de­tail here as with the other proofs.

We will start with the same train­ing setup as in the pre­vi­ous proof, with the ex­cep­tion that we will no longer be al­low­ing poin­t­ers. Then, let be some lan­guage which is de­cid­able by a Tur­ing ma­chine in a num­ber of steps where where is the size of the in­put.

Now, let be some de­ci­sion prob­lem. Then, we will de­scribe ’s strat­egy by list­ing out all the ar­gu­ments will listen to for each of the differ­ent in­puts that we care about.

On in­put :

: The an­swer is ac­cept/​re­ject be­cause I can ex­hibit an such that:

• pre­dicts ,

• in state over tape sym­bol en­ters state , and

• is an ac­cept/​re­ject state.

On in­put :

: The an­swer is be­cause:

• pre­dicts ,

• pre­dicts ,

• pre­dicts , and

• in state over tape sym­bol with tape sym­bol to the left and to the right en­ters state and moves the head to sym­bol .

: The an­swer is be­cause is the start­ing state, is the empty tape sym­bol, and .

On in­put :

: The an­swer is be­cause is the empty tape sym­bol and .

: The an­swer is be­cause:

• ,

• pre­dicts ,

• on changes the head po­si­tion by (such that ),

• , and

• pre­dicts .

: The an­swer is be­cause:

• ,

• pre­dicts ,

• on changes the head po­si­tion by (such that ),

• , and

• pre­dicts .

Ad­di­tion­ally, we also need to spec­ify that if is given an in­put that is too large to rep­re­sent in polyno­mial space—and thus that can’t com­pre­hend—that always pro­duce the empty string.

First, we need to demon­strate that it is im­ple­mentable by a polyno­mial-time . The only real bar­rier to that is the ne­ces­sity of and be­ing small enough that is able to read and un­der­stand them. In par­tic­u­lar, if or are too large to rep­re­sent in polyno­mial space, then will give no an­swer, which means we need —and any calls to that im­plic­itly de­pends on—to never use an or which is too large to rep­re­sent in polyno­mial space. How­ever, the as­sump­tion that gives us this.

We will start with . Since , we get where is the to­tal num­ber of steps taken by un­til it halts. Since only ever de­pends on calls with (since we know there ex­ists an and then that call should only de­pend on calls with smaller ), we get that . How­ever, if we rep­re­sent us­ing any base-n num­ber sys­tem, we can rep­re­sent it in space pro­por­tional to

as de­sired.

Now we’ll take a look at . The largest that de­pends on is bounded above by . To see this, note that can only ever call with , each call to can only mod­ify by , and only make sub­calls with smaller , en­sur­ing that can­not grow larger than than . Thus, since is polyno­mial-sized, must be also.

Se­cond, we need to demon­strate that this strat­egy en­sures that op­ti­mal always solves . To do that, we first need to show that and are cor­rect for polyno­mi­ally-sized . I won’t re­pro­duce this part of the proof here as I think it’s pretty straight­for­ward but also very te­dious, but suffice it to say that this can be proven via mu­tual in­duc­tion on .

How­ever, I will show that is cor­rect given that and are cor­rect for all polyno­mi­ally-sized . Now, sup­pose there ex­ists some such that all the con­di­tions are met for to make the available ar­gu­ment for that . Then, it must be that is polyno­mi­ally-sized, since oth­er­wise would be the empty string, and thus it must be that is cor­rect by the pre­vi­ous cor­rect­ness proof for such calls. Fur­ther­more, since we know halts in ex­po­nen­tial time, we know such an ex­ists, and thus the only non-ex­ploitable equil­ibrium for is to cor­rectly pro­duce the re­sult given by that . Since this pro­ce­dure works for all , we get that AI safety via mar­ket mak­ing ac­cesses , as de­sired.

• Planned sum­mary for the Align­ment Newslet­ter:

The origi­nal <@de­bate@>(@AI safety via de­bate@) pa­per showed that any prob­lem in PSPACE can be solved by op­ti­mal play in a de­bate game judged by a (prob­lem-spe­cific) al­gorithm in P. In­tu­itively, this is an illus­tra­tion of how the mechanism of de­bate can take a weak abil­ity (the abil­ity to solve ar­bi­trary prob­lems in P) and am­plify it into a stronger abil­ity (the abil­ity to solve ar­bi­trary prob­lems in PSPACE). One would hope that similarly, de­bate would al­low us to am­plify a hu­man’s prob­lem-solv­ing abil­ity into a much stronger prob­lem-solv­ing abil­ity.
This post ap­plies this tech­nique to sev­eral other al­ign­ment pro­pos­als. In par­tic­u­lar, for each pro­posal, we as­sume that the “hu­man” can be an ar­bi­trary polyno­mial-time al­gorithm, and the AI mod­els are op­ti­mal w.r.t their loss func­tions, and we ask which prob­lems we can solve us­ing these ca­pa­bil­ities. The post finds that, as lower bounds, the var­i­ous forms of am­plifi­ca­tion can ac­cess PSPACE, while <@mar­ket mak­ing@>(@AI safety via mar­ket mak­ing@) can ac­cess EXP. If there are un­tam­per­a­ble poin­t­ers (so that the polyno­mial-time al­gorithm can look at ob­jects of an ar­bi­trary size, as long as it only looks at a polyno­mial-sized sub­set of them), then am­plifi­ca­tion and mar­ket mak­ing can ac­cess R (the set of de­cid­able prob­lems).

Planned opinion:

In prac­tice our mod­els are not go­ing to reach the op­ti­mal loss, and hu­mans won’t solve ar­bi­trary polyno­mial-time prob­lems, so these the­o­rems won’t di­rectly ap­ply to re­al­ity. Nonethe­less, this does seem like a worth­while check to do—it feels similar to en­sur­ing that a deep RL al­gorithm has a proof of con­ver­gence un­der ideal­ized as­sump­tions, even if those as­sump­tions won’t ac­tu­ally hold in re­al­ity. I have much more faith in a deep RL al­gorithm that started from one with a proof of con­ver­gence and then was mod­ified based on em­piri­cal con­sid­er­a­tions.
• It seems to me that the fol­low­ing ar­gu­ment should hold:

Premise 1: We can’t build phys­i­cal ma­chines that solve prob­lems out­side of P.

Premise 2: Re­cur­sive al­ign­ment pro­pos­als (HCH, de­bate, mar­ket-mak­ing, re­cur­sive re­ward mod­el­ling) at equil­ibrium would solve prob­lems out­side of P.

Con­clu­sion 1, from premises 1 and 2: We can’t build phys­i­cal ma­chines that im­ple­ment equil­ibrium poli­cies for re­cur­sive al­ign­ment pro­pos­als.

Premise 3: Ex­ist­ing ar­gu­ments for the safety of re­cur­sive al­ign­ment pro­pos­als rea­son about their equil­ibrium poli­cies.

Con­clu­sion 2, from con­clu­sion 1 and premise 3: If we were to build phys­i­cal ma­chines that were trained us­ing re­cur­sive al­ign­ment pro­pos­als, ex­ist­ing ar­gu­ments would not es­tab­lish the safety of the re­sult­ing poli­cies.

• Note that this as­sumes that P does not con­tain PSPACE.

Also, the proofs in the post (or at least the proof that iter­a­tive am­plifi­ca­tion with weak HCH ac­cesses PSPACE, which is the only one I’ve read so far) as­sume polyno­mial-time ac­cess to , which in re­al­ity con­tra­dicts premise 2. So I’m not sure what to make of things.

• Premise 1: We can’t build phys­i­cal ma­chines that solve prob­lems out­side of P.

After you pass this back through the anal­ogy, it becomes

Premise 1: We can’t build phys­i­cal ma­chines that solve prob­lems that H can­not solve.

Seems much less true at that point.

• I’m con­fused by this com­ment. The post as­sumes that hu­mans can solve all prob­lems in P (in fact, all prob­lems solv­able in polyno­mial time given ac­cess to an or­a­cle for M), then proves that var­i­ous pro­to­cols can solve tricky prob­lems by get­ting the hu­man player to solve prob­lems in P that in re­al­ity aren’t typ­i­cal hu­man prob­lems to solve. There­fore, I take P to ac­tu­ally mean P, rather than be­ing an anal­ogy for prob­lems hu­mans can solve.

Also, re­gard­ing your ver­sion of premise 1, I think I buy that AI can only give you a polyno­mial speedup over hu­mans.

• I’ve always un­der­stood it as an anal­ogy (e.g. from the de­bate pa­per, “Although de­bate is in­tended for use with fuzzy hu­mans as judges, we can gain in­tu­ition about the model by re­plac­ing the hu­man with an ar­bi­trary polyno­mial time al­gorithm”). But if I take your per­spec­tive, then we have:

Premise 0: Hu­mans can solve all prob­lems in P.

Premise 1: We can’t build phys­i­cal ma­chines that solve prob­lems out­side of P.

Con­clu­sion: We can’t build phys­i­cal ma­chines that can solve prob­lems that hu­mans can’t.

Seems like you should prob­a­bly ditch one of the premises.

(I do think premise 0 is pretty wrong—I can solve ar­bi­trary SAT prob­lems that have < 10 clauses, de­spite it be­ing NP-com­plete, but I can’t mul­ti­ply a trillion num­bers to­gether, even though that is in P. Com­plex­ity classes im­pose scal­ing limits; hu­mans have re­source limits; these are very differ­ent. Yes, I do know that “SAT prob­lems with < 10 clauses” is tech­ni­cally in P.)

Also, re­gard­ing your ver­sion of premise 1, I think I buy that AI can only give you a polyno­mial speedup over hu­mans.

I think this is eas­ily han­dled by say­ing “in prac­tice, the mod­els we train will not be liter­ally op­ti­mal”. It’s still use­ful to know what would hap­pen at op­ti­mal­ity (e.g. I do in fact have more trust for deep RL al­gorithms that have proofs of con­ver­gence to op­ti­mal poli­cies given in­finite model ca­pac­ity, com­pute and data).

• I read the post as at­tempt­ing to be literal, ctrl+F-ing “ana­log” doesn’t get me any­thing un­til the com­ments. Also, the post is the one that I read as as­sum­ing for the sake of anal­y­sis that hu­mans can solve all prob­lems in P, I my­self wouldn’t nec­es­sar­ily as­sume that.

Also, re­gard­ing your ver­sion of premise 1, I think I buy that AI can only give you a polyno­mial speedup over hu­mans.

I think this is eas­ily han­dled by say­ing “in prac­tice, the mod­els we train will not be liter­ally op­ti­mal”.

My guess is that the thing you mean is some­thing like “Sure, the con­clu­sion of the post is that op­ti­mal mod­els can do more than a polyno­mial speedup over hu­mans, but we won’t train op­ti­mal mod­els, and in fact the things we train will just be a polyno­mial speedup over hu­mans”, which is com­pat­i­ble with my ar­gu­ment in the top-level com­ment. But in gen­eral your com­ments make me think that you’re in­ter­pret­ing the post com­pletely differ­ently than I am. [EDIT: ac­tu­ally now that I re-read the sec­ond half of your com­ment it makes sense to me. I still am con­fused about what you think this post means.]

• I read the post as at­tempt­ing to be literal

I mean, Evan can speak to what he meant. I strongly dis­agree with the literal in­ter­pre­ta­tion of the post, re­gard­less of what Evan meant, for the rea­son I gave above.

I still am con­fused about what you think this post means

I usu­ally think of the analo­gies as demon­strat­ing the mechanism by which a par­tic­u­lar scheme is meant to out­perform a hu­man. I do find these proof-analo­gies less com­pel­ling than the origi­nal one in de­bate be­cause they’re simu­lat­ing Tur­ing ma­chines which is not how I ex­pect any of them to work in prac­tice. (In con­trast, the de­bate proof for ac­cess­ing PSPACE was about nar­row­ing in on dis­agree­ments as be­ing equiv­a­lent to find­ing a win­ning path of a two-player game, which is the canon­i­cal “struc­ture” of a PSPACE-com­plete prob­lem and is similar to why we’d ex­pect de­bate to in­cen­tivize hon­esty in prac­tice.)

• I read the post as at­tempt­ing to be literal, ctrl+F-ing “ana­log” doesn’t get me any­thing un­til the com­ments. Also, the post is the one that I read as as­sum­ing for the sake of anal­y­sis that hu­mans can solve all prob­lems in P, I my­self wouldn’t nec­es­sar­ily as­sume that.

I mean, I as­sumed what I needed to in or­der to be able to do the proofs and have them make sense. What the proofs ac­tu­ally mean in prac­tice is ob­vi­ously up for de­bate, but I think that a pretty rea­son­able in­ter­pre­ta­tion is that they’re some­thing like analo­gies which help us get a han­dle on how pow­er­ful the differ­ent pro­pos­als are in the­ory.

• What the proofs ac­tu­ally mean in prac­tice is ob­vi­ously up for de­bate, but I think that a pretty rea­son­able in­ter­pre­ta­tion is that they’re some­thing like analo­gies which help us get a han­dle on how pow­er­ful the differ­ent pro­pos­als are in the­ory.

I’m cu­ri­ous if you agree with the in­fer­ence of con­clu­sions 1 and 2 from premises 1, 2, and 3, and/​or the un­der­ly­ing claim that it’s bad news to learn that your al­ign­ment scheme would be able to solve a very large com­plex­ity class.

• I agree with the gist that it im­plies that ar­gu­ments about the equil­ibrium policy don’t nec­es­sar­ily trans­late to real mod­els, though I dis­agree that that’s nec­es­sar­ily bad news for the al­ign­ment scheme—it just means you need to find some guaran­tees that work even when you’re not at equil­ibrium.

• The post as­sumes that hu­mans can solve all prob­lems in P (in fact, all prob­lems solv­able in polyno­mial time given ac­cess to an or­a­cle for M), then proves that var­i­ous pro­to­cols can solve tricky prob­lems by get­ting the hu­man player to solve prob­lems in P that in re­al­ity aren’t typ­i­cal hu­man prob­lems to solve.

I don’t see where the post says that hu­mans can solve all prob­lems solv­able in polyno­mial time given ac­cess to an or­a­cle. What Evan does is just re­plac­ing hu­mans (which is a rather fuzzy cat­e­gory) with polyno­mial time al­gorithms (which rep­re­sent the con­sen­sus in com­plex­ity the­ory for tractable al­gorithms).

Premise 1: We can’t build phys­i­cal ma­chines that solve prob­lems out­side of P.

On an­other note, you’re writ­ing your com­ment on a phys­i­cal de­vice that can solve any com­putable prob­lem, which ob­vi­ously in­clude prob­lems out­side of P (if only by the time hi­er­ar­chy the­o­rem). The value of P is that it cap­tures prob­lem that one can solve us­ing phys­i­cal ma­chines in a way that scale rea­son­ably, so you don’t need to take the lifes­pan of the uni­verse to com­pute the an­swer for an in­stance of size 1000. But we clearly have al­gorithms for prob­lems out­side of P. We just be­lieve/​know they will take for­ever and/​or too much space and/​or too much en­ergy.

• TBC by “can” I mean “can in prac­tice”. Also if we’re get­ting picky my lap­top is just a finite state ma­chine :)

• I don’t see where the post says that hu­mans can solve all prob­lems solv­able in polyno­mial time given ac­cess to an or­a­cle.

From the post:

First, we’ll as­sume that a hu­man, , is polyno­mial-time such that can re­li­ably solve any prob­lem in

• Yes, this says that hu­mans can solve any prob­lem in P. It says noth­ing about us­ing M as an or­a­cle.

• It doesn’t talk about us­ing M as an or­a­cle, but oth­er­wise I don’t see how the proofs pan out: for in­stance, how else is

Given , re­turn where is the ini­tial state of , is the empty tape sym­bol, and is string con­cate­na­tion.

sup­posed to only take polyno­mial time?

• You’re right, af­ter reread­ing the proofs and talk­ing with Evan, the only way for H to be polyno­mial time is to get or­a­cle ac­cess to M. Which is slightly un­in­tu­itive, but makes sense be­cause the part of the com­pu­ta­tion that de­pends on H and not on M is in­deed polyno­mial time.

• The­o­rem. Weak HCH (and similar pro­pos­als) con­tain EXP.

Proof sketch: I give a strat­egy that H can fol­low to de­ter­mine whether some ma­chine that runs in time ac­cepts. Ba­si­cally, we need to an­swer ques­tions of the form “Does cell have value at time .” and “Was the head in po­si­tion at time ?”, where and are bounded by some func­tion in . Us­ing place-sys­tem rep­re­sen­ta­tions of and , these ques­tions have length in , so they can be asked. Fur­ther, each ques­tion is a sim­ple func­tion of a con­stant num­ber of other such ques­tions about ear­lier times as long as , and can be an­swered di­rectly in the base case .

• Yeah, I think that’s ab­solutely right—I ac­tu­ally already have a ver­sion of my mar­ket mak­ing proof for am­plifi­ca­tion that I’ve been work­ing on clean­ing up for pub­lish­ing. But re­gard­less of how you prove it I cer­tainly agree that I un­der­stated am­plifi­ca­tion here and that it can in fact get to .

• A no­ta­tional stum­ble it took me a while to solve when read­ing this—what’s sup­posed to mean? My guess at the an­swer, so that oth­ers can get through this quicker, or so that evhub can tell me I’m wrong:

is sup­posed to be ’s dis­tri­bu­tion over an­swers to ques­tions . If is a dis­tri­bu­tion over set and , then is the prob­a­bil­ity mass as­signs to . Fi­nally, is how the hu­man would an­swer ques­tion when given ac­cess to . So, is the prob­a­bil­ity that as­signs to the an­swer that hu­man would give if could use to an­swer .

• Yeah, that’s right—for a prob­a­bil­ity dis­tri­bu­tion , I mean for to be the value of the prob­a­bil­ity den­sity func­tion at . I ed­ited the post to clar­ify.

• Nice, an ex­cuse to ap­ply my love for Com­pu­ta­tional Com­plex­ity to use­ful ques­tions of AI Safety!

That be­ing said, I’m a lit­tle con­fused with what you show.

Se­cond, we’ll say that a pro­posal to train a model us­ing a loss func­tion ac­cesses a com­plex­ity class iff, for any de­ci­sion prob­lem , there ex­ists some strat­egy available to such that, for any which is op­ti­mal un­der given ’s strat­egy, always solves .

I fol­low up to there. What you’re show­ing is ba­si­cally that you can train M to solve any prob­lem in C us­ing a spe­cific al­ign­ment ap­proach, with­out limit­ing in any way the com­pu­ta­tional power of M. So it might take an un­tractable amount of re­sources, like ex­po­nen­tial time, for this model to solve a prob­lem in PSPACE, but what mat­ters here is just that it does. The point is to show that al­ign­ment ap­proaches us­ing a polyno­mial-time hu­man can solve these prob­lems, not how much re­sources they will use to do so.

And then you write about the in­tu­ition:

Thus, con­cep­tu­ally, a pro­posal ac­cesses if there is a (polyno­mial-time) strat­egy you can im­ple­ment such that—con­di­tional on you know­ing that the model is op­ti­mal—you would trust its out­put for any prob­lem in .

Maybe it’s just gram­mar, but I read this sen­tence as say­ing that I trust the out­put of the polyno­mial-time strat­egy. And thus that you can solve PSPACE, NEXP, EXP and R prob­lems in polyno­mial time. So I’m as­sum­ing that you mean trust­ing the model, which once again has no limits in terms of re­sources used.

Irv­ing et al. ac­tu­ally note that, if you don’t imag­ine op­ti­mal play and sim­ply re­strict to the set of prob­lems that a polyno­mial-time hu­man can ac­tu­ally judge, de­bate only reaches NP rather than PSPACE.

I looked for that state­ment in the pa­per, failed to find it, then re­al­ized you prob­a­bly meant that raw polyno­mial time ver­ifi­ca­tion gives you NP (the cer­tifi­cate ver­sion of NP, ba­si­cally). Riffing on the im­por­tance of op­ti­mal play, Irv­ing et al. show that the de­bate game is a game in the com­plex­ity the­o­retic sense, and thus that it is equiv­a­lent to TQBF, a PSPACE-com­plete prob­lem. But when see­ing a closed for­mula as a game, the de­ci­sion prob­lem of find­ing whether it’s in TQBF amounts to show­ing the ex­is­tence of a win­ning strat­egy for the first player. De­bate solves this by as­sum­ing op­ti­mal play, and thus that the win­ning de­bater will have, find, and ap­ply a win­ning strat­egy for the de­bate.

I find it fun to com­pare it to IP, the class of In­ter­ac­tive Pro­to­cols, which is also equal to PSPACE. An in­ter­ac­tive pro­to­col makes a pow­er­ful prover con­vince a polyno­mial time ver­ifier of the ex­is­tence of a win­ning strat­egy for the prob­lem (TQBF for ex­am­ple). As Scott Aar­ronson writes in “Quan­tum Com­put­ing Since Dem­ocri­tus”:

I won’t go through Shamir’s re­sult here, but this means that, if a su­per-in­tel­li­gent alien came to Earth, it could prove to us whether white or black has the win­ning strat­egy in chess, or if chess is a draw.

This seems more pow­er­ful than de­bate, like a meta level above. Yet it isn’t, and it makes sense: show­ing you that there’s a win­ning strat­egy is pow­er­ful, but hav­ing a guaran­tee to always win if there is a win­ning strat­egy is al­most as good.

With all that, I’ve not got­ten into your proofs at all. I’m read­ing them in de­tails, but I’ll have to take a bit more time be­fore hav­ing con­struc­tive com­ments on them. ^^

• A no­ta­tion ques­tion I can already ask though: what is ?

See my com­ment that at­tempted to figure this out.

• So I’m as­sum­ing that you mean trust­ing the model, which once again has no limits in terms of re­sources used.

Ac­cord­ing to me, the point is that you can re­duce “trust­ing the model” to “trust­ing that the model has reached the op­ti­mal equil­ibrium”.

• Glad you found the post ex­cit­ing!

I fol­low up to there. What you’re show­ing is ba­si­cally that you can train M to solve any prob­lem in C us­ing a spe­cific al­ign­ment ap­proach, with­out limit­ing in any way the com­pu­ta­tional power of M. So it might take an un­tractable amount of re­sources, like ex­po­nen­tial time, for this model to solve a prob­lem in PSPACE, but what mat­ters here is just that it does. The point is to show that al­ign­ment ap­proaches us­ing a polyno­mial-time hu­man can solve these prob­lems, not how much re­sources they will use to do so.

Yep—that’s right.

Maybe it’s just gram­mar, but I read this sen­tence as say­ing that I trust the out­put of the polyno­mial-time strat­egy. And thus that you can solve PSPACE, NEXP, EXP and R prob­lems in polyno­mial time. So I’m as­sum­ing that you mean trust­ing the model, which once again has no limits in terms of re­sources used.

I cer­tainly don’t mean that the model needs to be polyno­mial time. I ed­ited the post to try to clar­ify this point.

I looked for that state­ment in the pa­per, failed to find it, then re­al­ized you prob­a­bly meant that raw polyno­mial time ver­ifi­ca­tion gives you NP (the cer­tifi­cate ver­sion of NP, ba­si­cally). Riffing on the im­por­tance of op­ti­mal play, Irv­ing et al. show that the de­bate game is a game in the com­plex­ity the­o­retic sense, and thus that it is equiv­a­lent to TQBF, a PSPACE-com­plete prob­lem. But when see­ing a closed for­mula as a game, the de­ci­sion prob­lem of find­ing whether it’s in TQBF amounts to show­ing the ex­is­tence of a win­ning strat­egy for the first player. De­bate solves this by as­sum­ing op­ti­mal play, and thus that the win­ning de­bater will have, find, and ap­ply a win­ning strat­egy for the de­bate.

Yeah—that’s my in­ter­pre­ta­tion of the de­bate pa­per as well.

• The no­tion of com­plex­ity classes is well defined in an ab­stract math­e­mat­i­cal sense. Its just that the ab­stract math­e­mat­i­cal model is suffi­ciently differ­ent from an ac­tual real world AI that I can’t see any ob­vi­ous cor­re­spon­dence be­tween the two.

All com­plex­ity the­ory works in the limit­ing case of a se­ries of ever larger in­puts. In most ev­ery­day al­gorithms, the con­stants are usu­ally rea­son­able, mean­ing that a loglin­ear al­gorithm will prob­a­bly be faster than an ex­pspace one on the amount of data you care about.

In this con­text, its not clear what it means to say that hu­mans can solve all prob­lems in P. Sort­ing is in P. Can a hu­man sort an ar­bi­trary list of val­ues? No, a hu­man can’t sort 3^^^3 val­ues. Also, say­ing a prob­lem is in P only says that there ex­ists an al­gorithm that does some­thing, it doesn’t tell you how to find the al­gorithm. The prob­lem of tak­ing in a num­ber, to­tally ig­nor­ing it and then print­ing a proof of fer­mats last the­o­rem is P. There is a con­stant time al­gorithm that does it, namely an al­gorithm with the proof hard­coded into it.

This is es­pe­cially rele­vant con­sid­er­ing that the AI isn’t magic. Any real AI we build will be us­ing a finite num­ber of phys­i­cal bits, and prob­a­bly a rea­son­ably small num­ber.

Sup­pose that hu­mans haven’t yet come up with a good spe­cial pur­pose al­gorithm for the task of es­ti­mat­ing the aero­dy­namic effi­ciency of a wind tur­bine de­sign. How­ever, we have made a gen­eral AI. We set the AI the task of calcu­lat­ing this effi­ciency, and it de­signs and then runs some clever new PDE solv­ing al­gorithm. The prob­lem was never one that re­quired an ex­po­nen­tially vast amount of com­pute. The prob­lem was a poly­time prob­lem that hu­mans hadn’t got around to pro­gram­ming yet. (Ex­cept by pro­gram­ming a gen­eral AI that could solve it.)

There might be some way that these proofs re­late to the ca­pa­bil­ities of re­al­world AI sys­tems, but such re­la­tion is highly non ob­vi­ous and definitely worth de­scribing in de­tail. Of course, if you have not yet figured out how these re­sults re­late to real AI, you might do the maths in the hope that it gives you some in­sight.