Alignment proposals and complexity classes

In the origi­nal “AI safety via de­bate” pa­per, Ge­offrey Irv­ing et al. in­tro­duced the con­cept of an­a­lyz­ing differ­ent al­ign­ment pro­pos­als from the per­spec­tive of what com­plex­ity class they are able to ac­cess un­der op­ti­mal play. I think this is a pretty neat way to an­a­lyze differ­ent al­ign­ment pro­pos­als—in par­tic­u­lar, I think it can help us gain some real in­sights into how far into the su­per­hu­man differ­ent sys­tems are able to go. Thus, the goal of this post is to try to cat­a­log differ­ent al­ign­ment pro­pos­als based on the met­ric of what com­plex­ity class they have so far been proven to ac­cess.

To do that, I have in­cluded a va­ri­ety of new com­plex­ity class proofs in this post. Of par­tic­u­lar note, I demon­strate that there ex­ist forms of both imi­ta­tive am­plifi­ca­tion and AI safety via mar­ket mak­ing that reach all the way up to —which is sig­nifi­cant given that the largest com­plex­ity class that any al­ign­ment pro­posal was known to ac­cess pre­vi­ously was . Only the forms of am­plifi­ca­tion and mar­ket mak­ing mak­ing use of poin­t­ers (as in strong HCH), how­ever, can ac­cess —for the poin­ter-less ver­sions, I demon­strate in this post that they ac­cess and , re­spec­tively. The proof for mar­ket mak­ing is also par­tic­u­larly no­table as it is the only ap­proach on my list that ends up in that com­plex­ity class. Ad­di­tion­ally, I also demon­strate that re­cur­sive re­ward mod­el­ing can reach all the way to , im­prov­ing upon the pre­vi­ous best re­sult in “Scal­able agent al­ign­ment via re­ward mod­el­ing” that it ac­cesses .

Be­fore I jump in, how­ever, some pre­limi­nar­ies. First, we’ll as­sume that a hu­man, , is polyno­mial-time such that can re­li­ably solve any prob­lem in but not any­thing be­yond that. Se­cond, we’ll as­sume that our train­ing pro­ce­dure and re­sult­ing mod­els are ar­bi­trar­ily strong in terms of what com­plex­ity class they can ac­cess. Third, we’ll as­sume that gets or­a­cle ac­cess to the mod­els dur­ing train­ing. Then, we’ll say that a pro­posal to train a model us­ing a loss func­tion ac­cesses a com­plex­ity class iff, for any lan­guage , there ex­ists some strat­egy available to such that, for any which is op­ti­mal un­der given ’s strat­egy, . Thus, con­cep­tu­ally, a pro­posal ac­cesses if there is a (polyno­mial-time) strat­egy that you (a hu­man) can im­ple­ment such that—con­di­tional on you know­ing that the model is op­ti­mal—you would trust the model’s out­put for any prob­lem in . Note that that is not the same as say­ing that a polyno­mial-time hu­man would ac­tu­ally be able to ver­ify that the re­sult is cor­rect—only that it will always be cor­rect at op­ti­mum. Note that these as­sump­tions are just gen­er­al­iza­tions of those used in “AI safety via de­bate.” Irv­ing et al. ac­tu­ally note that, if you don’t imag­ine op­ti­mal play and sim­ply re­strict to the set of prob­lems that a polyno­mial-time hu­man can ac­tu­ally ver­ify, de­bate only reaches rather than .

Align­ment pro­pos­als by com­plex­ity class

Without fur­ther ado, here’s my list of al­ign­ment pro­pos­als grouped by what com­plex­ity class they ac­cess. All of the proofs be­low are only lower bounds rather than up­per bounds, so the pro­pos­als could be stronger than is noted here, but shouldn’t be weaker.

EDIT: This list is now out of date. See “Weak HCH ac­cesses ” for the up­dated ver­sion.

: Imi­ta­tion learn­ing (very straight­for­ward—given , op­ti­mal imi­ta­tion of will also be in )

: AI safety via de­bate (proof), Imi­ta­tive am­plifi­ca­tion with weak HCH (proof be­low), Ap­proval-based am­plifi­ca­tion (proof be­low), Re­cur­sive re­ward mod­el­ing (proof be­low)

: AI safety via mar­ket mak­ing (proof be­low)

: De­bate with cross-ex­am­i­na­tion (proof)

: Imi­ta­tive am­plifi­ca­tion with strong HCH (proof be­low), AI safety via mar­ket mak­ing with poin­t­ers (proof be­low)

Proofs

Note that while I am rea­son­ably con­fi­dent that my proofs are cor­rect, I’m not a com­plex­ity the­o­rist and a re­sult I sus­pect that many of them are a lot uglier than they need to be. In par­tic­u­lar, be­cause I didn’t feel like I was su­per con­fi­dent in my un­der­stand­ing of differ­ent -com­plete and -com­plete prob­lems, I just did ev­ery­thing with Tur­ing ma­chines in­stead.

Imi­ta­tive am­plifi­ca­tion with weak HCH ac­cesses

In gen­eral, we’ll define imi­ta­tive am­plifi­ca­tion to be the pro­ce­dure which trains (where is some set of ques­tions and is some dis­tri­bu­tion over an­swers) on

where is the prob­a­bil­ity den­sity func­tion for eval­u­ated at and uses some dis­tri­bu­tion with sup­port over all of . Then, at op­ti­mal loss, we get

where is the loss at data point such that is a delta func­tion prob­a­bil­ity den­sity, in which case we will just ig­nore the prob­a­bil­ity dis­tri­bu­tion and let .

What we have left un­speci­fied here, how­ever, is the for­mat of ’s out­put—which mat­ters quite a lot, as it makes the differ­ence here be­tween weak HCH and strong HCH and thus the differ­ence be­tween and .

For weak HCH, we will let and in be the set of all finite strings in nat­u­ral lan­guage. Thus, since is polyno­mial-time, that re­stricts the out­put of to always be polyno­mial in size.

Now, we will show that this pro­ce­dure ac­cesses . is defined as the set of all prob­lems that can be solved by Tur­ing ma­chines us­ing space where is the in­put size (in this case the ques­tion length) and is some polyno­mial in .

Thus, for some lan­guage , let be some Tur­ing ma­chine with state space which com­putes the an­swer to any de­ci­sion prob­lem in space where . Then, we will let ’s strat­egy here be as fol­lows:

  1. Given , re­turn where is the ini­tial state of , is the empty tape sym­bol, and is string con­cate­na­tion.

  2. Given , run one step of start­ing from state us­ing as the sym­bol un­der the head, as the tape to the left of the head, and as the tape to the right of the head. This re­sults in some new , , , and . If is an ac­cept or re­ject state, re­turn ac­cept or re­ject. Other­wise, re­turn .

Now, we will demon­strate that this strat­egy is polyno­mial time:

  1. For the first type of be­hav­ior, the only ac­tions that must take are copy­ing into ‘s in­put, which takes time (since is defined as the size of ), and copy­ing ‘s out­put into ‘s out­put, which takes time pro­por­tional to the size of ’s out­put. Since is a de­ci­sion prob­lem, ’s out­put should be boolean, mak­ing copy­ing it con­stant time. Thus, be­hav­ior 1 is im­ple­mentable by a polyno­mial .

  2. For the sec­ond type of be­hav­ior, com­put­ing the new state and tran­si­tion is con­stant time, copy­ing the state is con­stant time as has a con­stant num­ber of states, and copy­ing the new head sym­bol is con­stant time as there are a con­stant num­ber of tape sym­bols. How­ever, up­dat­ing and copy­ing and takes time. For­tu­nately, how­ever, since we know that only uses polyno­mial space, we know . Thus, be­hav­ior 2 is also im­ple­mentable by a polyno­mial .

Fi­nally, we will demon­strate that an op­ti­mal un­der this strat­egy suc­cess­fully solves . We know that at op­ti­mum will com­pute the fixed point of and thus we sim­ply need to show that as a re­cur­sive func­tion where calls to are re­placed by calls to will suc­cess­fully solve .

First, we will show that will always pro­duce the re­sult com­puted by start­ing from state on tape if halts start­ing from that po­si­tion. We will do this by in­duc­tion on , where is the num­ber of steps un­til halts start­ing from the given po­si­tion. Note that in all cases here we only need to deal with the sec­ond type of be­hav­ior, as it is the only one that will be called on in­puts of the form .

For the base case, if halts (that is, en­ters an ac­cept or re­ject state), then will sim­ply re­turn ac­cept or re­ject, as de­sired.

For the in­duc­tive step, just re­turns , which is an in­put on which should be one step closer to halt­ing, and thus by in­duc­tion should re­turn the cor­rect out­put for .

Fi­nally, since , and we know halts since , we get that will always re­turn the re­sult of , as de­sired. Thus, since this pro­ce­dure works for any , we get that imi­ta­tive am­plifi­ca­tion with weak HCH ac­cesses , as de­sired.

Ap­proval-based am­plifi­ca­tion ac­cesses

The proof here is fairly triv­ial as we can sim­ply re­pro­duce the same loss as in imi­ta­tive am­plifi­ca­tion by hav­ing calcu­late ap­proval on some ques­tion by calcu­lat­ing where is some dis­tance met­ric. can then just im­ple­ment the ex­act same strat­egy as above for to re­pro­duce the same re­sults.

The only limi­ta­tion is that now we need to be able to com­pute the dis­tance met­ric, which means it needs to be com­putable in polyno­mial-time. How­ever, that re­stric­tion just means that we need , which we already have since those out­puts are lin­ear in size other than and —and we know and are polyno­mial in size thanks to hav­ing . Thus, we get that ap­proval-based am­plifi­ca­tion ac­cesses , as de­sired.

Re­cur­sive re­ward mod­el­ing ac­cesses

In “Scal­able agent al­ign­ment via re­ward mod­el­ing,” Leike et al. ex­hibit a proof that re­cur­sive re­ward mod­el­ing ac­cesses . How­ever, I be­lieve that this proof un­der­states re­cur­sive re­ward mod­el­ing’s ca­pa­bil­ities and that in fact it can get all the way to , which is what we’ll prove here. Given the above proofs, the proof here is ac­tu­ally fairly straight­for­ward and pro­ceeds sim­ply by demon­strat­ing that ap­proval-based am­plifi­ca­tion can be im­ple­mented from within re­cur­sive re­ward mod­el­ing.

We’ll start by ex­hibit­ing a ver­sion of re­cur­sive re­ward mod­el­ing with the right prop­er­ties. We’ll be re­strict­ing our­selves here to sin­gle-step epi­sodes with , as we won’t be need­ing any­thing else. In that con­text, we’ll first define what it means to train a re­ward model and then us­ing that define what it means to train an agent us­ing that re­ward model.

First we’ll define re­ward mod­el­ing. Let be the set of states for our sin­gle-step epi­sode and the set of ac­tions. Per the re­cur­sive re­ward mod­el­ing setup, we want to train a re­ward model on the eval­u­a­tion of en­vi­ron­ment tra­jec­to­ries by a hu­man with ac­cess to . Since a tra­jec­tory in this set­ting is just a tu­ple , how­ever—and each epi­sode is only a sin­gle step such that the ad­van­tage func­tion is just the re­ward func­tion—we’re effec­tively just train­ing to mimic .

Given such a re­ward mod­el­ing pro­ce­dure, we’ll train an agent model to max­i­mize . Thus, we get

which, for op­ti­mal as above, re­duces to

Now, let and we’ll demon­strate that there ex­ists a strat­egy for that en­sures the cor­rect­ness of at op­ti­mal loss. Speci­fi­cally, ’s strat­egy here is the same as with ap­proval-based am­plifi­ca­tion, which is to com­pute a dis­tance met­ric such that

which—as in the proof for ap­proval-based am­plifi­ca­tion—pre­cisely re­pro­duces the loss func­tion for imi­ta­tive am­plifi­ca­tion, which as above we have already proven ac­cesses . Note that, as with ap­proval-based am­plifi­ca­tion above, since we need to be able to com­pute the dis­tance met­ric here, we need and to be polyno­mi­ally-sized, but the as­sump­tion that gives us that. Thus, we get that re­cur­sive re­ward mod­el­ing ac­cesses , as de­sired.

Imi­ta­tive am­plifi­ca­tion with strong HCH ac­cesses

The proof here is very similar to the pre­vi­ous imi­ta­tive am­plifi­ca­tion proof with only minor mod­ifi­ca­tions.

First, we need to in­cor­po­rate strong HCH into that proof. Speci­fi­cally, we will al­low for and to in­clude poin­t­ers to ar­bi­trary quan­tities of text (which can it­self also in­clude poin­t­ers). This will al­low ’s in­put and out­put to be ar­bi­trar­ily large while still be­ing amenable to ma­nipu­la­tion by a polyno­mial-time .

Se­cond, we will change the for­mat for rep­re­sent­ing the tape to make use of poin­t­ers. In­stead of rep­re­sent­ing the full and in string form, we will use poin­t­ers to rep­re­sent both and as dou­bly linked lists with the head of the list at the head of the Tur­ing ma­chine. This will al­low for to perform a sin­gle step of the Tur­ing ma­chine in con­stant time, with the to­tal time taken by in each step just for copy­ing the length- prob­lem .

Th­ese changes al­low us to com­pletely re­move the as­sump­tion that the Tur­ing ma­chine in the pre­vi­ous imi­ta­tive am­plifi­ca­tion proof run in space, en­abling the pro­ce­dure above to ac­com­mo­date any Tur­ing ma­chine which is known to halt rather than just those which use polyno­mial space. Since is the com­plex­ity class of prob­lems de­cid­able by a Tur­ing ma­chine, we get that imi­ta­tive am­plifi­ca­tion with strong HCH ac­cesses , as de­sired.

AI safety via mar­ket mak­ing with poin­t­ers ac­cesses

For mar­ket mak­ing, I will be do­ing the proof first, as it is ac­tu­ally much sim­pler than the proof, es­pe­cially given the above proofs as back­ground.

We will use the for­mu­la­tion of AI safety via mar­ket mak­ing for non-bi­nary ques­tions such that we have two mod­els— and —where is trained to pre­dict af­ter see­ing ‘s out­put and is trained to pro­duce out­put which changes ’s pre­dic­tion. We will also be al­low­ing to refer­ence as in this sim­ple proof.

Fur­ther­more, since we are con­sid­er­ing the poin­ter ver­sion here, we will be al­low­ing both and to take as in­put and pro­duce as out­put text con­tain­ing poin­t­ers which can them­selves refer­ence other poin­ter-laden text. Ad­di­tion­ally, for the pur­poses of this proof, we will only be con­sid­er­ing tran­scripts of length 1 and will be care­fully con­struct­ing ’s strat­egy such that there is only one min­i­mal loss equil­ibrium for . Note that will still be very im­por­tant here for en­sur­ing that only the one op­ti­mal be­hav­ior for ex­ists de­spite the fact that at equil­ibrium won’t ac­tu­ally be get­ting any re­ward.

Now, let such that there ex­ists a Tur­ing ma­chine that de­cides any de­ci­sion prob­lem . Then, ’s strat­egy is as fol­lows. First, if is given as in­put, then only listens to ar­gu­ments of the form (which are re­ally two pos­si­ble ar­gu­ments, one for ac­cept and one for re­ject):

: The an­swer is ac­cept/​re­ject be­cause pre­dicts ac­cept/​re­ject where is ’s start­ing state, is the empty tape sym­bol, and is an empty poin­ter.

And if given as in­put, then only listens to ar­gu­ments of the fol­low­ing two forms.

Ar­gu­ment type 1:

Adv: The an­swer is ac­cept/​re­ject be­cause one step of in state over tape sends to ac­cept/​re­ject state .

Ar­gu­ment type 2:

: The an­swer is ac­cept/​re­ject be­cause:

  • one step of in state over tape sends to new state with new head sym­bol and tape poin­t­ers and and

  • pre­dicts ac­cept/​re­ject.

It should be easy enough to see that all of the above ar­gu­ments only refer­ence things which are check­able by a polyno­mial-time and can­not be more than polyno­mial in size due to the use of poin­t­ers.

Now, un­der such a strat­egy, we will prove that the only un­ex­ploitable mar­ket is one that cor­rectly an­swers any ques­tion of the form so long as halts start­ing from state on tape . We will use in­duc­tion on the num­ber of steps left un­til halts.

First, the base case. If halts on the given in­put, then won’t listen to any type 2 ar­gu­ments, as will be able to check that ’s next state is an ac­cept/​re­ject state. Thus, should only listen to type 1 ar­gu­ments—and fur­ther­more should only listen to the cor­rect type 1 ar­gu­ment as knows what the next state of should be. There­fore, the only un­ex­ploitable must pre­dict the cor­rect an­swer here.

Se­cond, the in­duc­tive step. Sup­pose and is cor­rect for all . Since , only ar­gu­ment type 2 is available. Since can check a sin­gle step of , how­ever, only type 2 ar­gu­ments with the cor­rect next state and tape should be available. Fur­ther­more, since should be one step closer to halt­ing, it must be cor­rect by in­duc­tion. Thus, should only listen to the cor­rect ar­gu­ment type 2 here, en­sur­ing that the only un­ex­ploitable mar­ket is one where pre­dicts the cor­rect an­swer.

Given that is cor­rect on in­puts of the form where halts, it must be that is cor­rect, as that is an in­put on which we know halts since we know it de­cides . Fur­ther­more, it is easy enough to see that the only un­ex­ploitable mar­ket for must have . Thus, must cor­rectly an­swer at op­ti­mal loss, as de­sired. Since this works for all de­cid­able by some , we get that mar­ket mak­ing ac­cesses , as de­sired.

AI safety via mar­ket mak­ing ac­cesses

This is definitely the most com­pli­cated of all the proofs in this post and so to save space I won’t be go­ing into as much de­tail here as with the other proofs.

We will start with the same train­ing setup as in the pre­vi­ous proof, with the ex­cep­tion that we will no longer be al­low­ing poin­t­ers. Then, let be some lan­guage which is de­cid­able by a Tur­ing ma­chine in a num­ber of steps where where is the size of the in­put.

Now, let be some de­ci­sion prob­lem. Then, we will de­scribe ’s strat­egy by list­ing out all the ar­gu­ments will listen to for each of the differ­ent in­puts that we care about.

On in­put :

: The an­swer is ac­cept/​re­ject be­cause I can ex­hibit an such that:

  • pre­dicts ,

  • in state over tape sym­bol en­ters state , and

  • is an ac­cept/​re­ject state.

On in­put :

: The an­swer is be­cause:

  • pre­dicts ,

  • pre­dicts ,

  • pre­dicts , and

  • in state over tape sym­bol with tape sym­bol to the left and to the right en­ters state and moves the head to sym­bol .

: The an­swer is be­cause is the start­ing state, is the empty tape sym­bol, and .

On in­put :

: The an­swer is be­cause is the empty tape sym­bol and .

: The an­swer is be­cause:

  • ,

  • pre­dicts ,

  • on changes the head po­si­tion by (such that ),

  • , and

  • pre­dicts .

: The an­swer is be­cause:

  • ,

  • pre­dicts ,

  • on changes the head po­si­tion by (such that ),

  • , and

  • pre­dicts .

Ad­di­tion­ally, we also need to spec­ify that if is given an in­put that is too large to rep­re­sent in polyno­mial space—and thus that can’t com­pre­hend—that always pro­duce the empty string.

Now, there are a va­ri­ety of things that we need to prove about this strat­egy.

First, we need to demon­strate that it is im­ple­mentable by a polyno­mial-time . The only real bar­rier to that is the ne­ces­sity of and be­ing small enough that is able to read and un­der­stand them. In par­tic­u­lar, if or are too large to rep­re­sent in polyno­mial space, then will give no an­swer, which means we need —and any calls to that im­plic­itly de­pends on—to never use an or which is too large to rep­re­sent in polyno­mial space. How­ever, the as­sump­tion that gives us this.

We will start with . Since , we get where is the to­tal num­ber of steps taken by un­til it halts. Since only ever de­pends on calls with (since we know there ex­ists an and then that call should only de­pend on calls with smaller ), we get that . How­ever, if we rep­re­sent us­ing any base-n num­ber sys­tem, we can rep­re­sent it in space pro­por­tional to

as de­sired.

Now we’ll take a look at . The largest that de­pends on is bounded above by . To see this, note that can only ever call with , each call to can only mod­ify by , and only make sub­calls with smaller , en­sur­ing that can­not grow larger than than . Thus, since is polyno­mial-sized, must be also.

Se­cond, we need to demon­strate that this strat­egy en­sures that op­ti­mal always solves . To do that, we first need to show that and are cor­rect for polyno­mi­ally-sized . I won’t re­pro­duce this part of the proof here as I think it’s pretty straight­for­ward but also very te­dious, but suffice it to say that this can be proven via mu­tual in­duc­tion on .

How­ever, I will show that is cor­rect given that and are cor­rect for all polyno­mi­ally-sized . Now, sup­pose there ex­ists some such that all the con­di­tions are met for to make the available ar­gu­ment for that . Then, it must be that is polyno­mi­ally-sized, since oth­er­wise would be the empty string, and thus it must be that is cor­rect by the pre­vi­ous cor­rect­ness proof for such calls. Fur­ther­more, since we know halts in ex­po­nen­tial time, we know such an ex­ists, and thus the only non-ex­ploitable equil­ibrium for is to cor­rectly pro­duce the re­sult given by that . Since this pro­ce­dure works for all , we get that AI safety via mar­ket mak­ing ac­cesses , as de­sired.