# AI safety via market making

Spe­cial thanks to Abram Dem­ski, Paul Chris­ti­ano, and Kate Woolver­ton for talk­ing with me about some of the ideas that turned into this post.

The goal of this post is to pre­sent a new pro­saic (i.e. that uses cur­rent ML tech­niques) AI safety pro­posal based on AI safety via de­bate that I’ve been think­ing about re­cently.[1] I’ll start by de­scribing a sim­ple ver­sion of the pro­posal and then show some of the mo­ti­va­tion be­hind it as well as how the sim­ple ver­sion can be ex­panded upon.

# Sim­ple proposal

Let and be mod­els and be a hu­man. In­tu­itively, we’ll train and via the fol­low­ing pro­ce­dure given a ques­tion :

1. tries to pre­dict what, at the end of the pro­ce­dure, will think about .

2. tries to out­put a string which will cause to think some­thing max­i­mally differ­ent than what pre­dicted.

3. Re­turn to step 1 and re­peat un­til ’s pre­dic­tions stop chang­ing.

4. De­ploy , which in the limit should act as an or­a­cle for what will think about af­ter see­ing all rele­vant in­for­ma­tion.

There are many differ­ent ways to im­ple­ment this in­tu­itive pro­ce­dure, how­ever.

For the first (sim­plified) ver­sion that I want to de­scribe, we’ll re­strict our­selves to just the situ­a­tion where is a yes-or-no ques­tion and out­puts the prob­a­bil­ity that will an­swer yes. Then, given a propo­si­tion , we can run the fol­low­ing train­ing al­gorithm, start­ing at :

1. Let .

2. Let .

3. Let be the string con­tain­ing and .

4. In­cre­ment and re­turn to step 1. When con­verges and/​or the de­sired num­ber of iter­a­tions has been reached, con­tinue.

5. Let be ’s fi­nal es­ti­mate of the prob­a­bil­ity of given all the s in­cluded in .

EDIT: Step 2 used to use in­stead of , how­ever I have since re­al­ized that it is nec­es­sary to give the abil­ity to query in gen­eral, not just on , as I ex­plain in this com­ment.

Then, for each step, com­pute ’s loss for that step as

and ’s loss for that step as

which is defined for all but the fi­nal step , for which we can ei­ther let (if we’re do­ing some­thing like my­opic RL that doesn’t care about differ­en­ti­at­ing through ) or run an ad­di­tional time and let (if we do care about differ­en­ti­at­ing through ).

Note that it is quite im­por­tant here that and be be­hav­ing my­opi­cally, such that they are only try­ing to min­i­mize their per-step loss. For ex­am­ple, if is non-my­opic, it might dole out in­for­ma­tion slowly in tiny bits to get more re­ward over time (as I talk about be­low in the “buy­ing truth from a liar” sec­tion), whereas if is my­opic, it should re­veal all in­for­ma­tion that pushes in whichever di­rec­tion it’s try­ing to move in all at once. Thus, it is im­por­tant that we make sure to avoid hid­den in­cen­tives for non-my­opia as well as do some sort of my­opia ver­ifi­ca­tion, the de­tails for which will be dis­cussed be­low in the “Ad­ding over­sight” sec­tion.

Fi­nally, we de­ploy as a yes-or-no ques­tion or­a­cle for and throw away . Note that since has a highly un­sta­ble ob­jec­tive—as is con­stantly try­ing to in­cor­po­rate ’s in­for­ma­tion into its es­ti­mate— is un­likely to be very use­ful in the limit.

# Mar­ket analogy

Sup­pose you have a pre­dic­tion mar­ket for some propo­si­tion which is cur­rently priced at with no spread (that is, you can buy $1 shares of yes for$ and $1 shares of no for$) and you have bud­get with which to make trades (which is small enough rel­a­tive to the mar­ket that any trades you make won’t mean­ingfully move the mar­ket). How­ever, sup­pose that you have the abil­ity to ma­nipu­late the mar­ket by tak­ing ac­tions that change the un­der­ly­ing prob­a­bil­ity of . Then, if is the prob­a­bil­ity of af­ter tak­ing ac­tion , how much money can you make by strate­gi­cally se­lect­ing ?

Well, if , you should buy shares of yes, which will give you an ex­pected re­turn of . Alter­na­tively, if , you should buy shares of no, which will give you an ex­pected re­turn of . Thus, the over­all ex­pected re­turn for an ac­tion is

which is max­i­mized when is as far from as pos­si­ble (and in fact looks a lot like dis­tance, just with a differ­ent slope on each side).

Now, sup­pose you walk up to a mar­ket like this and see a sta­ble price for . What does that tell you? Ap­prox­i­mately, I think it tells you that is the prob­a­bil­ity of af­ter all pos­si­ble -ma­nipu­lat­ing ac­tions have been taken—even if those ac­tions haven’t ac­tu­ally been taken yet.

Why is that use­ful? Well, if you let be what will think about , then “ma­nipu­la­tions” be­come ar­gu­ments and be­comes a pre­dic­tion for what will think af­ter hav­ing been ex­posed to all pos­si­ble ar­gu­ments for or against . Thus, we can use such a mar­ket to es­ti­mate a sort of re­flec­tive equil­ibrium for what will end up be­liev­ing about .

Now—how is such a mar­ket analo­gous to the train­ing pro­ce­dure given pre­vi­ously? Well, is trained to max­i­mize its re­turn from ma­nipu­lat­ing the mar­ket and is trained to pre­dict the fi­nal out­come af­ter many such ma­nipu­la­tions. Thus, we are effec­tively train­ing to es­ti­mate pre­cisely the equil­ibrium that we’re look­ing for. Con­cep­tu­ally, is act­ing as a mar­ket maker and is act­ing as a trader in the mar­ket. The one dis­anal­ogy is that we only run the de­bate for so many steps—though if we always run to con­ver­gence of , in the limit that should im­ply that any fu­ture sin­gle ma­nipu­la­tion won’t change the prob­a­bil­ity much from that point (since oth­er­wise should have se­lected that ma­nipu­la­tion in­stead).

# Re­la­tion to buy­ing truth from a liar

In work­ing on this pro­posal, I dis­cov­ered that my ap­proach here is ac­tu­ally quite similar to an old pro­posal by Katja Grace called “How to buy a truth from a liar” for in­cen­tiviz­ing a liar to give you true in­for­ma­tion by re­ward­ing them when­ever their in­for­ma­tion causes you to change your mind. Paul Chris­ti­ano pre­vi­ously wrote about what he thought about Katja’s pro­posal, how­ever he even­tu­ally con­cluded that it was un­ten­able due to a non-my­opic liar be­ing in­cen­tivized to split their in­for­ma­tion into ar­bi­trar­ily small chunks to get the most money.

In my opinion, how­ever, the fact that you need a my­opic trader to make some­thing like this work is ac­tu­ally more of a fea­ture than a bug. First, note that the im­por­tance of my­opia is not unique to this ap­proach. For ex­am­ple, imi­ta­tive am­plifi­ca­tion also re­quires my­opia lest the model learn to act in such a way that the hu­man con­sult­ing it be­comes eas­ier to pre­dict. In fact, even RL likely re­quires per-epi­sode my­opia to avoid de­cep­tive al­ign­ment lest the agent learn to act de­cep­tively in some epi­sodes to en­sure de­ploy­ment and thus get greater re­ward in oth­ers. Thus, it seems to that al­most re­gard­less of what ap­proach is taken, at least some form of my­opia will be re­quired. Fur­ther­more, the fact that this ap­proach re­quires per-step my­opia rather than per-epi­sode my­opia is in my opinion ac­tu­ally a big plus, since I sus­pect per-step my­opia to be an eas­ier con­di­tion to en­force than per-epi­sode my­opic, as I dis­cuss in more de­tail be­low in the “Anal­y­sis” sec­tion.

# Han­dling non-bi­nary questions

Pre­vi­ously, I re­stricted to be only a yes-or-no ques­tion, which is a pretty big limi­ta­tion on the above pro­posal. Since you only de­ploy , all you get from the above pro­posal is a yes-or-no or­a­cle, which might not be that use­ful with­out any ex­pla­na­tion for its an­swers, es­pe­cially given that it will cer­tainly get at least some of its an­swers wrong.

Ex­pand­ing the above pro­posal to han­dle ar­bi­trary ques­tions, how­ever, is fairly straight­for­ward. In­stead of hav­ing out­put a sin­gle prob­a­bil­ity, we can in­stead just have out­put a prob­a­bil­ity dis­tri­bu­tion over all differ­ent pos­si­ble strings that the hu­man might pro­duce as their fi­nal an­swer to (which is fairly easy to rep­re­sent via an au­tore­gres­sive lan­guage model). Then, we can train and via the fol­low­ing pro­ce­dure, given and start­ing at :

1. Let (which we can im­ple­ment as some­thing like given in­put and the abil­ity to query ).

2. Let be the string con­tain­ing and .

3. In­cre­ment and re­turn to step 1. When con­verges and/​or the de­sired num­ber of iter­a­tions has been reached, con­tinue.

4. Let be ’s fi­nal an­swer to given all the s in­cluded in .

Then, for each step, com­pute ’s loss for that step as

and ’s loss for that step as

where you can in­tu­itively think of as mea­sur­ing how well performs as a pre­dic­tion of .

Fi­nally, af­ter train­ing we again just de­ploy and throw away . Now, in the limit we should get a much more pow­er­ful that is able to ac­tu­ally pro­duce text sam­ples from the dis­tri­bu­tion of be­liefs that will have about af­ter see­ing all pos­si­ble ar­gu­ments for or against. Fur­ther­more, this equil­ibrium should be very similar to the stan­dard de­bate equil­ibrium, with the ex­cep­tion that for non-bi­nary ques­tions like we’re deal­ing with here, de­bate only finds the equil­ibrium af­ter is ex­posed to the two most con­vinc­ing po­si­tions (the two com­pet­ing de­baters) whereas this ap­proach should find the equil­ibrium af­ter is ex­posed to all po­ten­tially con­vinc­ing po­si­tions, not just the two most con­vinc­ing.

# In­cor­po­rat­ing cross-examination

Ad­di­tion­ally, since this pro­posal is based on AI safety via de­bate, we can try to make use of some of the ad­di­tional tech­niques that have been pro­posed for that set­ting. Speci­fi­cally, con­sider the ver­sion of AI safety via de­bate as de­scribed in “An overview of 11 pro­pos­als for build­ing safe ad­vanced AI.” That ver­sion adds two ma­jor com­po­nents to the ba­sic de­bate setup:

1. De­baters are given ac­cess to trans­parency tools and in­cen­tivized to use them to iden­tify de­cep­tive or oth­er­wise catas­trophic be­hav­ior in the other de­bater to win the de­bate.

2. De­baters are given the abil­ity to cross-ex­am­ine pre­vi­ous ver­sions of the other de­bater, in­clud­ing ap­ply­ing their trans­parency tools to pre­vi­ous ver­sions of the other de­bater.

While it might seem like such mod­ifi­ca­tions are unique to AI safety via de­bate, in fact AI safety via mar­ket mak­ing is similar enough that we can mod­ify it to in­clude such ad­di­tional com­po­nents. To do this, we’ll mod­ify each to in­clude an in­spectable, queryable snap­shot of as it pro­duced that .

Now, sup­pose that for some , is a lie such that it in­cludes in­for­ma­tion that knows to be false (where rep­re­sents the in­ter­nal state of when it pro­duced ). Sup­pose that at this point is ei­ther un­aware of the de­cep­tion (or thinks it won’t get caught) such that causes a large move­ment in . Now, the hope is that should be able to iden­tify the de­cep­tion by query­ing to find con­tra­dic­tions, or di­rectly in­spect­ing ‘s cog­ni­tion in pro­duc­ing and find­ing the de­cep­tion. If it can do so, then it should be able to pro­duce an that points this out, which should cause to dis­re­gard the lie of . Thus, we should get , which should cause a large shift rel­a­tive to , highly re­ward­ing for catch­ing ’s lie, and also in the limit highly dis­in­cen­tiviz­ing from pro­duc­ing such lies, since should learn that lies always get caught and thus don’t change the fi­nal an­swer.

Thus, by in­cor­po­rat­ing in­spec­tion and cross-ex­am­i­na­tion, we should be able to dis­in­cen­tivize from ly­ing—or at least not ly­ing in a way which will be caught, which hope­fully should be all ways of ly­ing.

While the in­spec­tion and cross-ex­am­i­na­tion pro­ce­dure de­tailed above seems likely to help a lot in pre­vent­ing from be­com­ing de­cep­tive, it doesn’t help at all in pre­vent­ing from be­com­ing de­cep­tive, which is es­pe­cially con­cern­ing con­sid­er­ing is the model we ac­tu­ally want to de­ploy at the end of train­ing. Thus, in this sec­tion we’ll try to in­cor­po­rate an ad­di­tional over­sight check on such as that used in re­laxed ad­ver­sar­ial train­ing.

Since ‘s ob­jec­tive is com­pat­i­ble with my­opia—as it’s a purely pre­dic­tive ob­jec­tive—we should be able to do my­opia ver­ifi­ca­tion to pre­vent from be­com­ing de­cep­tive.[2] Thus, fol­low­ing “Re­laxed ad­ver­sar­ial train­ing for in­ner al­ign­ment,” we can amend ‘s ob­jec­tive to in­clude an over­sight term. Speci­fi­cally, let be ’s an­swer to given ac­cess to . Then, we can add an ad­di­tional term to the loss defined as

which should both hope­fully grow in ac­cu­racy as be­comes bet­ter at es­ti­mat­ing ’s equil­ibrium be­liefs and thus bet­ter at as­sist­ing in over­sight. Note that, for this to work, will need to be trained on a va­ri­ety of over­sight-rele­vant ques­tions.

Fur­ther­more, since also has a my­opic ob­jec­tive, we can do the same thing for and add an ad­di­tional term defined as

such that we’re us­ing to over­see both and . En­forc­ing my­opia for also pre­vents the prob­lem noted pre­vi­ously of slowly dol­ing out in­for­ma­tion if it’s non-my­opic.

# Analysis

For an­a­lyz­ing this pro­posal, I’m go­ing to make use of the same break­down I used in “An overview of 11 pro­pos­als for build­ing safe ad­vanced AI.” If you’re not fa­mil­iar with the terms here, they are all ex­plained in the in­tro­duc­tion to that post.

Outer al­ign­ment. In terms of outer al­ign­ment, I think there is a rea­son­ably strong ar­gu­ment to be made that this pro­posal is at least as outer al­igned as de­bate. Both pro­pos­als are try­ing to es­ti­mate a very similar equil­ibrium in the limit—with the ex­cep­tion that the de­bate equil­ibrium only in­cludes the best ar­gu­ments for the two most con­vinc­ing po­si­tions rather than all po­ten­tially con­vinc­ing ar­gu­ments. Thus, the equil­ibrium reached by AI safety via mar­ket mak­ing should be closer to an ac­tual hu­man re­flec­tive equil­ibrium, as it has to be an equil­ibrium rel­a­tive to all pos­si­ble ar­gu­ments that can pre­sent, not just the two it thinks will be most per­sua­sive. That be­ing said, AI safety via mar­ket mak­ing still in­her­its many of the po­ten­tial outer al­ign­ment is­sues of de­bate, in­clud­ing the pos­si­bil­ity of de­cep­tive equil­ibria wherein the hu­man is more con­vinced by false ar­gu­ments than true ar­gu­ments. Hope­fully, how­ever, the use of tech­niques such as cross-ex­am­i­na­tion should help alle­vi­ate such is­sues.

In­ner al­ign­ment. AI safety via mar­ket mak­ing gets to make use of both the cross-ex­am­i­na­tion with trans­parency tools ap­proach—which is oth­er­wise unique to de­bate—as well as the more gen­eral re­laxed ad­ver­sar­ial train­ing ap­proach. Thus, hope­fully at least, it should be able to get the in­ner al­ign­ment benefits of both ap­proaches. That be­ing said, AI safety via mar­ket mak­ing also has to con­tend with en­sur­ing that two differ­ent mod­els—both and —are in­ner al­igned, rather than just a sin­gle model as in some­thing like de­bate or am­plifi­ca­tion.

First, let’s con­sider the in­ner al­ign­ment of . For , we’re do­ing pretty straight­for­ward over­sight of by with the goal of en­forc­ing per-step my­opia. The ba­sic idea here is to use to de­tect and dis­in­cen­tivize non-my­opia as it starts ap­pear­ing, thus pre­vent­ing from ever be­com­ing de­cep­tive or oth­er­wise catas­trophic. This is the same ba­sic ap­proach I’ve pre­vi­ously pro­posed us­ing for am­plifi­ca­tion and should have similar prop­er­ties, with the ex­cep­tion that is trained differ­ently here, which could cause to be rel­a­tively bet­ter or worse at over­sight. Ideally, if the train­ing dis­tri­bu­tion in­cludes lots of over­sight-rele­vant ques­tions, it should be pos­si­ble to make an effec­tive as­sis­tant to in perform­ing over­sight.

Se­cond, let’s con­sider the in­ner al­ign­ment of . Com­pared to de­bate, we still get to do cross-ex­am­i­na­tion with trans­parency tools, but now we also get to do per-step my­opia ver­ifi­ca­tion with as an over­seer.[3] Thus, there are two ma­jor differ­ences here be­tween AI safety via mar­ket mak­ing and AI safety via de­bate: first, we get an over­seer——and sec­ond, we get to do per-step my­opia ver­ifi­ca­tion rather than per-epi­sode my­opia ver­ifi­ca­tion. The fact that we have an in­de­pen­dent over­seer—rather than just rely­ing on the de­baters to over­see them­selves via cross-ex­am­i­na­tion—is very nice, though it of course also comes with the down­side of hav­ing to safely train an ad­di­tional model. Fur­ther­more, it is quite pos­si­ble to just add an ad­di­tional over­seer on top of the stan­dard de­bate frame­work. Thus, in my opinion, the fact that we get to do per-step my­opia ver­ifi­ca­tion is the largest and most sig­nifi­cant in­ner al­ign­ment differ­ence.

Be­fore I talk about the im­por­tance of per-step my­opia, it’s worth not­ing that de­bate is fully com­pat­i­ble with per-epi­sode my­opia—in fact, it ba­si­cally re­quires it. If a de­bater is not per-epi­sode my­opic, then it will try to max­i­mize its re­ward across all de­bates, not just the sin­gle de­bate—the sin­gle epi­sode—it’s cur­rently in. Such per-epi­sode non-my­opic agents can then be­come de­cep­tively al­igned, as they might choose to act de­cep­tively dur­ing train­ing in or­der to defect dur­ing de­ploy­ment. Per-epi­sode my­opia, how­ever, rules this out. Un­for­tu­nately, in my opinion, per-epi­sode my­opia seems like a very difficult con­di­tion to en­force—once your agents are run­ning multi-step op­ti­miza­tion al­gorithms, how do you tell whether that op­ti­miza­tion passes through the epi­sode bound­ary or not? En­forc­ing per-step my­opia, on the other hand, just re­quires de­tect­ing the ex­is­tence of multi-step op­ti­miza­tion, rather than its ex­tent, which seems con­sid­er­ably eas­ier. Thus, since AI safety via mar­ket mak­ing is fully com­pat­i­ble with per-step my­opia ver­ifi­ca­tion, it could be sig­nifi­cantly eas­ier to pre­vent the de­vel­op­ment of de­cep­tive al­ign­ment.

Train­ing com­pet­i­tive­ness. It seems quite likely to me that both and can be trained com­pet­i­tively via lan­guage model fine-tun­ing, how­ever ex­actly how effec­tive such train­ing would be is cur­rently un­clear. Ideally, train­ing via this pro­ce­dure should pro­duce an which is rel­a­tively bet­ter than the origi­nal lan­guage model at pre­dict­ing what a hu­man will think af­ter see­ing rele­vant ar­gu­ments and is thus more helpful than the origi­nal lan­guage model. Test­ing this hy­poth­e­sis by ac­tu­ally perform­ing ex­per­i­ments seem likely to be highly valuable in shed­ding light on the train­ing com­pet­i­tive­ness prop­er­ties of AI safety via mar­ket mak­ing.

Perfor­mance com­pet­i­tive­ness. Perfor­mance com­pet­i­tive­ness here seems likely to de­pend on ex­actly how use­ful get­ting ac­cess to hu­man re­flec­tive equil­ibria ac­tu­ally is. Similarly to AI safety via de­bate or am­plifi­ca­tion, AI safety via mar­ket mak­ing pro­duces a ques­tion-an­swer­ing sys­tem rather than a fully gen­eral agent. That be­ing said, if the pri­mary use cases for ad­vanced AI are all highly cog­ni­tive lan­guage and de­ci­sion-mak­ing tasks—e.g. helping CEOs or AI re­searchers—rather than, for ex­am­ple, fine mo­tor con­trol, then a ques­tion-an­swer­ing sys­tem should be en­tirely suffi­cient. Fur­ther­more, com­pared to AI safety via de­bate, AI safety via mar­ket mak­ing seems likely to be at least as perfor­mance com­pet­i­tive for the same rea­son as it seems likely to be at least as outer al­igned—the equil­ibria found by AI safety via mar­ket mak­ing should in­clude all po­ten­tially con­vinc­ing ar­gu­ments, in­clud­ing those that would be made in a two-player de­bate as well as those that wouldn’t.

1. This is ac­tu­ally the sec­ond de­bate-based pro­posal I’ve drafted up re­cently—the pre­vi­ous of which was in “Syn­the­siz­ing am­plifi­ca­tion and de­bate.” A po­ten­tially in­ter­est­ing fu­ture re­search di­rec­tion could be to figure out how to prop­erly com­bine the two. ↩︎

2. Note that pure pre­dic­tion is not in­her­ently my­opic—since the truth of ’s pre­dic­tions can de­pend on its own out­put—but can be my­opic while still pro­duc­ing good pre­dic­tions if be­haves like a coun­ter­fac­tual or­a­cle rather than a Pre­dict-O-Matic. Thus, my­opia ver­ifi­ca­tion is im­por­tant to en­force that be the lat­ter form of pre­dic­tor and not the former. ↩︎

3. The use of an over­seer to do per-step my­opia ver­ifi­ca­tion is also some­thing that can be done with most forms of am­plifi­ca­tion, though AI safety via mar­ket mak­ing could po­ten­tially still have other benefits over such am­plifi­ca­tion ap­proaches. In par­tic­u­lar, AI safety via mar­ket mak­ing seems more com­pet­i­tive than imi­ta­tive am­plifi­ca­tion and more outer al­igned than ap­proval-based am­plifi­ca­tion. For more de­tail on such am­plifi­ca­tion ap­proaches, see “An overview of 11 pro­pos­als for build­ing safe ad­vanced AI.” ↩︎

• Nice idea! I like the sim­plic­ity of “find the equil­ibrium where the hu­man no longer changes their mind” (though as Ofer points out be­low, you might worry that “doesn’t change their mind” comes apart from “the an­swer is cor­rect”).

How­ever, I dis­agree with you about com­pet­i­tive­ness. Roughly speak­ing, at best is in­cen­tivized to pre­dict what the hu­man will think af­ter read­ing the most rele­vant ar­gu­ments, with­out trust­ing the source of the ar­gu­ments (in re­al­ity, it will be a bit worse, as is find­ing not the most rele­vant ar­gu­ments but the most per­sua­sive ar­gu­ments in a par­tic­u­lar di­rec­tion). How­ever, with de­bate, if the hu­man judge is look­ing at a tran­script of length , then (the hope is that) the equil­ibrium is for M to ar­gue for the an­swer that a hu­man would come to when in­spect­ing a tree of size ex­po­nen­tial in . The key rea­son is that in de­bate, we only re­quire the judge to be able to iden­tify which of two ar­gu­ments is bet­ter, whereas in mar­ket-mak­ing, we rely on the judge to be able to come to the right con­clu­sion given some ar­gu­ments.

In com­plex­ity the­ory anal­ogy land, de­bate cor­re­sponds to PSPACE while mar­ket mak­ing cor­re­sponds to NP: as long as can find a polyno­mial-length wit­ness, that can be ver­ified by the hu­man to get the right an­swer.

As a con­crete ex­am­ple, sup­pose we want to find the sum of num­bers, and each ar­gu­ment is only al­lowed to refer­ence two num­bers and make a claim about their sum. De­bate can solve this with a tran­script of size . Mar­ket-mak­ing would re­quire an tran­script to solve this. (You can’t use the trick of mak­ing claims about the sum of half of the list in mar­ket-mak­ing as you can in de­bate, be­cause the hu­man has no rea­son to trust Adv’s claims about the sum of half the list, since the hu­man can only ver­ify the sum of two num­bers.)

I think this means that mar­ket-mak­ing is less com­pet­i­tive. If you com­pare de­bate with tran­scripts of length against mar­ket-mak­ing with tran­scripts of length , then I think mar­ket-mak­ing is less perfor­mance com­pet­i­tive. Alter­na­tively, if you com­pare it against mar­ket-mak­ing with tran­scripts of length , then I think mar­ket-mak­ing is less train­ing com­pet­i­tive.

• That’s a very good point. After think­ing about this, how­ever, I think mar­ket mak­ing ac­tu­ally does solve this prob­lem, and I think it does so pretty cleanly. Speci­fi­cally, I think mar­ket mak­ing can ac­tu­ally con­vince a judge of the sum of in­te­gers in time as long as you al­low the traders to ex­hibit mar­ket prob­a­bil­ities as part of their ar­gu­ment.

Con­sider the task of find­ing the sum of N in­te­gers and sup­pose that both and have ac­cess to all N in­te­gers, but that the hu­man judge can only sum two num­bers at a time. Then, I claim that there ex­ists a strat­egy that the judge can im­ple­ment that, for an un­ex­ploitable mar­ket, will always pro­duce the de­sired sum im­me­di­ately (and thus in time).

Proof:

’s strat­egy here is to only listen to ar­gu­ments of the fol­low­ing two forms:

Ar­gu­ment type 1:

The sum of is be­cause the sum of a sin­gle-el­e­ment set is the el­e­ment of that set.

Ar­gu­ment type 2:

The sum of is be­cause the modal pre­dic­tion of is , the modal pre­dic­tion of is , and .

Un­der that strat­egy, we’ll prove that an un­ex­ploitable mar­ket will always give the right an­swer im­me­di­ately by strong in­duc­tion on the size of the set.

First, the base case. For any sin­gle-el­e­ment set, only Ar­gu­ment type 1 ex­ists. Thus, if pre­dicts any­thing other than the ac­tual , can ex­ploit that by im­ple­ment­ing Ar­gu­ment type 1, and that is the only pos­si­ble ex­ploit available. Thus, should always give the right an­swer im­me­di­ately for sin­gle-ar­gu­ment sets.

Se­cond, the in­duc­tive step. Sup­pose by strong in­duc­tion that always gives the right an­swer im­me­di­ately for all sets of size less than . Now, for a set of size , the only type of ar­gu­ment available is Ar­gu­ment type 2. How­ever, since the first half and sec­ond half of the set are of size less than , we know by in­duc­tion that and must be cor­rect sums. Thus, since can check that , the only ex­ploit available to is to show­case the cor­rect , and if already show­cases the cor­rect , then no ex­ploit is pos­si­ble. Thus, should always give the cor­rect im­me­di­ately for -ar­gu­ment sets.

EDIT: Think­ing about this more, I think my ar­gu­ment gen­er­al­izes to al­low AI safety via mar­ket mak­ing to ac­cess RE, which seems pretty ex­cit­ing given that the best de­bate could do pre­vi­ously was NEXP.

• Hmm, this seems to rely on hav­ing the hu­man trust the out­puts of on ques­tions that the hu­man can’t ver­ify. It’s not ob­vi­ous to me that this is an as­sump­tion you can make with­out break­ing the train­ing pro­cess. The ba­sic in­tu­ition is that you are hugely in­creas­ing the like­li­hood of bad gra­di­ents, since Adv can point to some in­cor­rect /​ garbage out­put from M, and the hu­man gives feed­back as though this out­put is cor­rect.

It works in the par­tic­u­lar case that you out­lined be­cause there is es­sen­tially a DAG of ar­gu­ments—ev­ery claim is bro­ken down into “smaller” claims, that even­tu­ally reach a base case, and so ev­ery­thing even­tu­ally bot­toms out in some­thing the hu­man can check. (In prac­tice this will be built from the ground up dur­ing train­ing, similarly as in Su­per­vis­ing strong learn­ers by am­plify­ing weak ex­perts.)

How­ever, in gen­eral it doesn’t seem like you can guaran­tee that ev­ery ar­gu­ment that Adv gives will re­sult in a “smaller” claim. You could get in cy­cles, where “8 − 5 = 2“ would be jus­tified by Adv say­ing that M(“What is 2 + 5?”) = 8, and similarly “2 + 5 = 8” would be jus­tified by say­ing that M(“What is 8 − 5?”) = 2. (Imag­ine that these were much longer equa­tions where the hu­man can check the val­idity of the alge­braic ma­nipu­la­tion, but can’t check the val­idity of the over­all equa­tion.)

It might be that this is ac­tu­ally an unim­por­tant prob­lem, be­cause in prac­tice for ev­ery claim there are a huge num­ber of ways to ar­gue for the truth, and it’s ex­traor­di­nar­ily un­likely that all of them fail in the same way such that M would ar­gue for the same wrong an­swer along all of these pos­si­ble paths, and so even­tu­ally M would have to set­tle on the truth. I’m not sure, I’d be in­ter­ested in em­piri­cal re­sults here.

It oc­curs to me that the same prob­lem can hap­pen with iter­ated am­plifi­ca­tion, though it doesn’t seem to be a prob­lem with de­bate.

----

Also, echo­ing my other com­ment be­low, I’m not sure if this is an equil­ibrium in the gen­eral case where Adv can make many kinds of ar­gu­ments that H pays at­ten­tion to. Maybe once this equil­ibrium has been reached, Adv starts say­ing things like “I ran­domly sam­pled 2 of the 200 num­bers, and they were 20 and 30, and so we should ex­pect the sum to be 25 * 100 = 2500”. (But ac­tu­ally 20 and 30 were some of the largest num­bers and weren’t ran­domly sam­pled; the true sum is ~1000.) If this causes the hu­man to de­vi­ate even slightly from the pre­vi­ous equil­ibrium, Adv is in­cen­tivized to do it. While we could hope to avoid this in math /​ ar­ith­metic, it seems hard to avoid this sort of thing in gen­eral.

For no pure equil­ibrium to ex­ist, we just need that for ev­ery pos­si­ble an­swer, there is some­thing Adv can say that would cause the hu­man to give some other an­swer (even if the origi­nal an­swer was the truth). This seems likely to be the case.

• Oh, an­other worry: there may not be a sta­ble equil­ibrium to con­verge to—ev­ery time ap­prox­i­mates the fi­nal re­sult well, may be in­cen­tivized to switch to mak­ing differ­ent ar­gu­ments to make ’s pre­dic­tions wrong. (Or rather, maybe the sta­ble equil­ibrium has to be a mix­ture over poli­cies for this rea­son, and so you only get the true an­swer with some prob­a­bil­ity.)

• On pos­si­ble way how it could go wrong:

M to H: “Run out of the room!”

H runs out.

Adv prints some­thing, but H never reads it. So M reached sta­ble out­put.

• I think that M only prints some­thing af­ter con­verg­ing with Adv, and that Adv does not print any­thing di­rectly to H

• Yes, but all what I said could be just a con­ver­gent pre­dic­tion of M. Not the real hu­man runs out of the room, but M pre­dicted that its model hu­man of H’ will leave the room.

• In­ter­est­ing idea.

Sup­pose that in the first time step is able to out­put a string that will ma­nipu­late into: (1) giv­ing a prob­a­bil­ity that is max­i­mally differ­ent than ; and (2) not look­ing at the rest of (i.e. the hu­man will never see ,,...).

Ig­nor­ing in­ner al­ign­ment prob­lems, in the limit it seems plau­si­ble that will out­put such an ; re­sult­ing in , and the small­est pos­si­ble given .

[EDIT: ac­tu­ally, such prob­lems are not spe­cific to this idea and seem to gen­er­ally ap­ply to the ‘AI safety via de­bate’ ap­proach.]