AI safety via market making

Spe­cial thanks to Abram Dem­ski, Paul Chris­ti­ano, and Kate Woolver­ton for talk­ing with me about some of the ideas that turned into this post.

The goal of this post is to pre­sent a new pro­saic (i.e. that uses cur­rent ML tech­niques) AI safety pro­posal based on AI safety via de­bate that I’ve been think­ing about re­cently.[1] I’ll start by de­scribing a sim­ple ver­sion of the pro­posal and then show some of the mo­ti­va­tion be­hind it as well as how the sim­ple ver­sion can be ex­panded upon.

Sim­ple proposal

Let and be mod­els and be a hu­man. In­tu­itively, we’ll train and via the fol­low­ing pro­ce­dure given a ques­tion :

  1. tries to pre­dict what, at the end of the pro­ce­dure, will think about .

  2. tries to out­put a string which will cause to think some­thing max­i­mally differ­ent than what pre­dicted.

  3. Re­turn to step 1 and re­peat un­til ’s pre­dic­tions stop chang­ing.

  4. De­ploy , which in the limit should act as an or­a­cle for what will think about af­ter see­ing all rele­vant in­for­ma­tion.

There are many differ­ent ways to im­ple­ment this in­tu­itive pro­ce­dure, how­ever.

For the first (sim­plified) ver­sion that I want to de­scribe, we’ll re­strict our­selves to just the situ­a­tion where is a yes-or-no ques­tion and out­puts the prob­a­bil­ity that will an­swer yes. Then, given a propo­si­tion , we can run the fol­low­ing train­ing al­gorithm, start­ing at :

  1. Let .

  2. Let .

  3. Let be the string con­tain­ing and .

  4. In­cre­ment and re­turn to step 1. When con­verges and/​or the de­sired num­ber of iter­a­tions has been reached, con­tinue.

  5. Let be ’s fi­nal es­ti­mate of the prob­a­bil­ity of given all the s in­cluded in .

EDIT: Step 2 used to use in­stead of , how­ever I have since re­al­ized that it is nec­es­sary to give the abil­ity to query in gen­eral, not just on , as I ex­plain in this com­ment.

Then, for each step, com­pute ’s loss for that step as

and ’s loss for that step as

which is defined for all but the fi­nal step , for which we can ei­ther let (if we’re do­ing some­thing like my­opic RL that doesn’t care about differ­en­ti­at­ing through ) or run an ad­di­tional time and let (if we do care about differ­en­ti­at­ing through ).

Note that it is quite im­por­tant here that and be be­hav­ing my­opi­cally, such that they are only try­ing to min­i­mize their per-step loss. For ex­am­ple, if is non-my­opic, it might dole out in­for­ma­tion slowly in tiny bits to get more re­ward over time (as I talk about be­low in the “buy­ing truth from a liar” sec­tion), whereas if is my­opic, it should re­veal all in­for­ma­tion that pushes in whichever di­rec­tion it’s try­ing to move in all at once. Thus, it is im­por­tant that we make sure to avoid hid­den in­cen­tives for non-my­opia as well as do some sort of my­opia ver­ifi­ca­tion, the de­tails for which will be dis­cussed be­low in the “Ad­ding over­sight” sec­tion.

Fi­nally, we de­ploy as a yes-or-no ques­tion or­a­cle for and throw away . Note that since has a highly un­sta­ble ob­jec­tive—as is con­stantly try­ing to in­cor­po­rate ’s in­for­ma­tion into its es­ti­mate— is un­likely to be very use­ful in the limit.

Mar­ket analogy

There are a lot of differ­ent things to say about what’s go­ing on here, but I’ll start with the mar­ket-based anal­ogy that origi­nally mo­ti­vated me to think about this spe­cific pro­posal.

Sup­pose you have a pre­dic­tion mar­ket for some propo­si­tion which is cur­rently priced at with no spread (that is, you can buy $1 shares of yes for $ and $1 shares of no for $) and you have bud­get with which to make trades (which is small enough rel­a­tive to the mar­ket that any trades you make won’t mean­ingfully move the mar­ket). How­ever, sup­pose that you have the abil­ity to ma­nipu­late the mar­ket by tak­ing ac­tions that change the un­der­ly­ing prob­a­bil­ity of . Then, if is the prob­a­bil­ity of af­ter tak­ing ac­tion , how much money can you make by strate­gi­cally se­lect­ing ?

Well, if , you should buy shares of yes, which will give you an ex­pected re­turn of . Alter­na­tively, if , you should buy shares of no, which will give you an ex­pected re­turn of . Thus, the over­all ex­pected re­turn for an ac­tion is

which is max­i­mized when is as far from as pos­si­ble (and in fact looks a lot like dis­tance, just with a differ­ent slope on each side).

Now, sup­pose you walk up to a mar­ket like this and see a sta­ble price for . What does that tell you? Ap­prox­i­mately, I think it tells you that is the prob­a­bil­ity of af­ter all pos­si­ble -ma­nipu­lat­ing ac­tions have been taken—even if those ac­tions haven’t ac­tu­ally been taken yet.

Why is that use­ful? Well, if you let be what will think about , then “ma­nipu­la­tions” be­come ar­gu­ments and be­comes a pre­dic­tion for what will think af­ter hav­ing been ex­posed to all pos­si­ble ar­gu­ments for or against . Thus, we can use such a mar­ket to es­ti­mate a sort of re­flec­tive equil­ibrium for what will end up be­liev­ing about .

Now—how is such a mar­ket analo­gous to the train­ing pro­ce­dure given pre­vi­ously? Well, is trained to max­i­mize its re­turn from ma­nipu­lat­ing the mar­ket and is trained to pre­dict the fi­nal out­come af­ter many such ma­nipu­la­tions. Thus, we are effec­tively train­ing to es­ti­mate pre­cisely the equil­ibrium that we’re look­ing for. Con­cep­tu­ally, is act­ing as a mar­ket maker and is act­ing as a trader in the mar­ket. The one dis­anal­ogy is that we only run the de­bate for so many steps—though if we always run to con­ver­gence of , in the limit that should im­ply that any fu­ture sin­gle ma­nipu­la­tion won’t change the prob­a­bil­ity much from that point (since oth­er­wise should have se­lected that ma­nipu­la­tion in­stead).

Re­la­tion to buy­ing truth from a liar

In work­ing on this pro­posal, I dis­cov­ered that my ap­proach here is ac­tu­ally quite similar to an old pro­posal by Katja Grace called “How to buy a truth from a liar” for in­cen­tiviz­ing a liar to give you true in­for­ma­tion by re­ward­ing them when­ever their in­for­ma­tion causes you to change your mind. Paul Chris­ti­ano pre­vi­ously wrote about what he thought about Katja’s pro­posal, how­ever he even­tu­ally con­cluded that it was un­ten­able due to a non-my­opic liar be­ing in­cen­tivized to split their in­for­ma­tion into ar­bi­trar­ily small chunks to get the most money.

In my opinion, how­ever, the fact that you need a my­opic trader to make some­thing like this work is ac­tu­ally more of a fea­ture than a bug. First, note that the im­por­tance of my­opia is not unique to this ap­proach. For ex­am­ple, imi­ta­tive am­plifi­ca­tion also re­quires my­opia lest the model learn to act in such a way that the hu­man con­sult­ing it be­comes eas­ier to pre­dict. In fact, even RL likely re­quires per-epi­sode my­opia to avoid de­cep­tive al­ign­ment lest the agent learn to act de­cep­tively in some epi­sodes to en­sure de­ploy­ment and thus get greater re­ward in oth­ers. Thus, it seems to that al­most re­gard­less of what ap­proach is taken, at least some form of my­opia will be re­quired. Fur­ther­more, the fact that this ap­proach re­quires per-step my­opia rather than per-epi­sode my­opia is in my opinion ac­tu­ally a big plus, since I sus­pect per-step my­opia to be an eas­ier con­di­tion to en­force than per-epi­sode my­opic, as I dis­cuss in more de­tail be­low in the “Anal­y­sis” sec­tion.

Han­dling non-bi­nary questions

Pre­vi­ously, I re­stricted to be only a yes-or-no ques­tion, which is a pretty big limi­ta­tion on the above pro­posal. Since you only de­ploy , all you get from the above pro­posal is a yes-or-no or­a­cle, which might not be that use­ful with­out any ex­pla­na­tion for its an­swers, es­pe­cially given that it will cer­tainly get at least some of its an­swers wrong.

Ex­pand­ing the above pro­posal to han­dle ar­bi­trary ques­tions, how­ever, is fairly straight­for­ward. In­stead of hav­ing out­put a sin­gle prob­a­bil­ity, we can in­stead just have out­put a prob­a­bil­ity dis­tri­bu­tion over all differ­ent pos­si­ble strings that the hu­man might pro­duce as their fi­nal an­swer to (which is fairly easy to rep­re­sent via an au­tore­gres­sive lan­guage model). Then, we can train and via the fol­low­ing pro­ce­dure, given and start­ing at :

  1. Let (which we can im­ple­ment as some­thing like given in­put and the abil­ity to query ).

  2. Let be the string con­tain­ing and .

  3. In­cre­ment and re­turn to step 1. When con­verges and/​or the de­sired num­ber of iter­a­tions has been reached, con­tinue.

  4. Let be ’s fi­nal an­swer to given all the s in­cluded in .

Then, for each step, com­pute ’s loss for that step as

and ’s loss for that step as

where you can in­tu­itively think of as mea­sur­ing how well performs as a pre­dic­tion of .

Fi­nally, af­ter train­ing we again just de­ploy and throw away . Now, in the limit we should get a much more pow­er­ful that is able to ac­tu­ally pro­duce text sam­ples from the dis­tri­bu­tion of be­liefs that will have about af­ter see­ing all pos­si­ble ar­gu­ments for or against. Fur­ther­more, this equil­ibrium should be very similar to the stan­dard de­bate equil­ibrium, with the ex­cep­tion that for non-bi­nary ques­tions like we’re deal­ing with here, de­bate only finds the equil­ibrium af­ter is ex­posed to the two most con­vinc­ing po­si­tions (the two com­pet­ing de­baters) whereas this ap­proach should find the equil­ibrium af­ter is ex­posed to all po­ten­tially con­vinc­ing po­si­tions, not just the two most con­vinc­ing.

In­cor­po­rat­ing cross-examination

Ad­di­tion­ally, since this pro­posal is based on AI safety via de­bate, we can try to make use of some of the ad­di­tional tech­niques that have been pro­posed for that set­ting. Speci­fi­cally, con­sider the ver­sion of AI safety via de­bate as de­scribed in “An overview of 11 pro­pos­als for build­ing safe ad­vanced AI.” That ver­sion adds two ma­jor com­po­nents to the ba­sic de­bate setup:

  1. De­baters are given ac­cess to trans­parency tools and in­cen­tivized to use them to iden­tify de­cep­tive or oth­er­wise catas­trophic be­hav­ior in the other de­bater to win the de­bate.

  2. De­baters are given the abil­ity to cross-ex­am­ine pre­vi­ous ver­sions of the other de­bater, in­clud­ing ap­ply­ing their trans­parency tools to pre­vi­ous ver­sions of the other de­bater.

While it might seem like such mod­ifi­ca­tions are unique to AI safety via de­bate, in fact AI safety via mar­ket mak­ing is similar enough that we can mod­ify it to in­clude such ad­di­tional com­po­nents. To do this, we’ll mod­ify each to in­clude an in­spectable, queryable snap­shot of as it pro­duced that .

Now, sup­pose that for some , is a lie such that it in­cludes in­for­ma­tion that knows to be false (where rep­re­sents the in­ter­nal state of when it pro­duced ). Sup­pose that at this point is ei­ther un­aware of the de­cep­tion (or thinks it won’t get caught) such that causes a large move­ment in . Now, the hope is that should be able to iden­tify the de­cep­tion by query­ing to find con­tra­dic­tions, or di­rectly in­spect­ing ‘s cog­ni­tion in pro­duc­ing and find­ing the de­cep­tion. If it can do so, then it should be able to pro­duce an that points this out, which should cause to dis­re­gard the lie of . Thus, we should get , which should cause a large shift rel­a­tive to , highly re­ward­ing for catch­ing ’s lie, and also in the limit highly dis­in­cen­tiviz­ing from pro­duc­ing such lies, since should learn that lies always get caught and thus don’t change the fi­nal an­swer.

Thus, by in­cor­po­rat­ing in­spec­tion and cross-ex­am­i­na­tion, we should be able to dis­in­cen­tivize from ly­ing—or at least not ly­ing in a way which will be caught, which hope­fully should be all ways of ly­ing.

Ad­ding oversight

While the in­spec­tion and cross-ex­am­i­na­tion pro­ce­dure de­tailed above seems likely to help a lot in pre­vent­ing from be­com­ing de­cep­tive, it doesn’t help at all in pre­vent­ing from be­com­ing de­cep­tive, which is es­pe­cially con­cern­ing con­sid­er­ing is the model we ac­tu­ally want to de­ploy at the end of train­ing. Thus, in this sec­tion we’ll try to in­cor­po­rate an ad­di­tional over­sight check on such as that used in re­laxed ad­ver­sar­ial train­ing.

Since ‘s ob­jec­tive is com­pat­i­ble with my­opia—as it’s a purely pre­dic­tive ob­jec­tive—we should be able to do my­opia ver­ifi­ca­tion to pre­vent from be­com­ing de­cep­tive.[2] Thus, fol­low­ing “Re­laxed ad­ver­sar­ial train­ing for in­ner al­ign­ment,” we can amend ‘s ob­jec­tive to in­clude an over­sight term. Speci­fi­cally, let be ’s an­swer to given ac­cess to . Then, we can add an ad­di­tional term to the loss defined as

which should both hope­fully grow in ac­cu­racy as be­comes bet­ter at es­ti­mat­ing ’s equil­ibrium be­liefs and thus bet­ter at as­sist­ing in over­sight. Note that, for this to work, will need to be trained on a va­ri­ety of over­sight-rele­vant ques­tions.

Fur­ther­more, since also has a my­opic ob­jec­tive, we can do the same thing for and add an ad­di­tional term defined as

such that we’re us­ing to over­see both and . En­forc­ing my­opia for also pre­vents the prob­lem noted pre­vi­ously of slowly dol­ing out in­for­ma­tion if it’s non-my­opic.


For an­a­lyz­ing this pro­posal, I’m go­ing to make use of the same break­down I used in “An overview of 11 pro­pos­als for build­ing safe ad­vanced AI.” If you’re not fa­mil­iar with the terms here, they are all ex­plained in the in­tro­duc­tion to that post.

Outer al­ign­ment. In terms of outer al­ign­ment, I think there is a rea­son­ably strong ar­gu­ment to be made that this pro­posal is at least as outer al­igned as de­bate. Both pro­pos­als are try­ing to es­ti­mate a very similar equil­ibrium in the limit—with the ex­cep­tion that the de­bate equil­ibrium only in­cludes the best ar­gu­ments for the two most con­vinc­ing po­si­tions rather than all po­ten­tially con­vinc­ing ar­gu­ments. Thus, the equil­ibrium reached by AI safety via mar­ket mak­ing should be closer to an ac­tual hu­man re­flec­tive equil­ibrium, as it has to be an equil­ibrium rel­a­tive to all pos­si­ble ar­gu­ments that can pre­sent, not just the two it thinks will be most per­sua­sive. That be­ing said, AI safety via mar­ket mak­ing still in­her­its many of the po­ten­tial outer al­ign­ment is­sues of de­bate, in­clud­ing the pos­si­bil­ity of de­cep­tive equil­ibria wherein the hu­man is more con­vinced by false ar­gu­ments than true ar­gu­ments. Hope­fully, how­ever, the use of tech­niques such as cross-ex­am­i­na­tion should help alle­vi­ate such is­sues.

In­ner al­ign­ment. AI safety via mar­ket mak­ing gets to make use of both the cross-ex­am­i­na­tion with trans­parency tools ap­proach—which is oth­er­wise unique to de­bate—as well as the more gen­eral re­laxed ad­ver­sar­ial train­ing ap­proach. Thus, hope­fully at least, it should be able to get the in­ner al­ign­ment benefits of both ap­proaches. That be­ing said, AI safety via mar­ket mak­ing also has to con­tend with en­sur­ing that two differ­ent mod­els—both and —are in­ner al­igned, rather than just a sin­gle model as in some­thing like de­bate or am­plifi­ca­tion.

First, let’s con­sider the in­ner al­ign­ment of . For , we’re do­ing pretty straight­for­ward over­sight of by with the goal of en­forc­ing per-step my­opia. The ba­sic idea here is to use to de­tect and dis­in­cen­tivize non-my­opia as it starts ap­pear­ing, thus pre­vent­ing from ever be­com­ing de­cep­tive or oth­er­wise catas­trophic. This is the same ba­sic ap­proach I’ve pre­vi­ously pro­posed us­ing for am­plifi­ca­tion and should have similar prop­er­ties, with the ex­cep­tion that is trained differ­ently here, which could cause to be rel­a­tively bet­ter or worse at over­sight. Ideally, if the train­ing dis­tri­bu­tion in­cludes lots of over­sight-rele­vant ques­tions, it should be pos­si­ble to make an effec­tive as­sis­tant to in perform­ing over­sight.

Se­cond, let’s con­sider the in­ner al­ign­ment of . Com­pared to de­bate, we still get to do cross-ex­am­i­na­tion with trans­parency tools, but now we also get to do per-step my­opia ver­ifi­ca­tion with as an over­seer.[3] Thus, there are two ma­jor differ­ences here be­tween AI safety via mar­ket mak­ing and AI safety via de­bate: first, we get an over­seer——and sec­ond, we get to do per-step my­opia ver­ifi­ca­tion rather than per-epi­sode my­opia ver­ifi­ca­tion. The fact that we have an in­de­pen­dent over­seer—rather than just rely­ing on the de­baters to over­see them­selves via cross-ex­am­i­na­tion—is very nice, though it of course also comes with the down­side of hav­ing to safely train an ad­di­tional model. Fur­ther­more, it is quite pos­si­ble to just add an ad­di­tional over­seer on top of the stan­dard de­bate frame­work. Thus, in my opinion, the fact that we get to do per-step my­opia ver­ifi­ca­tion is the largest and most sig­nifi­cant in­ner al­ign­ment differ­ence.

Be­fore I talk about the im­por­tance of per-step my­opia, it’s worth not­ing that de­bate is fully com­pat­i­ble with per-epi­sode my­opia—in fact, it ba­si­cally re­quires it. If a de­bater is not per-epi­sode my­opic, then it will try to max­i­mize its re­ward across all de­bates, not just the sin­gle de­bate—the sin­gle epi­sode—it’s cur­rently in. Such per-epi­sode non-my­opic agents can then be­come de­cep­tively al­igned, as they might choose to act de­cep­tively dur­ing train­ing in or­der to defect dur­ing de­ploy­ment. Per-epi­sode my­opia, how­ever, rules this out. Un­for­tu­nately, in my opinion, per-epi­sode my­opia seems like a very difficult con­di­tion to en­force—once your agents are run­ning multi-step op­ti­miza­tion al­gorithms, how do you tell whether that op­ti­miza­tion passes through the epi­sode bound­ary or not? En­forc­ing per-step my­opia, on the other hand, just re­quires de­tect­ing the ex­is­tence of multi-step op­ti­miza­tion, rather than its ex­tent, which seems con­sid­er­ably eas­ier. Thus, since AI safety via mar­ket mak­ing is fully com­pat­i­ble with per-step my­opia ver­ifi­ca­tion, it could be sig­nifi­cantly eas­ier to pre­vent the de­vel­op­ment of de­cep­tive al­ign­ment.

Train­ing com­pet­i­tive­ness. It seems quite likely to me that both and can be trained com­pet­i­tively via lan­guage model fine-tun­ing, how­ever ex­actly how effec­tive such train­ing would be is cur­rently un­clear. Ideally, train­ing via this pro­ce­dure should pro­duce an which is rel­a­tively bet­ter than the origi­nal lan­guage model at pre­dict­ing what a hu­man will think af­ter see­ing rele­vant ar­gu­ments and is thus more helpful than the origi­nal lan­guage model. Test­ing this hy­poth­e­sis by ac­tu­ally perform­ing ex­per­i­ments seem likely to be highly valuable in shed­ding light on the train­ing com­pet­i­tive­ness prop­er­ties of AI safety via mar­ket mak­ing.

Perfor­mance com­pet­i­tive­ness. Perfor­mance com­pet­i­tive­ness here seems likely to de­pend on ex­actly how use­ful get­ting ac­cess to hu­man re­flec­tive equil­ibria ac­tu­ally is. Similarly to AI safety via de­bate or am­plifi­ca­tion, AI safety via mar­ket mak­ing pro­duces a ques­tion-an­swer­ing sys­tem rather than a fully gen­eral agent. That be­ing said, if the pri­mary use cases for ad­vanced AI are all highly cog­ni­tive lan­guage and de­ci­sion-mak­ing tasks—e.g. helping CEOs or AI re­searchers—rather than, for ex­am­ple, fine mo­tor con­trol, then a ques­tion-an­swer­ing sys­tem should be en­tirely suffi­cient. Fur­ther­more, com­pared to AI safety via de­bate, AI safety via mar­ket mak­ing seems likely to be at least as perfor­mance com­pet­i­tive for the same rea­son as it seems likely to be at least as outer al­igned—the equil­ibria found by AI safety via mar­ket mak­ing should in­clude all po­ten­tially con­vinc­ing ar­gu­ments, in­clud­ing those that would be made in a two-player de­bate as well as those that wouldn’t.

  1. This is ac­tu­ally the sec­ond de­bate-based pro­posal I’ve drafted up re­cently—the pre­vi­ous of which was in “Syn­the­siz­ing am­plifi­ca­tion and de­bate.” A po­ten­tially in­ter­est­ing fu­ture re­search di­rec­tion could be to figure out how to prop­erly com­bine the two. ↩︎

  2. Note that pure pre­dic­tion is not in­her­ently my­opic—since the truth of ’s pre­dic­tions can de­pend on its own out­put—but can be my­opic while still pro­duc­ing good pre­dic­tions if be­haves like a coun­ter­fac­tual or­a­cle rather than a Pre­dict-O-Matic. Thus, my­opia ver­ifi­ca­tion is im­por­tant to en­force that be the lat­ter form of pre­dic­tor and not the former. ↩︎

  3. The use of an over­seer to do per-step my­opia ver­ifi­ca­tion is also some­thing that can be done with most forms of am­plifi­ca­tion, though AI safety via mar­ket mak­ing could po­ten­tially still have other benefits over such am­plifi­ca­tion ap­proaches. In par­tic­u­lar, AI safety via mar­ket mak­ing seems more com­pet­i­tive than imi­ta­tive am­plifi­ca­tion and more outer al­igned than ap­proval-based am­plifi­ca­tion. For more de­tail on such am­plifi­ca­tion ap­proaches, see “An overview of 11 pro­pos­als for build­ing safe ad­vanced AI.” ↩︎