# A brief note on factoring out certain variables

Jes­sica Tay­lor and Chris Olah has a post on “Max­i­miz­ing a quan­tity while ig­nor­ing effect through some chan­nel”. I’ll briefly pre­sent a differ­ent way of do­ing this, and com­pare the two.

Essen­tially, the AI’s util­ity is given by a func­tion of a vari­able . The AI’s ac­tions are a ran­dom vari­able , but we want to ‘fac­tor out’ an­other ran­dom vari­able .

If we have a prob­a­bil­ity dis­tri­bu­tion over ac­tions, then, given back­ground ev­i­dence , the stan­dard way to max­imise would be to max­imise:

• .

The most ob­vi­ous idea, for me, is to re­place with , mak­ing ar­tifi­cially in­de­pen­dent of and giv­ing the ex­pres­sion:

• .

If is de­pen­dent on - if it isn’t, then fac­tor­ing it out is not in­ter­est­ing—then needs some im­plicit prob­a­bil­ity dis­tri­bu­tion over (which is in­de­pen­dent of ). So, in essence, this ap­proach re­lies on two dis­tri­bu­tions over the pos­si­ble ac­tions, one that the agent is op­ti­mis­ing, the other than is left un­op­ti­mised. In terms of Bayes nets, this just seems to be cut­ting from .

Jes­sica and Chris’s ap­proach also re­lies on two dis­tri­bu­tions. But, as far as I un­der­stand their ap­proach, the two dis­tri­bu­tions are taken to be the same, and in­stead, it is as­sumed that can­not be im­proved by changes to the dis­tri­bu­tion of , if one keeps the dis­tri­bu­tion of con­stant. This has the feel of be­ing a kind of differ­en­tial con­di­tion—the in­finites­i­mal im­pact on of changes to but not is non-pos­i­tive.

I sus­pect my ver­sion might have some odd be­havi­our (defin­ing the im­plicit dis­tri­bu­tion for does not seem nec­es­sar­ily nat­u­ral), but I’m not sure of the tran­si­tive prop­er­ties of the differ­en­tial ap­proach.

• I think the main prob­lem with us­ing a pre-speci­fied dis­tri­bu­tion over ac­tions is that, since it doesn’t re­flect the AI’s ac­tual be­hav­ior, you can’t say much about in re­la­tion to real life. For ex­am­ple, maybe the im­plicit policy is to take ran­dom ac­tions, which re­sults in hu­mans not press­ing the shut­down but­ton; there­fore, in real life the AI is con­fi­dent that the but­ton will not be pressed, so it spends no re­sources look­ing at the shut­down but­ton to see if it was pressed.

• How about an iter­ated pro­cess? If you have a se­ries of ac­tions and ob­ser­va­tions , and is the set of pos­si­ble ac­tions, have the AI pick such that:

(the is not nec­es­sary, just a re­minder that we are choos­ing a uniform dis­tri­bu­tion over the pos­si­ble ac­tions).

No­tice that the AI con­di­tions cor­rectly on all its past ac­tions, but not its cur­rent hy­po­thet­i­cal. I won­der how this would work if we set and had the AI type out mes­sages bit by bit, each time think­ing the prob­a­bil­ity of was in­de­pen­dent of its bit-choice?

EDIT: What this means is that, as soon as it’s done any ac­tion that might cause the shut­down but­ton to be pressed, it will then look at the shut­down but­ton (but maybe doesn’t ex­pect, ahead of time, that it will be look­ing at the but­ton).

• I think, to get , you have to have an im­plicit policy for the en­tire rest of the game (not just the next ac­tion ).

I like the idea of us­ing the ev­i­dence you have so far to in­form the , so you spend more effort on look­ing at the shut­down but­ton if you ex­pect shut­down might be im­mi­nent based on your ev­i­dence. Of course, you can com­bine this with the fixed point thing, so the dis­tri­bu­tion of is the same as the dis­tri­bu­tion of .

My main con­cern is that this isn’t re­flec­tively sta­ble. If at an early time step the AI has a cer­tain dis­tri­bu­tion, it may want to mod­ify into an agent that fixes this as the cor­rect rather than chang­ing in re­sponse to new ev­i­dence; this is be­cause it is mod­el­ling as com­ing in­de­pen­dently from .

• Maybe if the pre-speci­fied dis­tri­bu­tion is a rea­son­ably well-cal­ibrated pre­dic­tor of the AI (given that dis­tri­bu­tion)? Like, maybe this is a way that an Or­a­cle AI could help en­sure the safety of a some­what weaker Tool AI.

• [Note: This com­ment is three years later than the post]

The “ob­vi­ous idea” here un­for­tu­nately seems not to work, be­cause it is vuln­er­a­ble to so-called “in­finite im­prob­a­bil­ity drives”. Sup­pose is a shut­down but­ton, and gives some weight to and . Then, the AI will benefit from se­lect­ing a Q such that it always chooses an ac­tion , in which it en­ters a lot­tery, and if it does not win, then it the but­ton B is pushed. In this cir­cum­stance, is un­changed, while both and al­lo­cate al­most all of the prob­a­bil­ity to great out­comes. So the ap­proach will cre­ate an AI that wants to ex­ploit its abil­ity to de­ter­mine .