Translation “counterfactual”

A pu­ta­tive new idea for AI con­trol; in­dex here.

In a pre­vi­ous post, I briefly men­tioned trans­la­tions as one of three pos­si­ble coun­ter­fac­tu­als for in­differ­ence. Here I want to clar­ify what I meant there, be­cause the idea is in­ter­est­ing.


Imag­ine that there is a but­ton, which, if a hu­man presses it (event ), will cause an AI to have re­ward sig­nal . If the but­ton isn’t pressed (event ), the AI will have re­ward sig­nal .

Ini­tially, the prob­a­bil­ities of and are equal at .

Now, sup­pose that the AI takes an ac­tion that in­creases the prob­a­bil­ity of to (and de­creases the prob­a­bil­ity of to ).

We want the AI to be in­differ­ent to the change in prob­a­bil­ity caused by its ac­tions.

Ev­i­den­tial counterfactual

In the “ev­i­den­tial” coun­ter­fac­tual, the AI will sim­ply be­have as if the prob­a­bil­ity of and were fixed, and un­altered from the ini­tial odds:

The prob­lem with this ap­proach is that this doesn’t cor­re­spond to a true util­ity/​re­ward func­tions, lead­ing to the para­doxes de­tailed here and here (see the sec­tion on out­come pumps).

Causal counterfactual

In the “causal” coun­ter­fac­tual, there is some other event that, with small prob­a­bil­ity, will force the but­ton to be pressed, or pre­vent it from be­ing pressed, what­ever hu­mans want or try.

This event is sup­posed to be in­de­pen­dent of any­thing any­one does (imag­ine some speci­fied ra­dioac­tive de­cay event, or be­ing trig­gered by dis­tant cos­mic events).

Call the event that the but­ton is forcibly pressed, while means it is forcibly not-pressed. The AI does not value be­ing in any other world (mean­ing that the AI gets con­stant re­ward in any world where nei­ther nor hap­pen).

Then the AI will be­have as if the ra­tio of prob­a­bil­ities of fol­low­ing ver­sus is the (con­stant) ra­tio of to , what­ever the prob­a­bil­ity of be­comes.

The prob­lem is that (the hu­man presses the but­ton) is not the same as (the but­ton is forcibly pressed by some stochas­tic event), so ex­pected re­wards con­di­tional on one will gen­er­ally not be the same as those con­di­tional on the other.

Trans­la­tion “coun­ter­fac­tual”

In the “trans­la­tion” coun­ter­fac­tual, in con­trast, there are no un­usual con­di­tion­ings or as­sum­ing un­likely events. In­stead, the event is de­cou­pled some­what from the event of the AI max­imis­ing or .

In this spe­cific case, the prob­a­bil­ity of the AI max­imis­ing , given , is re­duced, so that the fi­nal prob­a­bil­ity is the same as the ini­tial one:

In this case, the prob­a­bil­ity of the AI max­imis­ing and hap­pen­ing, re­mains at , while the prob­a­bil­ity of max­imis­ing and is re­duced to , and the prob­a­bil­ity of max­imis­ing and is in­creased to .

Thus the prob­a­bil­ities of max­imis­ing and re­main each, and the AI is has a cor­rect prob­a­bil­ity dis­tri­bu­tion about fu­ture events (there are some sub­tleties con­cern­ing prob­a­bil­ities ver­sus weights here, but this is the gen­eral pic­ture).

No comments.