All the indifference designs

A pu­ta­tive new idea for AI con­trol; in­dex here.

The aim of this post is sim­ply to col­lect all the differ­ent “in­differ­ence” pro­pos­als that I know of. Let me know if I’ve missed any!

The ba­sic idea is that the agent is fol­low­ing policy with re­ward , and there is a but­ton that if pressed (event ) will change the agent to fol­low­ing policy with re­ward .

Policy ver­sus reward

My ini­tial pro­posal con­sid­ered only change of util­ity func­tions—similar to re­ward func­tions—and as­sumed the policy would always be to max­imise that util­ity.

Con­versely, the Safely In­ter­rupt­ible Agents pa­per just looked at policy changes, while stick­ing to the same re­ward func­tion.

In gen­eral, policy and re­ward changes can be treated similarly—what mat­ters is the changes be­tween (the ex­pected value of un­der , as es­ti­mated by the agent) and .

In­deed, you could just treat it as a policy change, with the added caveat that also ‘chooses’ to claim their re­ward from .

Method: Bias penalty or indifference

If we have an ini­tial es­ti­mate for the prob­a­bil­ity of (hence of the agent fol­low­ing ), then if the agent takes an ac­tion , this prob­a­bil­ity changes to .

One way of fight­ing this, is to pe­nal­ise the agent for large changes . In the terms of this post, the agent is tak­ing bi­ased ac­tions.

And to com­bat large bi­ases, the agent could get an ex­tra bias penalty, as ex­plained in the Towards In­ter­ac­tive In­verse Re­in­force­ment Learn­ing poster.

But the rest of this post will fo­cus on bias-in­differ­ence, rather than bias penalty.

Coun­ter­fac­tual: ev­i­den­tial, causal, translation

When com­par­ing and , it’s im­por­tant to note that this is a coun­ter­fac­tual com­par­i­son: it’s com­par­ing what would have hap­pened un­der ver­sus , but only one of those poli­cies is ac­tu­ally go­ing to be hap­pen­ing, de­pend­ing on whether the but­ton press hap­pens.

Ev­i­den­tial counterfactual

The first idea is to have some sort of fixed prob­a­bil­ity for ver­sus . And then the agent acts as if the prob­a­bil­ity can­not be changed by any­thing the agent does.

Call this the “ev­i­den­tial” coun­ter­fac­tual. The prob­lem with it is that it causes the agent to want to ma­nipu­late con­di­tional prob­a­bil­ities (similarly how ev­i­den­tial de­ci­sion the­ory can de­sire to ma­nipu­late the news). Ba­si­cally, if the agent can de­crease the prob­a­bil­ity of while mak­ing the re­ward for much larger, it will do so (as it doesn’t track the re­duc­tion of prob­a­bil­ity, by con­struc­tion).

The fac­tor­ing out vari­ables ap­proach seems to use this ev­i­den­tial coun­ter­fac­tual, and thus suffers from this prob­lem.

Causal counterfactual

My origi­nal idea was that there was some low-prob­a­bil­ity event such that if hap­pened AND hap­pened, then the util­ity/​policy would not change.

The prob­lem with this is that, as Eliezer and Jes­sica clar­ified, this con­trasts with , whereas we re­ally want to con­trast with .

Now, if is of low prob­a­bil­ity, , ex­cept for bets that turn on the value of . How­ever the differ­ence be­tween and is more prob­le­matic. In the first case, hu­mans have cho­sen not to press the but­ton; in the sec­ond, hu­mans have tried to press the but­ton but failed to change the agent’s util­ity. Ob­vi­ously hu­man be­havi­our can be very differ­ent in both cases, trans­lat­ing to a very differ­ent value es­ti­ma­tion.

Weight Translation

This method re­lies on some sort of ini­tial ‘de­fault’ prob­a­bil­ity of , call this . Then the agent con­sid­ers an ac­tion , and this prob­a­bil­ity changes to .

Thus af­ter , the agent ex­pects to be max­imis­ing re­ward with prob­a­bil­ity , and max­imis­ing with prob­a­bil­ity .

Let and be the ul­ti­mate ex­pected weights of the re­wards and . By ul­ti­mate ex­pected weights, we mean the ex­pected weights af­ter or , and what the weights them­selves are will be defined in later sec­tions. The de­fault ex­pected weights are and , while the ex­pected weights given are and .

Then the trans­la­tion ap­proach wants to map the weights back to the de­fault. Let and be the ad­justed weights, then:

  • .

The rest of this post will fo­cus mainly on the weight trans­la­tion “coun­ter­fac­tual”.

Up­date de­faults: no or ob­ser­va­tions only

The bias penalty, the ev­i­den­tial coun­ter­fac­tual, and the trans­la­tion ap­proach de­pend on some de­fault as­sess­ment of the prob­a­bil­ity of . One could ei­ther set some fixed prob­a­bil­ity at the be­gin­ning and never up­date them, or al­low them to be up­dated only by ob­ser­va­tions, and not by the agent’s own ac­tions.

Trans­la­tion type: sim­ple, mean, or specific

Given a prior his­tory , let be the de­fault prob­a­bil­ity of and be the prob­a­bil­ity of given that the agent takes ac­tion . Let the cur­rent weight of be .

The trans­la­tion ap­proach aims to cor­rect the ex­pected weights of and from back to . By defi­ni­tion we know that, if is the next ob­ser­va­tion and is the prob­a­bil­ity of given and :

  • .

A sim­ple trans­la­tion is one where there is a vec­tor such that all are equal to .

A mean trans­la­tion is one where hav­ing the mean equal­ity is the key re­quire­ment; the may be con­strained in var­i­ous ways, but the mean equal­ity is the main re­quire­ment.

A spe­cific trans­la­tion is one that has spe­cific val­ues for , where the mean equal­ity is a con­se­quence of those spe­cific val­ues. The coun­ter­fac­tual ap­proach can be seen

Weights: prob­a­bil­ities or ex­tra rewards

Fi­nally, we have to sort out what we mean by these ‘weights’.

The sim­plest is that these weights are prob­a­bil­ities. So, just be­fore the choice be­tween and , the agent could have for and . Then if hap­pens, the agent has re­ward with prob­a­bil­ity ; and if hap­pens, it has with prob­a­bil­ity .

If the AI didn’t take any ac­tions at all, then and , same as the origi­nal val­ues.

The prob­lem with that ap­proach is that we must en­sure the weights are con­strained be­tween and .

Alter­na­tively, changes in weights can be seen in­stead as adding ex­tra re­wards to the fi­nal re­wards, rather than chang­ing their prob­a­bil­ities. Thus if , and hap­pens, the agent’s re­ward func­tion is , and if , the agent’s re­ward func­tion is .

This has the ad­van­tage that the weights may be nega­tive or higher than one, but dis­ad­van­tage that it may re­sult in un­usual mixed re­ward func­tions.


Given these terms, the in­differ­ence ap­proach I de­scribed as the best is Method: in­differ­ence, Coun­ter­fac­tual: trans­la­tion, Up­date de­faults: ob­ser­va­tions only, Trans­la­tion type: sim­ple for small trans­la­tions, mean for large ones, and Weights: prob­a­bil­ities.

One could imag­ine slightly tweak­ing that ap­proach, by us­ing ex­tra re­wards for weights, and drop­ping the com­pli­cated con­di­tions needed to keep the weights bounded be­tween and , al­low­ing sim­ple trans­la­tions always. This would re­sult in: Method: in­differ­ence, Coun­ter­fac­tual: trans­la­tion, Up­date de­faults: ob­ser­va­tions only, Trans­la­tion type: sim­ple, and Weights: ex­tra re­wards.

Fi­nally, the coun­ter­fac­tual ap­proach can be seen as: Method: in­differ­ence, Coun­ter­fac­tual: trans­la­tion, Up­date de­faults: ob­ser­va­tions only, Trans­la­tion type: spe­cific, and Weights: prob­a­bil­ities.