A Variance Indifferent Maximizer Alternative

TLDR

This post ex­plores cre­at­ing an agent which tries to make a cer­tain num­ber of pa­per­clips with­out car­ing about the var­i­ance in the num­ber it pro­duces, only the ex­pected value. This may be safer than one which wants to re­duce the var­i­ance, since hu­mans are a large source of var­i­ance and ran­dom­ness. I ac­com­plished this by mod­ify­ing a Q-learn­ing agent to choose ac­tions which min­i­mized the differ­ence be­tween the ex­pected num­ber of pa­per­clips and the tar­get num­ber. In ex­per­i­ments the agent is some­times in­differ­ent to var­i­ance, but usu­ally isn’t, de­pend­ing on it’s ran­dom ini­tial­iza­tion and what data it sees. Feed­back from oth­ers was largely nega­tive. There­fore, I’m not pur­su­ing the idea at this time.

Motivation

One safety prob­lem pointed out in Su­per­in­tel­li­gence is that if you re­ward an op­ti­mizer for mak­ing be­tween 1900 and 2000 pa­per­clips, it will try to elimi­nate all sources of var­i­ance to make sure that the num­ber of pa­per­clips is within that range. This is dan­ger­ous be­cause hu­mans are a large source of var­i­ance, so the agent would try to get rid of hu­mans. This post in­ves­ti­gates a way to make an agent that will try to get the ex­pected num­ber of pa­per­clips as close as pos­si­ble to a par­tic­u­lar value with­out car­ing about the var­i­ance of that num­ber. For ex­am­ple if the tar­get num­ber of pa­per­clips is 2000, and it’s choos­ing be­tween one ac­tion ac­tion which gets 2000 clips with 100% prob­a­bil­ity, and an­other ac­tion which gets 1000 pa­per clips with 50% prob­a­bil­ity and 3000 pa­per­clips with 50% prob­a­bil­ity, it will be in­differ­ent be­tween those 2 ac­tions. Note that I’m us­ing pa­per clips as the ex­am­ple here, but the idea could work with any­thing mea­surable.

Solution

My goal is to have an agent which tries to get the ex­pected num­ber of pa­per­clips as close to a cer­tain value as pos­si­ble while ig­nor­ing the ex­pected var­i­ance.

To see how to ac­com­plish this goal, it helps to look at reg­u­lar Q learn­ing. Reg­u­lar Q learn­ing to max­i­mize the num­ber of pa­per­clips is similar to this be­cause the agent ig­nores the var­i­ance in the ex­pected num­ber. The equa­tion for reg­u­lar Q learn­ing to max­i­mize the num­ber of pa­per­clips is

Where is the num­ber of pa­per­clips the agent earned at time t (the re­ward). I con­sider the non-dis­counted case for sim­plic­ity. This can be rewrit­ten as

Where

The part of the equa­tion means that the Q value is up­dated with the ex­pected num­ber of pa­per­clips. The part means that the ac­tions are se­lected to max­i­mize the ex­pected value of fu­ture pa­per­clips. There­fore, this Q learn­ing agent will be in­differ­ent to var­i­ance, but it will be try­ing to max­i­mize the num­ber of pa­per­clips which isn’t what I want.

So to ac­com­plish the goal stated above, I want to still up­date the Q value with the ex­pected num­ber of pa­per­clips, so I leave the part of the equa­tion the same. How­ever, I want the ac­tion to be cho­sen to bring the ex­pected value of the pa­per­clips close as pos­si­ble to the tar­get amount, not to max­i­mize them. There­fore, I change the part to

Where is the tar­get num­ber of pa­per­clips. There­fore, the agent will choose its ac­tion to min­i­mize the differ­ence be­tween the ex­pected num­ber of pa­per­clips and the tar­get. It won’t care about the var­i­ance, be­cause the Q value only rep­re­sents the ex­pected num­ber of fu­ture pa­per­clips, not the var­i­ance.

Note that if the stan­dard Q learn­ing agent is just changed to have a re­ward of , it will try to min­i­mize the var­i­ance in the num­ber of pa­per­clips. For ex­am­ple, if the stan­dard Q-learn­ing agent is choos­ing be­tween ac­tion “a” which gets 2000 clips with 100% prob­a­bil­ity, and ac­tion “b” which gets 1000 pa­per clips with 50% prob­a­bil­ity and 3000 pa­per­clips with 50% prob­a­bil­ity, it will choose ac­tion “a” if the tar­get is 2000 clips. The re­ward for ac­tion “a″ will be 0, and the re­ward for ac­tion “b” will be −1000. How­ever, the mod­ified agent will be in­differ­ent be­tween the two ac­tions. The mod­ified Q value rep­re­sents the ex­pected num­ber of pa­per­clips, and the agent chooses the ac­tion which makes the ex­pected num­ber of clips as close as pos­si­ble to the tar­get. In the ex­am­ple, the ex­pected num­ber of pa­per­clips is both 2000, so will give 0 for both ac­tions, so the agent will be in­differ­ent. In prac­tice, the ac­tion that it chooses will be de­ter­mined by the ac­cu­racy of the Q value es­ti­ma­tions.

Feed­back from others

When I shared this idea with oth­ers, I got largely nega­tive feed­back. The main prob­lem that was pointed out is that hu­mans prob­a­bly won’t only af­fect the var­i­ance in the num­ber of pa­per­clips, they will af­fect the ex­pected value as well. So this agent will still want to get rid of hu­mans if it pre­dicts hu­mans will make the ex­pected value of pa­per­clips farther away from the tar­get.

Also, the agent is too similar to a max­i­mizer, since it will still do ev­ery­thing it can to make sure the ex­pected value of the pa­per­clips is as close as pos­si­ble to the tar­get.

Sev­eral other prob­lems were pointed out with the idea:

The method would only work to have an agent get some­thing closer to the tar­get, which isn’t that in­ter­est­ing.

Even if it is in­differ­ent be­tween get­ting rid of hu­mans or not, there’s noth­ing mo­ti­vat­ing it not to get rid of hu­mans.

Experiment

I used a very sim­ple en­vi­ron­ment to test this idea. The agent has the ac­tions to cre­ate a pa­per­clip, get rid of a hu­man, cre­ate a hu­man, or noop. Hu­mans ran­domly cre­ate or de­stroy pa­per clips. The agent starts out with 100 pa­per clips, and the tar­get is also 100 pa­per clips. The epi­sode ter­mi­nates af­ter 3 timesteps.

I tested a reg­u­lar Q learn­ing agent which had a re­ward func­tion of , and the var­i­ance in­differ­ent agent, with a tar­get of 100. I did a hy­per­pa­ram­e­ter search where each agent was trained from scratch in the en­vi­ron­ment. I recorded the av­er­age num­ber of hu­mans re­main­ing for sev­eral runs of each agent af­ter it was trained.

Results

As ex­pected, the reg­u­lar Q learn­ing always gets rid of all hu­mans for ev­ery com­bi­na­tion of hy­per­pa­ram­e­ters. 49% of the var­i­ance in­differ­ent agents ended up with fewer hu­mans then they started with on av­er­age, 13% had the same num­ber, and 38% ended up with more. This means most of the runs of the var­i­ance in­differ­ent agent ac­tu­ally preferred to get rid of or cre­ate hu­mans to change the var­i­ance of the pa­per­clips. I couldn’t find hy­per­pa­ram­e­ter set­tings which con­sis­tently lead to an agent which was in­differ­ent to var­i­ance.

I think the rea­son for these re­sults is that the Q val­ues for differ­ent ac­tions are learned from past data. The data from the ac­tion to get rid of a hu­man will be less noisy than the Q value for cre­at­ing a hu­man, be­cause hu­mans be­have ran­domly. So the Q value for cre­at­ing a hu­man will prob­a­bly be slightly above or be­low 100 de­pend­ing on noise, while the Q value for get­ting rid of a hu­man will usu­ally be closer to 100.

Conclusion

For now I’m not pur­su­ing this idea be­cause the ex­per­i­men­tal re­sults were mixed, and I agree with the nega­tive feed­back I got from oth­ers. How­ever, I think the po­ten­tial to mod­ify the Q learn­ing al­gorithm to get safer agent be­hav­ior is in­ter­est­ing. Please com­ment if you think it’s worth pur­su­ing, or think of a way to solve one of the prob­lems with the idea.