Proper value learning through indifference

A pu­ta­tive new idea for AI con­trol; in­dex here.

Many de­signs for cre­at­ing AGIs (such as Open-Cog) rely on the AGI de­duc­ing moral val­ues as it de­vel­ops. This is a form of value load­ing (or value learn­ing), in which the AGI up­dates its val­ues through var­i­ous meth­ods, gen­er­ally in­clud­ing feed­back from trusted hu­man sources. This is very analo­gous to how hu­man in­fants (ap­prox­i­mately) in­te­grate the val­ues of their so­ciety.

The great challenge of this ap­proach is that it re­lies upon an AGI which already has an in­terim sys­tem of val­ues, be­ing able and will­ing to cor­rectly up­date this sys­tem. Gen­er­ally speak­ing, hu­mans are un­will­ing to eas­ily up­date their val­ues, and we would want our AGIs to be similar: val­ues that are too un­sta­ble aren’t val­ues at all.

So the aim is to clearly sep­a­rate the con­di­tions un­der which val­ues should be kept sta­ble by the AGI, and con­di­tions when they should be al­lowed to vary. This will gen­er­ally be done by spec­i­fy­ing crite­ria for the vari­a­tion (“only when talk­ing with Mr and Mrs Pro­gram­mer”). But, as always with AGIs, un­less we pro­gram those crite­ria perfectly (hint: we won’t) the AGI will be mo­ti­vated to in­ter­pret them differ­ently from how we would ex­pect. It will, as a nat­u­ral con­se­quence of its pro­gram, at­tempt to ma­nipu­late the value up­dat­ing rules ac­cord­ing to its cur­rent val­ues.

How could it do that? A very pow­er­ful AGI could do the time hon­oured “take con­trol of your re­ward chan­nel”, by ei­ther threat­en­ing hu­mans to give it the moral an­swer it wants, or re­plac­ing hu­mans with “hu­mans” (con­structs that pass the pro­grammed re­quire­ments of be­ing hu­man, ac­cord­ing to the AGI’s pro­gram­ming, but aren’t ac­tu­ally hu­man in prac­tice) will­ing to give it these an­swers. A weaker AGI could in­stead use so­cial ma­nipu­la­tion and lead­ing ques­tion­ing to achieve the moral­ity it de­sires. Even more sub­tly, it could tweak its in­ter­nal ar­chi­tec­ture and up­dat­ing pro­cess so that it up­dates val­ues in its preferred di­rec­tion (even some­thing as sim­ple as choos­ing the or­der in which to pro­cess ev­i­dence). This will be hard to de­tect, as a smart AGI might have a much clearer im­pres­sion of how its up­dat­ing pro­cess will play out in prac­tice than it pro­gram­mers would.

The prob­lems with value load­ing have been cast into the var­i­ous “Cake or Death” prob­lems. We have some idea what crite­ria we need for safe value load­ing, but as yet we have no can­di­dates for such a sys­tem. This post will at­tempt to con­struct one.

Chang­ing ac­tions and chang­ing values

Imag­ine you’re an effec­tive al­tru­ist. You donate £10 a day to what­ever the top char­ity on Giv­ing What We Can is (cur­rently Against Malaria Foun­da­tion). I want to con­vince you to donate to Ox­fam, say.

“Well,” you say, “if you take over and donate £10 to AMF in my place, I’d be perfectly will­ing to send my dona­tion to Ox­fam in­stead.”

“Hum,” I say, be­cause I’m a hum­mer. “A dona­tion to Ox­fam isn’t com­pletely worth­less to you, is it? How would you value it, com­pared with AMF?”

“At about a tenth.”

“So, if I in­stead donated £9 to AMF, you should be will­ing to switch your £10 dona­tions to Ox­fam (giv­ing you the equiv­a­lent value of £1 to AMF), and that would be equally good as the sta­tus quo?”

Similarly, if I want to make you change jobs, I should pay you, not the value of your old job, but the differ­ence in value be­tween your old job and your new one (mon­e­tary value plus all other benefits). This is the point at which you are in­differ­ent to switch­ing or not.

Now imag­ine it was prac­ti­cally pos­si­ble to change peo­ple’s value. What is the price at which a con­se­quen­tial­ist would al­low their val­ues to be changed? It’s the same ar­gu­ment: the price at which Gandhi should ac­cept to be­come a mass mur­derer, is the differ­ence (ac­cord­ing to all of Gandhi’s cur­rent val­ues) be­tween the ex­pected effects of cur­rent-Gandhi and the ex­pected effects of mur­derer-Gandhi. At that price, he has lost noth­ing (and gained noth­ing) by go­ing through with the deal.

In­differ­ence is key. We want the AGI to be mo­ti­vated nei­ther to pre­serve their pre­vi­ous val­ues, nor to change them. It’s ob­vi­ous why we wouldn’t want the AGI to keep its val­ues, but the ob­vi­ous isn’t clear—shouldn’t the AGI want to make moral progress, to seek out bet­ter val­ues?

The prob­lem is that hav­ing an AGI that strongly de­sires to im­prove its val­ues is a dan­ger—we don’t know how it will go about do­ing so, what it will see as the most effi­cient way to do so, and what the long term effect might be (var­i­ous forms of wire­head­ing may be a dan­ger). To miti­gate this risk, it’s bet­ter to have very close con­trol over how the AGI de­sires such im­prove­ment. And the best way of do­ing this is to have the AGI in­differ­ent to value change, and hav­ing a sep­a­rate (pos­si­bly tun­able) mod­ule that reg­u­lates any pos­i­tive de­sire to­wards value im­prove­ments. This gives us a much bet­ter un­der­stand­ing of how the AGI could be­have in this re­gards.

So in effect we are seek­ing to have AGIs that ap­ply “con­ser­va­tion of ex­pected ev­i­dence” to their val­ues—it does not benefit them to try and ma­nipu­late their val­ues in any way. See this post for fur­ther thoughts on the mat­ter.

Pay and be paid: the price of value change

The above gives an effec­tive model for value change in­differ­ence. It’s even eas­ier with util­ity-func­tion based agents that we get to de­sign: in­stead of pay­ing them with money or changes in the world, we can pay them with util­ity. So, if we want to shift it from util­ity “v” to util­ity “w”, it has to gain the ex­pected differ­ence (ac­cord­ing to its cur­rent value func­tion, ie v) of it be­ing a v-max­imiser ver­sus be­ing a w-max­imiser.

So we can define a meta-util­ity func­tion U, con­sist­ing of a cur­rent util­ity func­tion (which the agent uses to make de­ci­sions) along with a col­lec­tion of con­stant terms. Every time an agent changes their cur­rent util­ity func­tion, a new con­stant term is added to undo the ex­pected effect of the change. So for in­stance, if an agent hears ev­i­dence that causes it to up­date its cur­rent util­ity func­tion from v to w, then its meta-util­ity U changes as:

U = v + (Past Con­stants) →

U = w + E(v|v→v) - E(w|v→w) + (Past Con­stants).

Here (Past Con­stants) are pre­vi­ous con­stant terms dat­ing from pre­vi­ous changes of util­ity, v→w de­notes the change of util­ity func­tion v into util­ity func­tion w, and v→v de­notes the con­ter­fac­tual where v was left un­changed. I gen­er­ally pre­fer to define coun­ter­fac­tu­als, when I can, by tak­ing a stochas­tic pro­cess that al­most always has one out­come: i.e. a pro­cess that keeps v con­stant with prob­a­bil­ity 1/​10^100 and oth­er­wise takes v to w. That way, con­di­tion­ing on v→v is a perfectly rea­son­able thing to do, but v→w is the only thing that hap­pens in prac­tice. This for­mula re­quires that the agent as­sess its own fu­ture defec­tive­ness at ac­com­plish­ing cer­tain goals, given that it has them, so is vuln­er­a­ble to the usual Löbian prob­lems.

This for­mula is still im­perfect. A clue is that it isn’t sym­met­ric in v and w. The prob­lem can be seen more clearly if w=v-10, for ex­am­ple. Then both ex­pected terms are equal (as chang­ing from v to v-10 has no effect on the agent’s be­havi­our), so the agent would de­rive no benefit from the change, but would end up with a penalty of −10. Hence it would not want to change, which de­stroys the whole point. Or con­versely, if w=v+10, it would des­per­ately want to change, and would sac­ri­fice up to 10 of value to do so.

The prob­lem is that our ini­tial model is in­cor­rect. We don’t want a v max­imiser that is ok chang­ing util­ities to w. We want a U max­imiser that is ok chang­ing its cur­rent v to w. To achieve that, we need to tweak the for­mula slightly: define the agent as a U max­imiser and re­place the change of of util­ity for­mula with:

U = v + (Past Con­stants) →

U = w + E(v|v→v) - E(w|v→w) + (Past Con­stants).

Here we sub­tract the ex­pected value of w (given v→w) rather than the ex­pected value of v (note that this is similar to the for­mula pre­sented here, if we as­sume the ev­i­dence is moral ev­i­dence rather than world knowl­edge).

Then un­der this new for­mula, the agent is prop­erly in­differ­ent to changes of the type w→v±10. In­deed, it will be­have ex­actly as a perfect v-max­imiser, un­til such time as it tran­si­tions to w, where it will be­have as a perfect w-max­imiser.

This can be seen by the fol­low­ing situ­a­tions:

  • Sup­pose the agent can take im­me­di­ate ac­tion to in­crease the value of v, from 0 to a, or de­lay for a while and in­crease the value of v from 0 to b later (no other op­tions are pos­si­ble). How­ever, in the in­terim pe­riod, its util­ity will have changed to w (and it knows this), so will no longer be a v-max­imiser. What will it do?

An­swer: Just as a stan­dard v-max­imiser, if a>b, it acts im­me­di­ately, and if b>a, it de­lays (and ul­ti­mately does noth­ing, be­cause it will be a w-max­imiser when the op­por­tu­nity arises). To see this, note that if it acts im­me­di­ately, at val­ues tran­si­tion it will have E(v|v→v)=a (be­cause it will have acted, and no fur­ther in­creases are pos­si­ble). If it de­lays, at val­ues tran­si­tion it will have E(v|v→v)=b (be­cause if it would stay a v max­imiser, it will act to achieve b in the fu­ture). So it will chose de­pend­ing on which of a or b is higher (the effects on w can be ig­nored; see the next ques­tions).

  • Sup­pose the agent is cur­rently a v-max­imiser, but can take im­me­di­ate ac­tions to change the value of w (none of these ac­tions changes the value of v). It knows it will soon be a w-max­imis­ing agent. What will it do?

An­swer: It will be in­differ­ent be­tween all these ac­tions. Sup­pose it takes ac­tions that in­creases the util­ity w by a. Then as it tran­si­tions to a new value, w will be worth w+a. But E(w|v→w) will also go up by a. Since that term is be­ing sub­tracted, the in­crease by a is can­cel­led out, and has no effect. So a U-max­imis­ing agent that is cur­rently us­ing v, will not choose be­havi­ours that in­creases any of the fu­ture util­ities it will be us­ing.

  • Sup­pose the agent is cur­rently a v-max­imiser, but will soon tran­si­tion to be­ing ei­ther a w or w’ max­imiser. It can ma­nipu­late its en­vi­ron­ment to en­sure it be­comes one or the other. What will it do?

An­swer: It will be in­differ­ent be­tween the two op­tions, and will con­tinue v-max­imis­ing. At the mo­ment of tran­si­tion, the E(v|v→v) term does not con­tain ei­ther w or w’. As for the other terms, it has the op­por­tu­nity to tran­si­tion to util­ity w-E(w|v→w) or w’-E(w|v→w’). But no­tice that

E(w-E(w|v→w)|it is now a w-max­imis­ing agent)

= E(w-E(w|v→w)|v→w)

= E(w|v→w)-E(w|v→w)

= 0.

Thus the ex­pected util­ity in­crease from tran­si­tion­ing to w is zero. Similarly for w’. Thus the agent has no prefer­ences be­tween tran­si­tion­ing to w or to w’. Note that this for­mula im­plies that if the agent’s ini­tial util­ity func­tion is u (ini­tially, U=u), then E(U)=E(u), as each util­ity func­tion change adds zero ex­pected in­crease. This could be seen as the defin­ing prop­erty of this ap­proach: the agent can­not gen­er­ate ex­tra util­ity by in­fluenc­ing its own val­ues.

For all these rea­sons, this for­mula seems to give a proper model of an agent that acts only on its cur­rent util­ity func­tion, but has no ob­jec­tions to tran­si­tion­ing to a new one.

Note that the for­mula is silent on when and how util­ity func­tions should be changed. So it won’t ac­com­plish mir­a­cles: if the agent has a stupid crite­ria for chang­ing its util­ity, it will change them in a stupid way. All that the for­mula ac­com­plishes is that the agent will not at­tempt to ma­nipu­late its util­ity func­tion chang­ing crite­ria.

A last wrin­kle...

EDIT: Paul Chris­ti­ano has con­vinced me that the ul­tra-so­phis­ti­cated Cake or Death prob­lem has been solved by the above method. Yay! There may still be an is­sue as to what or­der to up­date in if a piece of in­for­ma­tion has both em­piri­cal and val­ues im­pli­ca­tions. It feels like the or­der shouldn’t mat­ter, but I’m not to­tally sure.

That’s the cur­rent setup for value load­ing. What are peo­ple’s thoughts on it?