How much can value learning be disentangled?

In the con­text of whether the defi­ni­tion of hu­man val­ues can dis­en­tan­gled from the pro­cess of ap­prox­i­mat­ing/​im­ple­ment­ing that defi­ni­tion, David asks me:

  • But I think it’s rea­son­able to as­sume (within the bounds of a dis­cus­sion) that there is a non-ter­rible way (in prin­ci­ple) to spec­ify things like “ma­nipu­la­tion”. So do you dis­agree?

I think it’s a re­ally good ques­tion, and its an­swer is re­lated to a lot of rele­vant is­sues, so I put this here as a top-level post. My cur­rent feel­ing is, con­trary to my pre­vi­ous in­tu­itions, that things like “ma­nipu­la­tion” might not be pos­si­ble to spec­ify in a way that leads to use­ful dis­en­tan­gle­ment.

Why ma­nipu­late?

First of all, we should ask why an AI would be tempted to ma­nipu­late us in the first place. It may be that it needs us to do some­thing for it to ac­com­plish its goal; in that case it is try­ing to ma­nipu­late our ac­tions. Or maybe its goal in­cludes some­thing that cashes out as out men­tal states; in that case, it is try­ing to ma­nipu­late our men­tal state di­rectly.

The prob­lem is that any rea­son­able friendly AI would have our men­tal states as part of its goal—it would at least want us to be happy rather than mis­er­able. And (al­most) any AI that wasn’t perfectly in­differ­ent to our ac­tions would be try­ing to ma­nipu­late us just to get its goals ac­com­plished.

So ma­nipu­la­tion is to be ex­pected by most AI de­signs, friendly or not.

Ma­nipu­la­tion ver­sus explanation

Well, since the urge to ma­nipu­late is ex­pected to be pre­sent, could we just rule it out? The prob­lem is that we need to define the differ­ence be­tween ma­nipu­la­tion and ex­pla­na­tion.

Sup­pose I am fully al­igned/​cor­rigible/​nice or what­ever other prop­er­ties you might de­sire, and I want to in­form you of some­thing im­por­tant and rele­vant. In do­ing so, es­pe­cially if I am more in­tel­li­gent than you, I will sim­plify, I will omit ir­rele­vant de­tails, I will omit ar­guably rele­vant de­tails, I will em­pha­sise things that help you get a bet­ter un­der­stand­ing of my po­si­tion, and de-em­pha­sise things that will just con­fuse you.

And these are ex­actly the same sorts of be­havi­ours that smart ma­nipu­la­tor would do. Nor can we define the differ­ence as whether the AI is truth­ful or not. We want hu­man un­der­stand­ing of the prob­lem, not truth. It’s perfectly pos­si­ble to ma­nipu­late peo­ple while tel­ling them noth­ing but the truth. And if the AI struc­tures the or­der in which it pre­sents the true facts, it can ma­nipu­late peo­ple while pre­sent­ing the whole truth as well as noth­ing but the truth.

It seems that the only differ­ence be­tween ma­nipu­la­tion and ex­pla­na­tion is whether we end up with a bet­ter un­der­stand­ing of the situ­a­tion at the end. And mea­sur­ing un­der­stand­ing is very sub­tle. And even if we do it right, note that we have now mo­ti­vated the AI to… aim for a par­tic­u­lar set of men­tal states. We are re­ward­ing it for ma­nipu­lat­ing us. This is con­trary to the stan­dard un­der­stand­ing of ma­nipu­la­tion, which fo­cuses on the means, not the end re­sult.

Bad be­havi­our and good values

Does this mean that the situ­a­tion is com­pletely hope­less? No. There are cer­tain ma­nipu­la­tive prac­tices that we might choose to ban. Espe­cially if the AI is limited in ca­pa­bil­ity at some level, this would force it to fol­low be­havi­ours that are less likely to be ma­nipu­la­tive.

Essen­tially, there is no bound­ary be­tween ma­nipu­la­tion and ex­pla­na­tion, but there is a differ­ence be­tween ex­treme ma­nipu­la­tion and ex­pla­na­tion, so rul­ing out the first can help (or maybe not).

The other thing that can be done is to en­sure that the AI has val­ues close to ours. The closer the val­ues of the AI are to us, the less ma­nipu­la­tion it will need to use, and the less egre­gious the ma­nipu­la­tion will be. It might be that, be­tween par­tial value con­ver­gence and rul­ing out spe­cific prac­tices (and maybe some phys­i­cal con­straints), we may be able to get an AI that is very un­likely to ma­nipu­late us much.

In­ci­den­tally, I feel the same about low-im­pact ap­proaches. The full gen­er­al­ity prob­lem, an AI that is low im­pact but value-ag­nos­tic, I think is im­pos­si­ble. But if the val­ues of the AI are bet­ter al­igned with us, and more phys­i­cally con­strained, then low im­pact be­comes eas­ier to define.