Learning “known” information when the information is not actually known

Meth­ods like co­op­er­a­tive in­verse re­in­force­ment learn­ing as­sume that the hu­man knows their “true” re­ward func­tion , and then that the hu­man and the robot co­op­er­ate to figure out and max­imise this re­ward.

This is fine as far as the model goes, and can al­low us to de­sign many use­ful sys­tems. But it has a prob­lem: the as­sump­tion is not true, and, more­over, its falsity can have ma­jor detri­men­tal effects.

Con­trast two situ­a­tions:

  1. The hu­man knows the true .

  2. The hu­man has a col­lec­tion of par­tial mod­els in which they have clearly defined prefer­ences. As a bounded, limited agent whose in­ter­nal sym­bols are only well-grounded in stan­dard situ­a­tions, their stated prefer­ences will be a sim­plifi­ca­tion of their men­tal model at the time. The true is con­structed from some pro­cess of syn­the­sis.

Now imag­ine the fol­low­ing con­ver­sa­tion:

  • AI: What do you re­ally want?

  • Hu­man: Money.

  • AI: Are you sure?

  • Hu­man: Yes.

Un­der most ver­sions of hy­poth­e­sis 1., this will be in a dis­aster. The hu­man has ex­pressed their prefer­ences, and, when offered the op­por­tu­nity for clar­ifi­ca­tion, didn’t give any. The AI will be­come a money-max­imiser, and things go pear shaped.

Un­der hy­poth­e­sis 2., how­ever, the AI will at­tempt to get more de­tails out of the hu­man, sug­gest­ing hy­po­thet­i­cal sce­nar­ios, check­ing what hap­pens when money and other things in money’s web of con­no­ta­tions come apart—eg “What if you had a lot of money, but couldn’t buy any­thing, and ev­ery­one de­spised you?” The syn­the­sis may fail, but, at the very least, the AI will in­ves­ti­gate more.

Thus as­sum­ing the AI will be learn­ing a truth that hu­mans already know, is harm­less as­sump­tion in many cir­cum­stances, but will re­sult in dis­asters if pushed to the ex­treme.

No comments.