Figuring out what Alice wants, part I

This is a very pre­limi­nary two-part post sketch­ing out the di­rec­tion I’m tak­ing my re­search now (sec­ond post here). I’m ex­pect­ing and hop­ing that ev­ery­thing in here will get su­per­seded quite quickly. This has ob­vi­ous con­nec­tions to clas­si­cal ma­chine in­tel­li­gence re­search ar­eas (such as in­ter­pretabil­ity). I’d be very grate­ful for any links with pa­pers or peo­ple re­lated to the ideas of this post.

The the­ory: model fragments

I’ve pre­sented the the­o­ret­i­cal ar­gu­ment for why we can­not de­duce the prefer­ences of an ir­ra­tional agent, and a prac­ti­cal ex­am­ple of that difficulty. I’ll be build­ing on that ex­am­ple to illus­trate some al­gorithms that pro­duce the same ac­tions, but where we nonethe­less can feel con­fi­dent de­duc­ing differ­ent prefer­ences.

I’ve men­tioned a few ideas for “nor­ma­tive as­sump­tions”: the as­sump­tions that we, or an AI, could use to dis­t­in­guish be­tween differ­ent pos­si­ble prefer­ences even if they re­sult in the same be­havi­our. I’ve men­tioned things such as re­gret, hu­mans stat­ing their val­ues with more or less truth­ful­ness, hu­man nar­ra­tives, how we cat­e­gorise our own emo­tions (those last three are in this post), or the struc­ture of the hu­man al­gorithm.

Those all seems rather add-hoc, but they are all try­ing to do the same thing: hone in on hu­man judge­ment about ra­tio­nal­ity and prefer­ences. But what is this judge­ment? This judge­ment is defined to be the in­ter­nal mod­els that hu­mans use to as­sess situ­a­tions. Th­ese mod­els, about our­selves and about other hu­mans, of­ten agree with each other from one hu­man to the next (for in­stance, most peo­ple agree that you’re less ra­tio­nal when you’re drunk).

Cal­ling them mod­els might be a bit of an ex­ag­ger­a­tion, though. We of­ten only get a frag­men­tary or mo­men­tary piece of a model—“he’s be­ing silly”, “she’s an­gry”, “you won’t get a pro­mo­tion with that at­ti­tude”. Th­ese are called to mind, thought upon, and then swiftly dis­missed.

So what we want to ac­cess, is the piece of the model that the hu­man used to judge the situ­a­tion. Now, these model frag­ments can of­ten be con­tra­dic­tory, but we can deal with that prob­lem later.

Then all the nor­ma­tive as­sump­tions noted above are just ways of defin­ing these model frag­ments, or ac­cess­ing them (via emo­tion, truth­ful de­scrip­tion, or re­gret). Re­gret is a par­tic­u­larly use­ful emo­tion, as it in­di­cates a di­ver­gence be­tween what was ex­pected in the model, and what ac­tu­ally hap­pened (similarly to tem­po­ral differ­ence learn­ing).

So I’ll broadly cat­e­gorise meth­ods of learn­ing hu­man model frag­ments into three cat­e­gories:

  • Direct ac­cess to the in­ter­nal model.

  • Re­gret and sur­prise as show­ing mis­matchs be­tween model ex­pec­ta­tion and out­comes.

  • Priv­ileged out­put (eg cer­tain hu­man state­ments in cer­tain cir­cum­stances are taken to be true-ish state­ments about the in­ter­nal model).

The first method vi­o­lates al­gorith­mic equiv­alence and ex­ten­tion­al­ity: two al­gorithms with iden­ti­cal out­puts can nev­er­the­less use differ­ent mod­els. The sec­ond two meth­ods do re­spect al­gorith­mic equiv­alence, once we have defined what be­havi­ours cor­re­spond to re­gret/​sur­prise, or what situ­a­tions hu­mans can be ex­pected to re­spond truth­fully to. In the pro­cess of defin­ing those be­havi­ours and situ­a­tions, how­ever, we are likely to use in­tro­spec­tion and our own mod­els: a sober, re­laxed ra­tio­nal hu­man con­fid­ing con­fi­den­tially with an im­per­sonal com­puter, is more likely to be truth­ful than a pre­car­i­ously em­ployed worker on stage in front of their whole office.

What model frag­ments look like

The sec­ond post will provide ex­am­ples of the ap­proach, but here I’ll just list the kind of things that we can ex­pect as model frag­ment:

  • Direct state­ments about re­wards (“I want choco­late now”).

  • Direct state­ments about ra­tio­nal­ity (“I’m ir­ra­tional around them”).

  • An ac­tion is deemed bet­ter than an­other (“you should starts a pa­per trail, rather than just rely on oral in­struc­tions”).

  • An ac­tion is seen as good (or bad), com­pared with some im­plicit set of stan­dard ac­tions. (“com­pli­ment your lover of­ten”).

  • Similarly to ac­tions, ob­ser­va­tions/​out­comes can be treated as above (“the sec­ond prize is ac­tu­ally bet­ter”, “it was un­lucky you broke your foot”).

  • An out­come is seen as sur­pris­ing (“that was the great­est stock mar­ket crash in his­tory”), or the ac­tion of an­other agent is seen as that (“I didn’t ex­pect them to move to France”).

A hu­man can think these things about them­selves or about other agents; the most com­pli­cated var­i­ants are as­sess­ing the ac­tions of one agent from the per­spec­tive of an­other agent (“if she signed the check, he’d be in a good po­si­tion”).

Fi­nally, there are meta, and meta-meta, etc… ver­sions of these, as we model other agents mod­el­ling us. All of these give a par­tial in­di­ca­tion of our mod­els of the ra­tio­nal­ity or re­ward, about our­selves and about other hu­mans.