# Figuring out what Alice wants, part I

This is a very pre­limi­nary two-part post sketch­ing out the di­rec­tion I’m tak­ing my re­search now (sec­ond post here). I’m ex­pect­ing and hop­ing that ev­ery­thing in here will get su­per­seded quite quickly. This has ob­vi­ous con­nec­tions to clas­si­cal ma­chine in­tel­li­gence re­search ar­eas (such as in­ter­pretabil­ity). I’d be very grate­ful for any links with pa­pers or peo­ple re­lated to the ideas of this post.

## The the­ory: model fragments

I’ve pre­sented the the­o­ret­i­cal ar­gu­ment for why we can­not de­duce the prefer­ences of an ir­ra­tional agent, and a prac­ti­cal ex­am­ple of that difficulty. I’ll be build­ing on that ex­am­ple to illus­trate some al­gorithms that pro­duce the same ac­tions, but where we nonethe­less can feel con­fi­dent de­duc­ing differ­ent prefer­ences.

I’ve men­tioned a few ideas for “nor­ma­tive as­sump­tions”: the as­sump­tions that we, or an AI, could use to dis­t­in­guish be­tween differ­ent pos­si­ble prefer­ences even if they re­sult in the same be­havi­our. I’ve men­tioned things such as re­gret, hu­mans stat­ing their val­ues with more or less truth­ful­ness, hu­man nar­ra­tives, how we cat­e­gorise our own emo­tions (those last three are in this post), or the struc­ture of the hu­man al­gorithm.

Those all seems rather add-hoc, but they are all try­ing to do the same thing: hone in on hu­man judge­ment about ra­tio­nal­ity and prefer­ences. But what is this judge­ment? This judge­ment is defined to be the in­ter­nal mod­els that hu­mans use to as­sess situ­a­tions. Th­ese mod­els, about our­selves and about other hu­mans, of­ten agree with each other from one hu­man to the next (for in­stance, most peo­ple agree that you’re less ra­tio­nal when you’re drunk).

Cal­ling them mod­els might be a bit of an ex­ag­ger­a­tion, though. We of­ten only get a frag­men­tary or mo­men­tary piece of a model—“he’s be­ing silly”, “she’s an­gry”, “you won’t get a pro­mo­tion with that at­ti­tude”. Th­ese are called to mind, thought upon, and then swiftly dis­missed.

So what we want to ac­cess, is the piece of the model that the hu­man used to judge the situ­a­tion. Now, these model frag­ments can of­ten be con­tra­dic­tory, but we can deal with that prob­lem later.

Then all the nor­ma­tive as­sump­tions noted above are just ways of defin­ing these model frag­ments, or ac­cess­ing them (via emo­tion, truth­ful de­scrip­tion, or re­gret). Re­gret is a par­tic­u­larly use­ful emo­tion, as it in­di­cates a di­ver­gence be­tween what was ex­pected in the model, and what ac­tu­ally hap­pened (similarly to tem­po­ral differ­ence learn­ing).

So I’ll broadly cat­e­gorise meth­ods of learn­ing hu­man model frag­ments into three cat­e­gories:

• Direct ac­cess to the in­ter­nal model.

• Re­gret and sur­prise as show­ing mis­matchs be­tween model ex­pec­ta­tion and out­comes.

• Priv­ileged out­put (eg cer­tain hu­man state­ments in cer­tain cir­cum­stances are taken to be true-ish state­ments about the in­ter­nal model).

The first method vi­o­lates al­gorith­mic equiv­alence and ex­ten­tion­al­ity: two al­gorithms with iden­ti­cal out­puts can nev­er­the­less use differ­ent mod­els. The sec­ond two meth­ods do re­spect al­gorith­mic equiv­alence, once we have defined what be­havi­ours cor­re­spond to re­gret/​sur­prise, or what situ­a­tions hu­mans can be ex­pected to re­spond truth­fully to. In the pro­cess of defin­ing those be­havi­ours and situ­a­tions, how­ever, we are likely to use in­tro­spec­tion and our own mod­els: a sober, re­laxed ra­tio­nal hu­man con­fid­ing con­fi­den­tially with an im­per­sonal com­puter, is more likely to be truth­ful than a pre­car­i­ously em­ployed worker on stage in front of their whole office.

## What model frag­ments look like

The sec­ond post will provide ex­am­ples of the ap­proach, but here I’ll just list the kind of things that we can ex­pect as model frag­ment:

• Direct state­ments about re­wards (“I want choco­late now”).

• Direct state­ments about ra­tio­nal­ity (“I’m ir­ra­tional around them”).

• An ac­tion is deemed bet­ter than an­other (“you should starts a pa­per trail, rather than just rely on oral in­struc­tions”).

• An ac­tion is seen as good (or bad), com­pared with some im­plicit set of stan­dard ac­tions. (“com­pli­ment your lover of­ten”).

• Similarly to ac­tions, ob­ser­va­tions/​out­comes can be treated as above (“the sec­ond prize is ac­tu­ally bet­ter”, “it was un­lucky you broke your foot”).

• An out­come is seen as sur­pris­ing (“that was the great­est stock mar­ket crash in his­tory”), or the ac­tion of an­other agent is seen as that (“I didn’t ex­pect them to move to France”).

A hu­man can think these things about them­selves or about other agents; the most com­pli­cated var­i­ants are as­sess­ing the ac­tions of one agent from the per­spec­tive of an­other agent (“if she signed the check, he’d be in a good po­si­tion”).

Fi­nally, there are meta, and meta-meta, etc… ver­sions of these, as we model other agents mod­el­ling us. All of these give a par­tial in­di­ca­tion of our mod­els of the ra­tio­nal­ity or re­ward, about our­selves and about other hu­mans.

No nominations.
No reviews.
• Moved back to drafts, given that I am 70% con­fi­dent that this is still a draft (or maybe it’s some kind of game where I am sup­posed to figure out what Alice wants based on the sen­tence frag­ments in this post, feel free to move it back in that case).

• Ooops! Sorry, this is in­deed a draft.

• Herein I’m think­ing about this and the se­quel post and try­ing to un­der­stand why you might be in­ter­ested in this since it doesn’t feel to me like you spell it out.

It seems we might care about model frag­ments if we think we can’t build com­plete mod­els of other agents/​things but can in­stead build par­tial mod­els. The “we” build­ing these mod­els might be liter­ally us, but also an AI or a com­pos­ite agent like hu­man­ity. Hav­ing a the­ory of what to do with these model frag­ments is use­ful if we want to ad­dress at least two ques­tions, then, that we might be wor­ried about around these parts: how do we de­cide an AI is safe based on our frag­men­tary mod­els of it, and how does an AI model hu­man­ity based on its frag­men­tary mod­els of hu­mans.

• I’m look­ing at how hu­mans model each other based on their frag­men­tary mod­els, and us­ing this to get to their val­ues.

• Think­ing a bit more, it seems a big prob­lem we may face in us­ing model frag­ments is that they are frag­ments and we will have to find a way to stitch them to­gether so that they fill the gaps be­tween the mod­els, per­haps re­quiring some­thing like model in­ter­po­la­tion. Of course, maybe this isn’t nec­es­sary if we think of frag­ments as mostly over­lap­ping (al­though prob­a­bly in­con­sis­tent in the over­laps) or of new frag­ments to fill gaps as available on de­mand if we dis­cover we need them and don’t have them.

• I sus­pect deal­ing ad­e­quately with con­tra­dic­tions will be sig­nifi­cantly more com­pli­cated than you pro­pose, but haven’t writ­ten about that in depth yet. When I get around to ad­dress­ing what I view as nec­es­sary in this area (prac­tic­ing moral par­tic­u­larism that will be ro­bust to false pos­i­tives) I definitely look for­ward to talk­ing with you more about it.

• I agree with you to some ex­tent. That post is mainly a place­holder that tells me that the con­tra­dic­tions prob­lem is not in­trin­si­cally un­solv­able, so I can put it aside while I con­cen­trate on this prob­lem for the mo­ment.