Fig­ur­ing out what Alice wants, part I

This is a very pre­lim­in­ary two-part post sketch­ing out the dir­ec­tion I’m tak­ing my re­search now (second post here). I’m ex­pect­ing and hop­ing that everything in here will get su­per­seded quite quickly. This has ob­vi­ous con­nec­tions to clas­sical ma­chine in­tel­li­gence re­search areas (such as in­ter­pretab­il­ity). I’d be very grate­ful for any links with pa­pers or people re­lated to the ideas of this post.

The the­ory: model fragments

I’ve presen­ted the the­or­et­ical ar­gu­ment for why we can­not de­duce the pref­er­ences of an ir­ra­tional agent, and a prac­tical ex­ample of that dif­fi­culty. I’ll be build­ing on that ex­ample to il­lus­trate some al­gorithms that pro­duce the same ac­tions, but where we non­ethe­less can feel con­fid­ent de­du­cing dif­fer­ent pref­er­ences.

I’ve men­tioned a few ideas for “norm­at­ive as­sump­tions”: the as­sump­tions that we, or an AI, could use to dis­tin­guish between dif­fer­ent pos­sible pref­er­ences even if they res­ult in the same be­ha­viour. I’ve men­tioned things such as re­gret, hu­mans stat­ing their val­ues with more or less truth­ful­ness, hu­man nar­rat­ives, how we cat­egor­ise our own emo­tions (those last three are in this post), or the struc­ture of the hu­man al­gorithm.

Those all seems rather add-hoc, but they are all try­ing to do the same thing: hone in on hu­man judge­ment about ra­tion­al­ity and pref­er­ences. But what is this judge­ment? This judge­ment is defined to be the in­ternal mod­els that hu­mans use to as­sess situ­ations. These mod­els, about ourselves and about other hu­mans, of­ten agree with each other from one hu­man to the next (for in­stance, most people agree that you’re less ra­tional when you’re drunk).

Calling them mod­els might be a bit of an ex­ag­ger­a­tion, though. We of­ten only get a frag­ment­ary or mo­ment­ary piece of a model—“he’s be­ing silly”, “she’s angry”, “you won’t get a pro­mo­tion with that at­ti­tude”. These are called to mind, thought upon, and then swiftly dis­missed.

So what we want to ac­cess, is the piece of the model that the hu­man used to judge the situ­ation. Now, these model frag­ments can of­ten be con­tra­dict­ory, but we can deal with that prob­lem later.

Then all the norm­at­ive as­sump­tions noted above are just ways of de­fin­ing these model frag­ments, or ac­cess­ing them (via emo­tion, truth­ful de­scrip­tion, or re­gret). Regret is a par­tic­u­larly use­ful emo­tion, as it in­dic­ates a di­ver­gence between what was ex­pec­ted in the model, and what ac­tu­ally happened (sim­il­arly to tem­poral dif­fer­ence learn­ing).

So I’ll broadly cat­egor­ise meth­ods of learn­ing hu­man model frag­ments into three cat­egor­ies:

  • Dir­ect ac­cess to the in­ternal model.

  • Regret and sur­prise as show­ing mis­matchs between model ex­pect­a­tion and out­comes.

  • Priv­ileged out­put (eg cer­tain hu­man state­ments in cer­tain cir­cum­stances are taken to be true-ish state­ments about the in­ternal model).

The first method vi­ol­ates al­gorithmic equi­val­ence and ex­ten­tion­al­ity: two al­gorithms with identical out­puts can nev­er­the­less use dif­fer­ent mod­els. The second two meth­ods do re­spect al­gorithmic equi­val­ence, once we have defined what be­ha­viours cor­res­pond to re­gret/​sur­prise, or what situ­ations hu­mans can be ex­pec­ted to re­spond truth­fully to. In the pro­cess of de­fin­ing those be­ha­viours and situ­ations, how­ever, we are likely to use in­tro­spec­tion and our own mod­els: a sober, re­laxed ra­tional hu­man con­fid­ing con­fid­en­tially with an im­per­sonal com­puter, is more likely to be truth­ful than a pre­cari­ously em­ployed worker on stage in front of their whole of­fice.

What model frag­ments look like

The second post will provide ex­amples of the ap­proach, but here I’ll just list the kind of things that we can ex­pect as model frag­ment:

  • Dir­ect state­ments about re­wards (“I want chocol­ate now”).

  • Dir­ect state­ments about ra­tion­al­ity (“I’m ir­ra­tional around them”).

  • An ac­tion is deemed bet­ter than an­other (“you should starts a pa­per trail, rather than just rely on oral in­struc­tions”).

  • An ac­tion is seen as good (or bad), com­pared with some im­pli­cit set of stand­ard ac­tions. (“com­pli­ment your lover of­ten”).

  • Sim­il­arly to ac­tions, ob­ser­va­tions/​out­comes can be treated as above (“the second prize is ac­tu­ally bet­ter”, “it was un­lucky you broke your foot”).

  • An out­come is seen as sur­pris­ing (“that was the greatest stock mar­ket crash in his­tory”), or the ac­tion of an­other agent is seen as that (“I didn’t ex­pect them to move to France”).

A hu­man can think these things about them­selves or about other agents; the most com­plic­ated vari­ants are as­sess­ing the ac­tions of one agent from the per­spect­ive of an­other agent (“if she signed the check, he’d be in a good po­s­i­tion”).

Fin­ally, there are meta, and meta-meta, etc… ver­sions of these, as we model other agents mod­el­ling us. All of these give a par­tial in­dic­a­tion of our mod­els of the ra­tion­al­ity or re­ward, about ourselves and about other hu­mans.