Normative assumptions: regret

Cross­posted at the In­tel­li­gent Agents Fo­rum.

In a pre­vi­ous post, I pre­sented a model of hu­man ra­tio­nal­ity and re­ward as pair (p, R), with p our (ir)ra­tio­nal plan­ning al­gorithm (called a plan­ner), and R our re­ward, with p(R) giv­ing our ac­tions/​policy. I also showed that hu­man be­havi­our is in­suffi­cient to es­tab­lish R (or p) in any mean­ingful way what­so­ever.

Yet hu­mans seem to make judge­ments about each other’s ra­tio­nal­ity and true re­wards all the time. And not only do these judge­ments of­ten agree with each other (al­most ev­ery­one agrees that the an­chor­ing bias is a bias—or a out-of-con­text heuris­tic, ie a bias—and al­most no­body ar­gues that it’s ac­tu­ally a hu­man value), but they of­ten seem to have pre­dic­tive abil­ity. What’s go­ing on?

Ad­ding nor­ma­tive assumptions

To tackle this, we need to add nor­ma­tive as­sump­tions. Nor­ma­tive as­sump­tions are sim­ply as­sump­tions that dis­t­in­guish be­tween two pairs (p, R) (p’, R’) that lead to the same policy: p(R)=p’(R’).

Since those pairs pre­dict the same policy, they can­not be dis­t­in­guished via ob­ser­va­tions. So a nor­ma­tive as­sump­tion is an ex­tra piece, that can­not it­self be de­duced from ob­ser­va­tion, and dis­t­in­guishes be­tween plan­ner-re­ward pairs.

Be­cause nor­ma­tive as­sump­tions can­not be de­duced from ob­ser­va­tions, they are part of the defi­ni­tion of the hu­man re­ward func­tion. They are not ab­stract means of con­verg­ing on this true re­ward; they are part of the pro­ce­dure that defines this re­ward.

What are they?

There are two way of see­ing such nor­ma­tive as­sump­tions. The first is as an ex­tra up­date rule: upon see­ing ob­ser­va­tion o, the prob­a­bil­ities of (p, R) and (p’, R’) would nor­mally go to α and α′, but, with the nor­ma­tive as­sump­tion, there is an ex­tra up­date to the rel­a­tive prob­a­bil­ities of the two.

Equiv­a­lently, a nor­ma­tive as­sump­tion can be seen as an ad­just­ment to the pri­ors of these pairs. The two ap­proaches are equiv­a­lent, but some­times an ex­tra up­date is com­pu­ta­tion­ally tractable whereas the equiv­a­lent prior would be in­tractable (and vice-versa).

Ex­am­ple: regret

So, if nor­ma­tive as­sump­tions are defi­ni­tions of ra­tio­nal­ity/​re­ward, what are good can­di­dates to pick? How do we, as hu­mans, define our own ra­tio­nal­ity and re­ward?

Well, one strong way seems to be through our feel­ing of re­gret (hu­man re­gret, not re­gret in the ma­chine learn­ing sense). We of­ten feel and ex­press re­gret when some­thing turns out the wrong way from what we wanted. If we take “feel­ings and ex­pres­sions of re­gret en­code true re­ward in­for­ma­tion” as a nor­ma­tive as­sump­tion, then this re­stricts the num­ber of (p, R) pairs that are com­pat­i­ble with such an as­sump­tion.

In a very sim­ple ex­am­ple, some­one has a 50% chance of re­ceiv­ing ei­ther a sword (s) or an iPhone (i). After that, they will ei­ther say h=“I’m happy with my gift”, or ~h=“I’m un­happy with my gift”.

The hu­man’s re­ward is R(α, β, γ, δ) = αR(s-i) + βR(h-~h) + γR(h-~h|s) + δR(h-~h|i). The α, β, γ, and δ terms are con­stants (that can be nega­tive). The R(s-i) ex­presses the ex­tra re­ward the hu­man gets from re­ceiv­ing the sword rather than the iPhone (if nega­tive, it en­codes the op­po­site re­ward prefer­ence). The R(h-~h) ex­presses the ex­tra re­ward the hu­man gets from say h as op­posed to ~h; here h and ~h are seen as pure “speech acts”, which don’t mean any­thing in­trin­si­cally.

The R(h-~h|s) ex­presses the same prefer­ence, con­di­tional on the hu­man hav­ing re­ceived the sword, and R(h-~h|i) ex­presses it con­di­tional on the hu­man hav­ing re­ceived the iPhone.

I’ll re­strict to two plan­ners, p, which is fully ra­tio­nal, and -p, which is fully anti-ra­tio­nal.

As­sume the hu­man will say h fol­low­ing s, and ~h fol­low­ing i (that’s the hu­man “policy”). Then we are re­stricted to the fol­low­ing pairs:

  • (p, R(α, β, γ, δ) | γ≥-β≥δ) or (-p, R(α, β, γ, δ) | γ≤-β≤δ).

The con­di­tions on the con­stants en­sures the hu­man will fol­low the cor­rect policy. If we add the nor­ma­tive as­sump­tion about ex­press­ing re­gret, we can now say that the hu­man val­ues the sword more than the iPhone, and fur­ther re­strict to:

  • (p, R(α, β, γ, δ) | γ≥-β≥δ, α≥0) or (-p, R(α, β, γ, δ) | γ≤-β≤δ, α≥0).

In­ter­est­ingly, we haven’t ruled out the hu­man be­ing anti-ra­tio­nal! The anti-ra­tio­nal hu­man is one that de­sires the sword, but also wants to lie about it, but just spec­tac­u­larly messes up its at­tempt at ly­ing.

Plan­ning and mul­ti­ple attempts

We could al­low the hu­man to take ac­tions that in­crease or de­crease the prob­a­bil­ity of get­ting the sword. Then, if they tried to get the sword and then claimed they wanted it, we would con­clude they were ra­tio­nal, us­ing p. If they tried to avoid the sword and then claimed they wanted it, we could con­clude they were anti-ra­tio­nal, us­ing -p.

Of course, that makes two de­ci­sions, and we need to add al­ter­na­tive p’s, which are differ­ently ra­tio­nal/​anti-ra­tio­nal with re­spect to “get­ting the sword” and “ex­press­ing prefer­ences”.

But those p’s are more com­plex, and we can now fi­nally start to make use of the sim­plic­ity prior—es­pe­cially if we re­peat var­i­ants of the ex­per­i­ment mul­ti­ple times. Then if the hu­man is taken to be always cor­rect in their ex­pres­sion of re­gret (our as­sump­tion), and if their ac­tions are con­sis­tently in line with their stated val­ues (ob­ser­va­tion), we can con­clude they are most likely ra­tio­nal. Vic­tory?

Is the re­gret nor­ma­tive as­sump­tion cor­rect?

Nor­ma­tive as­sump­tions are not in­trin­si­cally cor­rect or in­cor­rect, as they can­not be de­duced from ob­ser­va­tion. The ques­tion is, in­stead, do they give a good defi­ni­tion of hu­man re­ward, ac­cord­ing to our judg­ment.

And the an­swer is… kinda?

First of all, it de­pends on con­text. Is the per­son speak­ing pri­vately? Are they speak­ing to a be­loved but for­mal rel­a­tive who has just given them the pre­sent? Are they ac­tu­ally talk­ing about that gift rather than an­other?

It’s clear that we can­not say for sure, just be­cause the hu­man ex­presses re­gret fol­low­ing a iPhone gift, that this is a cor­rect de­scrip­tion of their re­ward func­tion. It de­pends, in a com­pli­cated way, on the con­text.

But “de­pend­ing on the con­text” de­stroys the work of the sim­plic­ity prior in the pre­vi­ous para­graph. Un­less we have a com­pli­cated de­scrip­tion of when the re­gret as­sump­tion ap­plies, we fall back into the difficulty of dis­t­in­guish­ing com­plex noise and bi­ases, from ac­tual val­ues.

It’s not com­pletely hope­less—the as­sump­tion that stated re­gret is of­ten cor­rect does do some work, com­bined with a sim­plic­ity prior. But to get the most out of it, we need a much more ad­vanced un­der­stand­ing of re­gret.

We also need to bal­ance this with other nor­ma­tive as­sump­tions, lest we sim­ply cre­ate a re­gret-min­imiser.

Meta ver­sus feeling

One way of figur­ing out how re­gret works is to ask peo­ple meta-ques­tions—is the re­gret that you/​they ex­pressed in this situ­a­tion likely to be gen­uine? How about that situ­a­tion? If we make the nor­ma­tive as­sump­tion that the an­swers to these meta-ques­tions are cor­rect, we will get a more com­pli­cated un­der­stand­ing of re­gret, which will start re­duc­ing the space of (p, R) that we need to con­sider.

Another ap­proach might be to look at the brain and body cor­re­lates of gen­uine re­gret. I elided the dis­tinc­tion be­tween these ap­proaches by refer­ring to “feel­ings and ex­pres­sions of re­gret”, but I’ll re­turn to this dis­tinc­tion in a later post.