Applying reinforcement learning theory to reduce felt temporal distance

(cross-posted from my blog)

It is a ba­sic prin­ci­ple of re­in­force­ment learn­ing to dis­t­in­guish be­tween re­ward and value, where the re­ward of a state is the im­me­di­ate, in­trin­sic de­sir­a­bil­ity of the state, whereas the value of the state is pro­por­tional to the re­wards of the other states that you can reach from that state.

For ex­am­ple, sup­pose that I’m play­ing a com­pet­i­tive game of chess, and in ad­di­tion to win­ning I hap­pen to like cap­tur­ing my op­po­nent’s pieces, even when it doesn’t con­tribute to win­ning. I as­sign a re­ward of 10 points to win­ning, −10 to los­ing, 0 to a stale­mate, and 1 point to each piece that I cap­ture in the game. Now my op­po­nent offers me a chance to cap­ture one of his pawns, an ac­tion that would give me one point worth of re­ward. But when I look at the situ­a­tion more closely, I see that it’s a trap: if I did cap­ture the piece, I would be forced into a set of moves that would in­evitably re­sult in my defeat. So the value, or long-term re­ward, of that state is ac­tu­ally some­thing close to −9.

Once I re­al­ize this, I also re­al­ize that mak­ing that move is al­most ex­actly equiv­a­lent to agree­ing to re­sign in ex­change for my op­po­nent let­ting me cap­ture one of his pieces. My defeat won’t be in­stant, but by mak­ing that move, I would nonethe­less be choos­ing to lose.

Now con­sider a dilemma that I might be faced with when com­ing home late some evening. I have no food at home, but I’m feel­ing ex­hausted and don’t want to bother with go­ing to the store, and I’ve already eaten to­day any­way. But I also know that if I wake up with no food in the house, then I will quickly end up with low en­ergy, which makes it harder to go to the store, which means my en­ergy lev­els will drop fur­ther, and so on un­til I’ll fi­nally get some­thing to eat much later, af­ter wast­ing a long time in an un­com­fortable state.

Typ­i­cally, tem­po­ral dis­count­ing means that I’m aware of this in the evening, but nonethe­less skip the visit to the store. The penalty from not go­ing feels re­mote, whereas the dis­com­fort of go­ing feels close, and that ends up dom­i­nat­ing my de­ci­sion-mak­ing. Be­sides, I can always hope that the next morn­ing will be an ex­cep­tion, and I’ll ac­tu­ally get my­self to go to the store right from the mo­ment when I wake up!

And I haven’t tried this out for very long, but it feels like ex­plic­itly fram­ing the differ­ent ac­tions in terms of re­ward and value could be use­ful in re­duc­ing the im­pact of that ex­pe­rienced dis­tance. I skip the visit to the store be­cause be­ing hun­gry in the morn­ing is some­thing that seems re­mote. But if I think that skip­ping the visit is ex­actly the same thing as choos­ing to be hun­gry in the morn­ing, and that the value of skip­ping the visit is not the mo­men­tary re­lief of be­ing home ear­lier but rather the in­evitable con­se­quence of the causal chain that it sets in mo­tion – cul­mi­nat­ing in hours of hunger and low en­ergy – then that feels a lot differ­ent.

And of course, I can prop­a­gate the con­se­quences ear­lier back in time as well: if I think that I sim­ply won’t have the en­ergy to get food when I fi­nally come home, then I should re­al­ize that I need to go buy the food be­fore set­ting out on that trip. Other­wise I’ll again set in mo­tion a causal chain whose end re­sult is be­ing hun­gry. So then not go­ing shop­ping be­fore I leave be­comes ex­actly the same thing as be­ing hun­gry next morn­ing.

More ex­am­ples of the same:

  • Slightly ear­lier I con­sid­ered tak­ing a shower, and re­al­ized that if I’d take a shower in my cur­rent state of mind I’d in­evitably make it into a bath as well. So I wasn’t re­ally just con­sid­er­ing whether to take a shower, but whether to take a shower *and* a bath. That said, I wasn’t in a hurry any­where and there didn’t seem to be a big harm in also tak­ing the bath, so I de­cided to go ahead with it.

  • While in the shower/​bath, I started think­ing about this post, and de­cided that I wanted to get it writ­ten. But I also wanted to en­joy my hot bath for a while longer. Con­sid­er­ing it, I re­al­ized that stay­ing in the bath for too long might cause me to lose my mo­ti­va­tion for writ­ing this, so there was a chance that stay­ing in the bath would be­come the same thing as choos­ing not to get this writ­ten. I de­cided that the risk wasn’t worth it, and got up.

  • If I’m go­ing some­where and I choose a route that causes me to walk past a fast-food place sel­l­ing some­thing that I know I shouldn’t eat, and I know that the sight of that fast-food place is very likely to tempt me to eat there any­way, then choos­ing that par­tic­u­lar route is the same thing as choos­ing to go eat some­thing that I know I shouldn’t.

Re­lated post: Ap­plied cog­ni­tive sci­ence: learn­ing from a faux pas.