# Why is the impact penalty time-inconsistent?

I showed in a pre­vi­ous post that im­pact penalties were time-in­con­sis­tent. But why is this? There are two ob­vi­ous pos­si­bil­ities:

1. The im­pact penalty is in­con­sis­tent be­cause it in­cludes an op­ti­mi­sa­tion pro­cess over the pos­si­ble po­lices of the agent (eg when defin­ing the -val­ues in the at­tain­able util­ity preser­va­tion).

2. The im­pact penalty is in­con­sis­tent be­cause of how it’s defined at each step (eg be­cause the step­wise in­ac­tion baseline is re­set ev­ery turn).

It turns out the first an­swer is the cor­rect one. And in­deed, we get:

• If the im­pact penalty is not defined in terms of op­ti­mis­ing over the agent’s ac­tions or poli­cies, then it is kinda time-con­sis­tent.

What is the “kinda” do­ing there? Well, as we’ll see, there is a sub­tle se­man­tics vs syn­tax is­sue go­ing on.

## Time-con­sis­tent rewards

In at­tain­able util­ity am­plifi­ca­tion, and other im­pact penalties, the re­ward is ul­ti­mately a func­tion of the cur­rent state and a coun­ter­fac­tual state .

For the ini­tial state and the ini­tial state in­ac­tion baselines, the state is de­ter­mined in­de­pen­dently of any­thing the agent has ac­tu­ally done. So these baselines are given by a func­tion :

• .

Here, is the en­vi­ron­ment and is the set of ac­tions available to the agent. Since is fixed, we can re-write this as:

• .

Now, if the im­pact mea­sure is a func­tion of and only, then it is… a re­ward func­tion, with . Thus, since this is just a re­ward func­tion, the agent is time-con­sis­tent.

Now let’s look at the step­wise in­ac­tion baseline. In this case, is de­ter­mined by an in­ac­tion rol­lout from the prior state . So the im­pact mea­sure is ac­tu­ally a func­tion of:

• .

Again, if is in fact in­de­pen­dent of , the set of the agent’s ac­tions (in­clud­ing for the rol­louts from , then this is a re­ward func­tion—one that is a func­tion of the pre­vi­ous state and the cur­rent state, but that’s quite com­mon for re­ward func­tions.

So again, the agent has no in­ter­est in con­strain­ing its own fu­ture ac­tions.

## Se­man­tics vs syntax

Back to “kinda”. The prob­lem is that we’ve been as­sum­ing that ac­tions and states are very dis­tinct ob­jects. Sup­pose that, as in the pre­vi­ous post an agent at time wants to pre­vent it­self from tak­ing ac­tion (go south) at time . Let be the agent’s full set of ac­tions, and the same set with­out .

So now the agent might be time-in­con­sis­tent, since it’s pos­si­ble that:

But now, in­stead of de­not­ing “can’t go south” by re­duc­ing the ac­tion set, we could in­stead de­note it by ex­pand­ing the state set. So define as the same state as , ex­cept that tak­ing the ac­tion is the same as tak­ing the ac­tion . Every­thing is (tech­ni­cally) in­de­pen­dent of , so the agent is “time-con­sis­tent”.

But, of course, the two se­tups, re­stricted ac­tion set or ex­tended state set, are al­most com­pletely iso­mor­phic—even though, ac­cord­ing to our re­sult above, the agent would be time-con­sis­tent in the sec­ond case. It would be time con­sis­tent in that it would not want to change the ac­tions of it fu­ture self—in­stead it would just put its fu­ture self in a state where some ac­tions were in prac­tice un­ob­tain­able.

So it seems that, un­for­tu­nately, it’s not enough to be a re­ward-max­imiser (or a util­ity max­imiser) in or­der to be time-con­sis­tent in prac­tice.

• 1.

Is there such a thing as a free ac­tion, or an ac­tion where e.g. the agent breaks its own legs, when it is not ac­counted for in the ac­tion space of the un­der­ly­ing MDP? That feels like adding a new layer of mis­speci­fi­ca­tion (which no doubt is a pos­si­bil­ity, and prob­a­bly de­serves deep in­ves­ti­ga­tion) or­thog­o­nal to re­ward func­tion mis­speci­fi­ca­tion.

2.

It seems as though this kind of cir­cum­ven­tion of im­pact penalties de­pends on what calcu­lates the penalty. If the en­vi­ron­ment (or some­thing else ex­ter­nal to the agent) is re­spon­si­ble for calcu­lat­ing the penalty and pro­vid­ing it to the agent, then it could do so as if the agent was not con­strained (or com­mit­ting to be con­strained) at all. If the agent is in­ter­nally re­spon­si­ble for com­put­ing this im­pact, it could prob­a­bly find ways to hack this, similar to wire­head­ing.

So, as­sum­ing its some ex­ter­nal en­tity E that com­putes the im­pact penalty, it must have the (dis?)abil­ity to ac­count for the agent’s mod­ified ac­tion space when mak­ing this com­pu­ta­tion for the agent to suc­cess­fully re­duce the penalty as in the ear­lier ex­am­ple.

Some­thing agent A does must sig­nal to E that A’s ac­tion space has changed. If so, we might find our­selves in a situ­a­tion where A is able to com­mu­ni­cate an ar­bi­trar­ily com­plex mes­sage to E, par­tic­u­larly, some­thing of the form of: “I will never do any­thing that gives me more than the min­i­mum penalty”, or in terms of ac­tions: “I will de­ter­minis­ti­cally fol­low the policy that gives me the min­i­mum penalty while achiev­ing my goals.” E, if it be­lieved A, would not in­clude high penalty ac­tions in the ac­tion space of its penalty com­pu­ta­tions, and thus A would avoid the im­pact penalty.