Why is the impact penalty time-inconsistent?

I showed in a pre­vi­ous post that im­pact penalties were time-in­con­sis­tent. But why is this? There are two ob­vi­ous pos­si­bil­ities:

  1. The im­pact penalty is in­con­sis­tent be­cause it in­cludes an op­ti­mi­sa­tion pro­cess over the pos­si­ble po­lices of the agent (eg when defin­ing the -val­ues in the at­tain­able util­ity preser­va­tion).

  2. The im­pact penalty is in­con­sis­tent be­cause of how it’s defined at each step (eg be­cause the step­wise in­ac­tion baseline is re­set ev­ery turn).

It turns out the first an­swer is the cor­rect one. And in­deed, we get:

  • If the im­pact penalty is not defined in terms of op­ti­mis­ing over the agent’s ac­tions or poli­cies, then it is kinda time-con­sis­tent.

What is the “kinda” do­ing there? Well, as we’ll see, there is a sub­tle se­man­tics vs syn­tax is­sue go­ing on.

Time-con­sis­tent rewards

In at­tain­able util­ity am­plifi­ca­tion, and other im­pact penalties, the re­ward is ul­ti­mately a func­tion of the cur­rent state and a coun­ter­fac­tual state .

For the ini­tial state and the ini­tial state in­ac­tion baselines, the state is de­ter­mined in­de­pen­dently of any­thing the agent has ac­tu­ally done. So these baselines are given by a func­tion :

  • .

Here, is the en­vi­ron­ment and is the set of ac­tions available to the agent. Since is fixed, we can re-write this as:

  • .

Now, if the im­pact mea­sure is a func­tion of and only, then it is… a re­ward func­tion, with . Thus, since this is just a re­ward func­tion, the agent is time-con­sis­tent.

Now let’s look at the step­wise in­ac­tion baseline. In this case, is de­ter­mined by an in­ac­tion rol­lout from the prior state . So the im­pact mea­sure is ac­tu­ally a func­tion of:

  • .

Again, if is in fact in­de­pen­dent of , the set of the agent’s ac­tions (in­clud­ing for the rol­louts from , then this is a re­ward func­tion—one that is a func­tion of the pre­vi­ous state and the cur­rent state, but that’s quite com­mon for re­ward func­tions.

So again, the agent has no in­ter­est in con­strain­ing its own fu­ture ac­tions.

Se­man­tics vs syntax

Back to “kinda”. The prob­lem is that we’ve been as­sum­ing that ac­tions and states are very dis­tinct ob­jects. Sup­pose that, as in the pre­vi­ous post an agent at time wants to pre­vent it­self from tak­ing ac­tion (go south) at time . Let be the agent’s full set of ac­tions, and the same set with­out .

So now the agent might be time-in­con­sis­tent, since it’s pos­si­ble that:

But now, in­stead of de­not­ing “can’t go south” by re­duc­ing the ac­tion set, we could in­stead de­note it by ex­pand­ing the state set. So define as the same state as , ex­cept that tak­ing the ac­tion is the same as tak­ing the ac­tion . Every­thing is (tech­ni­cally) in­de­pen­dent of , so the agent is “time-con­sis­tent”.

But, of course, the two se­tups, re­stricted ac­tion set or ex­tended state set, are al­most com­pletely iso­mor­phic—even though, ac­cord­ing to our re­sult above, the agent would be time-con­sis­tent in the sec­ond case. It would be time con­sis­tent in that it would not want to change the ac­tions of it fu­ture self—in­stead it would just put its fu­ture self in a state where some ac­tions were in prac­tice un­ob­tain­able.

So it seems that, un­for­tu­nately, it’s not enough to be a re­ward-max­imiser (or a util­ity max­imiser) in or­der to be time-con­sis­tent in prac­tice.