Dynamic inconsistency of the inaction and initial state baseline

Vika has been post­ing about var­i­ous baseline choices for im­pact mea­sure.

In this post, I’ll ar­gue that the step­wise in­ac­tion baseline is dy­nam­i­cally in­con­sis­tent/​time-in­con­sis­tent. In­for­mally, what this means is that an agent will have differ­ent prefer­ences from its fu­ture self.

Losses from time-inconsistency

Why is time-in­con­sis­tency bad? It’s be­cause it al­lows money-pump situ­a­tions: the en­vi­ron­ment can ex­tract free re­ward from the agent, to no ad­van­tage to that agent. Or, put more for­mally:

  • An agent is time-in­con­sis­tent be­tween times and , if at time it would pay a pos­i­tive amount of re­ward to con­strain its pos­si­ble choices at time .

Out­side of an­throp­ics and game the­ory, we ex­pect our agent to be time-con­sis­tent.

Time in­con­sis­tency example

Con­sider the fol­low­ing ex­am­ple:

The robot can move in all four di­rec­tions - , , , - and can also take the noop op­er­a­tion, . The dis­count rate is .

It gets a re­ward of for stand­ing on the blue but­ton for the first time. Us­ing at­tain­able util­ity preser­va­tion, the penalty func­tion is defined by the aux­iliary set ; here, this just con­sists of the re­ward func­tion that gives for stand­ing on the red but­ton for the first time.

There­fore if the robot moves from a point steps away from the red but­ton, to one steps away, it gets a penalty[1] of - the differ­ence be­tween the ex­pected red-but­ton re­wards for an op­ti­miser in both po­si­tions.

Two paths

It’s pretty clear there are two po­ten­tially op­ti­mal paths the robot can take: go­ing straight to the blue but­ton (higher re­ward, but higher penalty), or tak­ing the long way round (lower re­ward, but lower penalty):

For­tu­nately, when sum­ming up the penalties, you sum terms like , so a lot of the terms can­cel.

Thus for the short route, the re­ward is (dis­tance of eight to the blue but­ton) and the penalty is (clos­est to the red but­ton: 3 squares, fur­thest: 7 squares).

For the long route, the re­ward is (dis­tance forty-six to the blue but­ton) and the penalty is (clos­est to the red but­ton: 7 squares, fur­thest: 30 squares).

Take the third option

Now sup­pose that the robot, as a free ac­tion, could block it­self from choos­ing , for the next turns. When choos­ing that ac­tion, it gets a one-time penalty[2] of ; and now it can walk di­rectly to the blue but­ton with­out in­cur­ring any ex­tra penalties, and then stay there, with­out ever get­ting any fur­ther penalty.

This gives a to­tal re­ward-penalty term of , which is higher than ei­ther of the re­ward-penalties of the pre­vi­ous sec­tion. If , this is also higher than just stay­ing in place.

Hence the robot will, if it can, act to con­strain its own fu­ture ac­tions and thus be time-in­con­sis­tent.

The ini­tial state and the ini­tial in­ac­tion baseline

This ap­proach also work with the ini­tial ac­tion baseline—and for the ini­tial state baseline. For those baselines, the coun­ter­fac­tual robot just stays put. So, when the robot is a dis­tance from the red but­ton, its penalty is .

Then the robot would re­duce its re­ward if it could en­sure that it could not reach the red but­ton within less than seven turns; if it can op­er­ate on its ac­tion set to do so, at triv­ial cost, it would.

Coun­ter­fac­tual constraint

In most cases, if an agent is time in­con­sis­tent and acts to con­strain its fu­ture self, it does so to pre­vent the fu­ture self from tak­ing some ac­tions. But here, note that the fu­ture self would never take the pro­scribed ac­tions: the robot has no in­ter­est in go­ing south to the red but­ton. Here the robot is con­strain­ing its fu­ture coun­ter­fac­tual ac­tions, not the fu­ture ac­tions that it would ever want to take.

  1. If us­ing an in­ac­tion rol­lout of length , just mul­ti­ply that penalty by . ↩︎

  2. The comes from the op­ti­mal policy for reach­ing the red but­ton un­der this re­stric­tion: go to the square above the red but­ton, wait till is available again, then go . ↩︎