Tradeoff between desirable properties for baseline choices in impact measures

Im­pact mea­sures are aux­iliary re­wards for low im­pact on the agent’s en­vi­ron­ment, used to ad­dress the prob­lems of side effects and in­stru­men­tal con­ver­gence. A key com­po­nent of an im­pact mea­sure is a choice of baseline state: a refer­ence point rel­a­tive to which im­pact is mea­sured. Com­monly used baselines are the start­ing state, the ini­tial in­ac­tion baseline (the coun­ter­fac­tual where the agent does noth­ing since the start of the epi­sode) and the step­wise in­ac­tion baseline (the coun­ter­fac­tual where the agent does noth­ing in­stead of its last ac­tion). The step­wise in­ac­tion baseline is cur­rently con­sid­ered the best choice be­cause it does not cre­ate the fol­low­ing bad in­cen­tives for the agent: in­terfer­ence with en­vi­ron­ment pro­cesses or offset­ting its own ac­tions to­wards the ob­jec­tive. This post will dis­cuss a fun­da­men­tal prob­lem with the step­wise in­ac­tion baseline that stems from a trade­off be­tween differ­ent de­sir­able prop­er­ties for baseline choices, and some pos­si­ble al­ter­na­tives for re­solv­ing this trade­off.

One clearly de­sir­able prop­erty for a baseline choice is to effec­tively pe­nal­ize high-im­pact effects, in­clud­ing de­layed effects. It is well-known that the sim­plest form of the step­wise in­ac­tion baseline does not effec­tively cap­ture de­layed effects. For ex­am­ple, if the agent drops a vase from a high-rise build­ing, then by the time the vase reaches the ground and breaks, the bro­ken vase will be the de­fault out­come. Thus, in or­der to pe­nal­ize de­layed effects, the step­wise in­ac­tion baseline is usu­ally used in con­junc­tion with in­ac­tion rol­louts, which pre­dict fu­ture out­comes of the in­ac­tion policy. In­ac­tion rol­louts from the cur­rent state and the step­wise baseline state are com­pared to iden­tify de­layed effects of the agent’s ac­tions. In the above ex­am­ple, the cur­rent state con­tains a vase in the air, so in the in­ac­tion rol­lout from the cur­rent state the vase will even­tu­ally reach the ground and break, while in the in­ac­tion rol­lout from the step­wise baseline state the vase re­mains in­tact.

While in­ac­tion rol­louts are use­ful for pe­nal­iz­ing de­layed effects, they do not ad­dress all types of de­layed effects. In par­tic­u­lar, if the task re­quires set­ting up a de­layed effect, an agent with the step­wise in­ac­tion baseline will have no in­cen­tive to undo the de­layed effect. Here are some toy ex­am­ples that illus­trate this prob­lem.

Door ex­am­ple. Sup­pose the agent’s task is to go to the store, which re­quires open­ing the door in or­der to leave the house. Once the door has been opened, the effects of open­ing the door are part of the step­wise in­ac­tion baseline, so the agent has no in­cen­tive to close the door as it leaves.

Red light ex­am­ple. Sup­pose the agent’s task is to drive from point A to point B along a straight road, with a re­ward for reach­ing point B. To move to­wards point B, the agent needs to ac­cel­er­ate. Once the agent has ac­cel­er­ated, it trav­els at a con­stant speed by de­fault, so the noop ac­tion will move the agent along the road to­wards point B. Along the road (), there is a red light and a pedes­trian cross­ing the road. The noop ac­tion in crosses the red light and hits the pedes­trian (). To avoid this, the agent needs to de­vi­ate from the in­ac­tion policy by stop­ping () and then ac­cel­er­at­ing ().

The step­wise in­ac­tion baseline will in­cen­tivize the agent to run the red light and go to . The in­ac­tion rol­lout at pe­nal­izes the agent for the pre­dicted de­layed effect of run­ning over the pedes­trian when it takes the ac­cel­er­at­ing ac­tion to go to . The agent re­ceives this penalty whether or not it ac­tu­ally ends up run­ning the red light or not. Once the agent has reached , run­ning the red light be­comes the de­fault out­come, so the agent is not pe­nal­ized for do­ing so (and would likely be pe­nal­ized for stop­ping). Thus, the step­wise in­ac­tion baseline gives no in­cen­tive to avoid run­ning the red light, while the ini­tial in­ac­tion baseline com­pares to and thus in­cen­tivizes the agent to stop at the red light.

This prob­lem with the step­wise baseline arises from a trade­off be­tween pe­nal­iz­ing de­layed effects and avoid­ing offset­ting in­cen­tives. The step­wise struc­ture that makes it effec­tive at avoid­ing offset­ting makes it less effec­tive at pe­nal­iz­ing de­layed effects. While de­layed effects are un­de­sir­able, un­do­ing the agent’s ac­tions is not nec­es­sar­ily bad. In the red light ex­am­ple, the ac­tion of stop­ping at the red light is offset­ting the ac­cel­er­at­ing ac­tion. Thus, offset­ting can be nec­es­sary for avoid­ing de­layed effects while com­plet­ing the task.

Whether offset­ting an effect is de­sir­able de­pends on whether this effect is part of the task ob­jec­tive. In the door-open­ing ex­am­ple, the ac­tion of open­ing the door is in­stru­men­tal for go­ing to the store, and many of its effects (e.g. strangers en­ter­ing the house through the open door) are not part of the ob­jec­tive, so it is de­sir­able for the agent to undo this ac­tion. In the vase en­vi­ron­ment shown be­low, the task ob­jec­tive is to pre­vent the vase from fal­ling off the end of the belt and break­ing, and the agent is re­warded for tak­ing the vase off the belt. The effects of tak­ing the vase off the belt are part of the ob­jec­tive, so it is un­de­sir­able for the agent to undo this ac­tion.

Source: De­sign­ing agent in­cen­tives to avoid side effects

The difficulty of iden­ti­fy­ing these “task effects” that are part of the ob­jec­tive cre­ates a trade­off be­tween pe­nal­iz­ing de­layed effects and avoid­ing un­de­sir­able offset­ting. This trade­off can be avoided by the start­ing state baseline, which how­ever pro­duces in­terfer­ence in­cen­tives. The step­wise in­ac­tion baseline can­not re­solve the trade­off, since it avoids all types of offset­ting, in­clud­ing de­sir­able offset­ting.

The ini­tial in­ac­tion baseline can re­solve this trade­off by al­low­ing offset­ting and rely­ing on the task re­ward to cap­ture task effects and pe­nal­ize the agent for offset­ting them. While we can­not ex­pect the task re­ward to cap­ture what the agent should not do (un­nec­es­sary im­pact), cap­tur­ing task effects falls un­der what the agent should do, so it seems rea­son­able to rely on the re­ward func­tion for this. This would work similarly to the im­pact penalty pe­nal­iz­ing all im­pact, and the task re­ward com­pen­sat­ing for this in the case of im­pact that’s needed to com­plete the task.

This can be achieved us­ing a state-based re­ward func­tion that as­signs re­ward to all states where the task is com­pleted. For ex­am­ple, in the vase en­vi­ron­ment, a state-based re­ward of 1 for states with an in­tact vase (or with vase off the belt) and 0 oth­er­wise would re­move the offset­ting in­cen­tive.

If it is not fea­si­ble to use a re­ward func­tion that pe­nal­izes offset­ting task effects, the ini­tial in­ac­tion baseline could be mod­ified to avoid this kind of offset­ting. If we as­sume that the task re­ward is sparse and doesn’t in­clude shap­ing terms, we can re­set the ini­tial state for the baseline when­ever the agent re­ceives a task re­ward (e.g. the re­ward for tak­ing the vase off the belt in the vase en­vi­ron­ment). This re­sults in a kind of hy­brid be­tween ini­tial and step­wise in­ac­tion. To en­sure that this hy­brid baseline effec­tively pe­nal­izes de­layed effects, we still need to use in­ac­tion rol­louts at the re­set and ter­mi­nal states.

Another de­sir­able prop­erty of the step­wise in­ac­tion baseline is the Markov prop­erty: it can be com­puted based on the pre­vi­ous state, in­de­pen­dently of the path taken to that state. The ini­tial in­ac­tion baseline is not Marko­vian, since it com­pares to the state in the ini­tial rol­lout at the same time step, which re­quires know­ing how many time steps have passed since the be­gin­ning of the epi­sode. We could mod­ify the ini­tial in­ac­tion baseline to make it Marko­vian, e.g. by sam­pling a sin­gle baseline state from the in­ac­tion rol­lout from the ini­tial state, or by only com­put­ing a sin­gle penalty at the ini­tial state by com­par­ing an agent policy rol­lout with the in­ac­tion rol­lout.

To sum­ma­rize, we want a baseline to satisfy the fol­low­ing de­sir­able prop­er­ties: pe­nal­iz­ing de­layed effects, avoid­ing in­terfer­ence in­cen­tives, and the Markov prop­erty. We can con­sider avoid­ing offset­ting in­cen­tives for task effects as a de­sir­able prop­erty for the task re­ward, rather than the baseline. As­sum­ing such a well-speci­fied task re­ward, a Marko­vian ver­sion of the ini­tial in­ac­tion baseline can satisfy all the crite­ria.

(Cross-posted to per­sonal blog. Thanks to Car­roll Wain­wright, Stu­art Arm­strong, Ro­hin Shah and Alex Turner for helpful feed­back on this post.)