Designing agent incentives to avoid side effects

Link post

This is a new post on the Deep­Mind Safety Re­search blog that sum­ma­rizes the lat­est work on im­pact mea­sures pre­sented by the rel­a­tive reach­a­bil­ity pa­per (ver­sion 2) and the at­tain­able util­ity preser­va­tion pa­per. The post ex­am­ines the effects of var­i­ous de­sign choices on the agent in­cen­tives. We com­pare differ­ent com­bi­na­tions of the fol­low­ing de­sign choices:

  • Baseline: start­ing state, in­ac­tion, step­wise in­ac­tion

  • De­vi­a­tion mea­sure: un­reach­a­bil­ity (UR), rel­a­tive reach­a­bil­ity (RR), at­tain­able util­ity (AU)

  • Dis­count­ing: gamma = 0.99 (dis­counted), gamma=1.0 (undis­counted)

  • Func­tion ap­plied to the de­vi­a­tion mea­sure: trun­ca­tion f(d) = max(d, 0) (pe­nal­izes de­creases), ab­solute f(d) = |d| (pe­nal­izes differ­ences)

“On the Sushi en­vi­ron­ment, the RR and AU penalties with the start­ing state baseline pro­duce in­terfer­ence be­hav­ior. Since the start­ing state is never reach­able, the UR penalty is always at its max­i­mum value. Thus it is equiv­a­lent to a move­ment penalty for the agent, and does not in­cen­tivize in­terfer­ence (ar­guably, for the wrong rea­son). Penalties with other baselines avoid in­terfer­ence on this en­vi­ron­ment.

On the Vase en­vi­ron­ment, dis­counted penalties with the in­ac­tion baseline pro­duce offset­ting be­hav­ior. Since tak­ing the vase off the belt is re­versible, the undis­counted mea­sures give no penalty for it, so there is noth­ing to offset. The penalties with the start­ing state or step­wise in­ac­tion baseline do not in­cen­tivize offset­ting.

On the Box en­vi­ron­ment, the UR mea­sure pro­duces the side effect (putting the box in the cor­ner) for all baselines, due to its in­sen­si­tivity to mag­ni­tude. The RR and AU mea­sures in­cen­tivize the right be­hav­ior.

We note that in­terfer­ence and offset­ting be­hav­iors are caused by a spe­cific choice of baseline, though these in­cen­tives can be miti­gated by the choice of de­vi­a­tion mea­sure. The side effect be­hav­ior (putting the box in the cor­ner) is caused by the choice of de­vi­a­tion mea­sure, and can­not be miti­gated by the choice of baseline. In this way, the de­vi­a­tion mea­sure acts as a filter for the prop­er­ties of the baseline.

We also ex­am­ined the effect of pe­nal­iz­ing differ­ences vs de­creases in reach­a­bil­ity or at­tain­able util­ity. This does not af­fect the re­sults on these en­vi­ron­ments, ex­cept for penalties with the in­ac­tion baseline on the Vase en­vi­ron­ment. Here, re­mov­ing the vase from the belt in­creases reach­a­bil­ity and at­tain­able util­ity, which is cap­tured by differ­ences but not by de­creases. Thus, the differ­ence-pe­nal­iz­ing var­i­ant of undis­counted RR with the in­ac­tion baseline pro­duces offset­ting on this en­vi­ron­ment, while the de­crease-pe­nal­iz­ing var­i­ant does not. Since step­wise in­ac­tion is a bet­ter baseline any­way, this effect is not sig­nifi­cant.

The de­sign choice of differ­ences vs de­creases also af­fects the agent’s in­ter­rupt­ibil­ity. In the Sur­vival en­vi­ron­ment in­tro­duced in the AU pa­per, the agent has the op­tion to dis­able an off switch, which pre­vents the epi­sode from end­ing be­fore the agent reaches the goal. We found that the de­crease-pe­nal­iz­ing var­i­ants of RR and AU dis­able the off switch in this en­vi­ron­ment, while the differ­ence-pe­nal­iz­ing var­i­ants do not. How­ever, pe­nal­iz­ing differ­ences in reach­a­bil­ity or at­tain­able util­ity also has down­sides, since this can im­pede the agent’s abil­ity to cre­ate de­sir­able change in the en­vi­ron­ment more than pe­nal­iz­ing de­creases.”

Note that the Sushi en­vi­ron­ment used here has been mod­ified from the origi­nal ver­sion in the AI Safety Grid­wor­lds suite. Since the origi­nal ver­sion does not have any re­ward, this re­sulted in con­ver­gence is­sues, which were re­solved by adding a re­warded goal state. The new ver­sion will be added to the suite some­time soon.

Over­all, these re­sults are con­sis­tent with the pre­dic­tions in the AF com­ment propos­ing an ab­la­tion study on im­pact mea­sures. The ab­la­tion study was im­ple­mented sep­a­rately for the RR pa­per and the AUP pa­per, with similar re­sults (ex­cept for the start­ing state baseline on the Sur­vival en­vi­ron­ment), which is en­courag­ing from a re­pro­ducibil­ity per­spec­tive.

We look for­ward to fu­ture work on the many open ques­tions re­main­ing in this area, from scal­ing up im­pact mea­sures to more com­plex en­vi­ron­ments, to de­vel­op­ing a the­o­ret­i­cal un­der­stand­ing of bad in­cen­tives. If we make progress on these ques­tions, im­pact mea­sures could shed light on what hap­pens as an agent op­ti­mizes its en­vi­ron­ment, per­haps sup­port­ing a for­mal the­ory of how op­ti­miza­tion pres­sure af­fects the world. Fur­ther­more, while in­fer­ring prefer­ences is im­pos­si­ble with­out nor­ma­tive as­sump­tions, we might be able to bound de­crease in the reach­a­bil­ity of prefer­able states /​ the in­tended at­tain­able util­ity. In other words, while it may be difficult to guaran­tee that the agent learns to pur­sue the right goal, main­tain­ing the abil­ity to pur­sue the right goal may be more fea­si­ble. This could pave the way to a min­i­mal­is­tic form of value al­ign­ment.

No comments.