2-D Robustness

This is a short note on a fram­ing that was de­vel­oped in col­lab­o­ra­tion with Joar Skalse, Chris van Mer­wijk and Evan Hub­inger while work­ing on Risks from Learned Op­ti­miza­tion, but which did not find a nat­u­ral place in the re­port.

Mesa-op­ti­mi­sa­tion is a kind of ro­bust­ness prob­lem, in the fol­low­ing sense:

Since the mesa-op­ti­miser is se­lected based on perfor­mance on the base ob­jec­tive, we ex­pect it (once trained) to have a good policy on the train­ing dis­tri­bu­tion. That is, we can ex­pect the mesa-op­ti­miser to act in a way that re­sults in out­comes that we want, and to do so com­pe­tently.

The place where we ex­pect trou­ble is off-dis­tri­bu­tion. When the mesa-op­ti­miser is placed in a new situ­a­tion, I want to high­light two dis­tinct failure modes; that is, out­comes which score poorly on the base ob­jec­tive:

  • The mesa-op­ti­miser fails to gen­er­al­ise in any way, and sim­ply breaks, scor­ing poorly on the base ob­jec­tive.

  • The mesa-op­ti­miser ro­bustly and com­pe­tently achieves an ob­jec­tive that is differ­ent from the base ob­jec­tive, thereby scor­ing poorly on it.

Both of these are failures of ro­bust­ness, but there is an im­por­tant dis­tinc­tion to be made be­tween them. In the first failure mode, the agent’s ca­pa­bil­ities fail to gen­er­al­ise. In the sec­ond, its ca­pa­bil­ities gen­er­al­ise, but its ob­jec­tive does not. This sec­ond failure mode seems in gen­eral more dan­ger­ous: if an agent is suffi­ciently ca­pa­ble, it might, for ex­am­ple, hin­der hu­man at­tempts to shut it down (if its ca­pa­bil­ities are ro­bust enough to gen­er­al­ise to situ­a­tions in­volv­ing hu­man at­tempts to shut it down). Th­ese failure modes map to what Paul Chris­ti­ano calls be­nign and ma­lign failures in Tech­niques for op­ti­miz­ing worst-case perfor­mance.

This dis­tinc­tion sug­gests a fram­ing of ro­bust­ness that we have found use­ful while writ­ing our re­port: in­stead of treat­ing ro­bust­ness as a scalar quan­tity that mea­sures the de­gree to which the sys­tem con­tinues work­ing off-dis­tri­bu­tion, we can view ro­bust­ness as a 2-di­men­sional quan­tity. Its two axes are some­thing like “ca­pa­bil­ities” and “al­ign­ment”, and the failure modes at differ­ent points in the space look differ­ent.

fig 1

Un­like the 1-d pic­ture, the 2-d pic­ture sug­gests that more ro­bust­ness is not always a good thing. In par­tic­u­lar, ro­bust­ness in ca­pa­bil­ities is only good in­so­far is it is matched by ro­bust al­ign­ment be­tween the mesa-ob­jec­tive and the base ob­jec­tive. It may be the case that for some sys­tems, we’d rather the sys­tem get to­tally con­fused in new situ­a­tions than re­main com­pe­tent while pur­su­ing the wrong ob­jec­tive.

Of course, there is a rea­son why we usu­ally think of ro­bust­ness as a scalar: one can define clear met­rics for how well the sys­tem gen­er­al­ises, in terms of the differ­ence be­tween perfor­mance on the base ob­jec­tive on- and off-dis­tri­bu­tion. In con­trast, 2-d ro­bust­ness does not yet have an ob­vi­ous way to ground its two axes in mea­surable quan­tities. Nev­er­the­less, as an in­tu­itive fram­ing I find it quite com­pel­ling, and in­vite you to also think in these terms.