Attainable Utility Preservation: Concepts

Ap­pendix: No free impact

What if we want the agent to sin­gle-hand­edly en­sure the fu­ture is sta­ble and al­igned with our val­ues? AUP prob­a­bly won’t al­low poli­cies which ac­tu­ally ac­com­plish this goal – one needs power to e.g. nip un­al­igned su­per­in­tel­li­gences in the bud. AUP aims to pre­vent catas­tro­phes by stop­ping bad agents from gain­ing power to do bad things, but it sym­met­ri­cally im­pedes oth­er­wise-good agents.

This doesn’t mean we can’t get use­ful work out of agents – there are im­por­tant asym­me­tries pro­vided by both the main re­ward func­tion and AU land­scape coun­ter­fac­tu­als.

First, even though we can’t spec­ify an al­igned re­ward func­tion, the pro­vided re­ward func­tion still gives the agent use­ful in­for­ma­tion about what we want. If we need pa­per­clips, then a pa­per­clip-AUP agent prefers poli­cies which make some pa­per­clips. Sim­ple.

Se­cond, if we don’t like what it’s be­gin­ning to do, we can shut it off (be­cause it hasn’t gained power over us). There­fore, it has “ap­proval in­cen­tives” which bias it to­wards AU land­scapes in which its power hasn’t de­creased too much, ei­ther.

So we can hope to build a non-catas­trophic AUP agent and get use­ful work out of it. We just can’t di­rectly ask it to solve all of our prob­lems: it doesn’t make much sense to speak of a “low-im­pact sin­gle­ton”.


  • To em­pha­size, when I say “AUP agents do ” in this post, I mean that AUP agents cor­rectly im­ple­ment­ing the con­cept of AUP tend to be­have in a cer­tain way.

  • As pointed out by Daniel Filan, AUP sug­gests that one might work bet­ter in groups by en­sur­ing one’s ac­tions pre­serve team­mates’ AUs.