New paper: Corrigibility with Utility Preservation

I am pleased to an­nounce the availa­bil­ity of a long-for­mat pa­per with new re­sults on AGI safety: Cor­rigi­bil­ity with Utility Preser­va­tion.

You can get the pa­per at https://​​​​abs/​​1908.01695 , and in the re­lated soft­ware repos­i­tory at https://​​​​kholt­man/​​ag­isim .


Cor­rigi­bil­ity is a safety prop­erty for ar­tifi­cially in­tel­li­gent agents. A cor­rigible agent will not re­sist at­tempts by au­tho­rized par­ties to al­ter the goals and con­straints that were en­coded in the agent when it was first started. This pa­per shows how to con­struct a safety layer that adds cor­rigi­bil­ity to ar­bi­trar­ily ad­vanced util­i­ty­
max­i­miz­ing agents, in­clud­ing pos­si­ble fu­ture agents with Ar­tifi­cial Gen­eral In­tel­li­gence (AGI). The layer counter-acts the emer­gent in­cen­tive of ad­vanced agents to re­sist such al­ter­a­tion.

A de­tailed model for agents which can rea­son about pre­serv­ing their util­ity func­tion is de­vel­oped, and used to prove that the cor­rigi­bil­ity layer works as in­tended in a large set of non-hos­tile uni­verses. The cor­rigible agents have an emer­gent in­cen­tive to pro­tect key el­e­ments of their cor­rigi­bil­ity layer. How­ever, hos­tile uni­verses may con­tain forces strong enough to break safety fea­tures. Some open prob­lems re­lated to grace­ful degra­da­tion when an agent is suc­cess­fully at­tacked are iden­ti­fied.

The re­sults in this pa­per were ob­tained by con­cur­rently de­vel­op­ing an AGI agent simu­la­tor, an agent model, and proofs. The simu­la­tor is available un­der an open source li­cense. The pa­per con­tains simu­la­tion re­sults which illus­trate the safety re­lated prop­er­ties of cor­rigible AGI agents in de­tail.

This post can be used for com­ments and ques­tions.

The pa­per con­tains sev­eral re­sults and ob­ser­va­tions that do not rely on the heavy use of math, but other key re­sults and dis­cus­sions are quite math­e­mat­i­cal. Feel to post ques­tions and com­ments even if you have not read all the math­e­mat­i­cal parts.

As this is my first post on LessWrong, and my first pa­per on AGI safety, I feel I should say some­thing to in­tro­duce my­self. I have a Ph.D. in soft­ware de­sign, but my pro­fes­sional life so far has been very di­verse and mul­ti­dis­ci­plinary. Among other things I have been an ex­per­i­men­tal physi­cist, a stan­dards de­vel­oper and ne­go­tia­tor, an In­ter­net pri­vacy ad­vo­cate, a wire­less net­work­ing ex­pert, and a sys­tems ar­chi­tect in an in­dus­trial re­search lab. So I bring a wide range of tools and method­olog­i­cal tra­di­tions to the field. What made me in­ter­ested in the field of AGI safety in par­tic­u­lar is that it seems to have open prob­lems where real progress can be made us­ing math­e­mat­i­cal tech­niques that I hap­pen to like. I am cur­rently on a sab­bat­i­cal: ba­si­cally this means that I de­cided to quit my day job, and to use my sav­ings to work for a while on some in­ter­est­ing prob­lems that are differ­ent from the in­ter­est­ing prob­lems I worked on ear­lier.