Introducing Corrigibility (an FAI research subfield)

Benja, Eliezer, and I have pub­lished a new tech­ni­cal re­port, in col­lab­o­ra­tion with Stu­art Arm­strong of the Fu­ture of Hu­man­ity in­sti­tute. This pa­per in­tro­duces Cor­rigi­bil­ity, a sub­field of Friendly AI re­search. The ab­stract is re­pro­duced be­low:

As ar­tifi­cially in­tel­li­gent sys­tems grow in in­tel­li­gence and ca­pa­bil­ity, some of their available op­tions may al­low them to re­sist in­ter­ven­tion by their pro­gram­mers. We call an AI sys­tem “cor­rigible” if it co­op­er­ates with what its cre­ators re­gard as a cor­rec­tive in­ter­ven­tion, de­spite de­fault in­cen­tives for ra­tio­nal agents to re­sist at­tempts to shut them down or mod­ify their prefer­ences. We in­tro­duce the no­tion of cor­rigi­bil­ity and an­a­lyze util­ity func­tions that at­tempt to make an agent shut down safely if a shut­down but­ton is pressed, while avoid­ing in­cen­tives to pre­vent the but­ton from be­ing pressed or cause the but­ton to be pressed, and while en­sur­ing prop­a­ga­tion of the shut­down be­hav­ior as it cre­ates new sub­sys­tems or self-mod­ifies. While some pro­pos­als are in­ter­est­ing, none have yet been demon­strated to satisfy all of our in­tu­itive desider­ata, leav­ing this sim­ple prob­lem in cor­rigi­bil­ity wide-open.

We’re ex­cited to pub­lish a pa­per on cor­rigi­bil­ity, as it promises to be an im­por­tant part of the FAI prob­lem. This is true even with­out mak­ing strong as­sump­tions about the pos­si­bil­ity of an in­tel­li­gence ex­plo­sion. Here’s an ex­cerpt from the in­tro­duc­tion:

As AI sys­tems grow more in­tel­li­gent and au­tonomous, it be­comes in­creas­ingly im­por­tant that they pur­sue the in­tended goals. As these goals grow more and more com­plex, it be­comes in­creas­ingly un­likely that pro­gram­mers would be able to spec­ify them perfectly on the first try.

Con­tem­po­rary AI sys­tems are cor­rectable in the sense that when a bug is dis­cov­ered, one can sim­ply stop the sys­tem and mod­ify it ar­bi­trar­ily; but once ar­tifi­cially in­tel­li­gent sys­tems reach and sur­pass hu­man gen­eral in­tel­li­gence, an AI sys­tem that is not be­hav­ing as in­tended might also have the abil­ity to in­ter­vene against at­tempts to “pull the plug”.

In­deed, by de­fault, a sys­tem con­structed with what its pro­gram­mers re­gard as er­ro­neous goals would have an in­cen­tive to re­sist be­ing cor­rected: gen­eral anal­y­sis of ra­tio­nal agents.1 has sug­gested that al­most all such agents are in­stru­men­tally mo­ti­vated to pre­serve their prefer­ences, and hence to re­sist at­tempts to mod­ify them [3, 8]. Con­sider an agent max­i­miz­ing the ex­pec­ta­tion of some util­ity func­tion U. In most cases, the agent’s cur­rent util­ity func­tion U is bet­ter fulfilled if the agent con­tinues to at­tempt to max­i­mize U in the fu­ture, and so the agent is in­cen­tivized to pre­serve its own U-max­i­miz­ing be­hav­ior. In Stephen Omo­hun­dro’s terms, “goal-con­tent in­tegrity″ is an in­stru­men­tally con­ver­gent goal of al­most all in­tel­li­gent agents [6].

This holds true even if an ar­tifi­cial agent’s pro­gram­mers in­tended to give the agent differ­ent goals, and even if the agent is suffi­ciently in­tel­li­gent to re­al­ize that its pro­gram­mers in­tended to give it differ­ent goals. If a U-max­i­miz­ing agent learns that its pro­gram­mers in­tended it to max­i­mize some other goal U*, then by de­fault this agent has in­cen­tives to pre­vent its pro­gram­mers from chang­ing its util­ity func­tion to U* (as this change is rated poorly ac­cord­ing to U). This could re­sult in agents with in­cen­tives to ma­nipu­late or de­ceive their pro­gram­mers.2

As AI sys­tems’ ca­pa­bil­ities ex­pand (and they gain ac­cess to strate­gic op­tions that their pro­gram­mers never con­sid­ered), it be­comes more and more difficult to spec­ify their goals in a way that avoids un­fore­seen solu­tionsout­comes that tech­ni­cally meet the let­ter of the pro­gram­mers’ goal speci­fi­ca­tion, while vi­o­lat­ing the in­tended spirit.3 Sim­ple ex­am­ples of un­fore­seen solu­tions are fa­mil­iar from con­tem­po­rary AI sys­tems: e.g., Bird and Layzell [2] used ge­netic al­gorithms to evolve a de­sign for an os­cilla­tor, and found that one of the solu­tions in­volved re­pur­pos­ing the printed cir­cuit board tracks on the sys­tem’s moth­er­board as a ra­dio, to pick up os­cillat­ing sig­nals gen­er­ated by nearby per­sonal com­put­ers. Gen­er­ally in­tel­li­gent agents would be far more ca­pa­ble of find­ing un­fore­seen solu­tions, and since these solu­tions might be eas­ier to im­ple­ment than the in­tended out­comes, they would have ev­ery in­cen­tive to do so. Fur­ther­more, suffi­ciently ca­pa­ble sys­tems (es­pe­cially sys­tems that have cre­ated sub­sys­tems or un­der­gone sig­nifi­cant self-mod­ifi­ca­tion) may be very difficult to cor­rect with­out their co­op­er­a­tion.

In this pa­per, we ask whether it is pos­si­ble to con­struct a pow­er­ful ar­tifi­cially in­tel­li­gent sys­tem which has no in­cen­tive to re­sist at­tempts to cor­rect bugs in its goal sys­tem, and, ideally, is in­cen­tivized to aid its pro­gram­mers in cor­rect­ing such bugs. While au­tonomous sys­tems reach­ing or sur­pass­ing hu­man gen­eral in­tel­li­gence do not yet ex­ist (and may not ex­ist for some time), it seems im­por­tant to de­velop an un­der­stand­ing of meth­ods of rea­son­ing that al­low for cor­rec­tion be­fore de­vel­op­ing sys­tems that are able to re­sist or de­ceive their pro­gram­mers. We re­fer to rea­son­ing of this type as cor­rigible.

1Von Neu­mann-Mor­gen­stern ra­tio­nal agents [7], that is, agents which at­tempt to max­i­mize ex­pected util­ity ac­cord­ing to some util­ity func­tion)

2In par­tic­u­larly egre­gious cases, this de­cep­tion could lead an agent to max­i­mize U* only un­til it is pow­er­ful enough to avoid cor­rec­tion by its pro­gram­mers, at which point it may be­gin max­i­miz­ing U. Bostrom [4] refers to this as a “treach­er­ous turn″.

3Bostrom [4] calls this sort of un­fore­seen solu­tion a “per­verse in­stan­ti­a­tion″.

(See the pa­per for refer­ences.)

This pa­per in­cludes a de­scrip­tion of Stu­art Arm­strong’s util­ity in­differ­ence tech­nique pre­vi­ously dis­cussed on LessWrong, and a dis­cus­sion of some po­ten­tial con­cerns. Many open ques­tions re­main even in our small toy sce­nario, and many more stand be­tween us and a for­mal de­scrip­tion of what it even means for a sys­tem to ex­hibit cor­rigible be­hav­ior.

Be­fore we build gen­er­ally in­tel­li­gent sys­tems, we will re­quire some un­der­stand­ing of what it takes to be con­fi­dent that the sys­tem will co­op­er­ate with its pro­gram­mers in ad­dress­ing as­pects of the sys­tem that they see as flaws, rather than re­sist­ing their efforts or at­tempt­ing to hide the fact that prob­lems ex­ist. We will all be safer with a for­mal ba­sis for un­der­stand­ing the de­sired sort of rea­son­ing.

As demon­strated in this pa­per, we are still en­coun­ter­ing ten­sions and com­plex­ities in for­mally spec­i­fy­ing the de­sired be­hav­iors and al­gorithms that will com­pactly yield them. The field of cor­rigi­bil­ity re­mains wide open, ripe for study, and cru­cial in the de­vel­op­ment of safe ar­tifi­cial gen­er­ally in­tel­li­gent sys­tems.