Tiling Agents for Self-Modifying AI (OPFAI #2)

An early draft of pub­li­ca­tion #2 in the Open Prob­lems in Friendly AI se­ries is now available: Tiling Agents for Self-Mod­ify­ing AI, and the Lo­bian Ob­sta­cle. ~20,000 words, aimed at math­e­mat­i­ci­ans or the highly math­e­mat­i­cally liter­ate. The re­search re­ported on was con­ducted by Yud­kowsky and Her­reshoff, sub­stan­tially re­fined at the Novem­ber 2012 MIRI Work­shop with Mihaly Barasz and Paul Chris­ti­ano, and re­fined fur­ther at the April 2013 MIRI Work­shop.

Ab­stract:

We model self-mod­i­ca­tion in AI by in­tro­duc­ing ‘tiling’ agents whose de­ci­sion sys­tems will ap­prove the con­struc­tion of highly similar agents, cre­at­ing a re­peat­ing pat­tern (in­clud­ing similar­ity of the offspring’s goals). Con­struct­ing a for­mal­ism in the most straight­for­ward way pro­duces a Godelian difficulty, the Lo­bian ob­sta­cle. By tech­ni­cal meth­ods we demon­strate the pos­si­bil­ity of avoid­ing this ob­sta­cle, but the un­der­ly­ing puz­zles of ra­tio­nal co­her­ence are thus only par­tially ad­dressed. We ex­tend the for­mal­ism to par­tially un­known de­ter­minis­tic en­vi­ron­ments, and show a very crude ex­ten­sion to prob­a­bil­is­tic en­vi­ron­ments and ex­pected util­ity; but the prob­lem of find­ing a fun­da­men­tal de­ci­sion crite­rion for self-mod­ify­ing prob­a­bil­is­tic agents re­mains open.

Com­ment­ing here is the preferred venue for dis­cus­sion of the pa­per. This is an early draft and has not been re­viewed, so it may con­tain math­e­mat­i­cal er­rors, and re­port­ing of these will be much ap­pre­ci­ated.

The over­all agenda of the pa­per is in­tro­duce the con­cep­tual no­tion of a self-re­pro­duc­ing de­ci­sion pat­tern which in­cludes re­pro­duc­tion of the goal or util­ity func­tion, by ex­pos­ing a par­tic­u­lar pos­si­ble prob­lem with a tiling log­i­cal de­ci­sion pat­tern and com­ing up with some par­tial tech­ni­cal solu­tions. This then makes it con­cep­tu­ally much clearer to point out the even deeper prob­lems with “We can’t yet de­scribe a prob­a­bil­is­tic way to do this be­cause of non-mono­ton­ic­ity” and “We don’t have a good bounded way to do this be­cause max­i­miza­tion is im­pos­si­ble, satis­fic­ing is too weak and Sch­mid­hu­ber’s swap­ping crite­rion is un­der­speci­fied.” The pa­per uses first-or­der logic (FOL) be­cause FOL has a lot of use­ful stan­dard ma­chin­ery for re­flec­tion which we can then in­voke; in real life, FOL is of course a poor rep­re­sen­ta­tional fit to most real-world en­vi­ron­ments out­side a hu­man-con­structed com­puter chip with ther­mo­dy­nam­i­cally ex­pen­sive crisp vari­able states.

As fur­ther back­ground, the idea that some­thing-like-proof might be rele­vant to Friendly AI is not about achiev­ing some chimera of ab­solute safety-feel­ing, but rather about the idea that the to­tal prob­a­bil­ity of catas­trophic failure should not have a sig­nifi­cant con­di­tion­ally in­de­pen­dent com­po­nent on each self-mod­ifi­ca­tion, and that self-mod­ifi­ca­tion will (at least in ini­tial stages) take place within the highly de­ter­minis­tic en­vi­ron­ment of a com­puter chip. This means that statis­ti­cal test­ing meth­ods (e.g. an evolu­tion­ary al­gorithm’s eval­u­a­tion of av­er­age fit­ness on a set of test prob­lems) are not suit­able for self-mod­ifi­ca­tions which can po­ten­tially in­duce catas­trophic failure (e.g. of parts of code that can af­fect the rep­re­sen­ta­tion or in­ter­pre­ta­tion of the goals). Math­e­mat­i­cal proofs have the prop­erty that they are as strong as their ax­ioms and have no sig­nifi­cant con­di­tion­ally in­de­pen­dent per-step failure prob­a­bil­ity if their ax­ioms are se­man­ti­cally true, which sug­gests that some­thing like math­e­mat­i­cal rea­son­ing may be ap­pro­pri­ate for cer­tain par­tic­u­lar types of self-mod­ifi­ca­tion dur­ing some de­vel­op­men­tal stages.

Thus the con­tent of the pa­per is very far off from how a re­al­is­tic AI would work, but con­versely, if you can’t even an­swer the kinds of sim­ple prob­lems posed within the pa­per (both those we par­tially solve and those we only pose) then you must be very far off from be­ing able to build a sta­ble self-mod­ify­ing AI. Be­ing able to say how to build a the­o­ret­i­cal de­vice that would play perfect chess given in­finite com­put­ing power, is very far off from the abil­ity to build Deep Blue. How­ever, if you can’t even say how to play perfect chess given in­finite com­put­ing power, you are con­fused about the rules of the chess or the struc­ture of chess-play­ing com­pu­ta­tion in a way that would make it en­tirely hope­less for you to figure out how to build a bounded chess-player. Thus “In real life we’re always bounded” is no ex­cuse for not be­ing able to solve the much sim­pler un­bounded form of the prob­lem, and be­ing able to de­scribe the in­finite chess-player would be sub­stan­tial and use­ful con­cep­tual progress com­pared to not be­ing able to do that. We can’t be ab­solutely cer­tain that an analo­gous situ­a­tion holds be­tween solv­ing the challenges posed in the pa­per, and re­al­is­tic self-mod­ify­ing AIs with sta­ble goal sys­tems, but ev­ery line of in­ves­ti­ga­tion has to start some­where.

Parts of the pa­per will be eas­ier to un­der­stand if you’ve read Highly Ad­vanced Episte­mol­ogy 101 For Begin­ners in­clud­ing the parts on cor­re­spon­dence the­o­ries of truth (rele­vant to sec­tion 6) and model-the­o­retic se­man­tics of logic (rele­vant to 3, 4, and 6), and there are foot­notes in­tended to make the pa­per some­what more ac­cessible than usual, but the pa­per is still es­sen­tially aimed at math­e­mat­i­cally so­phis­ti­cated read­ers.