Vingean Reflection: Reliable Reasoning for Self-Improving Agents

I’m pleased to an­nounce a new pa­per from MIRI: Vingean Reflec­tion: Reli­able Rea­son­ing for Self-Im­prov­ing Agents.


To­day, hu­man-level ma­chine in­tel­li­gence is in the do­main of fu­tur­ism, but there is ev­ery rea­son to ex­pect that it will be de­vel­oped even­tu­ally. Once ar­tifi­cial agents be­come able to im­prove them­selves fur­ther, they may far sur­pass hu­man in­tel­li­gence, mak­ing it vi­tally im­por­tant to en­sure that the re­sult of an “in­tel­li­gence ex­plo­sion” is al­igned with hu­man in­ter­ests. In this pa­per, we dis­cuss one as­pect of this challenge: en­sur­ing that the ini­tial agent’s rea­son­ing about its fu­ture ver­sions is re­li­able, even if these fu­ture ver­sions are far more in­tel­li­gent than the cur­rent rea­soner. We re­fer to rea­son­ing of this sort as Vingean Reflec­tion.

A self-im­prov­ing agent must rea­son about the be­hav­ior of its smarter suc­ces­sors in ab­stract terms, since if it could pre­dict their ac­tions in de­tail, it would already be as smart as them. This is called the Vingean prin­ci­ple, and we ar­gue that the­o­ret­i­cal work on Vingean re­flec­tion should fo­cus on for­mal mod­els that re­flect this prin­ci­ple. How­ever, the frame­work of ex­pected util­ity max­i­miza­tion, com­monly used to model ra­tio­nal agents, fails to do so. We re­view a body of work which in­stead in­ves­ti­gates agents that use for­mal proofs to rea­son about their suc­ces­sors. While it is un­likely that real-world agents would base their be­hav­ior en­tirely on for­mal proofs, this ap­pears to be the best cur­rently available for­mal model of ab­stract rea­son­ing, and work in this set­ting may lead to in­sights ap­pli­ca­ble to more re­al­is­tic ap­proaches to Vingean re­flec­tion.

This is the fourth in a se­ries of six pa­pers dis­cussing var­i­ous com­po­nents of MIRI’s tech­ni­cal re­search agenda. It mo­ti­vates the field of Vingean re­flec­tion, which stud­ies meth­ods by which agents can rea­son re­li­ably about agents that are more in­tel­li­gent than them­selves. Toy mod­els used to study this prob­lem in the past in­clude the “tiling agent” mod­els that have been dis­cussed on LessWrong in the past. The in­tro­duc­tion to the pa­per runs as fol­lows:

In a 1965 ar­ti­cle, I.J. Good in­tro­duced the con­cept of an “in­tel­li­gence ex­plo­sion” (Good 1965):

Let an ul­train­tel­li­gent ma­chine be defined as a ma­chine that can far sur­pass all the in­tel­lec­tual ac­tivi­ties of any man how­ever clever. Since the de­sign of ma­chines is one of these in­tel­lec­tual ac­tivi­ties, an ul­train­tel­li­gent ma­chine could de­sign even bet­ter ma­chines; there would then un­ques­tion­ably be an ‘in­tel­li­gence ex­plo­sion,’ and the in­tel­li­gence of man would be left far be­hind. Thus the first ul­train­tel­li­gent ma­chine is the last in­ven­tion that man need ever make.

Al­most fifty years later, a ma­chine in­tel­li­gence that is smart in the way hu­mans are re­mains the sub­ject of fu­tur­ism and sci­ence fic­tion. But bar­ring global catas­tro­phe, there seems to be lit­tle rea­son to doubt that hu­man­ity will even­tu­ally cre­ate a smarter-than-hu­man ma­chine. Whether ma­chine in­tel­li­gence can re­ally leave the in­tel­li­gence of biolog­i­cal hu­mans far be­hind is less ob­vi­ous, but there is some rea­son to think that this may be the case (Bostrom 2014): First, the hard­ware of hu­man brains is nowhere close to phys­i­cal limits; and sec­ond, not much time has passed on an evolu­tion­ary timescale since hu­mans de­vel­oped lan­guage, sug­gest­ing that we pos­sess the min­i­mal amount of gen­eral in­tel­li­gence nec­es­sary to de­velop a tech­nolog­i­cal civ­i­liza­tion, not the the­o­ret­i­cal op­ti­mum.

It’s not hard to see that if build­ing an ar­tifi­cial su­per­in­tel­li­gent agent will be pos­si­ble at some point in the fu­ture, this could be both a great boon to hu­man­ity and a great dan­ger if this agent does not work as in­tended (Bostrom 2014, Yud­kowsky 2008). Imag­ine, for ex­am­ple, a sys­tem built to op­er­ate a robotic lab­o­ra­tory for find­ing a cure for can­cer; if this is its only goal, and the sys­tem be­comes far smarter than any hu­man, then its best course of ac­tion (to max­i­mize the prob­a­bil­ity of achiev­ing its goal) may well be to con­vert all of Earth into more com­put­ers and robotic lab­o­ra­to­ries—and with suffi­cient in­tel­li­gence, it may well find a way to do so. This ar­gu­ment gen­er­al­izes, of course: While there is no rea­son to think that an ar­tifi­cial in­tel­li­gence would be driven by hu­man mo­ti­va­tions like a lust for power, any goals that are not quite ours would place it at odds with our in­ter­ests.

How, then, can we en­sure that self-im­prov­ing smarter-than-hu­man ma­chine in­tel­li­gence, if and when it is de­vel­oped, is benefi­cial to hu­man­ity?

Ex­ten­sive test­ing may not be suffi­cient. A smarter-than-hu­man agent would have an in­cen­tive to pre­tend dur­ing test­ing that its goals are al­igned with ours, even if they are not, be­cause we might oth­er­wise at­tempt to mod­ify it or shut it down (Bostrom 2014). Hence, test­ing would only give re­li­able in­for­ma­tion if the sys­tem is not yet suffi­ciently in­tel­li­gent to de­ceive us. If, at this point, it is also not yet in­tel­li­gent enough to re­al­ize that its goals are at odds with ours, a mis­al­igned agent might pass even very ex­ten­sive tests.

More­over, the test en­vi­ron­ment may be very differ­ent from the en­vi­ron­ment in which the sys­tem will ac­tu­ally op­er­ate. It may be in­fea­si­ble to set up a test­ing en­vi­ron­ment which al­lows a smarter-than-hu­man sys­tem to be tested in the kinds of com­plex, un­ex­pected situ­a­tions that it might en­counter in the real world as it gains knowl­edge and ex­e­cutes strate­gies that its pro­gram­mers never con­ceived of.

For these rea­sons, it seems im­por­tant to have a the­o­ret­i­cal un­der­stand­ing of why the sys­tem is ex­pected to work, so as to gain high con­fi­dence in a sys­tem that will face a wide range of unan­ti­ci­pated challenges (Soares and Fallen­stein, 2014a). By this we mean two things: (1) a for­mal speci­fi­ca­tion of the prob­lem faced by the sys­tem; and (2) a firm un­der­stand­ing of why the sys­tem (which must in­evitably use prac­ti­cal heuris­tics) is ex­pected to perform well on this prob­lem.

It may seem odd to raise these ques­tions to­day, with smarter-than-hu­man ma­chines still firmly in the do­main of fu­tur­ism; we can hardly ver­ify that the heuris­tics em­ployed by an ar­tifi­cial agent work as in­tended be­fore we even know what these heuris­tics are. How­ever, Soares and Fallen­stein (2014a) ar­gue that there is foun­da­tional re­search we can do to­day that can help us un­der­stand the op­er­a­tion of a smarter-than-hu­man agent on an ab­stract level.

For ex­am­ple, al­though the ex­pected util­ity max­i­miza­tion frame­work of neo­clas­si­cal eco­nomics has se­ri­ous short­com­ings in de­scribing the be­hav­ior of a re­al­is­tic ar­tifi­cial agent, it is a use­ful start­ing point for ask­ing whether it’s pos­si­ble to avoid giv­ing a mis­al­igned agent in­cen­tives for ma­nipu­lat­ing its hu­man op­er­a­tors (Soares 2015). Similarly, it al­lows us to ask what sorts of mod­els of the en­vi­ron­ment would be able to deal with the com­plex­ities of the real world (Hut­ter 2000). Where this frame­work falls short, we can ask how to ex­tend it to cap­ture more as­pects of re­al­ity, such as the fact that an agent is a part of its en­vi­ron­ment (Orseau 2012), and the fact that a real agent can­not be log­i­cally om­ni­scient (Gaif­man 2004, Soares and Fallen­stein 2015). More­over, even when more re­al­is­tic mod­els are available, sim­ple mod­els can clar­ify con­cep­tual is­sues by ideal­iz­ing away difficul­ties not rele­vant to a par­tic­u­lar prob­lem un­der con­sid­er­a­tion.

In this pa­per, we re­view work on one foun­da­tional is­sue that would be par­tic­u­larly rele­vant in the con­text of an in­tel­li­gence ex­plo­sion—that is, if hu­man­ity does not cre­ate a su­per­in­tel­li­gent agent di­rectly, but in­stead cre­ates an agent that at­tains su­per­in­tel­li­gence through a se­quence of suc­ces­sive self-im­prove­ments. In this case, the re­sult­ing su­per­in­tel­li­gent sys­tem may be quite differ­ent from the ini­tial ver­ified sys­tem. The be­hav­ior of the fi­nal sys­tem would de­pend en­tirely upon the abil­ity of the ini­tial sys­tem to rea­son cor­rectly about the con­struc­tion of sys­tems more in­tel­li­gent than it­self.

This is no trou­ble if the ini­tial sys­tem is ex­tremely re­li­able: if the rea­son­ing of the ini­tial agent were at least as good as a team of hu­man AI re­searchers in all do­mains, then the sys­tem it­self would be at least as safe as any­thing de­signed by a team of hu­man re­searchers. How­ever, if the sys­tem were only known to rea­son well in most cases, then it seems pru­dent to ver­ify its rea­son­ing speci­fi­cally in the crit­i­cal case where the agent rea­sons about self-mod­ifi­ca­tions.

At least in­tu­itively, rea­son­ing about the be­hav­ior of an agent which is more in­tel­li­gent than the rea­soner seems qual­i­ta­tively more difficult than rea­son­ing about the be­hav­ior of a less in­tel­li­gent sys­tem. Ver­ify­ing that a mil­i­tary drone obeys cer­tain rules of en­gage­ment is one thing; ver­ify­ing that an ar­tifi­cial gen­eral would suc­cess­fully run a war, iden­ti­fy­ing clever strate­gies never be­fore con­ceived of and de­ploy­ing brilli­ant plans as ap­pro­pri­ate, seems like an­other thing en­tirely. It is cer­tainly pos­si­ble that this in­tu­ition will turn out to be wrong, but it seems as if we should at least check: if ex­tremely high con­fi­dence must be placed on the abil­ity of self-mod­ify­ing sys­tems to rea­son about agents which are smarter than the rea­soner, then it seems pru­dent to de­velop a the­o­ret­i­cal un­der­stand­ing of satis­fac­tory rea­son­ing about smarter agents. In honor of Vinge (1993), who em­pha­sizes the difficulty of pre­dict­ing the be­hav­ior of smarter-than-hu­man agents with hu­man in­tel­li­gence, we re­fer to rea­son­ing of this sort as Vingean re­flec­tion.

Vingean Reflection

The sim­plest and clean­est for­mal model of in­tel­li­gent agents is the frame­work of ex­pected util­ity max­i­miza­tion. Given that this frame­work has been a pro­duc­tive ba­sis for the­o­ret­i­cal work both in ar­tifi­cial in­tel­li­gence in gen­eral, and on smarter-than-hu­man agents in par­tic­u­lar, it is nat­u­ral to ask whether it can be used to model the rea­son­ing of self-im­prov­ing agents.

How­ever, al­though it can be use­ful to con­sider mod­els that ideal­ize away part of the com­plex­ity of the real world, it is not difficult to see that in the case of self-im­prove­ment, ex­pected util­ity max­i­miza­tion ideal­izes away too much. An agent that can liter­ally max­i­mize ex­pected util­ity is already rea­son­ing op­ti­mally; it may lack in­for­ma­tion about its en­vi­ron­ment, but it can only fix this prob­lem by ob­serv­ing the ex­ter­nal world, not by im­prov­ing its own rea­son­ing pro­cesses.

A par­tic­u­larly illus­tra­tive ex­am­ple of the mis­match be­tween the clas­si­cal the­ory and the prob­lem of Vingean re­flec­tion is pro­vided by the stan­dard tech­nique of back­ward in­duc­tion, which finds the op­ti­mal policy of an agent fac­ing a se­quen­tial de­ci­sion prob­lem by con­sid­er­ing ev­ery node in the agent’s en­tire de­ci­sion tree. Back­ward in­duc­tion starts with the leaves, figur­ing out the ac­tion an op­ti­mal agent would take in the last timestep (for ev­ery pos­si­ble his­tory of what hap­pened in the pre­vi­ous timesteps). It then pro­ceeds to com­pute how an op­ti­mal agent would be­have in the sec­ond-to-last timestep, given the be­hav­ior in the last timestep, and so on back­ward to the root of the de­ci­sion tree.

A self-im­prov­ing agent is sup­posed to be­come more in­tel­li­gent as time goes on. An agent us­ing back­ward in­duc­tion to choose its ac­tion, how­ever, would have to com­pute its ex­act ac­tions in ev­ery situ­a­tion it might face in the fu­ture in the very first timestep—but if it is able to do that, its ini­tial ver­sion could hardly be called less in­tel­li­gent than the later ones!

Since we are in­ter­ested in the­o­ret­i­cal un­der­stand­ing, the rea­son we see this as a prob­lem is not that back­ward in­duc­tion is im­prac­ti­cal as an im­ple­men­ta­tion tech­nique. For ex­am­ple, we may not ac­tu­ally be able to run an agent which uses back­ward in­duc­tion (since this re­quires effort ex­po­nen­tial in the num­ber of timesteps), but it can still be use­ful to ask how such an agent would be­have, say in a situ­a­tion where it may have an in­cen­tive to ma­nipu­late its hu­man op­er­a­tors (Soares 2015). Rather, the prob­lem is that we are try­ing to un­der­stand con­cep­tu­ally how an agent can rea­son about the be­hav­ior of a more in­tel­li­gent suc­ces­sor, and an “ideal­ized” model that re­quires the origi­nal agent to already be as smart as its suc­ces­sors seems to ideal­ize away the very is­sue we are try­ing to in­ves­ti­gate.

The pro­gram­mers of the fa­mous chess pro­gram Deep Blue, for ex­am­ple, couldn’t have eval­u­ated differ­ent heuris­tics by pre­dict­ing, in their own heads, where each heuris­tic would make Deep Blue move in ev­ery pos­si­ble situ­a­tion; if they had been able to do so, they would have been able to play world-class chess them­selves. But this does not im­ply that they knew noth­ing about Deep Blue’s op­er­a­tion: their ab­stract knowl­edge of the code al­lowed them to know that Deep Blue was try­ing to win the game rather than to lose it, for ex­am­ple.

Like Deep Blue’s pro­gram­mers, any ar­tifi­cial agent rea­son­ing about smarter suc­ces­sors will have to do so us­ing ab­stract rea­son­ing, rather than by com­put­ing out what these suc­ces­sors would do in ev­ery pos­si­ble situ­a­tion. Yud­kowsky and Her­reshoff (2013) call this ob­ser­va­tion the Vingean prin­ci­ple, and it seems to us that progress on Vingean re­flec­tion will re­quire for­mal mod­els that im­ple­ment this prin­ci­ple, in­stead of ideal­iz­ing the prob­lem away.

This is not to say that ex­pected util­ity max­i­miza­tion has no role to play in the study of Vingean re­flec­tion. In­tu­itively, the rea­son the clas­si­cal frame­work is un­suit­able is that it de­mands log­i­cal om­ni­science: It as­sumes that al­though an agent may be un­cer­tain about its en­vi­ron­ment, it must have perfect knowl­edge of all math­e­mat­i­cal facts, such as which of two al­gorithms is more effi­cient on a given prob­lem or which of two bets leads to a higher ex­pected pay­off un­der a cer­tain com­putable (but in­tractable) prob­a­bil­ity dis­tri­bu­tion. Real agents, on the other hand, must deal with log­i­cal un­cer­tainty (Soares and Fallen­stein 2015). But many pro­pos­als for deal­ing with un­cer­tainty about math­e­mat­i­cal facts in­volve as­sign­ing prob­a­bil­ities to them, which might make it pos­si­ble to max­i­mize ex­pected util­ity with re­spect to the re­sult­ing prob­a­bil­ity dis­tri­bu­tion.

How­ever, while there is some ex­ist­ing work on for­mal mod­els of log­i­cal un­cer­tainty (see Soares and Fallen­stein [2015] for an overview), none of the ap­proaches the au­thors are aware of are mod­els of ab­stract rea­son­ing. It is clear that any agent perform­ing Vingean re­flec­tion will need to have some way of deal­ing with log­i­cal un­cer­tainty, since it will have to rea­son about the be­hav­ior of com­puter pro­grams it can­not run (in par­tic­u­lar, fu­ture ver­sions of it­self). At pre­sent, how­ever, for­mal mod­els of log­i­cal un­cer­tainty do not yet seem up to the task of study­ing ab­stract rea­son­ing about more in­tel­li­gent suc­ces­sors.

In this pa­per, we re­view a body of work which in­stead con­sid­ers agents that use for­mal proofs to rea­son about their suc­ces­sors, an ap­proach first pro­posed by Yud­kowsky and Her­reshoff (2013). In par­tic­u­lar, fol­low­ing these au­thors, we con­sider agents which will only perform ac­tions (such as self-mod­ifi­ca­tions) if they can prove that these ac­tions are, in some for­mal sense, “safe″. We do not ar­gue that this is a re­al­is­tic way for smarter-than-hu­man agents to rea­son about po­ten­tial ac­tions; rather, for­mal proofs seem to be the best for­mal model of ab­stract rea­son­ing available at pre­sent, and hence cur­rently the most promis­ing ve­hi­cle for study­ing Vingean re­flec­tion.

There is, of course, no guaran­tee that re­sults ob­tained in this set­ting will gen­er­al­ize to what­ever forms of rea­son­ing re­al­is­tic ar­tifi­cial agents will em­ploy. How­ever, there is some rea­son for op­ti­mism: at least one such re­sult (the pro­cras­ti­na­tion para­dox [Yud­kowsky 2013], dis­cussed in Sec­tion 4) both has an in­tu­itive in­ter­pre­ta­tion that makes it seem likely to be rele­vant be­yond the do­main of for­mal proofs, and has been shown to ap­ply to one ex­ist­ing model of self-refer­en­tial rea­son­ing un­der log­i­cal un­cer­tainty (Fallen­stein 2014b).

The study of Vingean re­flec­tion in a for­mal logic frame­work also has merit in its own right. While for­mal logic is not a good tool for rea­son­ing about a com­plex en­vi­ron­ment, it is a use­ful tool for rea­son­ing about the prop­er­ties of com­puter pro­grams. In­deed, when hu­mans re­quire ex­tremely high con­fi­dence in a com­puter pro­gram, they of­ten re­sort to sys­tems based on for­mal logic, such as model check­ers and the­o­rem provers (US DoD 1985; UK MoD 1991). Smarter-than-hu­man ma­chines at­tempt­ing to gain high con­fi­dence in a com­puter pro­gram may need to use similar tech­niques. While smarter-than-hu­man agents must ul­ti­mately rea­son un­der log­i­cal un­cer­tainty, there is some rea­son to ex­pect that high-con­fi­dence log­i­cally un­cer­tain rea­son­ing about com­puter pro­grams will re­quire some­thing akin to for­mal logic.

The re­main­der of this pa­per is struc­tured as fol­lows. In the next sec­tion, we dis­cuss in more de­tail the idea of re­quiring an agent to pro­duce for­mal proofs that its ac­tions are safe, and dis­cuss a prob­lem that arises in this con­text, the Löbian ob­sta­cle (Yud­kowsky and Her­reshoff 2013): Due to Gödel’s sec­ond in­com­plete­ness the­o­rem, an agent us­ing for­mal proofs can­not trust the rea­son­ing of fu­ture ver­sions us­ing the same proof sys­tem. In Sec­tion 4, we dis­cuss the pro­cras­ti­na­tion para­dox, an in­tu­itive ex­am­ple of what can go wrong in a sys­tem that trusts its own rea­son­ing too much. In Sec­tion 5, we in­tro­duce a con­crete toy model of self-rewrit­ing agents, and dis­cuss the Löbian ob­sta­cle in this con­text. Sec­tion 6 re­views par­tial solu­tions to this prob­lem, and Sec­tion 7 con­cludes.