Safety engineering, target selection, and alignment theory

This post is the lat­est in a se­ries in­tro­duc­ing the ba­sic ideas be­hind MIRI’s re­search pro­gram. To con­tribute, or learn more about what we’ve been up to re­cently, see the MIRI fundraiser page. Our 2015 win­ter fund­ing drive con­cludes tonight (31 Dec 15) at mid­night.


Ar­tifi­cial in­tel­li­gence ca­pa­bil­ities re­search is aimed at mak­ing com­puter sys­tems more in­tel­li­gent — able to solve a wider range of prob­lems more effec­tively and effi­ciently. We can dis­t­in­guish this from re­search speci­fi­cally aimed at mak­ing AI sys­tems at var­i­ous ca­pa­bil­ity lev­els safer, or more “ro­bust and benefi­cial.” In this post, I dis­t­in­guish three kinds of di­rect re­search that might be thought of as “AI safety” work: safety en­g­ineer­ing, tar­get se­lec­tion, and al­ign­ment the­ory.

Imag­ine a world where hu­mans some­how de­vel­oped heav­ier-than-air flight be­fore de­vel­op­ing a firm un­der­stand­ing of calcu­lus or ce­les­tial me­chan­ics. In a world like that, what work would be needed in or­der to safely trans­port hu­mans to the Moon?

In this case, we can say that the main task at hand is one of en­g­ineer­ing a rocket and re­fin­ing fuel such that the rocket, when launched, ac­cel­er­ates up­wards and does not ex­plode. The bound­ary of space can be com­pared to the bound­ary be­tween nar­rowly in­tel­li­gent and gen­er­ally in­tel­li­gent AI. Both bound­aries are fuzzy, but have en­g­ineer­ing im­por­tance: space­craft and air­craft have differ­ent uses and face differ­ent con­straints.

Paired with this task of de­vel­op­ing rocket ca­pa­bil­ities is a safety en­g­ineer­ing task. Safety en­g­ineer­ing is the art of en­sur­ing that an en­g­ineered sys­tem pro­vides ac­cept­able lev­els of safety. When it comes to achiev­ing a soft land­ing on the Moon, there are many differ­ent roles for safety en­g­ineer­ing to play. One team of en­g­ineers might en­sure that the ma­te­ri­als used in con­struct­ing the rocket are ca­pa­ble of with­stand­ing the stress of a rocket launch with sig­nifi­cant mar­gin for er­ror. Another might de­sign es­cape sys­tems that en­sure the hu­mans in the rocket can sur­vive even in the event of failure. Another might de­sign life sup­port sys­tems ca­pa­ble of sup­port­ing the crew in dan­ger­ous en­vi­ron­ments.

A sep­a­rate im­por­tant task is tar­get se­lec­tion, i.e., pick­ing where on the Moon to land. In the case of a Moon mis­sion, tar­get­ing re­search might en­tail things like de­sign­ing and con­struct­ing telescopes (if they didn’t ex­ist already) and iden­ti­fy­ing a land­ing zone on the Moon. Of course, only so much tar­get­ing can be done in ad­vance, and the lu­nar land­ing ve­hi­cle may need to be de­signed so that it can al­ter the land­ing tar­get at the last minute as new data comes in; this again would re­quire feats of en­g­ineer­ing.

Beyond the task of (safely) reach­ing es­cape ve­loc­ity and figur­ing out where you want to go, there is one more cru­cial pre­req­ui­site for land­ing on the Moon. This is rocket al­ign­ment re­search, the tech­ni­cal work re­quired to reach the cor­rect fi­nal des­ti­na­tion. We’ll use this as an anal­ogy to illus­trate MIRI’s re­search fo­cus, the prob­lem of ar­tifi­cial in­tel­li­gence al­ign­ment.

The al­ign­ment challenge

Hit­ting a cer­tain tar­get on the Moon isn’t as sim­ple as care­fully point­ing the nose of the rocket at the rele­vant lu­nar co­or­di­nate and hit­ting “launch” — not even if you trust your pi­lots to make course cor­rec­tions as nec­es­sary. There’s also the im­por­tant task of plot­ting tra­jec­to­ries be­tween ce­les­tial bod­ies.

This rocket al­ign­ment task may re­quire a dis­tinct body of the­o­ret­i­cal knowl­edge that isn’t re­quired just for get­ting a pay­load off of the planet. Without calcu­lus, de­sign­ing a func­tional rocket would be enor­mously difficult. Still, with enough tenac­ity and enough re­sources to spare, we could imag­ine a civ­i­liza­tion reach­ing space af­ter many years of trial and er­ror — at which point they would be con­fronted with the prob­lem that reach­ing space isn’t suffi­cient for steer­ing to­ward a spe­cific lo­ca­tion.1

The first rocket al­ign­ment re­searchers might ask, “What tra­jec­tory would we have our rocket take un­der ideal con­di­tions, with­out wor­ry­ing about winds or ex­plo­sions or fuel effi­ciency?” If even that ques­tion were be­yond their cur­rent abil­ities, they might sim­plify the prob­lem still fur­ther, ask­ing, “At what an­gle and ve­loc­ity would we fire a can­non­ball such that it en­ters a sta­ble or­bit around Earth, as­sum­ing that Earth is perfectly spher­i­cal and has no at­mo­sphere?”

To an early rocket en­g­ineer, for whom even the prob­lem of build­ing any ve­hi­cle that makes it off the launch pad re­mains a frus­trat­ing task, the al­ign­ment the­o­rist’s ques­tions might look out-of-touch. The en­g­ineer may ask “Don’t you know that rock­ets aren’t go­ing to be fired out of can­nons?” or “What does go­ing in cir­cles around the Earth have to do with get­ting to the Moon?” Yet un­der­stand­ing rocket al­ign­ment is quite im­por­tant when it comes to achiev­ing a soft land­ing on the Moon. If you don’t yet know at what an­gle and ve­loc­ity to fire a can­non­ball such that it would end up in a sta­ble or­bit on a perfectly spher­i­cal planet with no at­mo­sphere, then you may need to de­velop a bet­ter un­der­stand­ing of ce­les­tial me­chan­ics be­fore you at­tempt a Moon mis­sion.

Three forms of AI safety research

The case is similar with AI re­search. AI ca­pa­bil­ities work comes part and par­cel with as­so­ci­ated safety en­g­ineer­ing tasks. Work­ing to­day, an AI safety en­g­ineer might fo­cus on mak­ing the in­ter­nals of large classes of soft­ware more trans­par­ent and in­ter­pretable by hu­mans. They might en­sure that the sys­tem fails grace­fully in the face of ad­ver­sar­ial ob­ser­va­tions. They might de­sign se­cu­rity pro­to­cols and early warn­ing sys­tems that help op­er­a­tors pre­vent or han­dle sys­tem failures.2

AI safety en­g­ineer­ing is in­dis­pens­able work, and it’s in­fea­si­ble to sep­a­rate safety en­g­ineer­ing from ca­pa­bil­ities en­g­ineer­ing. Day-to-day safety work in aerospace en­g­ineer­ing doesn’t rely on com­mit­tees of ethi­cists peer­ing over en­g­ineers’ shoulders. Some en­g­ineers will hap­pen to spend their time on com­po­nents of the sys­tem that are there for rea­sons of safety — such as failsafe mechanisms or fal­lback life-sup­port — but safety en­g­ineer­ing is an in­te­gral part of en­g­ineer­ing for safety-crit­i­cal sys­tems, rather than a sep­a­rate dis­ci­pline.

In the do­main of AI, tar­get se­lec­tion ad­dresses the ques­tion: if one could build a pow­er­ful AI sys­tem, what should one use it for? The po­ten­tial de­vel­op­ment of su­per­in­tel­li­gence raises a num­ber of thorny ques­tions in the­o­ret­i­cal and ap­plied ethics. Some of those ques­tions can plau­si­bly be re­solved in the near fu­ture by moral philoso­phers and psy­chol­o­gists, and by the AI re­search com­mu­nity. Others will un­doubt­edly need to be left to the fu­ture. Stu­art Rus­sell goes so far as to pre­dict that “in the fu­ture, moral philos­o­phy will be a key in­dus­try sec­tor.” We agree that this is an im­por­tant area of study, but it is not the main fo­cus of the Ma­chine In­tel­li­gence Re­search In­sti­tute.

Re­searchers at MIRI fo­cus on prob­lems of AI al­ign­ment. We ask ques­tions analo­gous to “at what an­gle and ve­loc­ity would we fire a can­non­ball to put it in a sta­ble or­bit, if Earth were perfectly spher­i­cal and had no at­mo­sphere?”

Select­ing promis­ing AI al­ign­ment re­search paths is not a sim­ple task. With the benefit of hind­sight, it’s easy enough to say that early rocket al­ign­ment re­searchers should be­gin by in­vent­ing calcu­lus and study­ing grav­i­ta­tion. For some­one who doesn’t yet have a clear un­der­stand­ing of what “calcu­lus” or “grav­i­ta­tion” are, how­ever, choos­ing re­search top­ics might be quite a bit more difficult. The fruit­ful re­search di­rec­tions would need to com­pete with fruitless ones, such as study­ing aether or Aris­totelian physics; and which re­search pro­grams are fruitless may not be ob­vi­ous in ad­vance.

Toward a the­ory of al­ignable agents

What are some plau­si­ble can­di­dates for the role of “calcu­lus” or “grav­i­ta­tion” in the field of AI?

At MIRI, we cur­rently fo­cus on sub­jects such as good rea­son­ing un­der de­duc­tive limi­ta­tions, de­ci­sion the­o­ries that work well even for agents em­bed­ded in large en­vi­ron­ments, and rea­son­ing pro­ce­dures that ap­prove of the way they rea­son. This re­search of­ten in­volves build­ing toy mod­els and study­ing prob­lems un­der dra­matic sim­plifi­ca­tions, analo­gous to as­sum­ing a perfectly spher­i­cal Earth with no at­mo­sphere.

One com­mon ques­tion we hear about al­ign­ment re­search runs analo­gously to: “If you don’t de­velop calcu­lus, what bad thing hap­pens to your rocket? Do you think the pi­lot will be strug­gling to make a course cor­rec­tion, and find that they sim­ply can’t add up the tiny vec­tors fast enough? That sce­nario just doesn’t sound plau­si­ble.”

This mi­s­un­der­stand­ing per­haps stems from an at­tempt to draw too di­rect a line be­tween al­ign­ment the­ory and spe­cific pre­sent-day en­g­ineer­ing tasks. The point of de­vel­op­ing calcu­lus is not to al­low the pi­lot to make course cor­rec­tions quickly; the point is to make it pos­si­ble to dis­cuss curved rocket tra­jec­to­ries in a world where the best tools available as­sume that rock­ets move in straight lines.

The case is similar with, e.g., at­tempts to de­velop the­o­ries of log­i­cal un­cer­tainty. The prob­lem is not that we vi­su­al­ize a spe­cific AI sys­tem en­coun­ter­ing a catas­trophic failure be­cause it mishan­dles log­i­cal un­cer­tainty; the prob­lem is that all our ex­ist­ing tools for de­scribing the be­hav­ior of ra­tio­nal agents as­sume that those agents are log­i­cally om­ni­scient, mak­ing our best the­o­ries in­com­men­su­rate with our best prac­ti­cal AI de­signs.

At this point, the goal of al­ign­ment re­search is not to solve par­tic­u­lar en­g­ineer­ing prob­lems. The goal of early rocket al­ign­ment re­search is to de­velop shared lan­guage and tools for gen­er­at­ing and eval­u­at­ing rocket tra­jec­to­ries, which will re­quire de­vel­op­ing calcu­lus and ce­les­tial me­chan­ics if they do not already ex­ist. Similarly, the goal of AI al­ign­ment re­search is to de­velop shared lan­guage and tools for gen­er­at­ing and eval­u­at­ing meth­ods by which pow­er­ful AI sys­tems could be de­signed to act as in­tended.

One might worry that it is difficult to set bench­marks of suc­cess for al­ign­ment re­search. Is a New­to­nian un­der­stand­ing of grav­i­ta­tion suffi­cient to at­tempt a Moon land­ing, or must one de­velop a com­plete the­ory of gen­eral rel­a­tivity be­fore be­liev­ing that one can land softly on the Moon?3

In the case of AI al­ign­ment, there is at least one ob­vi­ous bench­mark to fo­cus on ini­tially. Imag­ine we had ac­cess to an in­cred­ibly pow­er­ful com­puter with ac­cess to the in­ter­net, an au­to­mated fac­tory, and large sums of money. If we could pro­gram that com­puter to re­li­ably achieve some sim­ple goal (such as pro­duc­ing as much di­a­mond as pos­si­ble), then a large share of the AI al­ign­ment re­search would be com­pleted.

The pur­suit of a goal such as this one is more or less MIRI’s ap­proach to AI al­ign­ment re­search. We think of this ques­tion as our ver­sion of the ques­tion, “Could you hit the Moon with a rocket if fuel and winds were no con­cern?” An­swer­ing that ques­tion, on its own, won’t en­sure that smarter-than-hu­man AI sys­tems are al­igned with our goals; but it would rep­re­sent a ma­jor ad­vance over our cur­rent knowl­edge, and it doesn’t look like the kind of ba­sic in­sight that we can safely skip over.

What next?

Over the past year, we’ve seen a mas­sive in­crease in at­ten­tion to­wards the task of en­sur­ing that fu­ture AI sys­tems are ro­bust and benefi­cial. AI safety work is be­ing taken very se­ri­ously, and AI en­g­ineers are step­ping up and ac­knowl­edg­ing that safety en­g­ineer­ing is not sep­a­rable from ca­pa­bil­ities en­g­ineer­ing. It is be­com­ing ap­par­ent that as the field of ar­tifi­cial in­tel­li­gence ma­tures, safety en­g­ineer­ing will be­come a more and more firmly em­bed­ded part of AI cul­ture. Mean­while, new in­ves­ti­ga­tions of tar­get se­lec­tion and other safety ques­tions will be show­cased at an AI and Ethics work­shop at AAAI-16, one of the larger an­nual con­fer­ences in the field.

A fourth va­ri­ety of safety work is also re­ceiv­ing in­creased sup­port: strat­egy re­search. If your na­tion is cur­rently en­gaged in a cold war and locked in a space race, you may well want to con­sult with game the­o­rists and strate­gists so as to en­sure that your at­tempts to put a per­son on the Moon do not up­set a del­i­cate poli­ti­cal bal­ance and lead to a nu­clear war.4 If in­ter­na­tional coal­i­tions will be re­quired in or­der to es­tab­lish treaties re­gard­ing the use of space, then diplo­macy may also be­come a rele­vant as­pect of safety work. The same prin­ci­ples hold when it comes to AI, where coal­i­tion-build­ing and global co­or­di­na­tion may play an im­por­tant role in the tech­nol­ogy’s de­vel­op­ment and use.

Strat­egy re­search has been on the rise this year. AI Im­pacts is pro­duc­ing strate­gic analy­ses rele­vant to the de­sign­ers of this po­ten­tially world-chang­ing tech­nol­ogy, and will soon be joined by the Strate­gic Ar­tifi­cial In­tel­li­gence Re­search Cen­tre. The new Lev­er­hulme Cen­tre for the Fu­ture of In­tel­li­gence will be pul­ling to­gether peo­ple across many differ­ent dis­ci­plines to study the so­cial im­pact of AI, forg­ing new col­lab­o­ra­tions. The Global Pri­ori­ties Pro­ject, mean­while, is an­a­lyz­ing what types of in­ter­ven­tions might be most effec­tive at en­sur­ing pos­i­tive out­comes from the de­vel­op­ment of pow­er­ful AI sys­tems.

The field is mov­ing fast, and these de­vel­op­ments are quite ex­cit­ing. Through­out it all, though, AI al­ign­ment re­search in par­tic­u­lar still seems largely un­der-served.

MIRI is not the only group work­ing on AI al­ign­ment; a hand­ful of re­searchers from other or­ga­ni­za­tions and in­sti­tu­tions are also be­gin­ning to ask similar ques­tions. MIRI’s par­tic­u­lar ap­proach to AI al­ign­ment re­search is by no means the only way one available — when first think­ing about how to put hu­mans on the Moon, one might want to con­sider both rock­ets and space ele­va­tors. Re­gard­less of who does the re­search or where they do it, it is im­por­tant that al­ign­ment re­search re­ceive at­ten­tion.

Smarter-than-hu­man AI sys­tems may be many decades away, and they may not closely re­sem­ble any ex­ist­ing soft­ware. This limits our abil­ity to iden­tify pro­duc­tive safety en­g­ineer­ing ap­proaches. At the same time, the difficulty of spec­i­fy­ing our val­ues makes it difficult to iden­tify pro­duc­tive re­search in moral the­ory. Align­ment re­search has the ad­van­tage of be­ing ab­stract enough to be po­ten­tially ap­pli­ca­ble to a wide va­ri­ety of fu­ture com­put­ing sys­tems, while be­ing for­mal­iz­able enough to ad­mit of un­am­bigu­ous progress. By pri­ori­tiz­ing such work, there­fore, we be­lieve that the field of AI safety will be able to ground it­self in tech­ni­cal work with­out los­ing sight of the most con­se­quen­tial ques­tions in AI.

Safety en­g­ineer­ing, moral the­ory, strat­egy, and gen­eral col­lab­o­ra­tion-build­ing are all im­por­tant parts of the pro­ject of de­vel­op­ing safe and use­ful AI. On the whole, these ar­eas look poised to thrive as a re­sult of the re­cent rise in in­ter­est in long-term out­comes, and I’m thrilled to see more effort and in­vest­ment go­ing to­wards those im­por­tant tasks.

The ques­tion is: What do we need to in­vest in next? The type of growth that I most want to see hap­pen in the AI com­mu­nity next would be growth in AI al­ign­ment re­search, via the for­ma­tion of new groups or or­ga­ni­za­tions fo­cused pri­mar­ily on AI al­ign­ment and the ex­pan­sion of ex­ist­ing AI al­ign­ment teams at MIRI, UC Berkeley, the Fu­ture of Hu­man­ity In­sti­tute at Oxford at Oxford, and other in­sti­tu­tions.

Be­fore try­ing to land a rocket on the Moon, it’s im­por­tant that we know how we would put a can­non­ball into a sta­ble or­bit. Ab­sent a good the­o­ret­i­cal un­der­stand­ing of rocket al­ign­ment, it might well be pos­si­ble for a civ­i­liza­tion to even­tu­ally reach es­cape ve­loc­ity; but get­ting some­where valuable and ex­cit­ing and new, and get­ting there re­li­ably, is a whole ex­tra challenge.


1 Similarly, we could imag­ine a civ­i­liza­tion that lives on the only planet in its so­lar sys­tem, or lives on a planet with per­pet­ual cloud cover ob­scur­ing all ob­jects ex­cept the Sun and Moon. Such a civ­i­liza­tion might have an ad­e­quate un­der­stand­ing of ter­res­trial me­chan­ics while lack­ing a model of ce­les­tial me­chan­ics and lack­ing the knowl­edge that the same dy­nam­i­cal laws hold on Earth and in space. There would then be a gap in our the­o­ret­i­cal un­der­stand­ing of rocket al­ign­ment, dis­tinct from limi­ta­tions in our un­der­stand­ing of how to reach es­cape ve­loc­ity.

2 Ro­man Yam­polskiy has used the term “AI safety en­g­ineer­ing” to re­fer to the study of AI sys­tems that can provide proofs of their safety for ex­ter­nal ver­ifi­ca­tion, in­clud­ing some the­o­ret­i­cal re­search that we would term “al­ign­ment re­search.” His us­age differs from the us­age here.

3 In ei­ther case, of course, we wouldn’t want to put a mora­to­rium on the space pro­gram while we wait for a unified the­ory of quan­tum me­chan­ics and gen­eral rel­a­tivity. We don’t need a perfect un­der­stand­ing of grav­ity.

4 This was a role his­tor­i­cally played by the RAND cor­po­ra­tion.