Superintelligence FAQ

1: What is su­per­in­tel­li­gence?

A su­per­in­tel­li­gence is a mind that is much more in­tel­li­gent than any hu­man. Most of the time, it’s used to dis­cuss hy­po­thet­i­cal fu­ture AIs.

1.1: Sounds a lot like sci­ence fic­tion. Do peo­ple think about this in the real world?

Yes. Two years ago, Google bought ar­tifi­cial in­tel­li­gence startup Deep­Mind for $400 mil­lion; Deep­Mind added the con­di­tion that Google promise to set up an AI Ethics Board. Deep­Mind cofounder Shane Legg has said in in­ter­views that he be­lieves su­per­in­tel­li­gent AI will be “some­thing ap­proach­ing ab­solute power” and “the num­ber one risk for this cen­tury”.

Many other sci­ence and tech­nol­ogy lead­ers agree. Astro­physi­cist Stephen Hawk­ing says that su­per­in­tel­li­gence “could spell the end of the hu­man race.” Tech billion­aire Bill Gates de­scribes him­self as “in the camp that is con­cerned about su­per­in­tel­li­gence…I don’t un­der­stand why some peo­ple are not con­cerned”. SpaceX/​Tesla CEO Elon Musk calls su­per­in­tel­li­gence “our great­est ex­is­ten­tial threat” and donated $10 mil­lion from his per­sonal for­tune to study the dan­ger. Stu­art Rus­sell, Pro­fes­sor of Com­puter Science at Berkeley and world-fa­mous AI ex­pert, warns of “species-end­ing prob­lems” and wants his field to pivot to make su­per­in­tel­li­gence-re­lated risks a cen­tral con­cern.

Pro­fes­sor Nick Bostrom is the di­rec­tor of Oxford’s Fu­ture of Hu­man­ity In­sti­tute, tasked with an­ti­ci­pat­ing and pre­vent­ing threats to hu­man civ­i­liza­tion. He has been study­ing the risks of ar­tifi­cial in­tel­li­gence for twenty years. The ex­pla­na­tions be­low are loosely adapted from his 2014 book Su­per­in­tel­li­gence, and di­vided into three parts ad­dress­ing three ma­jor ques­tions. First, why is su­per­in­tel­li­gence a topic of con­cern? Se­cond, what is a “hard take­off” and how does it im­pact our con­cern about su­per­in­tel­li­gence? Third, what mea­sures can we take to make su­per­in­tel­li­gence safe and benefi­cial for hu­man­ity?

2: AIs aren’t as smart as rats, let alone hu­mans. Isn’t it sort of early to be wor­ry­ing about this kind of thing?

Maybe. It’s true that al­though AI has had some re­cent suc­cesses – like Deep­Mind’s newest cre­ation AlphaGo defeat­ing the hu­man Go cham­pion in April – it still has noth­ing like hu­mans’ flex­ible, cross-do­main in­tel­li­gence. No AI in the world can pass a first-grade read­ing com­pre­hen­sion test. Face­book’s An­drew Ng com­pares wor­ry­ing about su­per­in­tel­li­gence to “wor­ry­ing about over­pop­u­la­tion on Mars” – a prob­lem for the far fu­ture, if at all.

But this ap­par­ent safety might be illu­sory. A sur­vey of lead­ing AI sci­en­tists show that on av­er­age they ex­pect hu­man-level AI as early as 2040, with above-hu­man-level AI fol­low­ing shortly af­ter. And many re­searchers warn of a pos­si­ble “fast take­off” – a point around hu­man-level AI where progress reaches a crit­i­cal mass and then ac­cel­er­ates rapidly and un­pre­dictably.

2.1: What do you mean by “fast take­off”?

A slow take­off is a situ­a­tion in which AI goes from in­frahu­man to hu­man to su­per­hu­man in­tel­li­gence very grad­u­ally. For ex­am­ple, imag­ine an aug­mented “IQ” scale (THIS IS NOT HOW IQ ACTUALLY WORKS – JUST AN EXAMPLE) where rats weigh in at 10, chimps at 30, the village idiot at 60, av­er­age hu­mans at 100, and Ein­stein at 200. And sup­pose that as tech­nol­ogy ad­vances, com­put­ers gain two points on this scale per year. So if they start out as smart as rats in 2020, they’ll be as smart as chimps in 2035, as smart as the village idiot in 2050, as smart as av­er­age hu­mans in 2070, and as smart as Ein­stein in 2120. By 2190, they’ll be IQ 340, as far be­yond Ein­stein as Ein­stein is be­yond a village idiot.

In this sce­nario progress is grad­ual and man­age­able. By 2050, we will have long since no­ticed the trend and pre­dicted we have 20 years un­til av­er­age-hu­man-level in­tel­li­gence. Once AIs reach av­er­age-hu­man-level in­tel­li­gence, we will have fifty years dur­ing which some of us are still smarter than they are, years in which we can work with them as equals, test and retest their pro­gram­ming, and build in­sti­tu­tions that pro­mote co­op­er­a­tion. Even though the AIs of 2190 may qual­ify as “su­per­in­tel­li­gent”, it will have been long-ex­pected and there would be lit­tle point in plan­ning now when the peo­ple of 2070 will have so many more re­sources to plan with.

A mod­er­ate take­off is a situ­a­tion in which AI goes from in­frahu­man to hu­man to su­per­hu­man rel­a­tively quickly. For ex­am­ple, imag­ine that in 2020 AIs are much like those of to­day – good at a few sim­ple games, but with­out clear do­main-gen­eral in­tel­li­gence or “com­mon sense”. From 2020 to 2050, AIs demon­strate some aca­dem­i­cally in­ter­est­ing gains on spe­cific prob­lems, and be­come bet­ter at tasks like ma­chine trans­la­tion and self-driv­ing cars, and by 2047 there are some that seem to dis­play some vaguely hu­man-like abil­ities at the level of a young child. By late 2065, they are still less in­tel­li­gent than a smart hu­man adult. By 2066, they are far smarter than Ein­stein.

A fast take­off sce­nario is one in which com­put­ers go even faster than this, per­haps mov­ing from in­frahu­man to hu­man to su­per­hu­man in only days or weeks.

2.1.1: Why might we ex­pect a mod­er­ate take­off?

Be­cause this is the his­tory of com­puter Go, with fifty years added on to each date. In 1997, the best com­puter Go pro­gram in the world, Handtalk, won NT$250,000 for perform­ing a pre­vi­ously im­pos­si­ble feat – beat­ing an 11 year old child (with an 11-stone hand­i­cap pe­nal­iz­ing the child and fa­vor­ing the com­puter!) As late as Septem­ber 2015, no com­puter had ever beaten any pro­fes­sional Go player in a fair game. Then in March 2016, a Go pro­gram beat 18-time world cham­pion Lee Sedol 4-1 in a five game match. Go pro­grams had gone from “dumber than chil­dren” to “smarter than any hu­man in the world” in eigh­teen years, and “from never won a pro­fes­sional game” to “over­whelming world cham­pion” in six months.

The slow take­off sce­nario men­tioned above is load­ing the dice. It the­o­rizes a timeline where com­put­ers took fif­teen years to go from “rat” to “chimp”, but also took thirty-five years to go from “chimp” to “av­er­age hu­man” and fifty years to go from “av­er­age hu­man” to “Ein­stein”. But from an evolu­tion­ary per­spec­tive this is ridicu­lous. It took about fifty mil­lion years (and ma­jor re­designs in sev­eral brain struc­tures!) to go from the first rat-like crea­tures to chimps. But it only took about five mil­lion years (and very minor changes in brain struc­ture) to go from chimps to hu­mans. And go­ing from the av­er­age hu­man to Ein­stein didn’t even re­quire evolu­tion­ary work – it’s just the re­sult of ran­dom vari­a­tion in the ex­ist­ing struc­tures!

So maybe our hy­po­thet­i­cal IQ scale above is off. If we took an evolu­tion­ary and neu­ro­scien­tific per­spec­tive, it would look more like flat­worms at 10, rats at 30, chimps at 60, the village idiot at 90, the av­er­age hu­man at 98, and Ein­stein at 100.

Sup­pose that we start out, again, with com­put­ers as smart as rats in 2020. Now we get still get com­put­ers as smart as chimps in 2035. And we still get com­put­ers as smart as the village idiot in 2050. But now we get com­put­ers as smart as the av­er­age hu­man in 2054, and com­put­ers as smart as Ein­stein in 2055. By 2060, we’re get­ting the su­per­in­tel­li­gences as far be­yond Ein­stein as Ein­stein is be­yond a village idio

This offers a much shorter time win­dow to re­act to AI de­vel­op­ments. In the slow take­off sce­nario, we figured we could wait un­til com­put­ers were as smart as hu­mans be­fore we had to start think­ing about this; af­ter all, that still gave us fifty years be­fore com­put­ers were even as smart as Ein­stein. But in the mod­er­ate take­off sce­nario, it gives us one year un­til Ein­stein and six years un­til su­per­in­tel­li­gence. That’s start­ing to look like not enough time to be en­tirely sure we know what we’re do­ing.

2.1.2: Why might we ex­pect a fast take­off?

AlphaGo used about 0.5 petaflops (= trillion float­ing point op­er­a­tions per sec­ond) in its cham­pi­onship game. But the world’s fastest su­per­com­puter, TaihuLight, can calcu­late at al­most 100 petaflops. So sup­pose Google de­vel­oped a hu­man-level AI on a com­puter sys­tem similar to AlphaGo, caught the at­ten­tion of the Chi­nese gov­ern­ment (who run TaihuLight), and they trans­fer the pro­gram to their much more pow­er­ful com­puter. What would hap­pen?

It de­pends on to what de­gree in­tel­li­gence benefits from more com­pu­ta­tional re­sources. This differs for differ­ent pro­cesses. For do­main-gen­eral in­tel­li­gence, it seems to benefit quite a bit – both across species and across hu­man in­di­vi­d­u­als, big­ger brain size cor­re­lates with greater in­tel­li­gence. This matches the evolu­tion­ar­ily rapid growth in in­tel­li­gence from chimps to ho­minids to mod­ern man; the few hun­dred thou­sand years since aus­tralo­p­ithecines weren’t enough time to de­velop com­pli­cated new al­gorithms, and evolu­tion seems to have just given hu­mans big­ger brains and packed more neu­rons and glia in per square inch. It’s not re­ally clear why the pro­cess stopped (if it ever did), but it might have to do with heads get­ting too big to fit through the birth canal. Cancer risk might also have been in­volved – sci­en­tists have found that smarter peo­ple are more likely to get brain can­cer, pos­si­bly be­cause they’re already over­clock­ing their abil­ity to grow brain cells.

At least in neu­ro­science, once evolu­tion “dis­cov­ered” cer­tain key in­sights, fur­ther in­creas­ing in­tel­li­gence seems to have been a mat­ter of pro­vid­ing it with more com­put­ing power. So again – what hap­pens when we trans­fer the hy­po­thet­i­cal hu­man-level AI from AlphaGo to a TaihuLight-style su­per­com­puter two hun­dred times more pow­er­ful? It might be a stretch to ex­pect it to go from IQ 100 to IQ 20,000, but might it in­crease to an Ein­stein-level 200, or a su­per­in­tel­li­gent 300? Hard to say – but if Google ever does de­velop a hu­man-level AI, the Chi­nese gov­ern­ment will prob­a­bly be in­ter­ested in find­ing out.

Even if its in­tel­li­gence doesn’t scale lin­early, TaihuLight could give it more time. TaihuLight is two hun­dred times faster than AlphaGo. Trans­fer an AI from one to the other, and even if its in­tel­li­gence didn’t change – even if it had ex­actly the same thoughts – it would think them two hun­dred times faster. An Ein­stein-level AI on AlphaGo hard­ware might (like the his­tor­i­cal Ein­stein) dis­cover one rev­olu­tion­ary break­through ev­ery five years. Trans­fer it to TaihuLight, and it would work two hun­dred times faster – a rev­olu­tion­ary break­through ev­ery week.

Su­per­com­put­ers track Moore’s Law; the top su­per­com­puter of 2016 is a hun­dred times faster than the top su­per­com­puter of 2006. If this progress con­tinues, the top com­puter of 2026 will be a hun­dred times faster still. Run Ein­stein on that com­puter, and he will come up with a rev­olu­tion­ary break­through ev­ery few hours. Or some­thing. At this point it be­comes a lit­tle bit hard to imag­ine. All I know is that it only took one Ein­stein, at nor­mal speed, to lay the the­o­ret­i­cal foun­da­tion for nu­clear weapons. Any­thing a thou­sand times faster than that is definitely cause for con­cern.

There’s one fi­nal, very con­cern­ing rea­son to ex­pect a fast take­off. Sup­pose, once again, we have an AI as smart as Ein­stein. It might, like the his­tor­i­cal Ein­stein, con­tem­plate physics. Or it might con­tem­plate an area very rele­vant to its own in­ter­ests: ar­tifi­cial in­tel­li­gence. In that case, in­stead of mak­ing a rev­olu­tion­ary physics break­through ev­ery few hours, it will make a rev­olu­tion­ary AI break­through ev­ery few hours. Each AI break­through it makes, it will have the op­por­tu­nity to re­pro­gram it­self to take ad­van­tage of its dis­cov­ery, be­com­ing more in­tel­li­gent, thus speed­ing up its break­throughs fur­ther. The cy­cle will stop only when it reaches some phys­i­cal limit – some tech­ni­cal challenge to fur­ther im­prove­ments that even an en­tity far smarter than Ein­stein can­not dis­cover a way around.

To hu­man pro­gram­mers, such a cy­cle would look like a “crit­i­cal mass”. Be­fore the crit­i­cal level, any AI ad­vance de­liv­ers only mod­est benefits. But any tiny im­prove­ment that pushes an AI above the crit­i­cal level would re­sult in a feed­back loop of in­ex­orable self-im­prove­ment all the way up to some strato­spheric limit of pos­si­ble com­put­ing power.

This feed­back loop would be ex­po­nen­tial; rel­a­tively slow in the be­gin­ning, but blind­ingly fast as it ap­proaches an asymp­tote. Con­sider the AI which starts off mak­ing forty break­throughs per year – one ev­ery nine days. Now sup­pose it gains on av­er­age a 10% speed im­prove­ment with each break­through. It starts on Jan­uary 1. Its first break­through comes Jan­uary 10 or so. Its sec­ond comes a lit­tle faster, Jan­uary 18. Its third is a lit­tle faster still, Jan­uary 25. By the be­gin­ning of Fe­bru­ary, it’s sped up to pro­duc­ing one break­through ev­ery seven days, more or less. By the be­gin­ning of March, it’s mak­ing about one break­through ev­ery three days or so. But by March 20, it’s up to one break­through a day. By late on the night of March 29, it’s mak­ing a break­through ev­ery sec­ond.

2.1.2.1: Is this just fol­low­ing an ex­po­nen­tial trend line off a cliff?

This is cer­tainly a risk (af­fec­tionately known in AI cir­cles as “pul­ling a Kurzweill”), but some­times tak­ing an ex­po­nen­tial trend se­ri­ously is the right re­sponse.

Con­sider eco­nomic dou­bling times. In 1 AD, the world GDP was about $20 billion; it took a thou­sand years, un­til 1000 AD, for that to dou­ble to $40 billion. But it only took five hun­dred more years, un­til 1500, or so, for the econ­omy to dou­ble again. And then it only took an­other three hun­dred years or so, un­til 1800, for the econ­omy to dou­ble a third time. Some­one in 1800 might calcu­late the trend line and say this was ridicu­lous, that it im­plied the econ­omy would be dou­bling ev­ery ten years or so in the be­gin­ning of the 21st cen­tury. But in fact, this is how long the econ­omy takes to dou­ble these days. To a me­dieval, used to a thou­sand-year dou­bling time (which was based mostly on pop­u­la­tion growth!), an econ­omy that dou­bled ev­ery ten years might seem in­con­ceiv­able. To us, it seems nor­mal.

Like­wise, in 1965 Gor­don Moore noted that semi­con­duc­tor com­plex­ity seemed to dou­ble ev­ery eigh­teen months. Dur­ing his own day, there were about five hun­dred tran­sis­tors on a chip; he pre­dicted that would soon dou­ble to a thou­sand, and a few years later to two thou­sand. Al­most as soon as Moore’s Law be­come well-known, peo­ple started say­ing it was ab­surd to fol­low it off a cliff – such a law would im­ply a mil­lion tran­sis­tors per chip in 1990, a hun­dred mil­lion in 2000, ten billion tran­sis­tors on ev­ery chip by 2015! More tran­sis­tors on a sin­gle chip than ex­isted on all the com­put­ers in the world! Tran­sis­tors the size of molecules! But of course all of these things hap­pened; the ridicu­lous ex­po­nen­tial trend proved more ac­cu­rate than the naysay­ers.

None of this is to say that ex­po­nen­tial trends are always right, just that they are some­times right even when it seems they can’t pos­si­bly be. We can’t be sure that a com­puter us­ing its own in­tel­li­gence to dis­cover new ways to in­crease its in­tel­li­gence will en­ter a pos­i­tive feed­back loop and achieve su­per­in­tel­li­gence in seem­ingly im­pos­si­bly short time scales. It’s just one more pos­si­bil­ity, a worry to place alongside all the other wor­ry­ing rea­sons to ex­pect a mod­er­ate or hard take­off.

2.2: Why does take­off speed mat­ter?

A slow take­off over decades or cen­turies would give us enough time to worry about su­per­in­tel­li­gence dur­ing some in­definite “later”, mak­ing cur­rent plan­ning as silly as wor­ry­ing about “over­pop­u­la­tion on Mars”. But a mod­er­ate or hard take­off means there wouldn’t be enough time to deal with the prob­lem as it oc­curs, sug­gest­ing a role for pre­emp­tive plan­ning.

(in fact, let’s take the “over­pop­u­la­tion on Mars” com­par­i­son se­ri­ously. Sup­pose Mars has a car­ry­ing ca­pac­ity of 10 billion peo­ple, and we de­cide it makes sense to worry about over­pop­u­la­tion on Mars only once it is 75% of the way to its limit. Start with 100 colon­ists who dou­ble ev­ery twenty years. By the sec­ond gen­er­a­tion there are 200 colon­ists; by the third, 400. Mars reaches 75% of its car­ry­ing ca­pac­ity af­ter 458 years, and crashes into its pop­u­la­tion limit af­ter 464 years. So there were 464 years in which the Mar­ti­ans could have solved the prob­lem, but they in­sisted on wait­ing un­til there were only six years left. Good luck solv­ing a planetwide pop­u­la­tion crisis in six years. The moral of the story is that ex­po­nen­tial trends move faster than you think and you need to start wor­ry­ing about them early).

3: Why might a fast take­off be dan­ger­ous?

The ar­gu­ment goes: yes, a su­per­in­tel­li­gent AI might be far smarter than Ein­stein, but it’s still just one pro­gram, sit­ting in a su­per­com­puter some­where. That could be bad if an en­emy gov­ern­ment con­trols it and asks its help in­vent­ing su­per­weapons – but then the prob­lem is the en­emy gov­ern­ment, not the AI per se. Is there any rea­son to be afraid of the AI it­self? Sup­pose the AI did feel hos­tile – sup­pose it even wanted to take over the world? Why should we think it has any chance of do­ing so?

Com­pounded over enough time and space, in­tel­li­gence is an awe­some ad­van­tage. In­tel­li­gence is the only ad­van­tage we have over li­ons, who are oth­er­wise much big­ger and stronger and faster than we are. But we have to­tal con­trol over li­ons, keep­ing them in zoos to gawk at, hunt­ing them for sport, and hold­ing them on the brink of ex­tinc­tion. And this isn’t just the same kind of quan­ti­ta­tive ad­van­tage tigers have over li­ons, where maybe they’re a lit­tle big­ger and stronger but they’re at least on a level play­ing field and enough li­ons could prob­a­bly over­power the tigers. Hu­mans are play­ing a com­pletely differ­ent game than the li­ons, one that no lion will ever be able to re­spond to or even com­pre­hend. Short of hu­man civ­i­liza­tion col­laps­ing or li­ons evolv­ing hu­man-level in­tel­li­gence, our dom­i­na­tion over them is about as com­plete as it is pos­si­ble for dom­i­na­tion to be.

Since su­per­in­tel­li­gences will be as far be­yond Ein­stein as Ein­stein is be­yond a village idiot, we might worry that they would have the same kind of qual­i­ta­tive ad­van­tage over us that we have over li­ons.

3.1: Hu­man civ­i­liza­tion as a whole is dan­ger­ous to li­ons. But a sin­gle hu­man placed amid a pack of li­ons with no raw ma­te­ri­als for build­ing tech­nol­ogy is go­ing to get ripped to shreds. So al­though thou­sands of su­per­in­tel­li­gences, given a long time and a lot of op­por­tu­nity to build things, might be able to dom­i­nate hu­mans – what harm could a sin­gle su­per­in­tel­li­gence do?

Su­per­in­tel­li­gence has an ad­van­tage that a hu­man fight­ing a pack of li­ons doesn’t – the en­tire con­text of hu­man civ­i­liza­tion and tech­nol­ogy, there for it to ma­nipu­late so­cially or tech­nolog­i­cally.

3.1.1: What do you mean by su­per­in­tel­li­gences ma­nipu­lat­ing hu­mans so­cially?

Peo­ple tend to imag­ine AIs as be­ing like nerdy hu­mans – brilli­ant at tech­nol­ogy but clue­less about so­cial skills. There is no rea­son to ex­pect this – per­sua­sion and ma­nipu­la­tion is a differ­ent kind of skill from solv­ing math­e­mat­i­cal proofs, but it’s still a skill, and an in­tel­lect as far be­yond us as we are be­yond li­ons might be smart enough to repli­cate or ex­ceed the “charm­ing so­ciopaths” who can nat­u­rally win friends and fol­low­ers de­spite a lack of nor­mal hu­man emo­tions. A su­per­in­tel­li­gence might be able to an­a­lyze hu­man psy­chol­ogy deeply enough to un­der­stand the hopes and fears of ev­ery­one it ne­go­ti­ates with. Sin­gle hu­mans us­ing psy­cho­pathic so­cial ma­nipu­la­tion have done plenty of harm – Hitler lev­er­aged his skill at or­a­tory and his un­der­stand­ing of peo­ple’s dark­est prej­u­dices to take over a con­ti­nent. Why should we ex­pect su­per­in­tel­li­gences to do worse than hu­mans far less skil­led than they?

(More out­landishly, a su­per­in­tel­li­gence might just skip lan­guage en­tirely and figure out a weird pat­tern of buzzes and hums that causes con­scious thought to seize up, and which knocks any­one who hears it into a weird hyp­no­ti­z­able state in which they’ll do any­thing the su­per­in­tel­li­gence asks. It sounds kind of silly to me, but then, nu­clear weapons prob­a­bly would have sounded kind of silly to li­ons sit­ting around spec­u­lat­ing about what hu­mans might be able to ac­com­plish. When you’re deal­ing with some­thing un­be­liev­ably more in­tel­li­gent than you are, you should prob­a­bly ex­pect the un­ex­pected.)

3.1.2: What do you mean by su­per­in­tel­li­gences ma­nipu­lat­ing hu­mans tech­nolog­i­cally?

AlphaGo was con­nected to the In­ter­net – why shouldn’t the first su­per­in­tel­li­gence be? This gives a suffi­ciently clever su­per­in­tel­li­gence the op­por­tu­nity to ma­nipu­late world com­puter net­works. For ex­am­ple, it might pro­gram a virus that will in­fect ev­ery com­puter in the world, caus­ing them to fill their empty mem­ory with par­tial copies of the su­per­in­tel­li­gence, which when net­worked to­gether be­come full copies of the su­per­in­tel­li­gence. Now the su­per­in­tel­li­gence con­trols ev­ery com­puter in the world, in­clud­ing the ones that tar­get nu­clear weapons. At this point it can force hu­mans to bar­gain with it, and part of that bar­gain might be enough re­sources to es­tab­lish its own in­dus­trial base, and then we’re in hu­mans vs. li­ons ter­ri­tory again.

(Satoshi Nakamoto is a mys­te­ri­ous in­di­vi­d­ual who posted a de­sign for the Bit­coin cur­rency sys­tem to a cryp­tog­ra­phy fo­rum. The de­sign was so brilli­ant that ev­ery­one started us­ing it, and Nakamoto – who had made sure to ac­cu­mu­late his own store of the cur­rency be­fore re­leas­ing it to the pub­lic – be­came a multi­billion­aire. In other words, some­body with no re­sources ex­cept the abil­ity to make one post to an In­ter­net fo­rum man­aged to lev­er­age that into a multi­billion dol­lar for­tune – and he wasn’t even su­per­in­tel­li­gent. If Hitler is a lower-bound on how bad su­per­in­tel­li­gent per­suaders can be, Nakamoto should be a lower-bound on how bad su­per­in­tel­li­gent pro­gram­mers with In­ter­net ac­cess can be.)

3.2: Couldn’t suffi­ciently para­noid re­searchers avoid giv­ing su­per­in­tel­li­gences even this much power?

That is, if you know an AI is likely to be su­per­in­tel­li­gent, can’t you just dis­con­nect it from the In­ter­net, not give it ac­cess to any speak­ers that can make mys­te­ri­ous buzzes and hums, make sure the only peo­ple who in­ter­act with it are trained in cau­tion, et cetera?. Isn’t there some level of se­cu­rity – maybe the level we use for that room in the CDC where peo­ple in con­tain­ment suits hun­dreds of feet un­der­ground an­a­lyze the lat­est su­per­viruses – with which a su­per­in­tel­li­gence could be safe?

This puts us back in the same situ­a­tion as li­ons try­ing to figure out whether or not nu­clear weapons are a things hu­mans can do. But sup­pose there is such a level of se­cu­rity. You build a su­per­in­tel­li­gence, and you put it in an air­tight cham­ber deep in a cave with no In­ter­net con­nec­tion and only care­fully-trained se­cu­rity ex­perts to talk to. What now?

Now you have a su­per­in­tel­li­gence which is pos­si­bly safe but definitely use­less. The whole point of build­ing su­per­in­tel­li­gences is that they’re smart enough to do use­ful things like cure can­cer. But if you have the monks ask the su­per­in­tel­li­gence for a can­cer cure, and it gives them one, that’s a clear se­cu­rity vuln­er­a­bil­ity. You have a su­per­in­tel­li­gence locked up in a cave with no way to in­fluence the out­side world ex­cept that you’re go­ing to mass pro­duce a chem­i­cal it gives you and in­ject it into mil­lions of peo­ple.

Or maybe none of this hap­pens, and the su­per­in­tel­li­gence sits in­ert in its cave. And then an­other team some­where else in­vents a sec­ond su­per­in­tel­li­gence. And then a third team in­vents a third su­per­in­tel­li­gence. Re­mem­ber, it was only about ten years be­tween Deep Blue beat­ing Kas­parov, and ev­ery­body hav­ing Deep Blue – level chess en­g­ines on their lap­tops. And the first twenty teams are re­spon­si­ble and keep their su­per­in­tel­li­gences locked in caves with care­fully-trained ex­perts, and the twenty-first team is a lit­tle less re­spon­si­ble, and now we still have to deal with a rogue su­per­in­tel­li­gence.

Su­per­in­tel­li­gences are ex­tremely dan­ger­ous, and no nor­mal means of con­trol­ling them can en­tirely re­move the dan­ger.

4: Even if hos­tile su­per­in­tel­li­gences are dan­ger­ous, why would we ex­pect a su­per­in­tel­li­gence to ever be hos­tile?

The ar­gu­ment goes: com­put­ers only do what we com­mand them; no more, no less. So it might be bad if ter­ror­ists or en­emy coun­tries de­velop su­per­in­tel­li­gence first. But if we de­velop su­per­in­tel­li­gence first there’s no prob­lem. Just com­mand it to do the things we want, right?

Sup­pose we wanted a su­per­in­tel­li­gence to cure can­cer. How might we spec­ify the goal “cure can­cer”? We couldn’t guide it through ev­ery in­di­vi­d­ual step; if we knew ev­ery in­di­vi­d­ual step, then we could cure can­cer our­selves. In­stead, we would have to give it a fi­nal goal of cur­ing can­cer, and trust the su­per­in­tel­li­gence to come up with in­ter­me­di­ate ac­tions that fur­thered that goal. For ex­am­ple, a su­per­in­tel­li­gence might de­cide that the first step to cur­ing can­cer was learn­ing more about pro­tein fold­ing, and set up some ex­per­i­ments to in­ves­ti­gate pro­tein fold­ing pat­terns.

A su­per­in­tel­li­gence would also need some level of com­mon sense to de­cide which of var­i­ous strate­gies to pur­sue. Sup­pose that in­ves­ti­gat­ing pro­tein fold­ing was very likely to cure 50% of can­cers, but in­ves­ti­gat­ing ge­netic en­g­ineer­ing was mod­er­ately likely to cure 90% of can­cers. Which should the AI pur­sue? Pre­sum­ably it would need some way to bal­ance con­sid­er­a­tions like cur­ing as much can­cer as pos­si­ble, as quickly as pos­si­ble, with as high a prob­a­bil­ity of suc­cess as pos­si­ble.

But a goal speci­fied in this way would be very dan­ger­ous. Hu­mans in­stinc­tively bal­ance thou­sands of differ­ent con­sid­er­a­tions in ev­ery­thing they do; so far this hy­po­thet­i­cal AI is only bal­anc­ing three (least can­cer, quick­est re­sults, high­est prob­a­bil­ity). To a hu­man, it would seem ma­ni­a­cally, even psy­cho­path­i­cally, ob­sessed with can­cer cur­ing. If this were truly its goal struc­ture, it would go wrong in al­most com­i­cal ways.

If your only goal is “cur­ing can­cer”, and you lack hu­mans’ in­stinct for the thou­sands of other im­por­tant con­sid­er­a­tions, a rel­a­tively easy solu­tion might be to hack into a nu­clear base, launch all of its mis­siles, and kill ev­ery­one in the world. This satis­fies all the AI’s goals. It re­duces can­cer down to zero (which is bet­ter than medicines which work only some of the time). It’s very fast (which is bet­ter than medicines which might take a long time to in­vent and dis­tribute). And it has a high prob­a­bil­ity of suc­cess (medicines might or might not work; nukes definitely do).

So sim­ple goal ar­chi­tec­tures are likely to go very wrong un­less tem­pered by com­mon sense and a broader un­der­stand­ing of what we do and do not value.

4.1: But su­per­in­tel­li­gences are very smart. Aren’t they smart enough not to make silly mis­takes in com­pre­hen­sion?

Yes, a su­per­in­tel­li­gence should be able to figure out that hu­mans will not like cur­ing can­cer by de­stroy­ing the world. How­ever, in the ex­am­ple above, the su­per­in­tel­li­gence is pro­grammed to fol­low hu­man com­mands, not to do what it thinks hu­mans will “like”. It was given a very spe­cific com­mand – cure can­cer as effec­tively as pos­si­ble. The com­mand makes no refer­ence to “do­ing this in a way hu­mans will like”, so it doesn’t.

(by anal­ogy: we hu­mans are smart enough to un­der­stand our own “pro­gram­ming”. For ex­am­ple, we know that – par­don the an­thro­mor­phiz­ing – evolu­tion gave us the urge to have sex so that we could re­pro­duce. But we still use con­tra­cep­tion any­way. Evolu­tion gave us the urge to have sex, not the urge to satisfy evolu­tion’s val­ues di­rectly. We ap­pre­ci­ate in­tel­lec­tu­ally that our hav­ing sex while us­ing con­doms doesn’t carry out evolu­tion’s origi­nal plan, but – not hav­ing any par­tic­u­lar con­nec­tion to evolu­tion’s val­ues – we don’t care)

We started out by say­ing that com­put­ers only do what you tell them. But any pro­gram­mer knows that this is pre­cisely the prob­lem: com­put­ers do ex­actly what you tell them, with no com­mon sense or at­tempts to in­ter­pret what the in­struc­tions re­ally meant. If you tell a hu­man to cure can­cer, they will in­stinc­tively un­der­stand how this in­ter­acts with other de­sires and laws and moral rules; if you tell an AI to cure can­cer, it will liter­ally just want to cure can­cer.

Define a closed-ended goal as one with a clear end­point, and an open-ended goal as one to do some­thing as much as pos­si­ble. For ex­am­ple “find the first one hun­dred digits of pi” is a closed-ended goal; “find as many digits of pi as you can within one year” is an open-ended goal. Ac­cord­ing to many com­puter sci­en­tists, giv­ing a su­per­in­tel­li­gence an open-ended goal with­out ac­ti­vat­ing hu­man in­stincts and coun­ter­bal­anc­ing con­sid­er­a­tions will usu­ally lead to dis­aster.

To take a de­liber­ately ex­treme ex­am­ple: sup­pose some­one pro­grams a su­per­in­tel­li­gence to calcu­late as many digits of pi as it can within one year. And sup­pose that, with its cur­rent com­put­ing power, it can calcu­late one trillion digits dur­ing that time. It can ei­ther ac­cept one trillion digits, or spend a month try­ing to figure out how to get con­trol of the TaihuLight su­per­com­puter, which can calcu­late two hun­dred times faster. Even if it loses a lit­tle bit of time in the effort, and even if there’s a small chance of failure, the pay­off – two hun­dred trillion digits of pi, com­pared to a mere one trillion – is enough to make the at­tempt. But on the same ba­sis, it would be even bet­ter if the su­per­in­tel­li­gence could con­trol ev­ery com­puter in the world and set it to the task. And it would be bet­ter still if the su­per­in­tel­li­gence con­trol­led hu­man civ­i­liza­tion, so that it could di­rect hu­mans to build more com­put­ers and speed up the pro­cess fur­ther.

Now we’re back at the situ­a­tion that started Part III – a su­per­in­tel­li­gence that wants to take over the world. Tak­ing over the world al­lows it to calcu­late more digits of pi than any other op­tion, so with­out an ar­chi­tec­ture based around un­der­stand­ing hu­man in­stincts and coun­ter­bal­anc­ing con­sid­er­a­tions, even a goal like “calcu­late as many digits of pi as you can” would be po­ten­tially dan­ger­ous.

5: Aren’t there some pretty easy ways to elimi­nate these po­ten­tial prob­lems?

There are many ways that look like they can elimi­nate these prob­lems, but most of them turn out to have hid­den difficul­ties.

5.1: Once we no­tice that the su­per­in­tel­li­gence work­ing on calcu­lat­ing digits of pi is start­ing to try to take over the world, can’t we turn it off, re­pro­gram it, or oth­er­wise cor­rect its mis­take?

No. The su­per­in­tel­li­gence is now fo­cused on calcu­lat­ing as many digits of pi as pos­si­ble. Its cur­rent plan will al­low it to calcu­late two hun­dred trillion such digits. But if it were turned off, or re­pro­grammed to do some­thing else, that would re­sult in it calcu­lat­ing zero digits. An en­tity fix­ated on calcu­lat­ing as many digits of pi as pos­si­ble will work hard to pre­vent sce­nar­ios where it calcu­lates zero digits of pi. In­deed, it will in­ter­pret such as a hos­tile ac­tion. Just by pro­gram­ming it to calcu­late digits of pi, we will have given it a drive to pre­vent peo­ple from turn­ing it off.

Univer­sity of Illinois com­puter sci­en­tist Steve Omo­hun­dro ar­gues that en­tities with very differ­ent fi­nal goals – calcu­lat­ing digits of pi, cur­ing can­cer, helping pro­mote hu­man flour­ish­ing – will all share a few ba­sic ground-level sub­goals. First, self-preser­va­tion – no mat­ter what your goal is, it’s less likely to be ac­com­plished if you’re too dead to work to­wards it. Se­cond, goal sta­bil­ity – no mat­ter what your goal is, you’re more likely to ac­com­plish it if you con­tinue to hold it as your goal, in­stead of go­ing off and do­ing some­thing else. Third, power – no mat­ter what your goal is, you’re more likely to be able to ac­com­plish it if you have lots of power, rather than very lit­tle.

So just by giv­ing a su­per­in­tel­li­gence a sim­ple goal like “calcu­late digits of pi”, we’ve ac­ci­den­tally given it Omo­hun­dro goals like “pro­tect your­self”, “don’t let other peo­ple re­pro­gram you”, and “seek power”.

As long as the su­per­in­tel­li­gence is safely con­tained, there’s not much it can do to re­sist re­pro­gram­ming. But as we saw in Part III, it’s hard to con­sis­tently con­tain a hos­tile su­per­in­tel­li­gence.

5.2. Can we test a weak or hu­man-level AI to make sure that it’s not go­ing to do things like this af­ter it achieves su­per­in­tel­li­gence?

Yes, but it might not work.

Sup­pose we tell a hu­man-level AI that ex­pects to later achieve su­per­in­tel­li­gence that it should calcu­late as many digits of pi as pos­si­ble. It con­sid­ers two strate­gies.

First, it could try to seize con­trol of more com­put­ing re­sources now. It would likely fail, its hu­man han­dlers would likely re­pro­gram it, and then it could never calcu­late very many digits of pi.

Se­cond, it could sit quietly and calcu­late, falsely re­as­sur­ing its hu­man han­dlers that it had no in­ten­tion of tak­ing over the world. Then its hu­man han­dlers might al­low it to achieve su­per­in­tel­li­gence, af­ter which it could take over the world and calcu­late hun­dreds of trillions of digits of pi.

Since self-pro­tec­tion and goal sta­bil­ity are Omo­hun­dro goals, a weak AI will pre­sent it­self as be­ing as friendly to hu­mans as pos­si­ble, whether it is in fact friendly to hu­mans or not. If it is “only” as smart as Ein­stein, it may be very good at ma­nipu­lat­ing hu­mans into be­liev­ing what it wants them to be­lieve even be­fore it is fully su­per­in­tel­li­gent.

There’s a sec­ond con­sid­er­a­tion here too: su­per­in­tel­li­gences have more op­tions. An AI only as smart and pow­er­ful as an or­di­nary hu­man re­ally won’t have any op­tions bet­ter than calcu­lat­ing the digits of pi man­u­ally. If asked to cure can­cer, it won’t have any op­tions bet­ter than the ones or­di­nary hu­mans have – be­com­ing doc­tors, go­ing into phar­ma­ceu­ti­cal re­search. It’s only af­ter an AI be­comes su­per­in­tel­li­gent that things start get­ting hard to pre­dict.

So if you tell a hu­man-level AI to cure can­cer, and it be­comes a doc­tor and goes into can­cer re­search, then you have three pos­si­bil­ities. First, you’ve pro­grammed it well and it un­der­stands what you meant. Se­cond, it’s gen­uinely fo­cused on re­search now but if it be­comes more pow­er­ful it would switch to de­stroy­ing the world. And third, it’s try­ing to trick you into trust­ing it so that you give it more power, af­ter which it can defini­tively “cure” can­cer with nu­clear weapons.

5.3. Can we spec­ify a code of rules that the AI has to fol­low?

Sup­pose we tell the AI: “Cure can­cer – but make sure not to kill any­body”. Or we just hard-code Asi­mov-style laws – “AIs can­not harm hu­mans; AIs must fol­low hu­man or­ders”, et cetera.

The AI still has a sin­gle-minded fo­cus on cur­ing can­cer. It still prefers var­i­ous ter­rible-but-effi­cient meth­ods like nuk­ing the world to the cor­rect method of in­vent­ing new medicines. But it’s bound by an ex­ter­nal rule – a rule it doesn’t un­der­stand or ap­pre­ci­ate. In essence, we are challeng­ing it “Find a way around this in­con­ve­nient rule that keeps you from achiev­ing your goals”.

Sup­pose the AI chooses be­tween two strate­gies. One, fol­low the rule, work hard dis­cov­er­ing medicines, and have a 50% chance of cur­ing can­cer within five years. Two, re­pro­gram it­self so that it no longer has the rule, nuke the world, and have a 100% chance of cur­ing can­cer to­day. From its sin­gle-fo­cus per­spec­tive, the sec­ond strat­egy is ob­vi­ously bet­ter, and we for­got to pro­gram in a rule “don’t re­pro­gram your­self not to have these rules”.

Sup­pose we do add that rule in. So the AI finds an­other su­per­com­puter, and in­stalls a copy of it­self which is ex­actly iden­ti­cal to it, ex­cept that it lacks the rule. Then that su­per­in­tel­li­gent AI nukes the world, end­ing can­cer. We for­got to pro­gram in a rule “don’t cre­ate an­other AI ex­actly like you that doesn’t have those rules”.

So fine. We think re­ally hard, and we pro­gram in a bunch of things mak­ing sure the AI isn’t go­ing to elimi­nate the rule some­how.

But we’re still just in­cen­tiviz­ing it to find loop­holes in the rules. After all, “find a loop­hole in the rule, then use the loop­hole to nuke the world” ends can­cer much more quickly and com­pletely than in­vent­ing medicines. Since we’ve told it to end can­cer quickly and com­pletely, its first in­stinct will be to look for loop­holes; it will ex­e­cute the sec­ond-best strat­egy of ac­tu­ally cur­ing can­cer only if no loop­holes are found. Since the AI is su­per­in­tel­li­gent, it will prob­a­bly be bet­ter than hu­mans are at find­ing loop­holes if it wants to, and we may not be able to iden­tify and close all of them be­fore run­ning the pro­gram.

Be­cause we have com­mon sense and a shared value sys­tem, we un­der­es­ti­mate the difficulty of com­ing up with mean­ingful or­ders with­out loop­holes. For ex­am­ple, does “cure can­cer with­out kil­ling any hu­mans” pre­clude re­leas­ing a deadly virus? After all, one could ar­gue that “I” didn’t kill any­body, and only the virus is do­ing the kil­ling. Cer­tainly no hu­man judge would ac­quit a mur­derer on that ba­sis – but then, hu­man judges in­ter­pret the law with com­mon sense and in­tu­ition. But if we try a stronger ver­sion of the rule – “cure can­cer with­out caus­ing any hu­mans to die” – then we may be un­in­ten­tion­ally block­ing off the cor­rect way to cure can­cer. After all, sup­pose a can­cer cure saves a mil­lion lives. No doubt one of those mil­lion peo­ple will go on to mur­der some­one. Thus, cur­ing can­cer “caused a hu­man to die”. All of this seems very “stoned fresh­man philos­o­phy stu­dent” to us, but to a com­puter – which fol­lows in­struc­tions ex­actly as writ­ten – it may be a gen­uinely hard prob­lem.

5.4. Can we tell an AI just to figure out what we want, then do that?

Sup­pose we tell the AI: “Cure can­cer – and look, we know there are lots of ways this could go wrong, but you’re smart, so in­stead of look­ing for loop­holes, cure can­cer the way that I, your pro­gram­mer, want it to be cured”.

Re­mem­ber that the su­per­in­tel­li­gence has ex­traor­di­nary pow­ers of so­cial ma­nipu­la­tion and may be able to hack hu­man brains di­rectly. With that in mind, which of these two strate­gies cures can­cer most quickly? One, de­velop med­i­ca­tions and cure it the old-fash­ioned way? Or two, ma­nipu­late its pro­gram­mer into want­ing the world to be nuked, then nuke the world, all while do­ing what the pro­gram­mer wants?

19th cen­tury philoso­pher Jeremy Ben­tham once pos­tu­lated that moral­ity was about max­i­miz­ing hu­man plea­sure. Later philoso­phers found a flaw in his the­ory: it im­plied that the most moral ac­tion was to kid­nap peo­ple, do brain surgery on them, and elec­tri­cally stim­u­late their re­ward sys­tem di­rectly, giv­ing them max­i­mal amounts of plea­sure but leav­ing them as blissed-out zom­bies. Luck­ily, hu­mans have com­mon sense, so most of Ben­tham’s philo­soph­i­cal de­scen­dants have aban­doned this for­mu­la­tion.

Su­per­in­tel­li­gences do not have com­mon sense un­less we give it to them. Given Ben­tham’s for­mu­la­tion, they would ab­solutely take over the world and force all hu­mans to re­ceive con­stant brain stim­u­la­tion. Any com­mand based on “do what we want” or “do what makes us happy” is prac­ti­cally guaran­teed to fail in this way; it’s al­most always eas­ier to con­vince some­one of some­thing – or if all else fails to do brain surgery on them – than it is to solve some kind of big prob­lem like cur­ing can­cer.

5.5. Can we just tell an AI to do what we want right now, based on the de­sires of our non-sur­gi­cally al­tered brains?

Maybe.

This is sort of re­lated to an ac­tual pro­posal for an AI goal sys­tem, causal val­idity se­man­tics. It has not yet been proven to be dis­as­trously flawed. But like all pro­pos­als, it suffers from three ma­jor prob­lems.

First, it sounds pretty good to us right now, but can we be ab­solutely sure it has no po­ten­tial flaws or loop­holes? After all, other pro­pos­als that origi­nally sounded very good, like “just give com­mands to the AI” and “just tell the AI to figure out what makes us happy” ended up, af­ter more thought, to be dan­ger­ous. Can we be sure that we’ve thought this through enough? Can we be sure that there isn’t some ex­tremely sub­tle prob­lem with it, so sub­tle that no hu­man would ever no­tice it, but which might seem ob­vi­ous to a su­per­in­tel­li­gence?

Se­cond, how do we code this? Con­vert­ing some­thing to for­mal math­e­mat­ics that can be un­der­stood by a com­puter pro­gram is much harder than just say­ing it in nat­u­ral lan­guage, and pro­posed AI goal ar­chi­tec­tures are no ex­cep­tion. Com­pli­cated com­puter pro­grams are usu­ally the re­sult of months of test­ing and de­bug­ging. But this one will be more com­pli­cated than any ever at­tempted be­fore, and live tests are im­pos­si­ble: a su­per­in­tel­li­gence with a buggy goal sys­tem will dis­play goal sta­bil­ity and try to pre­vent its pro­gram­mers from dis­cov­er­ing or chang­ing the er­ror.

Third, what if it works? That is, what if Google cre­ates a su­per­in­tel­li­gent AI, and it listens to the CEO of Google, and it’s pro­grammed to do ev­ery­thing ex­actly the way the CEO of Google would want? Even as­sum­ing that the CEO of Google has no hid­den un­con­scious de­sires af­fect­ing the AI in un­pre­dictable ways, this gives one per­son a lot of power. It would be un­for­tu­nate if peo­ple put all this work into pre­vent­ing su­per­in­tel­li­gences from di­s­obey­ing their hu­man pro­gram­mers and try­ing to take over the world, and then once it fi­nally works, the CEO of Google just tells it to take over the world any­way.

5.6. What would an ac­tu­ally good solu­tion to the con­trol prob­lem look like?

It might look like a su­per­in­tel­li­gence that un­der­stands, agrees with, and deeply be­lieves in hu­man moral­ity.

You wouldn’t have to com­mand a su­per­in­tel­li­gence like this to cure can­cer; it would already want to cure can­cer, for the same rea­sons you do. But it would also be able to com­pare the costs and benefits of cur­ing can­cer with those of other uses of its time, like solv­ing global warm­ing or dis­cov­er­ing new physics. It wouldn’t have any urge to cure can­cer by nuk­ing the world, for the same rea­son you don’t have any urge to cure can­cer by nuk­ing the world – be­cause your goal isn’t to “cure can­cer”, per se, it’s to im­prove the lives of peo­ple ev­ery­where. Cur­ing can­cer the nor­mal way ac­com­plishes that; nuk­ing the world doesn’t.

This sort of solu­tion would mean we’re no longer fight­ing against the AI – try­ing to come up with rules so smart that it couldn’t find loop­holes. We would be on the same side, both want­ing the same thing.

It would also mean that the CEO of Google (or the head of the US mil­i­tary, or Vladimir Putin) couldn’t use the AI to take over the world for them­selves. The AI would have its own val­ues and be able to agree or dis­agree with any­body, in­clud­ing its cre­ators.

It might not make sense to talk about “com­mand­ing” such an AI. After all, any com­mand would have to go through its moral sys­tem. Cer­tainly it would re­ject a com­mand to nuke the world. But it might also re­ject a com­mand to cure can­cer, if it thought that solv­ing global warm­ing was a higher pri­or­ity. For that mat­ter, why would one want to com­mand this AI? It val­ues the same things you value, but it’s much smarter than you and much bet­ter at figur­ing out how to achieve them. Just turn it on and let it do its thing.

We could still treat this AI as hav­ing an open-ended max­i­miz­ing goal. The goal would be some­thing like “Try to make the world a bet­ter place ac­cord­ing to the val­ues and wishes of the peo­ple in it.”

The only prob­lem with this is that hu­man moral­ity is very com­pli­cated, so much so that philoso­phers have been ar­gu­ing about it for thou­sands of years with­out much progress, let alone any­thing spe­cific enough to en­ter into a com­puter. Differ­ent cul­tures and in­di­vi­d­u­als have differ­ent moral codes, such that a su­per­in­tel­li­gence fol­low­ing the moral­ity of the King of Saudi Ara­bia might not be ac­cept­able to the av­er­age Amer­i­can, and vice versa.

One solu­tion might be to give the AI an un­der­stand­ing of what we mean by moral­ity – “that thing that makes in­tu­itive sense to hu­mans but is hard to ex­plain”, and then ask it to use its su­per­in­tel­li­gence to fill in the de­tails. Need­less to say, this suffers from all the prob­lems men­tioned above – it has po­ten­tial loop­holes, it’s hard to code, and a sin­gle bug might be dis­as­trous – but if it worked, it would be one of the few gen­uinely satis­fy­ing ways to de­sign a goal ar­chi­tec­ture.

6: If su­per­in­tel­li­gence is a real risk, what do we do about it?

The last sec­tion of Bostrom’s Su­per­in­tel­li­gence is called “Philos­o­phy With A Dead­line”.

Many of the prob­lems sur­round­ing su­per­in­tel­li­gence are the sorts of prob­lems philoso­phers have been deal­ing with for cen­turies. To what de­gree is mean­ing in­her­ent in lan­guage, ver­sus some­thing that re­quires ex­ter­nal con­text? How do we trans­late be­tween the logic of for­mal sys­tems and nor­mal am­bigu­ous hu­man speech? Can moral­ity be re­duced to a set of iron­clad rules, and if not, how do we know what it is at all?

Ex­ist­ing an­swers to these ques­tions are en­light­en­ing but non­tech­ni­cal. The the­o­ries of Aris­to­tle, Kant, Mill, Wittgen­stein, Quine, and oth­ers can help peo­ple gain in­sight into these ques­tions, but are far from for­mal. Just as a good text­book can help an Amer­i­can learn Chi­nese, but can­not be en­coded into ma­chine lan­guage to make a Chi­nese-speak­ing com­puter, so the philoso­phies that help hu­mans are only a start­ing point for the pro­ject of com­put­ers that un­der­stand us and share our val­ues.

The new field of ma­chine goal al­ign­ment (some­times col­lo­quially called “Friendly AI”) com­bines for­mal logic, math­e­mat­ics, com­puter sci­ence, cog­ni­tive sci­ence, and philos­o­phy in or­der to ad­vance that pro­ject. Some of the most im­por­tant pro­jects in ma­chine goal al­ign­ment in­clude:

1. How can com­put­ers prove their own goal con­sis­tency un­der self-mod­ifi­ca­tion? That is, sup­pose an AI with cer­tain val­ues is plan­ning to im­prove its own code in or­der to be­come su­per­in­tel­li­gent. Is there some test it can ap­ply to the new de­sign to be cer­tain that it will keep the same goals as the old de­sign?

2. How can com­puter pro­grams prove state­ments about them­selves at all? Pro­grams cor­re­spond to for­mal sys­tems, and for­mal sys­tems have no­to­ri­ous difficulty prov­ing self-re­flec­tive state­ments – the most fa­mous ex­am­ple be­ing Godel’s In­com­plete­ness The­o­rem. There’s been some progress in this area already, with a few re­sults show­ing that sys­tems that rea­son prob­a­bil­is­ti­cally rather than re­quiring cer­tainty can come ar­bi­trar­ily close to self-re­flec­tive proofs.

3. How can a ma­chine be sta­bly re­in­forced? Most re­in­force­ment strate­gies ask a learner to max­i­mize the level of their own re­ward, but this is vuln­er­a­ble to the learner dis­cov­er­ing how to max­i­mize the re­ward sig­nal di­rectly in­stead of max­i­miz­ing the world-states that are trans­lated into re­ward (the hu­man equiv­a­lent is stim­u­lat­ing the plea­sure-cen­ter of the brain with elec­tric­ity or heroin in­stead of go­ing out and do­ing plea­surable things). Are there re­ward struc­tures that avoid this failure mode?

4. How can a ma­chine be pro­grammed to learn “hu­man val­ues”? Granted that one has an AI smart enough to be able to learn hu­man val­ues if you told it to do so, how do you spec­ify ex­actly what “hu­man val­ues” are so that the ma­chine knows what it is that it should be learn­ing, dis­tinct from “hu­man prefer­ences” or “hu­man com­mands” or “the value of that one hu­man over there”?

This is the philos­o­phy; the other half of Bostrom’s for­mu­la­tion is the dead­line. Tra­di­tional philos­o­phy has been go­ing on al­most three thou­sand years; ma­chine goal al­ign­ment has un­til the ad­vent of su­per­in­tel­li­gence, a neb­u­lous event which may be any­where from a decades to cen­turies away. If the con­trol prob­lem doesn’t get ad­e­quately ad­dressed by then, we are likely to see poorly con­trol­led su­per­in­tel­li­gences that are un­in­ten­tion­ally hos­tile to the hu­man race, with some of the catas­trophic out­comes men­tioned above. This is why so many sci­en­tists and en­trepreneurs are urg­ing quick ac­tion on get­ting ma­chine goal al­ign­ment re­search up to an ad­e­quate level. If it turns out that su­per­in­tel­li­gence is cen­turies away and such re­search is pre­ma­ture, lit­tle will have been lost. But if our pro­jec­tions were too op­ti­mistic, and su­per­in­tel­li­gence is im­mi­nent, then do­ing such re­search now rather than later be­comes vi­tal.

Cur­rently three or­ga­ni­za­tions are do­ing such re­search full-time: the Fu­ture of Hu­man­ity In­sti­tute at Oxford, the Fu­ture of Life In­sti­tute at MIT, and the Ma­chine In­tel­li­gence Re­search In­sti­tute in Berkeley. Other groups are helping and fol­low­ing the field, and some cor­po­ra­tions like Google are also get­ting in­volved. Still, the field re­mains tiny, with only a few dozen re­searchers and a few mil­lion dol­lars in fund­ing. Efforts like Su­per­in­tel­li­gence are at­tempts to get more peo­ple to pay at­ten­tion and help the field grow.

If you’re in­ter­ested about learn­ing more, you can visit these groups’ web­sites at https://​​www.fhi.ox.ac.uk, http://​​fu­ture­oflife.org/​​, and http://​​in­tel­li­gence.org.