Reply to Holden on ‘Tool AI’

I be­gin by thank­ing Holden Karnofsky of Givewell for his rare gift of his de­tailed, en­gaged, and helpfully-meant crit­i­cal ar­ti­cle Thoughts on the Sin­gu­lar­ity In­sti­tute (SI). In this re­ply I will en­gage with only one of the many sub­jects raised therein, the topic of, as I would term them, non-self-mod­ify­ing plan­ning Or­a­cles, a.k.a. ‘Google Maps AGI’ a.k.a. ‘tool AI’, this be­ing the topic that re­quires me per­son­ally to an­swer. I hope that my re­ply will be ac­cepted as ad­dress­ing the most im­por­tant cen­tral points, though I did not have time to ex­plore ev­ery av­enue. I cer­tainly do not wish to be log­i­cally rude, and if I have failed, please re­mem­ber with com­pas­sion that it’s not always ob­vi­ous to one per­son what an­other per­son will think was the cen­tral point.

Luke Mueulhauser and Carl Shul­man con­tributed to this ar­ti­cle, but the fi­nal edit was my own, like­wise any flaws.


Holden’s con­cern is that “SI ap­pears to ne­glect the po­ten­tially im­por­tant dis­tinc­tion be­tween ‘tool’ and ‘agent’ AI.” His archety­pal ex­am­ple is Google Maps:

Google Maps is not an agent, tak­ing ac­tions in or­der to max­i­mize a util­ity pa­ram­e­ter. It is a tool, gen­er­at­ing in­for­ma­tion and then dis­play­ing it in a user-friendly man­ner for me to con­sider, use and ex­port or dis­card as I wish.

The re­ply breaks down into four heav­ily in­ter­re­lated points:

First, Holden seems to think (and Jaan Tal­linn doesn’t ap­par­ently ob­ject to, in their ex­change) that if a non-self-mod­ify­ing plan­ning Or­a­cle is in­deed the best strat­egy, then all of SIAI’s past and in­tended fu­ture work is wasted. To me it looks like there’s a huge amount of over­lap in un­der­ly­ing pro­cesses in the AI that would have to be built and the in­sights re­quired to build it, and I would be try­ing to as­sem­ble mostly—though not quite ex­actly—the same kind of team if I was try­ing to build a non-self-mod­ify­ing plan­ning Or­a­cle, with the same ini­tial mix of tal­ents and skills.

Se­cond, a non-self-mod­ify­ing plan­ning Or­a­cle doesn’t sound nearly as safe once you stop say­ing hu­man-English phrases like “de­scribe the con­se­quences of an ac­tion to the user” and start try­ing to come up with math that says scary dan­ger­ous things like (he trans­lated into English) “in­crease the cor­re­spon­dence be­tween the user’s be­lief about rele­vant con­se­quences and re­al­ity”. Hence why the peo­ple on the team would have to solve the same sorts of prob­lems.

Ap­pre­ci­at­ing the force of the third point is a lot eas­ier if one ap­pre­ci­ates the difficul­ties dis­cussed in points 1 and 2, but is ac­tu­ally em­piri­cally ver­ifi­able in­de­pen­dently: Whether or not a non-self-mod­ify­ing plan­ning Or­a­cle is the best solu­tion in the end, it’s not such an ob­vi­ous priv­ileged-point-in-solu­tion-space that some­one should be alarmed at SIAI not dis­cussing it. This is em­piri­cally ver­ifi­able in the sense that ‘tool AI’ wasn’t the ob­vi­ous solu­tion to e.g. John McCarthy, Marvin Min­sky, I. J. Good, Peter Norvig, Ver­nor Vinge, or for that mat­ter Isaac Asi­mov. At one point, Holden says:

One of the things that both­ers me most about SI is that there is prac­ti­cally no pub­lic con­tent, as far as I can tell, ex­plic­itly ad­dress­ing the idea of a “tool” and giv­ing ar­gu­ments for why AGI is likely to work only as an “agent.”

If I take liter­ally that this is one of the things that both­ers Holden most… I think I’d start stack­ing up some of the liter­a­ture on the num­ber of differ­ent things that just re­spectable aca­demics have sug­gested as the ob­vi­ous solu­tion to what-to-do-about-AI—none of which would be about non-self-mod­ify­ing smarter-than-hu­man plan­ning Or­a­cles—and beg him to have some com­pas­sion on us for what we haven’t ad­dressed yet. It might be the right sug­ges­tion, but it’s not so ob­vi­ously right that our failure to pri­ori­tize dis­cussing it re­flects neg­li­gence.

The fi­nal point at the end is look­ing over all the pre­ced­ing dis­cus­sion and re­al­iz­ing that, yes, you want to have peo­ple spe­cial­iz­ing in Friendly AI who know this stuff, but as all that pre­ced­ing dis­cus­sion is ac­tu­ally the fol­low­ing dis­cus­sion at this point, I shall re­serve it for later.

1. The math of op­ti­miza­tion, and the similar parts of a plan­ning Or­a­cle.

What does it take to build a smarter-than-hu­man in­tel­li­gence, of what­ever sort, and have it go well?

A “Friendly AI pro­gram­mer” is some­body who spe­cial­izes in see­ing the cor­re­spon­dence of math­e­mat­i­cal struc­tures to What Hap­pens in the Real World. It’s some­body who looks at Hut­ter’s speci­fi­ca­tion of AIXI and reads the ac­tual equa­tions—ac­tu­ally stares at the Greek sym­bols and not just the ac­com­pa­ny­ing English text—and sees, “Oh, this AI will try to gain con­trol of its re­ward chan­nel,” as well as nu­mer­ous sub­tler is­sues like, “This AI pre­sumes a Carte­sian bound­ary sep­a­rat­ing it­self from the en­vi­ron­ment; it may drop an anvil on its own head.” Similarly, work­ing on TDT means e.g. look­ing at a math­e­mat­i­cal speci­fi­ca­tion of de­ci­sion the­ory, and see­ing “Oh, this is vuln­er­a­ble to black­mail” and com­ing up with a math­e­mat­i­cal counter-speci­fi­ca­tion of an AI that isn’t so vuln­er­a­ble to black­mail.

Holden’s post seems to im­ply that if you’re build­ing a non-self-mod­ify­ing plan­ning Or­a­cle (aka ‘tool AI’) rather than an act­ing-in-the-world agent, you don’t need a Friendly AI pro­gram­mer be­cause FAI pro­gram­mers only work on agents. But this isn’t how the en­g­ineer­ing skills are split up. In­side the AI, whether an agent AI or a plan­ning Or­a­cle, there would be similar AGI-challenges like “build a pre­dic­tive model of the world”, and similar FAI-con­ju­gates of those challenges like find­ing the ‘user’ in­side an AI-cre­ated model of the uni­verse. The in­sides would look a lot more similar than the out­sides. An anal­ogy would be sup­pos­ing that a ma­chine learn­ing pro­fes­sional who does sales op­ti­miza­tion for an or­ange com­pany couldn’t pos­si­bly do sales op­ti­miza­tion for a ba­nana com­pany, be­cause their skills must be about or­anges rather than ba­nanas.

Ad­mit­tedly, if it turns out to be pos­si­ble to use a hu­man un­der­stand­ing of cog­ni­tive al­gorithms to build and run a smarter-than-hu­man Or­a­cle with­out it be­ing self-im­prov­ing—this seems un­likely, but not im­pos­si­ble—then you wouldn’t have to solve prob­lems that arise with self-mod­ifi­ca­tion. But this elimi­nates only one di­men­sion of the work. And on an even more meta level, it seems like you would call upon al­most iden­ti­cal tal­ents and skills to come up with what­ever in­sights were re­quired—though if it were pre­dictable in ad­vance that we’d ab­jure self-mod­ifi­ca­tion, then, yes, we’d place less em­pha­sis on e.g. find­ing a team mem­ber with past ex­pe­rience in re­flec­tive math, and wouldn’t waste (ad­di­tional) time spe­cial­iz­ing in re­flec­tion. But if you wanted math in­side the plan­ning Or­a­cle that op­er­ated the way you thought it did, and you wanted some­body who un­der­stood what could pos­si­bly go wrong and how to avoid it, you would need to make a func­tion call to the same sort of tal­ents and skills to build an agent AI, or an Or­a­cle that was self-mod­ify­ing, etc.

2. Yes, plan­ning Or­a­cles have hid­den gotchas too.

“Tool AI” may sound sim­ple in English, a short sen­tence in the lan­guage of em­path­i­cally-mod­eled agents — it’s just “a thingy that shows you plans in­stead of a thingy that goes and does things.” If you want to know whether this hy­po­thet­i­cal en­tity does X, you just check whether the out­come of X sounds like “show­ing some­one a plan” or “go­ing and do­ing things”, and you’ve got your an­swer. It starts sound­ing much scarier once you try to say some­thing more for­mal and in­ter­nally-causal like “Model the user and the uni­verse, pre­dict the de­gree of cor­re­spon­dence be­tween the user’s model and the uni­verse, and se­lect from among pos­si­ble ex­pla­na­tion-ac­tions on this ba­sis.”

Holden, in his di­alogue with Jaan Tal­linn, writes out this at­tempt at for­mal­iz­ing:

Here’s how I pic­ture the Google Maps AGI …

util­ity_func­tion = con­struct_util­ity_func­tion(pro­cess_user_in­put());

fore­ach $ac­tion in $all_pos­si­ble_ac­tions {

$ac­tion_out­come = pre­dic­tion_func­tion($ac­tion,$data);

$util­ity = util­ity_func­tion($ac­tion_out­come);

if ($util­ity > $lead­ing_util­ity) { $lead­ing_util­ity = $util­ity;

$lead­ing_ac­tion = $ac­tion; }



con­struct_util­ity_func­tion(pro­cess_user_in­put()) is just a hu­man-qual­ity func­tion for un­der­stand­ing what the speaker wants. pre­dic­tion_func­tion is an im­ple­men­ta­tion of a hu­man-qual­ity data->pre­dic­tion func­tion in su­pe­rior hard­ware. $data is fixed (it’s a dataset larger than any hu­man can pro­cess); same with $all_pos­si­ble_ac­tions. re­port($lead­ing_ac­tion) calls a Google Maps-like in­ter­face for un­der­stand­ing the con­se­quences of $lead­ing_ac­tion; it ba­si­cally breaks the ac­tion into com­po­nent parts and dis­plays pre­dic­tions for differ­ent times and con­di­tional on differ­ent pa­ram­e­ters.

Google Maps doesn’t check all pos­si­ble routes. If I wanted to de­sign Google Maps, I would start out by throw­ing out a stan­dard plan­ning tech­nique on a con­nected graph where each edge has a cost func­tion and there’s a good heuris­tic mea­sure of the dis­tance, e.g. A* search. If that was too slow, I’d next try some more effi­cient ver­sion like weighted A* (or bidi­rec­tional weighted mem­ory-bounded A*, which I ex­pect I could also get off-the-shelf some­where). Once you in­tro­duce weighted A*, you no longer have a guaran­tee that you’re se­lect­ing the op­ti­mal path. You have a guaran­tee to within a known fac­tor of the cost of the op­ti­mal path — but the ac­tual path se­lected wouldn’t be quite op­ti­mal. The sug­ges­tion pro­duced would be an ap­prox­i­ma­tion whose ex­act steps de­pended on the ex­act al­gorithm you used. That’s true even if you can pre­dict the ex­act cost — ex­act util­ity — of any par­tic­u­lar path you ac­tu­ally look at; and even if you have a heuris­tic that never over­es­ti­mates the cost.

The rea­son we don’t have God’s Al­gorithm for solv­ing the Ru­bik’s Cube is that there’s no perfect way of mea­sur­ing the dis­tance be­tween any two Ru­bik’s Cube po­si­tions — you can’t look at two Ru­bik’s cube po­si­tions, and figure out the min­i­mum num­ber of moves re­quired to get from one to an­other. It took 15 years to prove that there was a po­si­tion re­quiring at least 20 moves to solve, and then an­other 15 years to come up with a com­puter al­gorithm that could solve any po­si­tion in at most 20 moves, but we still can’t com­pute the ac­tual, min­i­mum solu­tion to all Cubes (“God’s Al­gorithm”). This, even though we can ex­actly calcu­late the cost and con­se­quence of any ac­tual Ru­bik’s-solu­tion-path we con­sider.

When it comes to AGI — solv­ing gen­eral cross-do­main “Figure out how to do X” prob­lems — you’re not go­ing to get any­where near the one, true, op­ti­mal an­swer. You’re go­ing to — at best, if ev­ery­thing works right — get a good an­swer that’s a cross-product of the “util­ity func­tion” and all the other al­gorith­mic prop­er­ties that de­ter­mine what sort of an­swer the AI finds easy to in­vent (i.e. can be in­vented us­ing bounded com­put­ing time).

As for the no­tion that this AGI runs on a “hu­man pre­dic­tive al­gorithm” that we got off of neu­ro­science and then im­ple­mented us­ing more com­put­ing power, with­out know­ing how it works or be­ing able to en­hance it fur­ther: It took 30 years of mul­ti­ple com­puter sci­en­tists do­ing ba­sic math re­search, and in­vent­ing code, and run­ning that code on a com­puter cluster, for them to come up with a 20-move solu­tion to the Ru­bik’s Cube. If a plan­ning Or­a­cle is go­ing to pro­duce bet­ter solu­tions than hu­man­ity has yet man­aged to the Ru­bik’s Cube, it needs to be ca­pa­ble of do­ing origi­nal com­puter sci­ence re­search and writ­ing its own code. You can’t get a 20-move solu­tion out of a hu­man brain, us­ing the na­tive hu­man plan­ning al­gorithm. Hu­man­ity can do it, but only by ex­ploit­ing the abil­ity of hu­mans to ex­plic­itly com­pre­hend the deep struc­ture of the do­main (not just rely on in­tu­ition) and then in­vent­ing an ar­ti­fact, a new de­sign, run­ning code which uses a differ­ent and su­pe­rior cog­ni­tive al­gorithm, to solve that Ru­bik’s Cube in 20 moves. We do all that with­out be­ing self-mod­ify­ing, but it’s still a ca­pa­bil­ity to re­spect.

And I’m not even go­ing into what it would take for a plan­ning Or­a­cle to out-strate­gize any hu­man, come up with a plan for per­suad­ing some­one, solve origi­nal sci­en­tific prob­lems by look­ing over ex­per­i­men­tal data (like Ein­stein did), de­sign a nanoma­chine, and so on.

Talk­ing like there’s this one sim­ple “pre­dic­tive al­gorithm” that we can read out of the brain us­ing neu­ro­science and over­power to pro­duce bet­ter plans… doesn’t seem quite con­gru­ous with what hu­man­ity ac­tu­ally does to pro­duce its pre­dic­tions and plans.

If we take the con­cept of the Google Maps AGI at face value, then it ac­tu­ally has four key mag­i­cal com­po­nents. (In this case, “mag­i­cal” isn’t to be taken as prej­u­di­cial, it’s a term of art that means we haven’t said how the com­po­nent works yet.) There’s a mag­i­cal com­pre­hen­sion of the user’s util­ity func­tion, a mag­i­cal world-model that GMAGI uses to com­pre­hend the con­se­quences of ac­tions, a mag­i­cal plan­ning el­e­ment that se­lects a non-op­ti­mal path us­ing some method other than ex­plor­ing all pos­si­ble ac­tions, and a mag­i­cal ex­plain-to-the-user func­tion.

re­port($lead­ing_ac­tion) isn’t ex­actly a triv­ial step ei­ther. Deep Blue tells you to move your pawn or you’ll lose the game. You ask “Why?” and the an­swer is a gi­gan­tic search tree of billions of pos­si­ble move-se­quences, leafing at po­si­tions which are heuris­ti­cally rated us­ing a static-po­si­tion eval­u­a­tion al­gorithm trained on mil­lions of games. Or the plan­ning Or­a­cle tells you that a cer­tain DNA se­quence will pro­duce a pro­tein that cures can­cer, you ask “Why?”, and then hu­mans aren’t even ca­pa­ble of ver­ify­ing, for them­selves, the as­ser­tion that the pep­tide se­quence will fold into the pro­tein the plan­ning Or­a­cle says it does.

“So,” you say, af­ter the first dozen times you ask the Or­a­cle a ques­tion and it re­turns an an­swer that you’d have to take on faith, “we’ll just spec­ify in the util­ity func­tion that the plan should be un­der­stand­able.”

Where­upon other things start go­ing wrong. Viliam_Bur, in the com­ments thread, gave this ex­am­ple, which I’ve slightly sim­plified:

Ex­am­ple ques­tion: “How should I get rid of my dis­ease most cheaply?” Ex­am­ple an­swer: “You won’t. You will die soon, un­avoid­ably. This re­port is 99.999% re­li­able”. Pre­dicted hu­man re­ac­tion: De­cides to kill self and get it over with. Suc­cess rate: 100%, the dis­ease is gone. Costs of cure: zero. Mis­sion com­pleted.

Bur is try­ing to give an ex­am­ple of how things might go wrong if the prefer­ence func­tion is over the ac­cu­racy of the pre­dic­tions ex­plained to the hu­man— rather than just the hu­man’s ‘good­ness’ of the out­come. And if the prefer­ence func­tion was just over the hu­man’s ‘good­ness’ of the end re­sult, rather than the ac­cu­racy of the hu­man’s un­der­stand­ing of the pre­dic­tions, the AI might tell you some­thing that was pre­dic­tively false but whose im­ple­men­ta­tion would lead you to what the AI defines as a ‘good’ out­come. And if we ask how happy the hu­man is, the re­sult­ing de­ci­sion pro­ce­dure would ex­ert op­ti­miza­tion pres­sure to con­vince the hu­man to take drugs, and so on.

I’m not say­ing any par­tic­u­lar failure is 100% cer­tain to oc­cur; rather I’m try­ing to ex­plain—as hand­i­capped by the need to de­scribe the AI in the na­tive hu­man agent-de­scrip­tion lan­guage, us­ing em­pa­thy to simu­late a spirit-in-a-box in­stead of try­ing to think in math­e­mat­i­cal struc­tures like A* search or Bayesian up­dat­ing—how, even so, one can still see that the is­sue is a tad more fraught than it sounds on an im­me­di­ate ex­am­i­na­tion.

If you see the world just in terms of math, it’s even worse; you’ve got some pro­gram with in­puts from a USB ca­ble con­nect­ing to a we­b­cam, out­put to a com­puter mon­i­tor, and op­ti­miza­tion crite­ria ex­pressed over some com­bi­na­tion of the mon­i­tor, the hu­mans look­ing at the mon­i­tor, and the rest of the world. It’s a whole lot eas­ier to call what’s in­side a ‘plan­ning Or­a­cle’ or some other English phrase than to write a pro­gram that does the op­ti­miza­tion safely with­out se­ri­ous un­in­tended con­se­quences. Show me any at­tempted speci­fi­ca­tion, and I’ll point to the vague parts and ask for clar­ifi­ca­tion in more for­mal and math­e­mat­i­cal terms, and as soon as the de­sign is clar­ified enough to be a hun­dred light years from im­ple­men­ta­tion in­stead of a thou­sand light years, I’ll show a neu­tral judge how that math would go wrong. (Ex­pe­rience shows that if you try to ex­plain to would-be AGI de­sign­ers how their de­sign goes wrong, in most cases they just say “Oh, but of course that’s not what I meant.” Mar­cus Hut­ter is a rare ex­cep­tion who speci­fied his AGI in such un­am­bigu­ous math­e­mat­i­cal terms that he ac­tu­ally suc­ceeded at re­al­iz­ing, af­ter some dis­cus­sion with SIAI per­son­nel, that AIXI would kill off its users and seize con­trol of its re­ward but­ton. But based on past sad ex­pe­rience with many other would-be de­sign­ers, I say “Ex­plain to a neu­tral judge how the math kills” and not “Ex­plain to the per­son who in­vented that math and likes it.”)

Just as the gi­gan­tic gap be­tween smart-sound­ing English in­struc­tions and ac­tu­ally smart al­gorithms is the main source of difficulty in AI, there’s a gap be­tween benev­olent-sound­ing English and ac­tu­ally benev­olent al­gorithms which is the source of difficulty in FAI. “Just make sug­ges­tions—don’t do any­thing!” is, in the end, just more English.

3. Why we haven’t already dis­cussed Holden’s suggestion

One of the things that both­ers me most about SI is that there is prac­ti­cally no pub­lic con­tent, as far as I can tell, ex­plic­itly ad­dress­ing the idea of a “tool” and giv­ing ar­gu­ments for why AGI is likely to work only as an “agent.”

The above state­ment seems to lack per­spec­tive on how many differ­ent things var­i­ous peo­ple see as the one ob­vi­ous solu­tion to Friendly AI. Tool AI wasn’t the ob­vi­ous solu­tion to John McCarthy, I.J. Good, or Marvin Min­sky. To­day’s lead­ing AI text­book, Ar­tifi­cial In­tel­li­gence: A Modern Ap­proach—where you can learn all about A* search, by the way—dis­cusses Friendly AI and AI risk for 3.5 pages but doesn’t men­tion tool AI as an ob­vi­ous solu­tion. For Ray Kurzweil, the ob­vi­ous solu­tion is merg­ing hu­mans and AIs. For Jur­gen Sch­mid­hu­ber, the ob­vi­ous solu­tion is AIs that value a cer­tain com­pli­cated defi­ni­tion of com­plex­ity in their sen­sory in­puts. Ben Go­ertzel, J. Storrs Hall, and Bill Hib­bard, among oth­ers, have all writ­ten about how silly Sin­ginst is to pur­sue Friendly AI when the solu­tion is ob­vi­ously X, for var­i­ous differ­ent X. Among cur­rent lead­ing peo­ple work­ing on se­ri­ous AGI pro­grams la­beled as such, nei­ther Demis Hass­abis (VC-funded to the tune of sev­eral mil­lion dol­lars) nor Moshe Looks (head of AGI re­search at Google) nor Henry Markram (Blue Brain at IBM) think that the ob­vi­ous an­swer is Tool AI. Ver­nor Vinge, Isaac Asi­mov, and any num­ber of other SF writ­ers with tech­ni­cal back­grounds who spent se­ri­ous time think­ing about these is­sues didn’t con­verge on that solu­tion.

Ob­vi­ously I’m not say­ing that no­body should be al­lowed to pro­pose solu­tions be­cause some­one else would pro­pose a differ­ent solu­tion. I have been known to ad­vo­cate for par­tic­u­lar de­vel­op­men­tal path­ways for Friendly AI my­self. But I haven’t, for ex­am­ple, told Peter Norvig that de­ter­minis­tic self-mod­ifi­ca­tion is such an ob­vi­ous solu­tion to Friendly AI that I would mis­trust his whole AI text­book if he didn’t spend time dis­cussing it.

At one point in his con­ver­sa­tion with Tal­linn, Holden ar­gues that AI will in­evitably be de­vel­oped along plan­ning-Or­a­cle lines, be­cause mak­ing sug­ges­tions to hu­mans is the nat­u­ral course that most soft­ware takes. Search­ing for coun­terex­am­ples in­stead of pos­i­tive ex­am­ples makes it clear that most lines of code don’t do this. Your com­puter, when it re­al­lo­cates RAM, doesn’t pop up a but­ton ask­ing you if it’s okay to re­al­lo­cate RAM in such-and-such a fash­ion. Your car doesn’t pop up a sug­ges­tion when it wants to change the fuel mix or ap­ply dy­namic sta­bil­ity con­trol. Fac­tory robots don’t op­er­ate as hu­man-worn bracelets whose blink­ing lights sug­gest mo­tion. High-fre­quency trad­ing pro­grams ex­e­cute stock or­ders on a microsec­ond timescale. Soft­ware that does hap­pen to in­ter­face with hu­mans is se­lec­tively visi­ble and salient to hu­mans, es­pe­cially the tiny part of the soft­ware that does the in­ter­fac­ing; but this is a spe­cial case of a gen­eral cost/​benefit trade­off which, more of­ten than not, turns out to swing the other way, be­cause hu­man ad­vice is ei­ther too costly or doesn’t provide enough benefit. Modern AI pro­gram­mers are gen­er­ally more in­ter­ested in e.g. push­ing the tech­nolog­i­cal en­velope to al­low self-driv­ing cars than to “just” do Google Maps. Branches of AI that in­voke hu­man aid, like hy­brid chess-play­ing al­gorithms de­signed to in­cor­po­rate hu­man ad­vice, are a field of study; but they’re the ex­cep­tion rather than the rule, and oc­cur pri­mar­ily where AIs can’t yet do some­thing hu­mans do, e.g. hu­mans act­ing as or­a­cles for the­o­rem-provers, where the hu­mans sug­gest a route to a proof and the AI ac­tu­ally fol­lows that route. This is an­other rea­son why plan­ning Or­a­cles were not a uniquely ob­vi­ous solu­tion to the var­i­ous aca­demic AI re­searchers, would-be AI-cre­ators, SF writ­ers, etcetera, listed above. Again, re­gard­less of whether a plan­ning Or­a­cle is ac­tu­ally the best solu­tion, Holden seems to be em­piri­cally-demon­stra­bly over­es­ti­mat­ing the de­gree to which other peo­ple will au­to­mat­i­cally have his preferred solu­tion come up first in their search or­der­ing.

4. Why we should have full-time Friendly AI spe­cial­ists just like we have trained pro­fes­sion­als do­ing any­thing else mathy that some­body ac­tu­ally cares about get­ting right, like pric­ing in­ter­est-rate op­tions or something

I hope that the pre­ced­ing dis­cus­sion has made, by ex­am­ple in­stead of mere ar­gu­ment, what’s prob­a­bly the most im­por­tant point: If you want to have a sen­si­ble dis­cus­sion about which AI de­signs are safer, there are spe­cial­ized skills you can ap­ply to that dis­cus­sion, as built up over years of study and prac­tice by some­one who spe­cial­izes in an­swer­ing that sort of ques­tion.

This isn’t meant as an ar­gu­ment from au­thor­ity. It’s not meant as an at­tempt to say that only ex­perts should be al­lowed to con­tribute to the con­ver­sa­tion. But it is meant to say that there is (and ought to be) room in the world for Friendly AI spe­cial­ists, just like there’s room in the world for spe­cial­ists on op­ti­mal philan­thropy (e.g. Holden).

The de­ci­sion to build a non-self-mod­ify­ing plan­ning Or­a­cle would be prop­erly made by some­one who: un­der­stood the risk gra­di­ent for self-mod­ify­ing vs. non-self-mod­ify­ing pro­grams; un­der­stood the risk gra­di­ent for hav­ing the AI think­ing about the thought pro­cesses of the hu­man watcher and try­ing to come up with plans im­ple­mentable by the hu­man watcher in the ser­vice of lo­cally ab­sorbed util­ity func­tions, vs. try­ing to im­ple­ment its own plans in the ser­vice of more globally de­scrip­tive util­ity func­tions; and who, above all, un­der­stood on a tech­ni­cal level what ex­actly gets ac­com­plished by hav­ing the plans routed through a hu­man. I’ve given sub­stan­tial pre­vi­ous thought to de­scribing more pre­cisely what hap­pens — what is be­ing gained, and how much is be­ing gained — when a hu­man “ap­proves a sug­ges­tion” made by an AI. But that would be an­other a differ­ent topic, plus I haven’t made too much progress on say­ing it pre­cisely any­way.

In the tran­script of Holden’s con­ver­sa­tion with Jaan Tal­linn, it looked like Tal­linn didn’t deny the as­ser­tion that Friendly AI skills would be in­ap­pli­ca­ble if we’re build­ing a Google Maps AGI. I would deny that as­ser­tion and em­pha­size that de­nial, be­cause to me it seems that it is ex­actly Friendly AI pro­gram­mers who would be able to tell you if the risk gra­di­ent for non-self-mod­ifi­ca­tion vs. self-mod­ifi­ca­tion, the risk gra­di­ent for rout­ing plans through hu­mans vs. act­ing as an agent, the risk gra­di­ent for re­quiring hu­man ap­proval vs. un­ap­proved ac­tion, and the ac­tual fea­si­bil­ity of di­rectly con­struct­ing tran­shu­man mod­el­ing-pre­dic­tion-and-plan­ning al­gorithms through di­rectly de­sign of sheerly bet­ter com­pu­ta­tions than are presently run by the hu­man brain, had the right com­bi­na­tion of prop­er­ties to im­ply that you ought to go con­struct a non-self-mod­ify­ing plan­ning Or­a­cle. Similarly if you wanted an AI that took a limited set of ac­tions in the world with hu­man ap­proval, or if you wanted an AI that “just an­swered ques­tions in­stead of mak­ing plans”.

It is similarly im­plied that a “philo­soph­i­cal AI” might ob­so­lete Friendly AI pro­gram­mers. If we’re talk­ing about PAI that can start with a hu­man’s ter­rible de­ci­sion the­ory and come up with a good de­ci­sion the­ory, or PAI that can start from a hu­man talk­ing about bad metaethics and then con­struct a good metaethics… I don’t want to say “im­pos­si­ble”, be­cause, af­ter all, that’s just what hu­man philoso­phers do. But we are not talk­ing about a triv­ial in­ven­tion here. Con­struct­ing a “philo­soph­i­cal AI” is a Holy Grail pre­cisely be­cause it’s FAI-com­plete (just ask it “What AI should we build?”), and has been dis­cussed (e.g. with and by Wei Dai) over the years on the old SL4 mailing list and the mod­ern Less Wrong. But it’s re­ally not at all clear how you could write an al­gorithm which would know­ably pro­duce the cor­rect an­swer to the en­tire puz­zle of an­thropic rea­son­ing, with­out be­ing in pos­ses­sion of that cor­rect an­swer your­self (in the same way that we can have Deep Blue win chess games with­out know­ing the ex­act moves, but un­der­stand­ing ex­actly what ab­stract work Deep Blue is do­ing to solve the prob­lem).

Holden’s post pre­sents a re­stric­tive view of what “Friendly AI” peo­ple are sup­posed to learn and know — that it’s about ma­chine learn­ing for op­ti­miz­ing or­ange sales but not ap­ple sales, or about pro­duc­ing an “agent” that im­ple­ments CEV — which is some­thing of a straw view, much weaker than the view that a Friendly AI pro­gram­mer takes of Friendly AI pro­gram­ming. What the hu­man species needs from an x-risk per­spec­tive is ex­perts on This Whole Damn Prob­lem, who will ac­quire what­ever skills are needed to that end. The Sin­gu­lar­ity In­sti­tute ex­ists to host such peo­ple and en­able their re­search—once we have enough fund­ing to find and re­cruit them. See also, How to Pur­chase AI Risk Re­duc­tion.

I’m pretty sure Holden has met peo­ple who think that hav­ing a whole in­sti­tute to rate the effi­ciency of char­i­ties is pointless over­head, es­pe­cially peo­ple who think that their own char­ity-solu­tion is too ob­vi­ously good to have to con­tend with busy­bod­ies pre­tend­ing to spe­cial­ize in think­ing about ‘marginal util­ity’. Which Holden knows about, I would guess, from be­ing paid quite well to think about that eco­nomic de­tails when he was a hedge fundie, and learn­ing from books writ­ten by pro­fes­sional re­searchers be­fore then; and the re­ally key point is that peo­ple who haven’t stud­ied all that stuff don’t even re­al­ize what they’re miss­ing by try­ing to wing it. If you don’t know, you don’t know what you don’t know, or the cost of not know­ing. Is there a prob­lem of figur­ing out who might know some­thing you don’t, if Holden in­sists that there’s this strange new stuff called ‘marginal util­ity’ you ought to learn about? Yes, there is. But is some­one who trusts their philan­thropic dol­lars to be steered just by the warm fuzzies of their heart, do­ing some­thing wrong? Yes, they are. It’s one thing to say that SIAI isn’t known-to-you to be do­ing it right—an­other thing still to say that SIAI is known-to-you to be do­ing it wrong—and then quite an­other thing en­tirely to say that there’s no need for Friendly AI pro­gram­mers and you know it, that any­one can see it with­out re­sort­ing to math or crack­ing a copy of AI: A Modern Ap­proach. I do wish that Holden would at least credit that the task SIAI is tak­ing on con­tains at least as many gotchas, rel­a­tive to the in­stinc­tive ap­proach, as op­ti­mal philan­thropy com­pared to in­stinc­tive philan­thropy, and might like­wise benefit from some full-time pro­fes­sion­ally spe­cial­ized at­ten­tion, just as our so­ciety cre­ates trained pro­fes­sion­als to han­dle any other prob­lem that some­one ac­tu­ally cares about get­ting right.

On the other side of things, Holden says that even if Friendly AI is proven and checked:

“I be­lieve that the prob­a­bil­ity of an un­fa­vor­able out­come—by which I mean an out­come es­sen­tially equiv­a­lent to what a UFAI would bring about—ex­ceeds 90% in such a sce­nario.”

It’s nice that this ap­pre­ci­ates that the prob­lem is hard. As­so­ci­at­ing all of the difficulty with agenty pro­pos­als and think­ing that it goes away as soon as you in­voke tooli­ness is, well, of this I’ve already spo­ken. I’m not sure whether this ir­re­ducible-90%-doom as­sess­ment is based on a com­mon straw ver­sion of FAI where all the work of the FAI pro­gram­mer goes into “prov­ing” some­thing and do­ing this care­fully checked proof which then—alas, poor Spock! - turns out to be no more rele­vant than prov­ing that the un­der­ly­ing CPU does float­ing-point ar­ith­metic cor­rectly if the tran­sis­tors work as stated. I’ve re­peat­edly said that the idea be­hind prov­ing de­ter­minism of self-mod­ifi­ca­tion isn’t that this guaran­tees safety, but that if you prove the self-mod­ifi­ca­tion sta­ble the AI might work, whereas if you try to get by with no proofs at all, doom is guaran­teed. My mind keeps turn­ing up Ben Go­ertzel as the one who in­vented this car­i­ca­ture—“Don’t you un­der­stand, poor fool Eliezer, life is full of un­cer­tainty, your at­tempt to flee from it by re­fuge in ‘math­e­mat­i­cal proof’ is doomed”—but I’m not sure he was ac­tu­ally the in­ven­tor. In any case, the bur­den of safety isn’t car­ried just by the proof, it’s car­ried mostly by prov­ing the right thing. If Holden is as­sum­ing that we’re just run­ning away from the in­her­ent un­cer­tainty of life by tak­ing re­fuge in math­e­mat­i­cal proof, then, yes, 90% prob­a­bil­ity of doom is an un­der­state­ment, the vast ma­jor­ity of plau­si­ble-on-first-glance goal crite­ria you can prove sta­ble will also kill you.

If Holden’s as­sess­ment does take into ac­count a great effort to se­lect the right the­o­rem to prove—and at­tempts to in­cor­po­rate the difficult but finitely difficult fea­ture of meta-level er­ror-de­tec­tion, as it ap­pears in e.g. the CEV pro­posal—and he is still as­sess­ing 90% doom prob­a­bil­ity, then I must ask, “What do you think you know and how do you think you know it?” The com­plex­ity of the hu­man mind is finite; there’s only so many things we want or would-want. Why would some­one claim to know that prov­ing the right thing is be­yond hu­man abil­ity, even if “100 of the world’s most in­tel­li­gent and rele­vantly ex­pe­rienced peo­ple” (Holden’s terms) check it over? There’s hid­den com­plex­ity of wishes, but not in­finite com­plex­ity of wishes or un­learn­able com­plex­ity of wishes. There are deep and sub­tle gotchas but not an un­end­ing num­ber of them. And if that were the set­ting of the hid­den vari­ables—how would you end up know­ing that with 90% prob­a­bil­ity in ad­vance? I don’t mean to wield my own ig­no­rance as a sword or en­gage in mo­ti­vated un­cer­tainty—I hate it when peo­ple ar­gue that if they don’t know some­thing, no­body else is al­lowed to know ei­ther—so please note that I’m also coun­ter­ar­gu­ing from pos­i­tive facts point­ing the other way: the hu­man brain is com­pli­cated but not in­finitely com­pli­cated, there are hun­dreds or thou­sands of cy­toar­chi­tec­turally dis­tinct brain ar­eas but not trillions or googols. If hu­man­ity had two hun­dred years to solve FAI us­ing hu­man-level in­tel­li­gence and there was no penalty for guess­ing wrong I would be pretty re­laxed about the out­come. If Holden says there’s 90% doom prob­a­bil­ity left over no mat­ter what sane in­tel­li­gent peo­ple do (all of which goes away if you just build Google Maps AGI, but leave that aside for now) I would ask him what he knows now, in ad­vance, that all those sane in­tel­li­gent peo­ple will miss. I don’t see how you could (well-jus­tifiedly) ac­cess that epistemic state.

I ac­knowl­edge that there are points in Holden’s post which are not ad­dressed in this re­ply, ac­knowl­edge that these points are also de­serv­ing of re­ply, and hope that other SIAI per­son­nel will be able to re­ply to them.