Three Approaches to “Friendliness”

I put “Friendli­ness” in quotes in the ti­tle, be­cause I think what we re­ally want, and what MIRI seems to be work­ing to­wards, is closer to “op­ti­mal­ity”: cre­ate an AI that min­i­mizes the ex­pected amount of as­tro­nom­i­cal waste. In what fol­lows I will con­tinue to use “Friendly AI” to de­note such an AI since that’s the es­tab­lished con­ven­tion.

I’ve of­ten stated my ob­jec­tions MIRI’s plan to build an FAI di­rectly (in­stead of af­ter hu­man in­tel­li­gence has been sub­stan­tially en­hanced). But it’s not be­cause, as some have sug­gested while crit­i­ciz­ing MIRI’s FAI work, that we can’t fore­see what prob­lems need to be solved. I think it’s be­cause we can largely fore­see what kinds of prob­lems need to be solved to build an FAI, but they all look su­per­hu­manly difficult, ei­ther due to their in­her­ent difficulty, or the lack of op­por­tu­nity for “trial and er­ror”, or both.

When peo­ple say they don’t know what prob­lems need to be solved, they may be mostly talk­ing about “AI safety” rather than “Friendly AI”. If you think in terms of “AI safety” (i.e., mak­ing sure some par­tic­u­lar AI doesn’t cause a dis­aster) then that does looks like a prob­lem that de­pends on what kind of AI peo­ple will build. “Friendly AI” on the other hand is re­ally a very differ­ent prob­lem, where we’re try­ing to figure out what kind of AI to build in or­der to min­i­mize as­tro­nom­i­cal waste. I sus­pect this may ex­plain the ap­par­ent dis­agree­ment, but I’m not sure. I’m hop­ing that ex­plain­ing my own po­si­tion more clearly will help figure out whether there is a real dis­agree­ment, and what’s caus­ing it.

The ba­sic is­sue I see is that there is a large num­ber of se­ri­ous philo­soph­i­cal prob­lems fac­ing an AI that is meant to take over the uni­verse in or­der to min­i­mize as­tro­nom­i­cal waste. The AI needs a full solu­tion to moral philos­o­phy to know which con­figu­ra­tions of par­ti­cles/​fields (or per­haps which dy­nam­i­cal pro­cesses) are most valuable and which are not. Mo­ral philos­o­phy in turn seems to have de­pen­den­cies on the philos­o­phy of mind, con­scious­ness, meta­physics, aes­thet­ics, and other ar­eas. The FAI also needs solu­tions to many prob­lems in de­ci­sion the­ory, episte­mol­ogy, and the philos­o­phy of math­e­mat­ics, in or­der to not be stuck with mak­ing wrong or sub­op­ti­mal de­ci­sions for eter­nity. Th­ese es­sen­tially cover all the ma­jor ar­eas of philos­o­phy.

For an FAI builder, there are three ways to deal with the pres­ence of these open philo­soph­i­cal prob­lems, as far as I can see. (There may be other ways for the fu­ture to turns out well with­out the AI builders mak­ing any spe­cial effort, for ex­am­ple if be­ing philo­soph­i­cal is just a nat­u­ral at­trac­tor for any su­per­in­tel­li­gence, but I don’t see any way to be con­fi­dent of this ahead of time.) I’ll name them for con­ve­nient refer­ence, but keep in mind that an ac­tual de­sign may use a mix­ture of ap­proaches.

  1. Nor­ma­tive AI—Solve all of the philo­soph­i­cal prob­lems ahead of time, and code the solu­tions into the AI.

  2. Black-Box Me­taphilo­soph­i­cal AI—Pro­gram the AI to use the minds of one or more hu­man philoso­phers as a black box to help it solve philo­soph­i­cal prob­lems, with­out the AI builders un­der­stand­ing what “do­ing philos­o­phy” ac­tu­ally is.

  3. White-Box Me­taphilo­soph­i­cal AI—Un­der­stand the na­ture of philos­o­phy well enough to spec­ify “do­ing philos­o­phy” as an al­gorithm and code it into the AI.

The prob­lem with Nor­ma­tive AI, be­sides the ob­vi­ous in­her­ent difficulty (as ev­i­denced by the slow progress of hu­man philoso­phers af­ter decades, some­times cen­turies of work), is that it re­quires us to an­ti­ci­pate all of the philo­soph­i­cal prob­lems the AI might en­counter in the fu­ture, from now un­til the end of the uni­verse. We can cer­tainly fore­see some of these, like the prob­lems as­so­ci­ated with agents be­ing copy­able, or the AI rad­i­cally chang­ing its on­tol­ogy of the world, but what might we be miss­ing?

Black-Box Me­taphilo­soph­i­cal AI is also risky, be­cause it’s hard to test/​de­bug some­thing that you don’t un­der­stand. Be­sides that gen­eral con­cern, de­signs in this cat­e­gory (such as Paul Chris­ti­ano’s take on in­di­rect nor­ma­tivity) seem to re­quire that the AI achieve su­per­hu­man lev­els of op­ti­miz­ing power be­fore be­ing able to solve its philo­soph­i­cal prob­lems, which seems to mean that a) there’s no way to test them in a safe man­ner, and b) it’s un­clear why such an AI won’t cause dis­aster in the time pe­riod be­fore it achieves philo­soph­i­cal com­pe­tence.

White-Box Me­taphilo­soph­i­cal AI may be the most promis­ing ap­proach. There is no strong em­piri­cal ev­i­dence that solv­ing metaphilos­o­phy is su­per­hu­manly difficult, sim­ply be­cause not many peo­ple have at­tempted to solve it. But I don’t think that a rea­son­able prior com­bined with what ev­i­dence we do have (i.e., ab­sence of visi­ble progress or clear hints as to how to pro­ceed) gives much hope for op­ti­mism ei­ther.

To re­cap, I think we can largely already see what kinds of prob­lems must be solved in or­der to build a su­per­in­tel­li­gent AI that will min­i­mize as­tro­nom­i­cal waste while coloniz­ing the uni­verse, and it looks like they prob­a­bly can’t be solved cor­rectly with high con­fi­dence un­til hu­mans be­come sig­nifi­cantly smarter than we are now. I think I un­der­stand why some peo­ple dis­agree with me (e.g., Eliezer thinks these prob­lems just aren’t that hard, rel­a­tive to his abil­ities), but I’m not sure why some oth­ers say that we don’t yet know what the prob­lems will be.