Formal Metaethics and Metasemantics for AI Alignment


A Brief In­tro­duc­tion to Me­taEth­i­cal.AI

tl;dr: AIXI for Friendliness

[Cross­posted from my new blog.]

Ab­stract: We con­struct a fully tech­ni­cal eth­i­cal goal func­tion for AI by di­rectly tack­ling the philo­soph­i­cal prob­lems of metaethics and men­tal con­tent. To sim­plify our re­duc­tion of these philo­soph­i­cal challenges into “merely” en­g­ineer­ing ones, we sup­pose that un­limited com­pu­ta­tion and a com­plete low-level causal model of the world and the adult hu­man brains in it are available.

Given such a model, the AI at­tributes be­liefs and val­ues to a brain in two stages. First, it iden­ti­fies the syn­tax of a brain’s men­tal con­tent by se­lect­ing a de­ci­sion al­gorithm which is i) iso­mor­phic to the brain’s causal pro­cesses and ii) best com­presses its be­hav­ior while iii) max­i­miz­ing char­ity. The se­man­tics of that con­tent then con­sists first in sense data that prim­i­tively re­fer to their own oc­cur­rence and then in log­i­cal and causal struc­tural com­bi­na­tions of such con­tent.

The re­sult­ing de­ci­sion al­gorithm can cap­ture how we de­cide what to do, but it can also iden­tify the eth­i­cal fac­tors that we seek to de­ter­mine when we de­cide what to value or even how to de­cide. Un­fold­ing the im­pli­ca­tions of those fac­tors, we ar­rive at what we should do. All to­gether, this al­lows us to im­bue the AI with the nec­es­sary con­cepts to de­ter­mine and do what we should pro­gram it to do.

See the open source code and com­men­tary at www.metaeth­i­cal.ai

Motivation

Re­searchers like Eliezer Yud­kowsky and Nick Bostrom, among oth­ers, have ar­gued for the ur­gent need to de­velop a rigor­ous frame­work for en­sur­ing that smarter-than-hu­man in­tel­li­gence will be benefi­cial for hu­man­ity. I think of them and the com­mu­nity around their re­spec­tive non­prof­its, the Ma­chine In­tel­li­gence Re­search In­sti­tute and the Oxford Fu­ture of Hu­man­ity In­sti­tute, as tend­ing to share a cluster of views, in­clud­ing the fol­low­ing:

  • We should have a wide prob­a­bil­ity dis­tri­bu­tion over when hu­man-level AI will be de­vel­oped. If we go by ex­pert pre­dic­tions, most pre­dict it will ar­rive within this cen­tury. A cau­tion­ary ap­proach should pre­pare for even more rapid timelines.

  • Once AI that is at least as in­tel­li­gent as its hu­man cre­ators has been de­vel­oped, there is a pos­i­tive feed­back loop in which it can take over the task of im­prov­ing its own in­tel­li­gence, quickly re­sult­ing in a su­per­in­tel­li­gence vastly greater than that of hu­mans.

  • A sili­con-based in­tel­li­gence will have many nat­u­ral ad­van­tages that would fur­ther com­pound this pro­cess, e.g. ease of faith­ful repli­ca­tion, read­ily available ad­di­tional hard­ware, already mil­lion-fold in­crease in se­rial com­pu­ta­tional speed rel­a­tive to biolog­i­cal neu­rons, long-term ex­po­nen­tial Moore’s law trend as well as re­al­is­tic plans and mod­els for its con­tinu­a­tion.

  • Vir­tu­ally any suffi­ciently ad­vanced in­tel­li­gence will con­verge upon cer­tain in­stru­men­tal goals to per­sist and ac­quire more re­sources and power, if only to bet­ter serve what­ever in­trin­sic goals it may have.

  • There is no au­to­matic guaran­tee that greater in­tel­li­gence co­in­cides with bet­ter ethics. There is also tremen­dous eco­nomic in­cen­tive to de­velop ever smarter AI but not nec­es­sar­ily to make it safer or benefi­cial in the long run. If any­thing, each pri­vate party’s in­cen­tive may be to cut cor­ners on safety to get to mar­ket quicker.

  • Many naive ap­proaches to al­ign­ing AI with our val­ues fail. Hu­man val­ues have a great deal of hid­den com­plex­ity and miss­ing just one di­men­sion can lead to very un­de­sir­able out­comes. There­fore, a metaeth­i­cal ap­proach seems to be more promis­ing than hop­ing to cap­ture all eth­i­cal prin­ci­ples at the ob­ject level.

My own in­quiry into metaethics be­gan long be­fore these ideas were writ­ten. In fact, I had even reached a point where I felt I could ex­plain my re­duc­tion­ist metaethics to other philoso­phers. But hav­ing fol­lowed the de­vel­op­ment of the AI safety liter­a­ture with great in­ter­est, I felt a re­newed sense of pur­pose and ur­gency. It seemed we’ll need to not only solve peren­nial philo­soph­i­cal prob­lems but do so with suffi­cient pre­ci­sion to make a com­puter un­der­stand them. What is more, it looked like we were in a race to ac­com­plish it all be­fore the ar­guably ex­po­nen­tial ad­vance­ment in AI crossed some un­known thresh­old.

Hav­ing men­tioned all this, I will not be ar­gu­ing for the above claims here. And while this forms my pri­mary mo­ti­va­tion, I ac­tu­ally don’t think agree­ment with any of them is nec­es­sary to ap­pre­ci­ate the metaethics and metase­man­tics I de­velop here. I have spent enough time in aca­demic philos­o­phy to ap­pre­ci­ate such the­o­ries in the the­o­ret­i­cal spirit in which they have of­ten been de­vel­oped. For­mu­lat­ing them in code as I have done could be seen as just a no­ta­tional var­i­ant to the more con­ven­tional ex­pres­sion of cer­tain rigor­ous philo­soph­i­cal the­o­ries in math­e­mat­i­cal logic. Do­ing so helps us avoid mis­lead­ing vague­ness and am­bi­guity and en­sures max­i­mal pre­ci­sion in our think­ing and com­mu­ni­cat­ing, all of which can be ap­pre­ci­ated with­out re­gard to any prac­ti­cal ap­pli­ca­tions.

Still, I hope many of you have already or will soon come to ap­pre­ci­ate some of the back­drop of this MIRIFHI cluster of views. It’s said that ne­ces­sity is the mother of in­ven­tion. It has cer­tainly driven me to be more am­bi­tious and aim for higher pre­ci­sion than I thought pos­si­ble in philos­o­phy. To have any hope of suc­cess, I re­al­ized I would need to delve into math­e­mat­ics and com­puter sci­ence and bridge the con­cep­tual di­vide. In do­ing so, I was ex­cited to dis­cover new in­spira­tion and draw con­nec­tions I doubt I would have made oth­er­wise. And tak­ing on an en­g­ineer­ing mind­set, I found my­self prun­ing search trees that ini­tially sounded ap­peal­ing but turned out not to be fruit­ful, while find­ing new ap­pre­ci­a­tion for the­o­ries that shed new light and en­abled fur­ther tech­ni­cal progress.

While many ar­eas of philos­o­phy can benefit from a more tech­ni­cal mind­set, I think con­versely many in com­puter sci­ence or other math­e­mat­i­cal fields may be too ea­ger to ap­ply what­ever tech­ni­cal tools they may cur­rently have at their dis­posal with­out paus­ing to pon­der whether a prob­lem is still at a philo­soph­i­cal stage in which im­por­tant con­cep­tual ad­vance­ments must first take place. Per­haps these ad­vance­ments have even been made already in aca­demic philos­o­phy but they are not aware of them, while the philoso­phers in turn are not aware of how to for­mal­ize them.

What fol­lows is a mix­ture of origi­nal con­tri­bu­tions to philo­soph­i­cal prob­lems, some stan­dard or not-so-stan­dard com­po­nents bor­rowed from across com­puter sci­ence and philos­o­phy, and novel ways of weav­ing them all to­gether. Through­out it all, I have tried my best to bal­ance faith­ful­ness to the sub­tleties of philo­soph­i­cal re­al­ity, the rigor of for­mal­iz­ing these the­o­ries, the ur­gency of mak­ing and com­mu­ni­cat­ing this progress, and the prac­ti­cal­ities of en­g­ineer­ing an ini­tial pro­to­type of a wildly am­bi­tious pro­ject.

While I don’t nec­es­sar­ily trust our civ­i­liza­tion to get philos­o­phy right, I think it is quite good at mak­ing progress on well-defined tech­ni­cal prob­lems. I hope I have largely suc­ceeded in turn­ing the philo­soph­i­cal prob­lems of get­ting an AI to un­der­stand and share our val­ues into an en­g­ineer­ing prob­lem — and hope­fully one we can solve in time.

See the open source code and com­men­tary at www.metaeth­i­cal.ai

An Apology

In an ideal world, I would have ac­com­plished the above while ex­plain­ing and jus­tify­ing each philo­soph­i­cal step up to the stan­dards of con­tem­po­rary an­a­lytic philos­o­phy and re­lat­ing them to the cur­rent and his­tor­i­cal liter­a­ture. More­over, on the tech­ni­cal side, the math­e­mat­i­cal for­mu­las would be writ­ten and type­set in beau­tiful LaTeX with am­ple di­a­grams and gen­tle tu­to­ri­als.

Or you know, I could have at least writ­ten it in English. In­stead, I chose to write it in a lit­tle known pro­gram­ming lan­guage called setlX (al­though I’ve since in­ter­spersed the code with con­sid­er­able philo­soph­i­cal com­ments). My choice at the time and per­haps even now, was on the one hand, to strug­gle with writ­ing math with limited ex­pe­rience or in­sti­tu­tional sup­port. Or on the other hand, I could lev­er­age my long ex­pe­rience and in­tu­ition with pro­gram­ming to write es­sen­tially the same con­tent in a lan­guage with clear se­man­tics in set the­ory — the clas­sic lin­gua franca of math­e­mat­ics. On top of that, I’d have a com­piler to check for bugs and an in­ter­ac­tive con­sole serv­ing as a con­crete in­ter­face by which to ma­nipu­late very ab­stract ob­jects and func­tions.

In fur­ther defense of setlX, I find it to be a very el­e­gant and pow­er­ful lan­guage. Its rel­a­tively few prim­i­tives are suffi­cient to con­cisely con­struct com­plex data and al­gorithms while be­ing small enough to pick up fairly quickly if you have some ex­pe­rience with pro­gram­ming, math­e­mat­ics or logic. Not sur­pris­ingly, writ­ing in it feels like you’re pro­gram­ming close to the math­e­mat­i­cal essence.

De­spite its ini­tial un­fa­mil­iar­ity, I hope you will give setlX, my code and my com­men­tary a chance. Even if you are not tech­ni­cally in­clined, I ex­pect that with a lit­tle pa­tience, a syn­op­sis can be gleaned from the com­ments — the most im­por­tant of which I’ve gath­ered into a Key Con­cepts out­line beside the code. Other than that, I have not im­posed much of an or­der­ing but tried to en­able a more free ex­plo­ra­tion by hy­per­link­ing pro­ce­dure calls to that pro­ce­dure’s defi­ni­tion, which of­ten has at least a short ex­plana­tory com­ment.

Where I have left im­por­tant ex­pla­na­tions and jus­tifi­ca­tions sparse, I’ve tried to in­clude links to those of oth­ers who have likely done a bet­ter job than I would have. I wish I could have done more but I have mainly been op­ti­miz­ing for solv­ing the prob­lem rather than com­mu­ni­cat­ing the solu­tion.

See the open source code and com­men­tary at www.metaeth­i­cal.ai