RFC: Meta-ethical uncertainty in AGI alignment

I’m work­ing on writ­ing a pa­per about an idea I pre­vi­ously out­lined for ad­dress­ing false pos­i­tives in AI al­ign­ment re­search. This is the first com­pleted draft of one of the sub­sec­tions ar­gu­ing for the adop­tion of a par­tic­u­lar, nec­es­sary hinge propo­si­tion to rea­son about al­igned AGI. I ap­pre­ci­ate feed­back on this sub­sec­tion es­pe­cially re­gard­ing if you agree with the line of rea­son­ing and if you think I’ve ig­nored any­thing im­por­tant that should be ad­dressed here. Thanks!

AGI al­ign­ment is typ­i­cally phrased in terms of al­ign­ing AGI with hu­man in­ter­ests, but this hides some of the com­plex­ity of the prob­lem be­hind de­ter­min­ing what “hu­man in­ter­ests” are. Tak­ing “in­ter­ests” as a syn­onym for “val­ues”, we can be­gin to make some progress by treat­ing al­ign­ment as at least par­tially the prob­lem of teach­ing AGI hu­man val­ues (Soares, 2016). Un­for­tu­nately, what con­sti­tutes hu­man val­ues is cur­rently un­known since hu­mans may not be aware of the ex­tent of their own val­ues or may not hold re­flex­ively con­sis­tent val­ues (Scan­lon, 2003). Fur­ther com­pli­cat­ing mat­ters, hu­mans are not ra­tio­nal, so their val­ues can­not be de­duced from their be­hav­ior un­less some nor­ma­tive as­sump­tions are made (Tver­sky, 1969), (Arm­strong and Min­der­mann, 2017). This is a spe­cial case of Hume’s is-ought prob­lem—that ax­iol­ogy can­not be in­ferred from on­tol­ogy alone—and it com­pli­cates the prob­lem of train­ing AGI on hu­man val­ues (Hume, 1739).

Per­haps some of the difficulty could be cir­cum­vented if a few nor­ma­tive as­sump­tions were made, like as­sum­ing that ra­tio­nal prefer­ences are always bet­ter than ir­ra­tional prefer­ences or as­sum­ing that suffer­ing su­per­venes on prefer­ence satis­fac­tion. This poses an im­me­di­ate prob­lem for our false pos­i­tive re­duc­tion strat­egy by in­tro­duc­ing ad­di­tional vari­ables that will nec­es­sar­ily in­crease the chance of a false pos­i­tive. Maybe we could avoid mak­ing any spe­cific nor­ma­tive as­sump­tions prior to the cre­ation of al­igned AGI by ex­pect­ing the AGI to dis­cover them via a pro­cess like Yud­kowsky’s co­her­ent ex­trap­o­lated vo­li­tion (Yud­kowsky, 2004). This may avoid the need to make as many as­sump­tions, but still re­quires mak­ing at least one—that moral facts ex­ist to per­mit the cor­rect choice of nor­ma­tive as­sump­tions—and re­veals a deep philo­soph­i­cal prob­lem at the heart of AGI al­ign­ment—meta-eth­i­cal un­cer­tainty.

Meta-eth­i­cal un­cer­tainty stems from epistemic cir­cu­lar­ity and the prob­lem of the crite­rion since it is not pos­si­ble to know the crite­ria by which to asses which moral facts are true or even if any moral facts ex­ist with­out first as­sum­ing to know what is good and true (Chisholm, 1982). We can­not hope to re­solve meta-eth­i­cal un­cer­tainty here, but we can at least de­cide what im­pact par­tic­u­lar as­sump­tions about the ex­is­tence of moral facts have upon false pos­i­tives in AGI al­ign­ment. Speci­fi­cally, whether or not moral facts ex­ists and, if they do, what moral facts should be as­sumed to be true.

On the one hand sup­pose we as­sume that moral facts ex­ist, then we could build al­igned AGI on the pre­sup­po­si­tion that it could at least dis­cover moral facts even if no moral facts were speci­fied in ad­vance and then use knowl­edge of these facts to con­strain its val­ues such that they al­igned with hu­man­ity’s val­ues. Now sup­pose this as­sump­tion is false and moral facts do not ex­ist, then our moral-facts-as­sum­ing AGI would ei­ther never dis­cover any moral facts to con­strain its val­ues to be al­igned with hu­man val­ues or would con­strain it­self with ar­bi­trary moral facts that would not be sure to pro­duce value al­ign­ment with hu­man­ity.

On the other hand sup­pose we as­sume that moral facts do not ex­ist, then we must build al­igned AGI to rea­son about and al­ign it­self with the ax­iol­ogy of hu­man­ity in the ab­sence of any nor­ma­tive as­sump­tions, likely on a non-cog­ni­tivist ba­sis like emo­tivism. Now sup­pose this as­sump­tion is false and moral facts do ex­ist, then our moral-facts-deny­ing AGI would dis­cover the ex­is­tence of moral facts, at least im­plic­itly, by their in­fluence on the ax­iol­ogy of hu­man­ity and would al­ign it­self with hu­man­ity as if it had started out as­sum­ing moral facts ex­isted but at the cost of solv­ing the much harder prob­lem of learn­ing ax­iol­ogy with­out the use of nor­ma­tive as­sump­tions.

Based on this anal­y­sis it seems that as­sum­ing the ex­is­tence of moral facts, let alone as­sum­ing any par­tic­u­lar moral facts, is more likely to pro­duce false pos­i­tives than as­sum­ing moral facts do not ex­ist be­cause deny­ing the ex­is­tence of moral facts gives up the pur­suit of a class of al­ign­ment schemes that may fail, namely those that de­pend on the ex­is­tence of moral facts. Do­ing so likely makes find­ing and im­ple­ment­ing a suc­cess­ful al­ign­ment scheme harder, but it does this by re­plac­ing difficulty tied to un­cer­tainty around a meta­phys­i­cal ques­tion that may not be re­solved in fa­vor of al­ign­ment to un­cer­tainty around im­ple­men­ta­tion is­sues that through suffi­cient effort may be made to work. Bar­ring a re­sult show­ing that moral nihilism—the as­sump­tion that no moral facts ex­ist—im­plies the im­pos­si­bil­ity of build­ing al­igned AGI, it seems the best hinge propo­si­tion to hold in or­der to re­duce false pos­i­tives in AGI al­ign­ment due to meta-eth­i­cal un­cer­tainty.


  • Nate Soares. The Value Learn­ing Prob­lem. In Ethics for Ar­tifi­cial In­tel­li­gence Work­shop at 25th In­ter­na­tional Joint Con­fer­ence on Ar­tifi­cial In­tel­li­gence. (2016). Link

  • T. M. Scan­lon. 3 Rawls on Jus­tifi­ca­tion. 139 In The Cam­bridge Com­pan­ion to Rawls. Cam­bridge Univer­sity Press, 2003.

  • Amos Tver­sky. In­tran­si­tivity of prefer­ences.. Psy­cholog­i­cal Re­view 76, 31–48 Amer­i­can Psy­cholog­i­cal As­so­ci­a­tion (APA), 1969.Link

  • Stu­art Arm­strong, Sören Min­der­mann. Im­pos­si­bil­ity of de­duc­ing prefer­ences and ra­tio­nal­ity from hu­man policy. (2017). Link

  • David Hume. A Trea­tise of Hu­man Na­ture. Oxford Univer­sity Press, 1739. Link

  • Eliezer Yud­kowsky. Co­her­ent Ex­trap­o­lated Vo­li­tion. (2004). Link

  • Rod­er­ick M. Chisholm. The Foun­da­tions of Know­ing. Univer­sity of Min­nesota Press, 1982.