Disambiguating “alignment” and related notions

I re­cently had an on­go­ing and un­de­tected in­fer­en­tial gap with some­one over our use of the term “value al­igned”.

Holis­tic al­ign­ment vs. parochial alignment

I was talk­ing about “holis­tic al­ign­ment”:
Agent R is holis­ti­cally al­igned with agent H iff R and H have the same ter­mi­nal val­ues.
This is the “clas­sic AI safety (CAIS)” (as ex­em­plified by Su­per­in­tel­li­gence) no­tion of al­ign­ment, and the CAIS view is roughly: “a su­per­in­tel­li­gent AI (ASI) that is not holis­ti­cally al­igned is an Xrisk”; this view is sup­ported by the in­stru­men­tal con­ver­gence the­sis.

My friend was talk­ing about “parochial al­ign­ment”. I’m lack­ing a satis­fy­ingly crisp defi­ni­tion of parochial al­ign­ment, but in­tu­itively, it refers to how you’d want a “ge­nie” to be­have:
R is parochially al­igned with agent H and task T iff R’s ter­mi­nal val­ues are to ac­com­plish T in ac­cor­dance to H’s prefer­ences over the in­tended task do­main.
We both agreed that a parochially al­igned ASI is not safe by de­fault (it might pa­per­clip), but that it might be pos­si­ble to make one safe us­ing var­i­ous ca­pa­bil­ity con­trol mechanisms (for in­stance, any­thing that effec­tively re­strict it to op­er­at­ing within the task do­main).

Suffi­cient alignment

We might fur­ther con­sider a no­tion of “suffi­cient al­ign­ment”:
R is suffi­ciently al­igned with H iff op­ti­miz­ing R’s ter­mi­nal val­ues would not in­duce a non­triv­ial Xrisk (ac­cord­ing to H’s defi­ni­tion of Xrisk).
For ex­am­ple, an AI whose ter­mi­nal val­ues are “main­tain mean­ingful hu­man con­trol over the fu­ture” is plau­si­bly suffi­ciently al­igned. A suffi­ciently al­igned ASI is safe in the ab­sence of ca­pa­bil­ity con­trol. It’s worth con­sid­er­ing what might con­sti­tute suffi­cient al­ign­ment, short of holis­tic al­ign­ment. For in­stance, Paul seems to ar­gue that cor­rigible agents are suffi­ciently al­igned. As an­other ex­am­ple, we don’t know how bad of a per­verse in­stan­ti­a­tion to ex­pect from an ASI whose val­ues are al­most cor­rect (e.g. within ep­silon in L-in­finity norm over pos­si­ble fu­tures).

In­ten­tional al­ign­ment and “be­nign­ment”

Paul Chris­ti­ano’s ver­sion of al­ign­ment, I’m call­ing “in­ten­tional al­ign­ment”:
R is in­ten­tion­ally al­igned with H if R is try­ing to do what H wants it to do.
Although it feels in­tu­itive, I’m not satis­fied with the crisp­ness of this defi­ni­tion, since we don’t have a good way of de­ter­min­ing a black box sys­tem’s in­ten­tions. We can ap­ply the in­ten­tional stance, but that doesn’t provide a clear way of deal­ing with ir­ra­tional­ity.

Paul also talks about be­nign AI which is about what an AI is op­ti­mized for (which is closely re­lated to what it “val­ues”). In­spired by this, I’ll define a com­ple­men­tary no­tion to Paul’s no­tion of al­ign­ment:
R is be­nigned with H if R is not ac­tively try­ing to do some­thing that H doesn’t want it to do.

Take-aways

1) Be clear what no­tion of al­ign­ment you are us­ing.

2) There might be suffi­ciently al­igned ASIs that are not holis­ti­cally al­igned.

3) Try to come up with crisper defi­ni­tions of parochial and in­ten­tional al­ign­ment.