Disambiguating “alignment” and related notions
I recently had an ongoing and undetected inferential gap with someone over our use of the term “value aligned”.
Holistic alignment vs. parochial alignment
I was talking about “holistic alignment”:
Agent R is holistically aligned with agent H iff R and H have the same terminal values.
This is the “classic AI safety (CAIS)” (as exemplified by Superintelligence) notion of alignment, and the CAIS view is roughly: “a superintelligent AI (ASI) that is not holistically aligned is an Xrisk”; this view is supported by the instrumental convergence thesis.
My friend was talking about “parochial alignment”. I’m lacking a satisfyingly crisp definition of parochial alignment, but intuitively, it refers to how you’d want a “genie” to behave:
R is parochially aligned with agent H and task T iff R’s terminal values are to accomplish T in accordance to H’s preferences over the intended task domain.
We both agreed that a parochially aligned ASI is not safe by default (it might paperclip), but that it might be possible to make one safe using various capability control mechanisms (for instance, anything that effectively restrict it to operating within the task domain).
We might further consider a notion of “sufficient alignment”:
R is sufficiently aligned with H iff optimizing R’s terminal values would not induce a nontrivial Xrisk (according to H’s definition of Xrisk).
For example, an AI whose terminal values are “maintain meaningful human control over the future” is plausibly sufficiently aligned. A sufficiently aligned ASI is safe in the absence of capability control. It’s worth considering what might constitute sufficient alignment, short of holistic alignment. For instance, Paul seems to argue that corrigible agents are sufficiently aligned. As another example, we don’t know how bad of a perverse instantiation to expect from an ASI whose values are almost correct (e.g. within epsilon in L-infinity norm over possible futures).
Intentional alignment and “benignment”
Paul Christiano’s version of alignment, I’m calling “intentional alignment”:
R is intentionally aligned with H if R is trying to do what H wants it to do.
Although it feels intuitive, I’m not satisfied with the crispness of this definition, since we don’t have a good way of determining a black box system’s intentions. We can apply the intentional stance, but that doesn’t provide a clear way of dealing with irrationality.
Paul also talks about benign AI which is about what an AI is optimized for (which is closely related to what it “values”). Inspired by this, I’ll define a complementary notion to Paul’s notion of alignment:
R is benigned with H if R is not actively trying to do something that H doesn’t want it to do.
1) Be clear what notion of alignment you are using.
2) There might be sufficiently aligned ASIs that are not holistically aligned.
3) Try to come up with crisper definitions of parochial and intentional alignment.