Re-Define Intent Alignment?
I think Evan’s Clarifying Inner Alignment Terminology is quite clever; more well-optimized than it may at first appear. However, do think there are a couple of things which don’t work as well as they could:
What exactly does the modifier “intent” mean?
Based on how “intent alignment” is defined (basically, the optimal policy of its behavioral objective would be good for humans), capability robustness is exactly what it needs to combine with in order to achieve impact alignment. However, we could instead define “intent alignment” as “the optimal policy of the mesa objective would be good for humans”. In this case, capability robustness is not exactly what’s needed; instead, what I’ll provisionally call inner robustness (IE, strategies for achieving the mesa-objective generalize well) would be put in its place.
(I find myself flipping between these two views, and thereby getting confused.)
Furthermore, I would argue that the second alternative (making “intent alignment” about the mesa-objective) is more true to the idea of intent alignment. Making it about the behavioral objective turns it into a fact about the actual impact of the system, since “behavioral objective” is defined by looking at what the system actually accomplishes. But then, why the divide between intent alignment and impact alignment?
Any definition where “inner alignment” isn’t directly paired with “outer alignment” is going to be confusing for beginners.
In Evan’s terms, objective robustness is basically a more clever (more technically accurate and more useful) version of “the behavioral objective equals the outer objective”, whereas inner alignment is “the mesa-objective equals the outer objective”.
(It’s clear that “behavioral” is intended to imply generalization, here—the implication of objective robustness is supposed to be that the objective is stable under distributional shift. But this is obscured by the definition, which does not explicitly mention any kind of robustness/generalization.)
By making this distinction, Evan highlights the assumption that solving inner alignment will solve behavioral alignment: he thinks that the most important cases of catastrophic bad behavior are intentional (ie, come from misaligned objectives, either outer objective or inner objective).
In contrast, the generalization-focused approach puts less emphasis on the assumption that the worst catastrophes are intentional—which could be an advantage, if this assumption isn’t so good!
However, although I find the decomposition insightful, I dread explaining it to beginners in this way. I find that I would prefer to gloss over objective robustness and pretend that intent alignment simply factors into outer alignment and inner alignment.
I also find myself constantly thinking as if inner/outer alignment were a pair, intuitively!
My current proposal would be the following:
Re-define “intent alignment” to refer to the mesa-objective.
Now, inner alignment + outer alignment directly imply intent alignment, provided that there is a mesa-objective at all (IE, assuming that there’s an inner optimizer).
This fits with the intuitive picture that inner and outer are supposed to be complimentary!
If we wish, we could replace or re-define “capability robustness” with “inner robustness”, the robustness of pursuit of the mesa-objective under distributional shift.
This is exactly what we need to pair with the new “intent alignment” in order to achieve impact alignment.
However, this is clearly a narrower concept than capability robustness (it assumes there is a mesa-objective).
This is a complex and tricky issue, and I’m eager to get thoughts on it.
The post which discusses Evan’s as the “objective-focused approach”, contrasting it with Rohin’s “generalization-focused approach”. My proposal would make the two diagrams more different from each other. I’m also interested in trying to merge the diagrams or otherwise “bridge the conceptual gap” between the two approaches.
As a reminder, here are Evan’s definitions. Nested children are subgoals; it’s supposed to be the case that if you can achieve all the children, you can achieve the parent.
Impact Alignment: An agent is impact aligned (with humans) if it doesn’t take actions that we would judge to be bad/problematic/dangerous/catastrophic.
Outer Alignment: An objective function is outer aligned if all models that perform optimally on in the limit of perfect training and infinite data are intent aligned.
So we split impact alignment into intent alignment and capability; we split intent alignment into outer alignment and objective robustness; and, we achieve objective robustness through inner alignment.
Here’s what my proposed modifications do:
Inner Robustness: An agent is inner-robust if it performs well on its mesa-objective even in deployment/off-distribution.
Intent Alignment: An agent is intent aligned if the optimal policy for its mesa-objective is impact aligned with humans.
“Objective Robustness” disappears from this, because inner+outer gives intent-alignment directly now. This is a bit of a shame, as I think objective robustness is an important subgoal. But I think the idea of objective robustness fits better with the generalization-focused approach:
Outer Alignment: For this approach, outer alignment is re-defined to be only on-training-distribution (we could call it “on-distribution alignment” or something).
And it’s fine for there to be multiple different subgoal hierarchies, since there may be multiple paths forward.