Clarifying inner alignment terminology
Here’s my diagram of how I think the various concepts should fit together:
The idea of this diagram is that the arrows are implications—that is, for any problem in the diagram, if its direct subproblems are solved, then it should be solved as well (though not necessarily vice versa). Thus, we get:
And here are all my definitions of the relevant terms which I think produce those implications:
(Impact) Alignment: An agent is impact aligned (with humans) if it doesn’t take actions that we would judge to be bad/problematic/dangerous/catastrophic.
And an explanation of each of the diagram’s implications:
: If a model is a mesa-optimizer, then its behavioral objective should match its mesa-objective, which means if it’s mesa-objective is aligned with the base, then it’s behavioral objective should be too.
: Outer alignment ensures that the base objective is measuring what we actually care about and objective robustness ensures that the model’s behavioral objective is aligned with that base objective. Thus, putting them together, we get that the model’s behavioral objective must be aligned with humans, which is precisely intent alignment.
: Intent alignment ensures that the behavioral objective is aligned with humans and capability robustness ensures that the model actually pursues that behavioral objective effectively—even off-distribution—which means that the model will actually always take aligned actions, not just have an aligned behavioral objective.
If a model is both outer and inner aligned, what does that imply?
Intent alignment. Reading off the implications from the diagram, we can see that the conjunction of outer and inner alignment gets us to intent alignment, but not all the way to impact alignment, as we’re missing capability robustness.
Can impact alignment be split into outer alignment and inner alignment?
No. As I just mentioned, the conjunction of both outer and inner alignment only gives us intent alignment, not impact alignment. Furthermore, if the model is not a mesa-optimizer, then it can be objective robust (and thus intent aligned) without being inner aligned.
Does a model have to be inner aligned to be impact aligned?
No—we only need inner alignment if we’re dealing with mesa-optimization. While we can get impact alignment through a combination of inner alignment, outer alignment, and capability robustness, the diagram tells us that we can get the same exact thing if we substitute objective robustness for inner alignment—and while inner alignment implies objective robustness, the converse is not true.
How does this breakdown distinguish between the general concept of inner alignment as failing “when your capabilities generalize but your objective does not” and the more specific concept of inner alignment as “eliminating the base-mesa objective gap?”
Only the more specific definition is inner alignment. Under this set of terminology, the more general definition instead refers to objective robustness, of which inner alignment is only a subproblem.
What type of problem is deceptive alignment?
Inner alignment—assuming that deception requires mesa-optimization. If we relax that assumption, then it becomes an objective robustness problem. Since deception is a problem with the model trying to do the wrong thing, it’s clearly an intent alignment problem rather than a capability robustness problem—and see here for an explanation of why deception is never an outer alignment problem. Thus, it has to be an objective robustness problem—and if we’re dealing with a mesa-optimizer, an inner alignment problem.
What type of problem is training a model to maximize paperclips?
Outer alignment—maximizing paperclips isn’t an aligned objective even in the limit of infinite data.
How does this picture relate to a more robustness-centric version?
The above diagram can easily be reorganized into an equivalent, more robustness-centric version, which I’ve included below. This diagram is intended to be fully compatible with the above diagram—using the exact same definitions of all the terms as given above—but with robustness given a more central role, replacing the central role of intent alignment in the above diagram.
Edit: Previously I had this diagram only in a footnote, but I decided it was useful enough to promote it to the main body.
The point of talking about the “optimal policy for a behavioral objective” is to reference what an agent’s behavior would look like if it never made any “mistakes.” Primarily, I mean this just in that intuitive sense, but we can also try to build a somewhat more rigorous picture if we imagine using perfect IRL in the limit of infinite data to recover a behavioral objective and then look at the optimal policy under that objective. ↩︎
What I mean by perfect training and infinite data here is for the model to always have optimal loss on all data points that it ever encounters. That gets a bit tricky for reinforcement learning, though in that setting we can ask for the model to act according to the optimal policy on the actual MDP that it experiences. ↩︎
Note that robustness as a whole isn’t included in the diagram as I thought it made it too messy. For an implication diagram with robustness instead of intent alignment, see the alternative diagram in the FAQ. ↩︎