Mesa optimization and inner alignment have become pretty important topics in AI alignment since the <@2019 paper@>(@Risks from Learned Optimization in Advanced Machine Learning Systems@) on it was published. However, there are two quite different interpretations of inner alignment concerns:
1. **Objective-focused:** This approach considers _structural_ properties of the computation executed by the learned model. In particular, the risk argument is that sufficiently capable learned models will be executing some form of optimization algorithm (such as a search algorithm), guided by an explicit objective called the mesa-objective, and this mesa-objective may not be identical to the base objective (though it should incentivize similar behavior on the training distribution) which can then lead to bad behavior out of distribution.
The natural decomposition is then to separate alignment into two problems: first, how do we specify an outer (base) objective that incentivizes good behavior in all situations that the model will ever encounter, and second, how do we ensure that the mesa objective equals the base objective.
2. **Generalization-focused:** This approach instead talks about the behavior of the model out of distribution. The risk argument is that sufficiently capable learned models, when running out of distribution, will take actions that are still competent and high impact, but that are not targeted towards accomplishing what we want: in other words, their capabilities generalize, but their objectives do not.
Alignment can then be decomposed into two problems: first, how do we get the behavior that we want on the training distribution, and second, how do we ensure the model never behaves catastrophically on any input.
Planned opinion:
I strongly prefer the second framing, though I’ll note that this is not independent evidence—the description of the second framing in the post comes from some of my presentations and comments and conversations with the authors. The post describes some of the reasons for this; I recommend reading through it if you’re interested in inner alignment.
Planned summary for the Alignment Newsletter:
Planned opinion: