Thanks for the interesting paper. I feel that the risks described are entirely plausible.What is valuable for me in particular is that the paper re-casts many alignment risks that have already been discussed in a programmer-agent context into a new ‘inner alignment’ context. To quote the key description and separation of concerns:
In this post, we outline reasons to think that a mesa-optimizer may not optimize the same objective function as its base optimizer. Machine learning practitioners have direct control over the base objective function—either by specifying the loss function directly or training a model for it—but cannot directly specify the mesa-objective developed by a mesa-optimizer. We refer to this problem of aligning mesa-optimizers with the base objective as the inner alignment problem. This is distinct from the outer alignment problem, which is the traditional problem of ensuring that the base objective captures the intended goal of the programmers.
That being said, I sometimes have trouble understanding how the paper defines, or does not define, the time-based relation between the base optimizer and the mesa-optimizer. I started out with a mental model where there is a one-time ‘batch’ creation operation in which the base optimizer creates the mesa-optimizer (or rather the agent which might contain a mesa-optimizer) by using simulations over a training set to compare the performance of candidate agents. The agent that scores best on the base objective is then run in the real world. However, some of Evan’s comments on mesa-optimization lead me to believe that there is sometimes a more real-time continuous adjustment relation between the base optimizer and the agent that is created. I am unclear on whether this would create additional problems, or block certain solutions.The base-to-mesa fidelity loss problem is similar to the problem where there is a loss of fidelity between a) what the programmers actually want and b) what they encode into the base objective. However, when considering fidelity loss between b) the base objective and c) the mesa objective, I feel there is an important extra dimension. Unlike the objectives a), the objective function b) is by nature computable: it has to be computable or else the base optimizer cannot use it to select between candidates. But if the base objective function is computable at mesa-optimizer design time, it should typically also be computable at mesa-optimizer run time.Say that the mesa optimizer is trained to control a self-driving car, or a racing car in a video game. Then while the mesa-optimizer is driving, it should be possible to evaluate the quality of the driving by using the base objective function. Whenever the base objective function shows a very low value, a safety protocol can kick in, e.g. to stop the car. The threshold of ‘very low value’ can be calibrated using the the values computed over the training set at design time.(I can image some special cases where the base objective function is not computable while the mesa-agent runs, e.g. if the base objective function was created by hand-labeling all instances of the training set. But for many economically relevant scenarios, especially for agents that need to be good at ‘planning’, good at optimizing sequences of actions that work towards a goal, I expect that the base objective will be perfectly computable in the real world.)So overall, while I appreciate that the paper identifies and highlights inner alignment risks, my feeling is that the analysis provided is implicitly too pessimistic about the inner alignment problem. It seems to me that some very plausible and interesting risk mitigation options, options that leverage the availability of a computable base objective function, are not being identified. The obvious statement applies: future work to chart these options would be most welcome.
The simple reason why that doesn’t work is that computing the base objective function during deployment will likely require something like actually running the model, letting it act in the real world, and then checking how good that action was. But in the case of actions that are catastrophically bad, the fact that you can check after the fact that the action was in fact bad isn’t much consolation: just having the agent take the action in the first place could be exceptionally dangerous during deployment.
I believe you might have been thinking in your reply above about a sub-set of all possible base objective functions: functions that compute a single ‘pass/fail’ value at a natural end of the training run, e.g. ‘the car never crashed’ or ‘all parts of the floor have been swept’. I was thinking of incrementally scoring objective functions, basically functions that sum utility increments achieved over time. So at any time during a run you can measure and compute the base objective function score up to that time. Monitoring this score should allow you to detect many forms of non-alignment between the base objective and the mesa objective automatically.
As mentioned, I see this as a promising technique for risk mitigation, it is not supposed to be a watertight way to eliminate all risks. The technique considered only looks at the score achieved so far. It does not run models to extrapolate and score the long-term consequences of every action: this would indeed be difficult, While observed past good performance does not guarantee future good performance, an observation of past bad performance does give you a very useful safety signal.