Is the term mesa optimizer too narrow?
In the post introducing mesa optimization, the authors defined an optimizer as
a system [that is] internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system.
The paper continues by defining a mesa optimizer as an optimizer that was selected by a base optimizer.
However, there are a number of issues with this definition, as some have already pointed out.
First, I think by this definition humans are clearly not mesa optimizers. Most optimization we do is implicit. Yet, humans are the supposed to be the prototypical examples of mesa optimizers, which appears be a contradiction.
Second, the definition excludes perfectly legitimate examples of inner alignment failures. To see why, consider a simple feedforward neural network trained by deep reinforcement learning to navigate my Chests and Keys environment. Since “go to the nearest key” is a good proxy for getting the reward, the neural network simply returns the action, that when given the board state, results in the agent getting closer to the nearest key.
Is the feedforward neural network optimizing anything here? Hardly, it’s just applying a heuristic. Note that you don’t need to do anything like an internal A* search to find keys in a maze, because in many environments, following a wall until the key is within sight, and then performing a very shallow search (which doesn’t have to be explicit) could work fairly well.
As far as I can tell, Hjalmar Wijk introduced the term “malign generalization” to describe the failure mode that I think is most worth worrying about here. In particular, malign generalization happens when you trained a system with objective function X, that at deployment has the actual outcome of doing Y, where Y is so bad that we’d prefer the system to fail completely. To me at least, this seems like a far more intuitive and less theory-laden way of framing inner alignment failures.
This way of reframing the issue allows us to keep the old terminology that we are concerned with capability robustness without alignment robustness, but drops all unnecessary references to mesa optimization.
Mesa optimizers could still form a natural class of things that are prone to malign generalization. But if even humans are not mesa optimizers, why should we expect mesa optimizers to be the primary real world examples of such inner alignment failures?