[Question] Why is pseudo-alignment “worse” than other ways ML can fail to generalize?

I have only just read the mesa optimizers paper, and I don’t understand what it adds to the pre-existing picture that “ML can fail to generalize outside the train distribution and this is bad.”

The discussion in the paper generally assumes a background distinction between “training” and “deployment” (or an equivalent distinction), and discuss models which succeed on the base objective during “training” but not during “deployment.”

In the sections about “deception,” this happens in a special way quite unlike the ordinary failures to generalize that we see in ML today (and arises under correspondingly exotic conditions). But, in cases other than “deception,” the paper describes dynamics and outcomes that seem identical to the ordinary generalization problem in ML:

  • Training find a model that scores well on the base objective, assessed over the training distribution

  • But, this model may not score well on the base objective, assessed over other distributions

For example, the following is just a textbook generalization failure, which can happen with or without a mesa-optimizer:

For a toy example of what pseudo-alignment might look like, consider an RL agent trained on a maze navigation task where all the doors during training happen to be red. Let the base objective (reward function) be = (1 if reached a door, 0 otherwise). On the training distribution, this objective is equivalent to = (1 if reached something red, 0 otherwise). Consider what would happen if an agent, trained to high performance on on this task, were put in an environment where the doors are instead blue, and with some red objects that are not doors. It might generalize on , reliably navigating to the blue door in each maze (robust alignment). But it might also generalize on instead of , reliably navigating each maze to reach red objects (pseudo-alignment).

Additionally, when the paper makes remarks that seem to be addressing my question, I find these remarks confused (or perhaps just confusing).

For instance, in this remark

The possibility of mesa-optimizers has important implications for the safety of advanced machine learning systems. When a base optimizer generates a mesa-optimizer, safety properties of the base optimizer may not transfer to the mesa-optimizer.

I don’t understand what “safety properties of the base optimizer” could be, apart from facts about the optima it tends to produce. That is, I can’t think of a property that would appear to confer “safety” until we consider the possibility of producing mesa-optimizers, and then stop appearing thus.

A safety property of an optimizer is some kind of assurance about the properties of the optima; if such a property only holds for a subset of optima (the ones that are not mesa-optimizers), we’ll see this appear mathematically in the definition of the property or in theorems about it, whether or not we have explicitly considered the possibility of mesa-optimizers. (I suppose the argument could be that some candidate safety properties implicitly assume no optimum is a mesa-optimizer, and thus appear to apply to all optima while not really doing so—somewhat analogous to early notions of continuity which implicitly assumed away the Weierstrass function. But if so, I need a real example of such a case to convince me.)

The following seems to offer a different answer to my question:

Pseudo-alignment, therefore, presents a potentially dangerous robustness problem since it opens up the possibility of a machine learning system that competently takes actions to achieve something other than the intended goal when off the training distribution. That is, its capabilities might generalize while its objective does not.

This seems to contrast two ways of failing on a samples from non-train distribution. Supposing a model has learned to “understand” train samples and use that understanding to aim for a target, it can then

  • fail to understand non-train samples, thus losing the ability to aim for any target (capabilities fails to generalize)

  • understand non-train samples and aim for its internalized target, which matched the base target in training, but not here (objective fails to generalize)

But even without mesa-optimizers, cases of ML generalization failure often involve the latter, not just the former. A dumb classifier with no internal search, trained on the red-door setting described above, would “understand” the blue-door test data well enough to apply its internalized objective perfectly; even in this textbook-like case, exhibited by arbitrarily simple classifiers, the “capabilities generalize.” This kind of problem is definitely bad, I just don’t see what’s new about it.