More variations on pseudo-alignment

In “Risks from Learned Optimization,” we talked about a variety of different forms of pseudo-alignment—that is, ways in which a trained model’s objective (its mesa-objective) can be misaligned off-distribution with the loss function it was trained under (the base objective). In particular, we distinguished between proxy alignment, suboptimality alignment, approximate alignment, and deceptive alignment. I still make heavy use of this classification, though I now believe that there are some additional types of pseudo-alignment which I think are fairly important but which I don’t feel like this classification fully addresses. In particular, there are two variations on pseudo-alignment not discussed in the paper which I want to talk about here: corrigible pseudo-alignment and suboptimality deceptive alignment.

Corrigible pseudo-alignment. In the paper, we defined corrigible alignment as the situation in which “the base objective is incorporated into the mesa-optimizer’s epistemic model and [the mesa-optimizer’s] objective is modified to ‘point to’ that information.” We mostly just talked about this as a form of robust alignment—however, as I note in “Towards a mechanistic understanding of corrigibility,” this is a very unstable operation, requiring you to get your pointer just right. Thus, I think it’s better to talk about corrigible alignment as the class of possible relationships between the base and mesa-objectives defined by the model having some sort of pointer to the base objective, including both corrigible robust alignment (if the pointer is robust) and corrigible pseudo-alignment (if the pointer is to some sort of non-robust proxy). In particular, I think this distinction is fairly important to why deceptive alignment might be more likely than robust alignment, as it points at why robust alignment via corrigibility might be quite difficult (which is a point we made in the paper, but one which I think is made much clearer with this distinction).

Suboptimality deceptive alignment. In the paper, we defined suboptimality alignment as follows:

A mesa-optimizer is suboptimality aligned if some deficiency, error, or limitation in its optimization process causes it to exhibit aligned behavior on the training distribution. This could be due to computational constraints, unsound reasoning, a lack of information, irrational decision procedures, or any other defect in the mesa-optimizer’s reasoning process. Importantly, we are not referring to a situation where the mesa-optimizer is robustly aligned but nonetheless makes mistakes leading to bad outcomes on the base objective. Rather, suboptimality alignment refers to the situation where the mesa-optimizer is misaligned but nevertheless performs well on the base objective, precisely because it has been selected to make mistakes that lead to good outcomes on the base objective.

I now think we may have seriously understated the importance of this case (especially the “lack of information” part). In particular, I think suboptimality alignment may be one of the most difficult to root out ways in which you get deceptive alignment.[1]

Specifically, suppose you train a model and end up with a proxy pseudo-aligned mesa-optimizer. Furthermore, suppose that the reason that it isn’t deceptive is that it hasn’t yet figured out that deception is a good strategy.[2] If you then go to deploy such a model, doing so could be almost as (if not more) dangerous than deploying an already deceptive model, as it could turn deceptive on you at any point if it encounters the right information during deployment. Since this is a situation in which the only reason the model isn’t deceptive is due to a lack of information, I’ve taken to calling this situation suboptimality deceptive alignment (though I’m open to suggestions for less clunky names).

As I mentioned recently in this comment, I think that suboptimality deceptive alignment complicates a lot of stories for how we can correct issues in our AIs as they appear. First, even if you have the ability to detect and train away any deceptive cognition that appears during training, that’s still insufficient to rule out suboptimality deceptive alignment, as your model could still end up becoming deceptive later during deployment.[3] Second, it may be that the input which causes the model to become deceptive isn’t generated naturally, but rather adversarially (by some competitor AI, for example). If natural inputs are unlikely to produce deception but adversarial ones are, then this could create a situation in which the deception comes too late to properly deal with (namely, only after you’ve already filled the world with competing AIs).[4]

  1. I discussed this possibility previously in the section on suboptimality alignment in “Relaxed adversarial training for inner alignment,” though I think it also deserves additional attention here. ↩︎

  2. As opposed to because it’s myopic (which solves this problem since a myopic model should never want to become deceptive) or some other reason that isn’t due to a lack of information. ↩︎

  3. Online learning complicates this story somewhat, though even in that case it seems likely that there will still be some point at which you’re relying on your model to generalize correctly in a situation where deception would be catastrophic (helping you build another AI, for example). ↩︎

  4. Unless you’re doing some sort of relaxed adversarial training to train away such adversarial inputs. ↩︎