Therefore, either we can try to revise the framework slightly, essentially omitting the notions of robust alignment and ‘internalization of the base objective’ and focussing more on revised versions of ‘proxy alignment’ and ‘approximate alignment’ as descriptors of what is essentially the best possible situation in terms of alignment.
Have you seen Hubinger’s more recent post, More variations on pseudo-alignment ? It amends the list of pseudo-alignment types originally listed in “Risks of Learned Optimization” to include a couple more.
Your claim above that the best we could hope for may be a form of proxy alignment or approximate alignment reminds me of following pseudo-alignment type he introduced in that more recent post. In the description of this type, he also seems to agree with you that robust alignment is very difficult or “unstable” (though perhaps you go further in saying its impossible):
Corrigible pseudo-alignment. In the paper, we defined corrigible alignment as the situation in which “the base objective is incorporated into the mesa-optimizer’s epistemic model and [the mesa-optimizer’s] objective is modified to ‘point to’ that information.” We mostly just talked about this as a form of robust alignment—however, as I note in “Towards a mechanistic understanding of corrigibility,” this is a very unstable operation, requiring you to get your pointer just right. Thus, I think it’s better to talk about corrigible alignment as the class of possible relationships between the base and mesa-objectives defined by the model having some sort of pointer to the base objective, including both corrigible robust alignment (if the pointer is robust) and corrigible pseudo-alignment (if the pointer is to some sort of non-robust proxy). In particular, I think this distinction is fairly important to why deceptive alignment might be more likely than robust alignment, as it points at why robust alignment via corrigibility might be quite difficult (which is a point we made in the paper, but one which I think is made much clearer with this distinction).
Nice post.
Have you seen Hubinger’s more recent post, More variations on pseudo-alignment ? It amends the list of pseudo-alignment types originally listed in “Risks of Learned Optimization” to include a couple more.
Your claim above that the best we could hope for may be a form of proxy alignment or approximate alignment reminds me of following pseudo-alignment type he introduced in that more recent post. In the description of this type, he also seems to agree with you that robust alignment is very difficult or “unstable” (though perhaps you go further in saying its impossible):
Thanks for the comments and pointers!