Is “outer alignment” meant to be applicable in the general case?
I’m not exactly sure what you’re asking here.
Do you think it also makes sense to talk about outer alignment of the training process as a whole, so that for example if there is a security hole in the hardware or software environment and the model takes advantage of the security hole to hack its loss/reward, then we’d call that an “outer alignment failure”.
I would call that an outer alignment failure, but only because I would say that the ways in which your loss function can be hacked are part of the specification of your loss function. However, I wouldn’t consider an entire training process to be outer aligned—rather, I would just say that an entire training process is aligned. I generally use outer and inner alignment to refer to different components of aligning the training process—namely the objective/loss function/environment in the case of outer alignment and the inductive biases/architecture/optimization procedure in the case of inner alignment (though note that this is a more general definition than the one used in “Risks from Learned Optimization,” as it makes no mention of mesa-optimizers, though I would still say that mesa-optimization is my primary example of how you could get an inner alignment failure).
So technically, one should say that a loss function is outer aligned at optimum with respect to some model class, right?
Yes, though in the definition I gave here I just used the model class of all functions, which is obviously too large but has the nice property of being a fully general definition.
Also, related to Ofer’s comment, can you clarify whether it’s intended for this definition that the loss function only looks at the model’s input/output behavior, or can it also take into account other information about the model?
I would include all possible input/output channels in the domain/codomain of the model when interpreted as a function.
I’m also curious whether you have HBO or LBO in mind for this post.
I generally think you need HBO and am skeptical that LBO can actually do very much.
I’m not exactly sure what you’re asking here.
I would call that an outer alignment failure, but only because I would say that the ways in which your loss function can be hacked are part of the specification of your loss function. However, I wouldn’t consider an entire training process to be outer aligned—rather, I would just say that an entire training process is aligned. I generally use outer and inner alignment to refer to different components of aligning the training process—namely the objective/loss function/environment in the case of outer alignment and the inductive biases/architecture/optimization procedure in the case of inner alignment (though note that this is a more general definition than the one used in “Risks from Learned Optimization,” as it makes no mention of mesa-optimizers, though I would still say that mesa-optimization is my primary example of how you could get an inner alignment failure).
Yes, though in the definition I gave here I just used the model class of all functions, which is obviously too large but has the nice property of being a fully general definition.
I would include all possible input/output channels in the domain/codomain of the model when interpreted as a function.
I generally think you need HBO and am skeptical that LBO can actually do very much.