making sure that if you train on such a penalty the resulting model will actually learn to always behave according to it (I see this as the hardest part).
I agree that this is a huge problem if the penalty is applied at the level of the base optimizer. I see some promise in the approach of explicitly whitelisting mesa optimizers which are designed to be safe. I talk more about it in my post yesterday, so I will quote directly from it,
To oversimplify things for a bit, there are a few ways that we could ameliorate the issue of misaligned mesa optimization. One way is that we could find a way to robustly align arbitrary mesa objectives with base objectives. I am a bit pessimistic about this strategy working without some radical insights, because it currently seems really hard. If we could do that, it would be something which would require a huge chunk of alignment to be solved.
Alternatively, we could whitelist our search space such that only certain safe optimizers could be discovered. This is a task where I see impact measurements could be helpful.
When we do some type of search over models, we could construct an explicit optimizer that forms the core of each model. The actual parameters that we perform gradient descent over would need to be limited enough such that we could still transparently see what type of “utility function” is being inner optimized, but not so limited that the model search itself would be useless.
If we could constrain and control this space of optimizers enough, then we should be able to explicitly add safety precautions to these mesa objectives. The exact way that this could be performed is a bit difficult for me to imagine. Still, I think that as long as we are able to perform some type of explicit constraint on what type of optimization is allowed, then it should be possible to penalize mesa optimizers in a way that could potentially avoid catastrophe.
During the process of training, the model will start unaligned and gradually shift towards performing better on the base objective. At any point during the training, we wouldn’t want the model to try to do anything that might be extremely impactful, both because it will initially be unaligned, and because we are uncertain about the safety of the trained model itself. An impact penalty could thus help us to create a safe testing environment.
I agree that this is a huge problem if the penalty is applied at the level of the base optimizer. I see some promise in the approach of explicitly whitelisting mesa optimizers which are designed to be safe. I talk more about it in my post yesterday, so I will quote directly from it,