It seems to me that aligning a mesa-optimizer could be solved with the same methods used to align a base model. If so, distinction between “can create” and “can control” steps would be minimized. What concerns me more is how we would detect the existence of mesa-optimizers, especially if they emerge randomly.
It seems to me that aligning a mesa-optimizer could be solved with the same methods used to align a base model. If so, distinction between “can create” and “can control” steps would be minimized. What concerns me more is how we would detect the existence of mesa-optimizers, especially if they emerge randomly.