If you have something that’s smart enough to figure out the training environment, the dynamics of gradient descent, and its own parameters, then yes, I expect it could do a pretty good job at preserving its goals while being modified. But that’s explicitly not what we have here. An agent that isn’t smart enough to not know how to trick out your training process so it doesn’t get modified to have human values probably also isn’t smart enough to then tell you how to preserve these values during further training.
Or at least, the chance of the intelligence thresholds working out like that does not sound to me like something you want to base a security strategy on.
If you have something that’s smart enough to figure out the training environment, the dynamics of gradient descent, and its own parameters, then yes, I expect it could do a pretty good job at preserving its goals while being modified. But that’s explicitly not what we have here. An agent that isn’t smart enough to not know how to trick out your training process so it doesn’t get modified to have human values probably also isn’t smart enough to then tell you how to preserve these values during further training.
Or at least, the chance of the intelligence thresholds working out like that does not sound to me like something you want to base a security strategy on.