I think I mostly agree with what you’re saying here, except that I do have some concerns about applying a corrigibility layer to a learned model after the fact. Here are some ways in which I could see that going wrong:
The learned model might be so opaque that you can’t figure out how to apply corrigibility unless you have some way of purposefully training it to be transparent in the right way.
If you aren’t applying corrigibility during training, then your model could act dangerously during the training process.
If corrigibility is applied as a separate layer, the model could figure out how to disable it.
I think if you don’t have some way of training your model to be transparent, you’re not going to be able to make it corrigible after the fact, and if you do have some way of training your model to be transparent, then you might as well apply corrigibility during training rather than doing it at the end. Also, I think making sure your model is always corrigible during training can also help with the task of training it to be transparent, as you can use your corrigible model to work on the task of making itself more transparent. In particular, I mostly think about informed oversight via an amplified overseer as how you would do something like that.
2. If you aren’t applying corrigibility during training, then your model could act dangerously during the training process.
I agree for certain types of training. To clarify: I wrote the above while thinking about a type of base-optimiser that constructs a mesa-optimiser by running many candidate mesa-optimisers in a virtual environment, and then picking the highest-scoring one. If the training process involves running in a real environment, additional safety measures need to be considered.
3. If corrigibility is applied as a separate layer, the model could figure out how to disable it.
Well the interesting thing here is that the button-corrigible agent that is produced after the software transformation might be fully aware that there exist actions it can take to disable the corrigibility layer, but this does not matter. The software transformation not only adds the layer, it also adds an incentive to preserve the layer, so the agent will not take any actions to disable it. (This broad statements is no longer true if the agent succumbs to an attack in a hostile universe, see e.g. section 7.2 of the paper.)
I think I mostly agree with what you’re saying here, except that I do have some concerns about applying a corrigibility layer to a learned model after the fact. Here are some ways in which I could see that going wrong:
The learned model might be so opaque that you can’t figure out how to apply corrigibility unless you have some way of purposefully training it to be transparent in the right way.
If you aren’t applying corrigibility during training, then your model could act dangerously during the training process.
If corrigibility is applied as a separate layer, the model could figure out how to disable it.
I think if you don’t have some way of training your model to be transparent, you’re not going to be able to make it corrigible after the fact, and if you do have some way of training your model to be transparent, then you might as well apply corrigibility during training rather than doing it at the end. Also, I think making sure your model is always corrigible during training can also help with the task of training it to be transparent, as you can use your corrigible model to work on the task of making itself more transparent. In particular, I mostly think about informed oversight via an amplified overseer as how you would do something like that.
I agree for certain types of training. To clarify: I wrote the above while thinking about a type of base-optimiser that constructs a mesa-optimiser by running many candidate mesa-optimisers in a virtual environment, and then picking the highest-scoring one. If the training process involves running in a real environment, additional safety measures need to be considered.
Well the interesting thing here is that the button-corrigible agent that is produced after the software transformation might be fully aware that there exist actions it can take to disable the corrigibility layer, but this does not matter. The software transformation not only adds the layer, it also adds an incentive to preserve the layer, so the agent will not take any actions to disable it. (This broad statements is no longer true if the agent succumbs to an attack in a hostile universe, see e.g. section 7.2 of the paper.)