[Question] What is corrigibility? /​ What are the right background readings on it?

Ryan Carey asks in a related question “What are some good examples of incorrigibility?” He provides the following overview:

The idea of corrigibility is roughly that an AI should be aware that it may have faults, and therefore allow and facilitate human operators to correct these faults. I’m especially interested in scenarios where the AI system controls a particular input channel that is supposed to be used to control it, such as a shutdown button, a switch used to alter its mode of operation, or another device used to control its motivation.

What’s a more detailed understanding? What are the right things to read? I believe there’s at least one MIRI paper, some Arbital posts. Writing to this question to center my inquiry.

No comments.