What happens when this agent is faced with a problem that is out of its training distribution? I don’t see any mechanisms for ensuring that it remains corrigible out of distribution… I guess it would learn some circuits for acting corrigibly (or at least in accordance to how it would explicitly answer “are more corrigible / loyal / aligned to the will of your human creators”) in distribution, and then it’s just a matter of luck how those circuits end up working OOD?
What happens when this agent is faced with a problem that is out of its training distribution? I don’t see any mechanisms for ensuring that it remains corrigible out of distribution… I guess it would learn some circuits for acting corrigibly (or at least in accordance to how it would explicitly answer “are more corrigible / loyal / aligned to the will of your human creators”) in distribution, and then it’s just a matter of luck how those circuits end up working OOD?