Wei Dai comments on Eli’s shortform feed

Wei Dai 14 Apr 2025 22:07 UTC
7 points
2
What happens when this agent is faced with a problem that is out of its training distribution? I don’t see any mechanisms for ensuring that it remains corrigible out of distribution… I guess it would learn some circuits for acting corrigibly (or at least in accordance to how it would explicitly answer “are more corrigible / loyal / aligned to the will of your human creators”) in distribution, and then it’s just a matter of luck how those circuits end up working OOD?