Recently Shah et al described various benefits they see of CIRL (or “assistance games”) over reward learning, though this doesn’t address the corrigibility question head on.
(Indeed, this was because I didn’t see shutdown corrigibility as a difference between assistance games and reward learning—optimal policies for both would tend to avoid shutdown.)
(Indeed, this was because I didn’t see shutdown corrigibility as a difference between assistance games and reward learning—optimal policies for both would tend to avoid shutdown.)