Rohin Shah comments on CIRL Corrigibility is Fragile

Rohin Shah 22 Dec 2022 3:42 UTC
4 points
2
Recently Shah et al described various benefits they see of CIRL (or “assistance games”) over reward learning, though this doesn’t address the corrigibility question head on.
(Indeed, this was because I didn’t see shutdown corrigibility as a difference between assistance games and reward learning—optimal policies for both would tend to avoid shutdown.)