I’m not sure how necessary it is to explicitly aim to avoid catastrophic behavior—it seems that even a low capability corrigible agent would still know enough to avoid catastrophic behavior in practice.
Paul gave a bit more motivation here: (It’s a bit confusing that these two posts are reposted here out of order. ETA on 1/28/19: Strange, the date on that repost just changed to today’s date. Yesterday it was dated November 2018.)
If powerful ML systems fail catastrophically, they may be able to quickly cause irreversible damage. To be safe, it’s not enough to have an average-case performance guarantee on the training distribution — we need to ensure that even if our systems fail on new distributions or with small probability, they will never fail too badly.
My interpretation of this is that learning with catastrophes / optimizing worst-case performance (I believe these are referring to the same thing, which is also confusing) is needed to train an agent that can be called corrigible in the first place. Without it, we could end up with an agent that looks corrigible on the training distribution, but would do something malign (“applies its intelligence in the service of an unintended goal”) after deployment.
Yeah, that makes sense, also the distinction between benign and malign failures in that post seems right. It makes much more sense that learning with catastrophes is necessary for corrigibility.
Paul gave a bit more motivation here: (It’s a bit confusing that these two posts are reposted here out of order. ETA on 1/28/19: Strange, the date on that repost just changed to today’s date. Yesterday it was dated November 2018.)
My interpretation of this is that learning with catastrophes / optimizing worst-case performance (I believe these are referring to the same thing, which is also confusing) is needed to train an agent that can be called corrigible in the first place. Without it, we could end up with an agent that looks corrigible on the training distribution, but would do something malign (“applies its intelligence in the service of an unintended goal”) after deployment.
Yeah, that makes sense, also the distinction between benign and malign failures in that post seems right. It makes much more sense that learning with catastrophes is necessary for corrigibility.