A lot of the corrigibility properties of CIRL come from uncertainty: it wants to defer to a human because the human knows more about its preferences than the robot. Recently, Yudkowsky and others described the problem of fully updated deference: if the AI has learned everything it can, it may have no uncertainty, at which point this corrigibility goes away. If the AI has learned your preferences perfectly, perhaps this is OK. But here Carey’s critique of model misspecification rears its head again—if the AI is convinced you love vanilla ice cream, saying “please no give me chocolate” will not convince it (perhaps it thinks you have a cognitive bias against admitting your plain, vanilla preferences—it knows the real you), whereas it might if it had uncertainty.
The standard approaches to dealing with this are nonparametric models, safe Bayes, and including many different models in your space of all possible models.
The standard approaches to dealing with this are nonparametric models, safe Bayes, and including many different models in your space of all possible models.