Corrigibility doesn’t always have a good action to take

In a previous critique of corrigibility, I brought up the example of a corrigible AI-butler that was in a situation where it was forced to determine the human’s values through its actions—it had no other option.

Eliezer pointed out that, in his view of corrigibility, there could be situations where the AI had no corrigible actions it could take—where, in effect, all it could do was say “I cannot act in corrigible way here”.

This makes corrigibility immune to my criticism in the previous post, while potentially opening the concept up to other criticisms—it’s hard to see how a powerful agent, whose actions affect the future in many ways, including inevitably manipulating the human, can remain corrigible AND still do something. But that’s a point for a more thorough anaylsis.

No comments.