I want to bring to your attention a question that came up here: Is corrigility incompatible with enhanced AI-based cooperation? One hope of positive differential progress caused by AI is that AIs may be able to better cooperate/coordinate with each other because they can be more transparent than humans. But if corrigibility implies that humans are ultimately in control of resources and therefore can override any binding commitments that an AI may make, that would make it impossible for an AI to trust another AI more than it trusts that AI’s user/operator.
If you were building a “treaty AI” tasked with enforcing an agreement between two agents, that AI could not be corrigible by either agent, and this is a big reason that such a treaty AI seem a bit scary. Similarly if I am trying to delegate power to an AI who will honor a treaty by construction.
I often imagine a treaty AI being corrigible by some judiciary (which need not be fast/cheap enough to act as an enforcer), but of course this leaves the question of how to construct that judiciary, and the same questions come up there.
But if corrigibility implies that humans are ultimately in control of resources and therefore can override any binding commitments that an AI may make
I view this as: the problem of making binding agreements is separate from the problem of delegating to an AI. We can split the two up, and ask separately: “can we delegate effectively to an AI?” and “can we use AI to make binding commitments?” The division seems clean: if we can make binding commitments by any mechanism than we can have the committed human delegate to a (likely corrigible) AI rather than having the original human so delegate.
I want to bring to your attention a question that came up here: Is corrigility incompatible with enhanced AI-based cooperation? One hope of positive differential progress caused by AI is that AIs may be able to better cooperate/coordinate with each other because they can be more transparent than humans. But if corrigibility implies that humans are ultimately in control of resources and therefore can override any binding commitments that an AI may make, that would make it impossible for an AI to trust another AI more than it trusts that AI’s user/operator.
If you were building a “treaty AI” tasked with enforcing an agreement between two agents, that AI could not be corrigible by either agent, and this is a big reason that such a treaty AI seem a bit scary. Similarly if I am trying to delegate power to an AI who will honor a treaty by construction.
I often imagine a treaty AI being corrigible by some judiciary (which need not be fast/cheap enough to act as an enforcer), but of course this leaves the question of how to construct that judiciary, and the same questions come up there.
I view this as: the problem of making binding agreements is separate from the problem of delegating to an AI. We can split the two up, and ask separately: “can we delegate effectively to an AI?” and “can we use AI to make binding commitments?” The division seems clean: if we can make binding commitments by any mechanism than we can have the committed human delegate to a (likely corrigible) AI rather than having the original human so delegate.