paulfchristiano comments on Corrigibility

paulfchristiano 20 Dec 2018 21:38 UTC
LW: 12 AF: 4
0
AF
If you were building a “treaty AI” tasked with enforcing an agreement between two agents, that AI could not be corrigible by either agent, and this is a big reason that such a treaty AI seem a bit scary. Similarly if I am trying to delegate power to an AI who will honor a treaty by construction.
I often imagine a treaty AI being corrigible by some judiciary (which need not be fast/cheap enough to act as an enforcer), but of course this leaves the question of how to construct that judiciary, and the same questions come up there.
But if corrigibility implies that humans are ultimately in control of resources and therefore can override any binding commitments that an AI may make
I view this as: the problem of making binding agreements is separate from the problem of delegating to an AI. We can split the two up, and ask separately: “can we delegate effectively to an AI?” and “can we use AI to make binding commitments?” The division seems clean: if we can make binding commitments by any mechanism than we can have the committed human delegate to a (likely corrigible) AI rather than having the original human so delegate.