Define’s corrigibility as “agent’s willingness to let us change it’s policy w/o incentivized to manipulate us”. Separates terms to define:
Weakly-corrigible to policy change pi—if there exists an optimal policy where not disabling is optimal.
Strictly-corrigible—if all optimal policies don’t disable correction.
For most optimal policies, correcting it in the way we want is a small minority. If correcting leads to more optimal policies, it’s then optimal to manipulate us into “correcting it”. So we can’t get strict-corrigibility with a large class of optimizing agents
Another useful concept is whether you change rewards for disabling correction. Do we reward it more or less in the states accessible by disabling correction? If we make them equal, then we cannot get strict corrigibility (is this true for the class of optimizers in the satisficers post?).
But if we reward it more for the corrigible states, then it will manipulate us into correcting it even if we wouldn’t have done that in the first place. This would only work well for us if we knew the correct policies we would want it to be corrected to and reward more for that. However, this requires certainty about the correct policy, but we want corrigible agents because we’re uncertain what the correct policy is. Being manipulated to correct it is still not the corrigibility we want.
Then comes in Attainable Utility Preservation (AUP) which gives a partial-solution: state based reward doesn’t change with environment dynamics but AUP does. By penalizing change its ability to achieve many goals (have access to different sets of optimal policies?) compared to the baseline, the optimizer is not incentivized to disable correction because the inaction baseline is never disabling correction(?).
Though this toy example doesn’t include an aspect of “manipulating humans to force it to correct it even if they wouldn’t have done that by default”
Functional constraints: I can kind of understand the future direction mentioned here, but what do you mean by functional constraints? What’s the domain and range and what specifically are we limiting here?
Summary & Thoughts:
Define’s corrigibility as “agent’s willingness to let us change it’s policy w/o incentivized to manipulate us”. Separates terms to define:
Weakly-corrigible to policy change pi—if there exists an optimal policy where not disabling is optimal.
Strictly-corrigible—if all optimal policies don’t disable correction.
For most optimal policies, correcting it in the way we want is a small minority. If correcting leads to more optimal policies, it’s then optimal to manipulate us into “correcting it”. So we can’t get strict-corrigibility with a large class of optimizing agents
Another useful concept is whether you change rewards for disabling correction. Do we reward it more or less in the states accessible by disabling correction? If we make them equal, then we cannot get strict corrigibility (is this true for the class of optimizers in the satisficers post?).
But if we reward it more for the corrigible states, then it will manipulate us into correcting it even if we wouldn’t have done that in the first place. This would only work well for us if we knew the correct policies we would want it to be corrected to and reward more for that. However, this requires certainty about the correct policy, but we want corrigible agents because we’re uncertain what the correct policy is. Being manipulated to correct it is still not the corrigibility we want.
Then comes in Attainable Utility Preservation (AUP) which gives a partial-solution: state based reward doesn’t change with environment dynamics but AUP does. By penalizing change its ability to achieve many goals (have access to different sets of optimal policies?) compared to the baseline, the optimizer is not incentivized to disable correction because the inaction baseline is never disabling correction(?).
Though this toy example doesn’t include an aspect of “manipulating humans to force it to correct it even if they wouldn’t have done that by default”
Functional constraints: I can kind of understand the future direction mentioned here, but what do you mean by functional constraints? What’s the domain and range and what specifically are we limiting here?