Huh. I see a lot of the work I’m excited about as trying to avoid the Agency Hand-Off paradigm, and the Value Learning sequence was all about why we might hope to be able to avoid the Agency Hand-Off paradigm.
Your definition of corrigibility is the MIRI version. If you instead go by Paul’s post, the introduction goes:
I would like to build AI systems which help me:
Figure out whether I built the right AI and correct any mistakes I made
Remain informed about the AI’s behavior and avoid unpleasant surprises
Make better decisions and clarify my preferences
Acquire resources and remain in effective control of them
Ensure that my AI systems continue to do all of these nice things
…and so on
This seems pretty clearly outside of the Agency Hand-Off paradigm to me (that’s the way I interpret it at least). Similarly for e.g. approval-directed agents.
I do agree that assistance games look like they’re within the Agency Hand-Off paradigm, especially the way Stuart talks about them; this is one of the main reasons I’m more excited about corrigibility.
Huh. I see a lot of the work I’m excited about as trying to avoid the Agency Hand-Off paradigm, and the Value Learning sequence was all about why we might hope to be able to avoid the Agency Hand-Off paradigm.
Your definition of corrigibility is the MIRI version. If you instead go by Paul’s post, the introduction goes:
This seems pretty clearly outside of the Agency Hand-Off paradigm to me (that’s the way I interpret it at least). Similarly for e.g. approval-directed agents.
I do agree that assistance games look like they’re within the Agency Hand-Off paradigm, especially the way Stuart talks about them; this is one of the main reasons I’m more excited about corrigibility.