Other strategies I want to put in this cluster include formal verification, informed oversight and factorization.
Why informed oversight? It doesn’t feel like a natural fit to me. Perhaps you think any oversight fits in this category, as opposed to the specific problem pointed to by informed oversight? Or perhaps there was no better place to put it?
Corrigibility is largely about making systems that are superintelligent without being themselves fully agentic.
This seems very different from the notion of corrigibility that is “a system that is trying to help its operator”. Do you think that these are two different notions, or are they different ways of pointing at the same thing?
I think informed oversight fits better with MtG white than it does with boxing. I agree that the three main examples are boxing like, and informed oversight is not, but it still feels white to me.
I do think that corrigibility done right is a thing that is in some sense less agentic. I think that things that have goals outside of them are less agentic than things that have their goals inside of them, but I think corrigibility is stronger than that. I want to say something like a corrigible agent not only has its goals partially on the outside (in the human), but also partially has its decision theory on the outside. Idk.
Why informed oversight? It doesn’t feel like a natural fit to me. Perhaps you think any oversight fits in this category, as opposed to the specific problem pointed to by informed oversight? Or perhaps there was no better place to put it?
This seems very different from the notion of corrigibility that is “a system that is trying to help its operator”. Do you think that these are two different notions, or are they different ways of pointing at the same thing?
I think informed oversight fits better with MtG white than it does with boxing. I agree that the three main examples are boxing like, and informed oversight is not, but it still feels white to me.
I do think that corrigibility done right is a thing that is in some sense less agentic. I think that things that have goals outside of them are less agentic than things that have their goals inside of them, but I think corrigibility is stronger than that. I want to say something like a corrigible agent not only has its goals partially on the outside (in the human), but also partially has its decision theory on the outside. Idk.