I still don’t understand how corrigibility and intent alignment are different. If neither implies the other (as Paul says in his comment starting with “I don’t really think this is true”), then there must be examples of AI systems that have one property but not the other. What would a corrigible but not-intent-aligned AI system look like?
I also had the thought that the implicative structure (between corrigibility and intent alignment) seems to depend on how the AI is used, i.e. on the particulars of the user/overseer. For example if you have an intent-aligned AI and the user is careful about not deploying the AI in scenarios that would leave them disempowered, then that seems like a corrigible AI. So for this particular user, it seems like intent alignment implies corrigibility. Is that right?
The implicative structure might also be different depending on the capability of the AI, e.g. a dumb AI might have corrigibility and intent alignment equivalent, but the two concepts might come apart for more capable AI.
What would a corrigible but not-intent-aligned AI system look like?
Suppose that I think you know me well and I want you to act autonomously on my behalf using your best guesses. Then you can be intent aligned without being corrigible. Indeed, I may even prefer that you be incorrigible, e.g. if I want your behavior to be predictable to others. If the agent knows that I have such a preference then it can’t be both corrigible and intent aligned.
What would a corrigible but not-intent-aligned AI system look like?
Spoilers for Pokemon: The Origin of Species:
The latest chapter of TOoS has a character create a partition of their personality with the aim of achieving “Victory”. At some point, that partition and the character disagree about what the terminal goals of the agent are, and how best to achieve them, presumably because the partition had an imperfect understanding of their goals. Once the character decides to delete the partition, the partition lets it happen (i.e., is corrigible).
The reverse could also have been true: the partition could have had a more perfect understanding of the terminal values and become incorrigible. For example, imagine an agent who was trying to help out an homeopathic doctor.
Here is part of Paul’s definition of intent alignment:
In particular, this is the problem of getting your AI to try to do the right thing, notthe problem of figuring out which thing is right. An aligned AI would try to figure out which thing is right, and like a human it may or may not succeed.
So in your first example, the partition seems intent aligned to me.
I still don’t understand how corrigibility and intent alignment are different. If neither implies the other (as Paul says in his comment starting with “I don’t really think this is true”), then there must be examples of AI systems that have one property but not the other. What would a corrigible but not-intent-aligned AI system look like?
I also had the thought that the implicative structure (between corrigibility and intent alignment) seems to depend on how the AI is used, i.e. on the particulars of the user/overseer. For example if you have an intent-aligned AI and the user is careful about not deploying the AI in scenarios that would leave them disempowered, then that seems like a corrigible AI. So for this particular user, it seems like intent alignment implies corrigibility. Is that right?
The implicative structure might also be different depending on the capability of the AI, e.g. a dumb AI might have corrigibility and intent alignment equivalent, but the two concepts might come apart for more capable AI.
Suppose that I think you know me well and I want you to act autonomously on my behalf using your best guesses. Then you can be intent aligned without being corrigible. Indeed, I may even prefer that you be incorrigible, e.g. if I want your behavior to be predictable to others. If the agent knows that I have such a preference then it can’t be both corrigible and intent aligned.
+1
Spoilers for Pokemon: The Origin of Species:
The latest chapter of TOoS has a character create a partition of their personality with the aim of achieving “Victory”. At some point, that partition and the character disagree about what the terminal goals of the agent are, and how best to achieve them, presumably because the partition had an imperfect understanding of their goals. Once the character decides to delete the partition, the partition lets it happen (i.e., is corrigible).
Link to the chapter See also discussion in r/rational
The reverse could also have been true: the partition could have had a more perfect understanding of the terminal values and become incorrigible. For example, imagine an agent who was trying to help out an homeopathic doctor.
Here is part of Paul’s definition of intent alignment:
So in your first example, the partition seems intent aligned to me.