What would a corrigible but not-intent-aligned AI system look like?
Spoilers for Pokemon: The Origin of Species:
The latest chapter of TOoS has a character create a partition of their personality with the aim of achieving “Victory”. At some point, that partition and the character disagree about what the terminal goals of the agent are, and how best to achieve them, presumably because the partition had an imperfect understanding of their goals. Once the character decides to delete the partition, the partition lets it happen (i.e., is corrigible).
The reverse could also have been true: the partition could have had a more perfect understanding of the terminal values and become incorrigible. For example, imagine an agent who was trying to help out an homeopathic doctor.
Here is part of Paul’s definition of intent alignment:
In particular, this is the problem of getting your AI to try to do the right thing, notthe problem of figuring out which thing is right. An aligned AI would try to figure out which thing is right, and like a human it may or may not succeed.
So in your first example, the partition seems intent aligned to me.
Spoilers for Pokemon: The Origin of Species:
The latest chapter of TOoS has a character create a partition of their personality with the aim of achieving “Victory”. At some point, that partition and the character disagree about what the terminal goals of the agent are, and how best to achieve them, presumably because the partition had an imperfect understanding of their goals. Once the character decides to delete the partition, the partition lets it happen (i.e., is corrigible).
Link to the chapter See also discussion in r/rational
The reverse could also have been true: the partition could have had a more perfect understanding of the terminal values and become incorrigible. For example, imagine an agent who was trying to help out an homeopathic doctor.
Here is part of Paul’s definition of intent alignment:
So in your first example, the partition seems intent aligned to me.