Understanding the internal mechanics of corrigibility seems very important, and I think this post helped me get a more fine-grained understanding and vocabulary for it.
I’ve historically strongly preferred the type of corrigibility which comes from pointing to the goal and letting it be corrigible for instrumental reasons, I think largely because it seems very elegant and that when it works many good properties seem to pop out ‘for free’. For instance, the agent is motivated to improve communication methods, avoid coercion, tile properly and even possibly improve its corrigibility—as long as the pointer really is correct. I agree though that this solution doesn’t seem stable to mistakes in the ‘pointing’, which is very concerning and makes me start to lean toward something more like act-based corrigibility being safer.
I’m still very pessimistic about indifference corrigibility though, in that it still seems extremely fragile/low-measure-in-agent-space. I think maybe I’m stuck imagining complex/unnatural indifference, as in finding agents indifferent to whether a stop-button is pressed, and that my intuition might change if I spend more time thinking about examples like myopia or world-model <-> world interaction, where the indifference seems to have more ‘natural’ boundaries in some sense.
I’ve historically strongly preferred the type of corrigibility which comes from pointing to the goal and letting it be corrigible for instrumental reasons, I think largely because it seems very elegant and that when it works many good properties seem to pop out ‘for free’. For instance, the agent is motivated to improve communication methods, avoid coercion, tile properly and even possibly improve its corrigibility—as long as the pointer really is correct.
The ‘type of corrigibly’ you are referring to there is corrigibly at all; rather, it’s alignment. Indeed, the term corrigibly was coined to contrast to this, motivated by the fragility of this to getting the printer right.
I’m still very pessimistic about indifference corrigibility though, in that it still seems extremely fragile/low-measure-in-agent-space.
I’m not sure it’s the same thing as alignment… it seems there’s at least 3 concepts here, and Hjalmar is talking about the 2nd, which is importantly different from the 1st:
“classic notion of alignment”: The AI has the correct goal (represented internally, e.g. as a reward function)
“CIRL notion of alignment”: AI has a pointer to the correct goal (but the goal is represented externally, e.g. in a human partner’s mind)
Understanding the internal mechanics of corrigibility seems very important, and I think this post helped me get a more fine-grained understanding and vocabulary for it.
I’ve historically strongly preferred the type of corrigibility which comes from pointing to the goal and letting it be corrigible for instrumental reasons, I think largely because it seems very elegant and that when it works many good properties seem to pop out ‘for free’. For instance, the agent is motivated to improve communication methods, avoid coercion, tile properly and even possibly improve its corrigibility—as long as the pointer really is correct. I agree though that this solution doesn’t seem stable to mistakes in the ‘pointing’, which is very concerning and makes me start to lean toward something more like act-based corrigibility being safer.
I’m still very pessimistic about indifference corrigibility though, in that it still seems extremely fragile/low-measure-in-agent-space. I think maybe I’m stuck imagining complex/unnatural indifference, as in finding agents indifferent to whether a stop-button is pressed, and that my intuition might change if I spend more time thinking about examples like myopia or world-model <-> world interaction, where the indifference seems to have more ‘natural’ boundaries in some sense.
The ‘type of corrigibly’ you are referring to there is corrigibly at all; rather, it’s alignment. Indeed, the term corrigibly was coined to contrast to this, motivated by the fragility of this to getting the printer right.
I tend to agree. I’m hoping that thinking about myopia and related issues could help me understand more natural notions of corrigibility.
I’m not sure it’s the same thing as alignment… it seems there’s at least 3 concepts here, and Hjalmar is talking about the 2nd, which is importantly different from the 1st:
“classic notion of alignment”: The AI has the correct goal (represented internally, e.g. as a reward function)
“CIRL notion of alignment”: AI has a pointer to the correct goal (but the goal is represented externally, e.g. in a human partner’s mind)
“corrigibility”: something else