I think we have a strong intuitive disagreement here, that explains our varying judgements.
I think we both agree on the facts that a) there is a sense of corrigibility for humans interacting with humans in typical situations, and b) there are thought experiments (eg a human given more time to reflect) that extend this beyond typical situations.
We possibly also agree on c) corrigibility is not uniquely defined.
I intuitively feel that there is not a well defined version of corrigibility that works for arbitrary agents interacting with arbitrary agents, or even for arbitrary agents interacting with humans (except for one example, see below).
One of the reasons for this is my experience in how intuitive human concepts are very hard to scale up—at least, without considering human preferences. See this comment for an example in the “low impact” setting.
So corrigibility feels like it’s in the same informal category as low impact. It also has a lot of possible contradictions in how its applied, depending on which corrigibility preferences and meta-preferences the AI choose to use. Contradictions are opportunities for the AI to choose an outcome, randomly or with some optimisation pressure.
But haven’t I argued that human preferences themselves are full of contradictions? Indeed; and resolving these contradictions is an important part of the challenge. But I’m much more optimistic about getting to a good place, for human preferences, when explicitly resolving the contradictions in human overall preferences—rather than when resolving the contradictions in human corrigibility preferences (and if “human corrigibility preferences” include enough general human preferences to make it safe—is this really corrigibility we’re talking about?).
To develop that point slightly, I see optimising for anything that doesn’t include safety or alignment as likely to sacrifice safety or alignment; so optimising for corrigibility will either sacrifice them, or the concepts of safety (and most of alignment) are already present in “corrigibility”.
I do know one version of corrigibility that makes sense, which explicitly looks at the outcome of what human preferences will be, and attempts to minimise the rigging of this process. That’s one of the reasons I keep coming back to the outcome.
I would prefer if you presented an example of a setup, maybe one that had some corrigibility-like features, rather than having a general setup and saying “corrigibility will solve the problems with this”.
I think we have a strong intuitive disagreement here, that explains our varying judgements.
I think we both agree on the facts that a) there is a sense of corrigibility for humans interacting with humans in typical situations, and b) there are thought experiments (eg a human given more time to reflect) that extend this beyond typical situations.
We possibly also agree on c) corrigibility is not uniquely defined.
I intuitively feel that there is not a well defined version of corrigibility that works for arbitrary agents interacting with arbitrary agents, or even for arbitrary agents interacting with humans (except for one example, see below).
One of the reasons for this is my experience in how intuitive human concepts are very hard to scale up—at least, without considering human preferences. See this comment for an example in the “low impact” setting.
So corrigibility feels like it’s in the same informal category as low impact. It also has a lot of possible contradictions in how its applied, depending on which corrigibility preferences and meta-preferences the AI choose to use. Contradictions are opportunities for the AI to choose an outcome, randomly or with some optimisation pressure.
But haven’t I argued that human preferences themselves are full of contradictions? Indeed; and resolving these contradictions is an important part of the challenge. But I’m much more optimistic about getting to a good place, for human preferences, when explicitly resolving the contradictions in human overall preferences—rather than when resolving the contradictions in human corrigibility preferences (and if “human corrigibility preferences” include enough general human preferences to make it safe—is this really corrigibility we’re talking about?).
To develop that point slightly, I see optimising for anything that doesn’t include safety or alignment as likely to sacrifice safety or alignment; so optimising for corrigibility will either sacrifice them, or the concepts of safety (and most of alignment) are already present in “corrigibility”.
I do know one version of corrigibility that makes sense, which explicitly looks at the outcome of what human preferences will be, and attempts to minimise the rigging of this process. That’s one of the reasons I keep coming back to the outcome.
I would prefer if you presented an example of a setup, maybe one that had some corrigibility-like features, rather than having a general setup and saying “corrigibility will solve the problems with this”.