By “corrigible” here did you mean Paul’s definition which doesn’t include competence in modeling the user and detecting ambiguities, or what we thought “corrigible” meant (where it does include those things)?
Thinking of “corrigible” as “whatever Paul means when he says corrigible”. The idea applies to any notion of corrigibility which allows for multiple actions and does not demand that the action returned be one that is the best possible for the user
In that case, your answer doesn’t seem to address my question, which was about:
“Would a community of AI researchers with a long time to think about answering this be able to provide training to someone so that they and a bunch of assistants find the ambiguity”?
If someone doesn’t know how to use a bunch of assistants to “find the ambiguity”, how does training them to avoid certain actions help? (Perhaps we’re no longer discussing that topic, which was in part based on a misunderstanding of what Paul meant?)
Was thinking of things more in line with Paul’s version, not this finding ambiguity definition, where the goal is to avoid doing some kind of malign optimization during search (ie. untrained assistant thinks it’s a good idea to use the universal prior, then you show them What does the universal prior actually look like?, and afterwards they know not to do that).
Thinking of “corrigible” as “whatever Paul means when he says corrigible”. The idea applies to any notion of corrigibility which allows for multiple actions and does not demand that the action returned be one that is the best possible for the user
In that case, your answer doesn’t seem to address my question, which was about:
If someone doesn’t know how to use a bunch of assistants to “find the ambiguity”, how does training them to avoid certain actions help? (Perhaps we’re no longer discussing that topic, which was in part based on a misunderstanding of what Paul meant?)
Was thinking of things more in line with Paul’s version, not this finding ambiguity definition, where the goal is to avoid doing some kind of malign optimization during search (ie. untrained assistant thinks it’s a good idea to use the universal prior, then you show them What does the universal prior actually look like?, and afterwards they know not to do that).