I think this is under-discussed, but also that I have seen many discussions in this area. E.g. I have seen it come up and brought it up in the context of Paul’s research agenda, where success relies on humans being able to play their part safely in the amplification system. Many people say they are more worried about misuse than accident on the basis of the corruption issues (and much discussion about CEV and idealization, superstimuli, etc addresses the kind of path-dependence and adversarial search you mention).
However, those varied problems mostly aren’t formulated as ‘ML safety problems in humans’ (I have seen robustness and distributional shift discussion for Paul’s amplification, and daemons/wireheading/safe-self-modification for humans and human organizations), and that seems like a productive framing for systematic exploration, going through the known inventories and trying to see how they cross-apply.
I agree with all of this but I don’t think it addresses my central point/question. (I’m not sure if you were trying to, or just making a more tangential comment.) To rephrase, it seems to me that ‘ML safety problems in humans’ is a natural/obvious framing that makes clear that alignment to human users/operators is likely far from sufficient to ensure the safety of human-AI systems, that in some ways corrigibility is actually opposed to safety, and that there are likely technical angles of attack on these problems. It seems surprising that someone like me had to point out this framing to people who are intimately familiar with ML safety problems, and also surprising that they largely respond with silence.
in some ways corrigibility is actually opposed to safety
We can talk about “corrigible by X” for arbitrary X. I don’t think these considerations imply a tension between corrigibility and safety, they just suggest “humans in the real world” may not be the optimal X. You might prefer use an appropriate idealization of humans / humans in some safe environment / etc.
To the extent that even idealized humans are not perfectly safe (e.g., perhaps a white-box metaphilosophical approach is even safer), and that corrigibility seems to conflict with greater transparency and hence cooperation between AIs, there still seems to be some tension between corrigibility and safety even when X = idealized humans.
ETA: Do you think IDA can be used to produce an AI that is corrigible by some kind of idealized human? That might be another approach that’s worth pursuing if it looks feasible.
ETA: Do you think IDA can be used to produce an AI that is corrigible by some kind of idealized human? That might be another approach that’s worth pursuing if it looks feasible.
I think this is under-discussed, but also that I have seen many discussions in this area. E.g. I have seen it come up and brought it up in the context of Paul’s research agenda, where success relies on humans being able to play their part safely in the amplification system. Many people say they are more worried about misuse than accident on the basis of the corruption issues (and much discussion about CEV and idealization, superstimuli, etc addresses the kind of path-dependence and adversarial search you mention).
However, those varied problems mostly aren’t formulated as ‘ML safety problems in humans’ (I have seen robustness and distributional shift discussion for Paul’s amplification, and daemons/wireheading/safe-self-modification for humans and human organizations), and that seems like a productive framing for systematic exploration, going through the known inventories and trying to see how they cross-apply.
I agree with all of this but I don’t think it addresses my central point/question. (I’m not sure if you were trying to, or just making a more tangential comment.) To rephrase, it seems to me that ‘ML safety problems in humans’ is a natural/obvious framing that makes clear that alignment to human users/operators is likely far from sufficient to ensure the safety of human-AI systems, that in some ways corrigibility is actually opposed to safety, and that there are likely technical angles of attack on these problems. It seems surprising that someone like me had to point out this framing to people who are intimately familiar with ML safety problems, and also surprising that they largely respond with silence.
We can talk about “corrigible by X” for arbitrary X. I don’t think these considerations imply a tension between corrigibility and safety, they just suggest “humans in the real world” may not be the optimal X. You might prefer use an appropriate idealization of humans / humans in some safe environment / etc.
To the extent that even idealized humans are not perfectly safe (e.g., perhaps a white-box metaphilosophical approach is even safer), and that corrigibility seems to conflict with greater transparency and hence cooperation between AIs, there still seems to be some tension between corrigibility and safety even when X = idealized humans.
ETA: Do you think IDA can be used to produce an AI that is corrigible by some kind of idealized human? That might be another approach that’s worth pursuing if it looks feasible.
Yes.