Figuring out whether to act vs ask questions feels like a fundamentally epistemic judgement: How confident am I in my knowledge that this is what my operator wants me to do? How important do I believe this aspect of my task to be, and how confident am I in my importance assessment? What is the likely cost of delaying in order to ask my operator a question? Etc. My intuition is that this problem is therefore best viewed within an epistemic framework (trying to have well-calibrated knowledge) rather than a behavioral one (trying to mimic instances of question-asking in the training data). Giving an agent examples of cases where it should ask questions feels like about as much of a solution to the problem of corrigibility as the use of soft labels (probability targets that are neither 0 nor 1) is a solution to the problem of calibration in a supervised learning context. It’s a good start, but I’d prefer a solution with a stronger justification behind it. However, if we did have a solution with a strong justification, FAI starts looking pretty easy to me.
My impression (shaped by this example of amplification) is that the agents in the amplification tree would be considering exactly these sort of epistemic questions. (There is then the separate question of how faithfully this behaviour is reproduced/generalized during distillation)
Figuring out whether to act vs ask questions feels like a fundamentally epistemic judgement: How confident am I in my knowledge that this is what my operator wants me to do? How important do I believe this aspect of my task to be, and how confident am I in my importance assessment? What is the likely cost of delaying in order to ask my operator a question? Etc. My intuition is that this problem is therefore best viewed within an epistemic framework (trying to have well-calibrated knowledge) rather than a behavioral one (trying to mimic instances of question-asking in the training data). Giving an agent examples of cases where it should ask questions feels like about as much of a solution to the problem of corrigibility as the use of soft labels (probability targets that are neither 0 nor 1) is a solution to the problem of calibration in a supervised learning context. It’s a good start, but I’d prefer a solution with a stronger justification behind it. However, if we did have a solution with a strong justification, FAI starts looking pretty easy to me.
My impression (shaped by this example of amplification) is that the agents in the amplification tree would be considering exactly these sort of epistemic questions. (There is then the separate question of how faithfully this behaviour is reproduced/generalized during distillation)