Hm. I’ve often imagined a “keep the diamond safe” planner just choosing a plan which a narrow-ELK-solving reporter says is OK.
But where does the plan come from? If you’re imagining that the planner creates N different plans and then executes the one that the reporter says is OK, then I have the same objection:
The planner “knows” how and why it chose the action sequence while the predictor doesn’t, and so it’s very plausible that this allows the planner to choose some bad / deceptive sequence that looks good to the predictor. (The classic example is that plagiarism is easy to commit but hard to detect just from the output; see this post.)
How do you imagine the reporter being used?
Planner proposes some actions, call them A. The human raters use the reporter to understand the probable consequences of A, how those consequences should be valued, etc. This allows them to provide good feedback on A, creating a powerful and aligned oversight process that can be used as a training signal for the planner.
Hm. I’ve often imagined a “keep the diamond safe” planner just choosing a plan which a narrow-ELK-solving reporter says is OK.
How do you imagine the reporter being used?
But where does the plan come from? If you’re imagining that the planner creates N different plans and then executes the one that the reporter says is OK, then I have the same objection:
Planner proposes some actions, call them A. The human raters use the reporter to understand the probable consequences of A, how those consequences should be valued, etc. This allows them to provide good feedback on A, creating a powerful and aligned oversight process that can be used as a training signal for the planner.