Why would the planner have pressure to choose something which looks good to the predictor, but is secretly bad, given that it selects a plan based on what the reporter says? Is this a Goodhart’s curse issue, where the curse afflicts not the reporter (which is assumed conservative, if it’s the direct translator), but the predictor’s own understanding of the situation?
Hm. I’ve often imagined a “keep the diamond safe” planner just choosing a plan which a narrow-ELK-solving reporter says is OK.
But where does the plan come from? If you’re imagining that the planner creates N different plans and then executes the one that the reporter says is OK, then I have the same objection:
The planner “knows” how and why it chose the action sequence while the predictor doesn’t, and so it’s very plausible that this allows the planner to choose some bad / deceptive sequence that looks good to the predictor. (The classic example is that plagiarism is easy to commit but hard to detect just from the output; see this post.)
How do you imagine the reporter being used?
Planner proposes some actions, call them A. The human raters use the reporter to understand the probable consequences of A, how those consequences should be valued, etc. This allows them to provide good feedback on A, creating a powerful and aligned oversight process that can be used as a training signal for the planner.
Why would the planner have pressure to choose something which looks good to the predictor, but is secretly bad, given that it selects a plan based on what the reporter says? Is this a Goodhart’s curse issue, where the curse afflicts not the reporter (which is assumed conservative, if it’s the direct translator), but the predictor’s own understanding of the situation?
… What makes you think it does this? That wasn’t part of my picture.
Hm. I’ve often imagined a “keep the diamond safe” planner just choosing a plan which a narrow-ELK-solving reporter says is OK.
How do you imagine the reporter being used?
But where does the plan come from? If you’re imagining that the planner creates N different plans and then executes the one that the reporter says is OK, then I have the same objection:
Planner proposes some actions, call them A. The human raters use the reporter to understand the probable consequences of A, how those consequences should be valued, etc. This allows them to provide good feedback on A, creating a powerful and aligned oversight process that can be used as a training signal for the planner.