So this line of thinking came from considering the predictor and the planner as separate, which is what ELK does AFAIR. It would thus be the planner that executes, or prepares a treacherous turn, while the predictor doesn’t know about it (but would be in principle able to find out if it actually needed to).
I often imagine a multihead architecture, where most of the computation is in the shared layers, then one (small) head is the “predictor” (trained by self-supervised learning), one (small) head is the “actor / planner” (trained by RL from human feedback), and the other (small) head is the “reporter” (trained via ELK). In this version the hope is that ~all of the relevant knowledge is in the shared layers and so is accessible to the reporter.
You could also be fancier and take the activations, weights, and/or outputs of the planner head and feed them as inputs into the reporter, if you really wanted to be sure that the information is accessible in principle.
This thread makes me thing that my post is basically a hardness result for ELK when you don’t have access to the planner. I agree with you that in settings like the ones you describe, the reporter would have access to the planner, and thus the examples described in this post wouldn’t really apply. But the need to have control of the planner is not stated in the ELK report.
So if this post is correct, solving ELK isn’t enough if you don’t have access to the planner. Which means we either need to ensure that in all case we can train/observe the planner (which depends on when the alignment team is involved, and the exact architectures available), or deal with the regress of question I present.
Why is this what you often imagine? I thought that in classic ELK architectural setup, the planner uses the outputs of both a predictor and reporter in order to make its plans, eg using the reporter to grade plans and finding the plan which most definitely contains a diamond (according to the reporter). And the simplest choice would be brute-force search over action sequences.
After all, here’s the architecture:
But in your setup, the planner would be another head branching from “figure out what’s going on”, which means that it’s receiving the results of a computation already conditionedon the action sequence?
Mostly my reason for thinking about my architecture is that if the planner is separate it seems so obviously doomed (from a worst-case perspective, for outer alignment / building a good oversight process). The planner “knows” how and why it chose the action sequence while the predictor doesn’t, and so it’s very plausible that this allows the planner to choose some bad / deceptive sequence that looks good to the predictor. (The classic example is that plagiarism is easy to commit but hard to detect just from the output; see this post.)
But if you had me speculate about what ARC thinks, my probably-somewhat-incorrect understanding is “in theory, the oversight process (predictor + reporter) needs to be able to run exponentially-large searches (= perfect optimization in an exponentially large space) over the possible computations that the planner could have done (see idealized ascription universality); in practice we’re going to give it “hints” about what computation the planner actually does by e.g. sharing weights or looking at the planner’s activations”.
But in your setup, the planner would be another head branching from “figure out what’s going on”, which means that it’s receiving the results of a computation already conditionedon the action sequence?
I would assume the action sequence input is a variable-length list (e.g. you pass all the actions through an LSTM and then the last output / hidden state is provided as an input to the rest of the neural net). The planner can be conditioned on the last N actions and asked to produce the next action (and initially N = 0).
Why would the planner have pressure to choose something which looks good to the predictor, but is secretly bad, given that it selects a plan based on what the reporter says? Is this a Goodhart’s curse issue, where the curse afflicts not the reporter (which is assumed conservative, if it’s the direct translator), but the predictor’s own understanding of the situation?
Hm. I’ve often imagined a “keep the diamond safe” planner just choosing a plan which a narrow-ELK-solving reporter says is OK.
But where does the plan come from? If you’re imagining that the planner creates N different plans and then executes the one that the reporter says is OK, then I have the same objection:
The planner “knows” how and why it chose the action sequence while the predictor doesn’t, and so it’s very plausible that this allows the planner to choose some bad / deceptive sequence that looks good to the predictor. (The classic example is that plagiarism is easy to commit but hard to detect just from the output; see this post.)
How do you imagine the reporter being used?
Planner proposes some actions, call them A. The human raters use the reporter to understand the probable consequences of A, how those consequences should be valued, etc. This allows them to provide good feedback on A, creating a powerful and aligned oversight process that can be used as a training signal for the planner.
So this line of thinking came from considering the predictor and the planner as separate, which is what ELK does AFAIR. It would thus be the planner that executes, or prepares a treacherous turn, while the predictor doesn’t know about it (but would be in principle able to find out if it actually needed to).
What is the planner? How is it trained?
I often imagine a multihead architecture, where most of the computation is in the shared layers, then one (small) head is the “predictor” (trained by self-supervised learning), one (small) head is the “actor / planner” (trained by RL from human feedback), and the other (small) head is the “reporter” (trained via ELK). In this version the hope is that ~all of the relevant knowledge is in the shared layers and so is accessible to the reporter.
You could also be fancier and take the activations, weights, and/or outputs of the planner head and feed them as inputs into the reporter, if you really wanted to be sure that the information is accessible in principle.
This thread makes me thing that my post is basically a hardness result for ELK when you don’t have access to the planner. I agree with you that in settings like the ones you describe, the reporter would have access to the planner, and thus the examples described in this post wouldn’t really apply. But the need to have control of the planner is not stated in the ELK report.
So if this post is correct, solving ELK isn’t enough if you don’t have access to the planner. Which means we either need to ensure that in all case we can train/observe the planner (which depends on when the alignment team is involved, and the exact architectures available), or deal with the regress of question I present.
Thanks for making that clearer to me!
Why is this what you often imagine? I thought that in classic ELK architectural setup, the planner uses the outputs of both a predictor and reporter in order to make its plans, eg using the reporter to grade plans and finding the plan which most definitely contains a diamond (according to the reporter). And the simplest choice would be brute-force search over action sequences.
After all, here’s the architecture:
But in your setup, the planner would be another head branching from “figure out what’s going on”, which means that it’s receiving the results of a computation already conditioned on the action sequence?
Mostly my reason for thinking about my architecture is that if the planner is separate it seems so obviously doomed (from a worst-case perspective, for outer alignment / building a good oversight process). The planner “knows” how and why it chose the action sequence while the predictor doesn’t, and so it’s very plausible that this allows the planner to choose some bad / deceptive sequence that looks good to the predictor. (The classic example is that plagiarism is easy to commit but hard to detect just from the output; see this post.)
But if you had me speculate about what ARC thinks, my probably-somewhat-incorrect understanding is “in theory, the oversight process (predictor + reporter) needs to be able to run exponentially-large searches (= perfect optimization in an exponentially large space) over the possible computations that the planner could have done (see idealized ascription universality); in practice we’re going to give it “hints” about what computation the planner actually does by e.g. sharing weights or looking at the planner’s activations”.
I would assume the action sequence input is a variable-length list (e.g. you pass all the actions through an LSTM and then the last output / hidden state is provided as an input to the rest of the neural net). The planner can be conditioned on the last N actions and asked to produce the next action (and initially N = 0).
Why would the planner have pressure to choose something which looks good to the predictor, but is secretly bad, given that it selects a plan based on what the reporter says? Is this a Goodhart’s curse issue, where the curse afflicts not the reporter (which is assumed conservative, if it’s the direct translator), but the predictor’s own understanding of the situation?
… What makes you think it does this? That wasn’t part of my picture.
Hm. I’ve often imagined a “keep the diamond safe” planner just choosing a plan which a narrow-ELK-solving reporter says is OK.
How do you imagine the reporter being used?
But where does the plan come from? If you’re imagining that the planner creates N different plans and then executes the one that the reporter says is OK, then I have the same objection:
Planner proposes some actions, call them A. The human raters use the reporter to understand the probable consequences of A, how those consequences should be valued, etc. This allows them to provide good feedback on A, creating a powerful and aligned oversight process that can be used as a training signal for the planner.