Kaj, the point I understand you to be making is: “The inner RL algorithm in this scenario is probably reliably aligned with the outer RL algorithm, since the former was selected specifically on the basis of it being good at accomplishing the latter’s objective, and since if the former deviates from pursuing that objective it will receive less reward from the outer, causing it to reconfigure itself to be better aligned. And since the two algorithms operate on similar time scales, we should expect any such misalignment to be noticed/corrected quickly.” Does this seem like a reasonable paraphrase?
It doesn’t feel obvious to me that the outer layer will be able to reliably steer the inner layer in this sense, especially as systems become more powerful. For example, it seems plausible to me that the inner layer might come to optimize for its proxy estimations of outer reward more than for outer reward itself, and that those two things might become decoupled.
That seems like a reasonable paraphrase, at least if you include the qualification that the “quickly” is relative to the amount of structure that the inner layer has accumulated, so might not actually happen quickly enough to be useful in all cases.
For example, it seems plausible to me that the inner layer might come to optimize for its proxy estimations of outer reward more than for outer reward itself, and that those two things could become decoupled.
Sure, e.g. lots of exotic sexual fetishes look like that to me. Hmm, though actually that example makes me rethink the argument that you just paraphrased, given that those generally emerge early in an individual’s life and then generally don’t get “corrected”.
Kaj, the point I understand you to be making is: “The inner RL algorithm in this scenario is probably reliably aligned with the outer RL algorithm, since the former was selected specifically on the basis of it being good at accomplishing the latter’s objective, and since if the former deviates from pursuing that objective it will receive less reward from the outer, causing it to reconfigure itself to be better aligned. And since the two algorithms operate on similar time scales, we should expect any such misalignment to be noticed/corrected quickly.” Does this seem like a reasonable paraphrase?
It doesn’t feel obvious to me that the outer layer will be able to reliably steer the inner layer in this sense, especially as systems become more powerful. For example, it seems plausible to me that the inner layer might come to optimize for its proxy estimations of outer reward more than for outer reward itself, and that those two things might become decoupled.
That seems like a reasonable paraphrase, at least if you include the qualification that the “quickly” is relative to the amount of structure that the inner layer has accumulated, so might not actually happen quickly enough to be useful in all cases.
Sure, e.g. lots of exotic sexual fetishes look like that to me. Hmm, though actually that example makes me rethink the argument that you just paraphrased, given that those generally emerge early in an individual’s life and then generally don’t get “corrected”.