When I say “optimize all the parameters at runtime”, I do not mean “take one gradient step in between each timestep”. I mean, at each timestep, fully optimize all of the parameters. Optimize θ all the way to convergence before every single action.
Think back to the central picture of mesa-optimization (at least as I understand it). The mesa-optimizer shows up because some data is only available at runtime, not during training, so it has to be processed at runtime using parameters selected during training. In the online RL setup you sketch here, “runtime” for mesa-optimization purposes is every time the system chooses its action—i.e. every timestep—and “training” is all the previous timesteps. A mesa-optimizer should show up if, at every timestep, some relevant new data comes in and the system has to process that data in order to choose the optimal action, using parameters inherited from previous timesteps.
Now, suppose we fully optimize all of the parameters at every timestep. The objective function for this optimization would presumably be ∑trtlog(P[at|πθ]), with the sum taken over all previous data points, since that’s what the RL setup is approximating.
This optimization would probably still “find” the same mesa-optimizer as before, but now it looks less like a mesa-optimizer problem and more like an outer alignment problem: that objective function is probably not actually the thing we want. The fact that the true optimum for that objective function probably has our former “mesa-optimizer” embedded in it is a pretty strong signal that that objective function itself is not outer aligned; the true optimum of that objective function is not really the thing we want.
The RL process is actually optimizing E[∑trt], the log just comes from the REINFORCE trick. Regardless, I’m not sure I understand what you mean by optimizing fully to convergence at each timestep—convergence is a limiting property, so I don’t know what it could mean do it for a single timestep. Perhaps you mean just taking the optimal policy π∗ such that
π∗=argmaxπE[∑trt|π]?
In that case, that is in fact the definition of outer alignment I’ve given in the past, so I agree that whether π∗ is aligned or not is an outer alignment question.
When I say “optimize all the parameters at runtime”, I do not mean “take one gradient step in between each timestep”. I mean, at each timestep, fully optimize all of the parameters. Optimize θ all the way to convergence before every single action.
Think back to the central picture of mesa-optimization (at least as I understand it). The mesa-optimizer shows up because some data is only available at runtime, not during training, so it has to be processed at runtime using parameters selected during training. In the online RL setup you sketch here, “runtime” for mesa-optimization purposes is every time the system chooses its action—i.e. every timestep—and “training” is all the previous timesteps. A mesa-optimizer should show up if, at every timestep, some relevant new data comes in and the system has to process that data in order to choose the optimal action, using parameters inherited from previous timesteps.
Now, suppose we fully optimize all of the parameters at every timestep. The objective function for this optimization would presumably be ∑trtlog(P[at|πθ]), with the sum taken over all previous data points, since that’s what the RL setup is approximating.
This optimization would probably still “find” the same mesa-optimizer as before, but now it looks less like a mesa-optimizer problem and more like an outer alignment problem: that objective function is probably not actually the thing we want. The fact that the true optimum for that objective function probably has our former “mesa-optimizer” embedded in it is a pretty strong signal that that objective function itself is not outer aligned; the true optimum of that objective function is not really the thing we want.
Does that make sense?
The RL process is actually optimizing E[∑trt], the log just comes from the REINFORCE trick. Regardless, I’m not sure I understand what you mean by optimizing fully to convergence at each timestep—convergence is a limiting property, so I don’t know what it could mean do it for a single timestep. Perhaps you mean just taking the optimal policy π∗ such that π∗=argmaxπE[∑trt | π]? In that case, that is in fact the definition of outer alignment I’ve given in the past, so I agree that whether π∗ is aligned or not is an outer alignment question.
Sure, π∗ works for what I’m saying, assuming that sum-over-time only includes the timesteps taken thus far. In that case, I’m saying that either:
the mesa optimizer doesn’t appear in π∗, in which case the problem is fixed by fully optimizing everything at every timestep (i.e. by using π∗), or
the mesa optimizer does appear in π∗, in which case the problem was really an outer alignment issue all along.