Overall I think “simulators” names a useful concept. I also liked how you pointed out and deconfused type errors around “GPT-3 got this question wrong.” Other thoughts:
RL’s archetype of an agent optimized to maximize free parameters (such as action-trajectories) relative to a reward function.
...
Simulators like GPT give us methods of instantiating intelligent processes, including goal-directed agents, with methods other than optimizing against a reward function.
...
Does the simulator archetype converge with the RL archetype in the case where all training samples were generated by an agent optimized to maximize a reward function? Or are there still fundamental differences that derive from the training method?
For the last quote—I think people do reinforcement learning, and so are “updated by reward functions” in an appropriate sense. Then GPT-3 is already mostly trained against samples meeting your stipulated condition. (But perhaps you meant something else?)
This brings me to another question you ask:
Why mechanistically should mesaoptimizers form in predictive learning, versus for instance in reinforcement learning or GANs?
What if the training data is a biased/limited sample, representing only a subset of all possible conditions? There may be many “laws of physics” which equally predict the training distribution but diverge in their predictions out-of-distribution.
I think this isn’t so much a problem with your “simulators” concept, but a problem with the concept of outer alignment.
Overall I think “simulators” names a useful concept. I also liked how you pointed out and deconfused type errors around “GPT-3 got this question wrong.” Other thoughts:
I wish that that you more strongly ruled out “reward is the optimization target” as an interpretation of the following quotes:
For the last quote—I think people do reinforcement learning, and so are “updated by reward functions” in an appropriate sense. Then GPT-3 is already mostly trained against samples meeting your stipulated condition. (But perhaps you meant something else?)
This brings me to another question you ask:
I think most of the alignment-relevant differences between RL and SSL might come from an independence assumption more strongly satisfied in SSL.
I think this isn’t so much a problem with your “simulators” concept, but a problem with the concept of outer alignment.