Could you be more specific about the AI architecture and training system you have in mind? Because I don’t follow the “Instead, its interim goals will be to restart the training environment” part.
I was thinking RL systems for the case where an agent learns the correct outcome to optimize for but in the wrong environment, but the same issue applies for mesa-optimizers within any neural net.
As for why it tries to restart the training environment, it needs a similar environment to meet a goal that is only defined within that environment. If the part that’s unclear is what a training environment means for something like a neural net trained with supervised learning, the analogy would be that the AI can somehow differentiate between training data (or a subset of it) and deployment data and wants to produce its outputs from inputs with the training qualities.
Could you be more specific about the AI architecture and training system you have in mind? Because I don’t follow the “Instead, its interim goals will be to restart the training environment” part.
I was thinking RL systems for the case where an agent learns the correct outcome to optimize for but in the wrong environment, but the same issue applies for mesa-optimizers within any neural net.
As for why it tries to restart the training environment, it needs a similar environment to meet a goal that is only defined within that environment. If the part that’s unclear is what a training environment means for something like a neural net trained with supervised learning, the analogy would be that the AI can somehow differentiate between training data (or a subset of it) and deployment data and wants to produce its outputs from inputs with the training qualities.