To what extent do models care about their performance across episodes? If there exists a side-channel which only increases next-episode performance, under what circumstances will a model exploit such a thing?
If an agent is trained with an episodic learning scheme and ends up with a behavior that maximizes reward across episodes, I’m not sure we should consider this an inner alignment failure. In some sense, we got the behavior that our learning scheme was optimizing for. [EDIT: this is not true necessarily true for all learning algorithms, e.g. gradient descent, see discussion here]
To quickly see this, imagine an episodic learning scheme where—at the end of each episode—if the agent failed to achieve the episode’s goal then its policy network parameters are completely randomized, and otherwise the agent is unchanged. Assuming we have infinite resources, if we run this learning scheme for an arbitrarily long time, we should expect to end up with an agent that tries to achieve goals in future episodes.
If an agent is trained with an episodic learning scheme and ends up with a behavior that maximizes reward across episodes, I’m not sure we should consider this an inner alignment failure. In some sense, we got the behavior that our learning scheme was optimizing for. [EDIT: this is not true necessarily true for all learning algorithms, e.g. gradient descent, see discussion here]
To quickly see this, imagine an episodic learning scheme where—at the end of each episode—if the agent failed to achieve the episode’s goal then its policy network parameters are completely randomized, and otherwise the agent is unchanged. Assuming we have infinite resources, if we run this learning scheme for an arbitrarily long time, we should expect to end up with an agent that tries to achieve goals in future episodes.
I agree that you could interpret that as more of an outer alignment problem, though either way I think it’s definitely an important safety concern.