I don’t think this demonstration truly captures treacherous turns, precisely because the agent needs to learn about how it can misbehave over multiple trials. As I understand it, a treacherous turn involves the agent modeling the environment sufficiently well that it can predict the payoff of misbehaving before taking any overt actions. The Goertzel prediction is what is happening here.
It’s important to start getting a grasp on how treacherous turns may work, and this demonstration helps; my disagreement is on how to label it.
I agree that this could be presented differently in order to be “narratively” closer to the canonical tracherous turn. However, in my opinion, this still counts as a good demonstration; think of the first 1,999 episodes (out of 2,000) as happening in Link’s mind, before taking his “real” decisions in the last episode. Granted, in our world AI would not be able to predict the future, but it would have access to sophisticated predictive tools, including machine learning.
a treacherous turn involves the agent modeling the environment sufficiently well that it can predict the payoff of misbehaving before taking any overt actions.
I agree. To be able to make this prediction, it must already know about the preferences of the overseer, know that the overseer would punish unaligned behavior, potentially estimating the punishing reward or predicting the actions the overseer would take. To make this prediction it must therefore have some kind of knowledge about how overseers behave, what actions they are likely to punish. If this knowledge does not come from experience, it must come from somewhere else, maybe from reading books/articles/Wikipedia or oberving this behaviour somewhere else, but this is outside of what I can implement right now.
The Goertzel prediction is what is happening here.
I agree that this does not correctly illustrate a treacherous right now, but it is moving towards it.
I’d like to register an intuition that I could come up with a (toy, unrealistic) continual learning scenario that looks like a treacherous turn with today’s ML, perhaps by restricting the policies that the agent can learn, giving it a strong inductive bias that lets it learn the environment and the supervisor’s preferences quickly and accurately, and making it model-based. It would look something like Stuart Armstrong’s toy version of the AI alignment problem, but with a learned environment model (but maybe learned from a very strong prior, not a neural net).
This is just an intuition, not a strong belief, but it would be enough for me to work on this if I had the time to do so.