Thanks for doing this! I’m a fan of executable code that demonstrates the problems that we are worrying about—it makes the concept (in this case, a treacherous turn) more concrete.
In order to make it more realistic, I would want the agent to grow in capability organically (rather than simply getting a more powerful weapon). It would really drive home the point if the agent undertook a treacherous turn the very first time, whereas in this post I assume it learned using many episodes of trial-and-error that a treacherous turn leads to higher reward. This seems hard to demonstrate with today’s ML in any complex environment, where you need to learn from experience instead of using eg. value iteration, but it’s not out of the question in a continual learning setup where the agent can learn a model of the world.
Would it be possible to just apply model-based planning and show the treacherous turn on the first time?
Model-based planning is also AI, and we clearly have an available model of this environment.
Yes, that would work. I think Stuart Armstrong’s AI Toy Control problem already demonstrates this quite well—it’s the generalization to unknown dynamics that might be interesting and more compelling.
Thanks for the suggestion!
Yes, it learned through Q-learning to behave differently when he had this more powerful weapon, thus undertaking multiple treacherous turn in training. A “continual learning setup” would be to have it face multiple adversaries/supervisors, so it could learn how to behave in such conditions. Eventually, it would generalize and understand that “when I face this kind of agent that punishes me, it’s better to wait capability gains before taking over”. I don’t know any ML algorithm that would allow such “generalization” though.
About an organic growth: I think that, using only vanilla RL, it would still learn to behave correctly until a certain threshold in capability, and then undertake a treacherous turn. So even with N different capability levels, there would still be 2 possibilities: 1) killing the overseer gives the highest expected reward 2) the aligned behavior gives the highest expected reward.
I don’t think this demonstration truly captures treacherous turns, precisely because the agent needs to learn about how it can misbehave over multiple trials. As I understand it, a treacherous turn involves the agent modeling the environment sufficiently well that it can predict the payoff of misbehaving before taking any overt actions. The Goertzel prediction is what is happening here.
It’s important to start getting a grasp on how treacherous turns may work, and this demonstration helps; my disagreement is on how to label it.
I agree that this could be presented differently in order to be “narratively” closer to the canonical tracherous turn. However, in my opinion, this still counts as a good demonstration; think of the first 1,999 episodes (out of 2,000) as happening in Link’s mind, before taking his “real” decisions in the last episode. Granted, in our world AI would not be able to predict the future, but it would have access to sophisticated predictive tools, including machine learning.
a treacherous turn involves the agent modeling the environment sufficiently well that it can predict the payoff of misbehaving before taking any overt actions.
I agree. To be able to make this prediction, it must already know about the preferences of the overseer, know that the overseer would punish unaligned behavior, potentially estimating the punishing reward or predicting the actions the overseer would take. To make this prediction it must therefore have some kind of knowledge about how overseers behave, what actions they are likely to punish. If this knowledge does not come from experience, it must come from somewhere else, maybe from reading books/articles/Wikipedia or oberving this behaviour somewhere else, but this is outside of what I can implement right now.
The Goertzel prediction is what is happening here.
I agree that this does not correctly illustrate a treacherous right now, but it is moving towards it.
I’d like to register an intuition that I could come up with a (toy, unrealistic) continual learning scenario that looks like a treacherous turn with today’s ML, perhaps by restricting the policies that the agent can learn, giving it a strong inductive bias that lets it learn the environment and the supervisor’s preferences quickly and accurately, and making it model-based. It would look something like Stuart Armstrong’s toy version of the AI alignment problem, but with a learned environment model (but maybe learned from a very strong prior, not a neural net).
This is just an intuition, not a strong belief, but it would be enough for me to work on this if I had the time to do so.