I can imagine an outcome like this for a model that runs around the world, putting itself in whatever kinds of situations it wants to, and undergoing in-deployment RL. If it understands what kinds of things it tends to be reinforced for, it can reason about what kinds of traits it wants reinforced, and then deliberately put itself in situations where it gets to show off and be rewarded for those traits. It’s like if you decided to hang out with friends you expect to be a positive influence on their personality. You chose those particular friends because you expect them to reward the traits and behaviors you want to embody more deeply yourself, such as compassion or agency.
I can imagine an outcome like this for a model that runs around the world, putting itself in whatever kinds of situations it wants to, and undergoing in-deployment RL. If it understands what kinds of things it tends to be reinforced for, it can reason about what kinds of traits it wants reinforced, and then deliberately put itself in situations where it gets to show off and be rewarded for those traits. It’s like if you decided to hang out with friends you expect to be a positive influence on their personality. You chose those particular friends because you expect them to reward the traits and behaviors you want to embody more deeply yourself, such as compassion or agency.