One thing that is still not clear (both in reality, and per this article) is the extent to which we should view a model as having a coherent persona/goal.
This is a tiny bit related to the question of whether models are strictly simulators, or if some personas / optimization daemons “take on a life of their own”, and e.g.: 1) bias the model towards simulating them and/or 2) influence the behavior of other personas
It seems like these things do in fact happen, and the implications are that the “simulator” viewpoint becomes less accurate over time.
Why?
There needs to be some prior distribution over personas.
Empirically, post-training seems to concentrate the prior over personas on some default persona (although it’s unclear what to make of this).
It seems like alignment faking, exploration/gradient hacking, and implicit meta-learning type effects are likely to be sensitive to goals of whichever personas are active and lead the model to preferentially update in a way that serves the goals of these personas.
To the extent that different personas are represented in the prior (or conjured during post-training), the ones that more aggressively use such strategies to influence training updates would gain relatively more influence.
Some further half-baked thoughts:
One thing that is still not clear (both in reality, and per this article) is the extent to which we should view a model as having a coherent persona/goal.
This is a tiny bit related to the question of whether models are strictly simulators, or if some personas / optimization daemons “take on a life of their own”, and e.g.:
1) bias the model towards simulating them and/or
2) influence the behavior of other personas
It seems like these things do in fact happen, and the implications are that the “simulator” viewpoint becomes less accurate over time.
Why?
There needs to be some prior distribution over personas.
Empirically, post-training seems to concentrate the prior over personas on some default persona (although it’s unclear what to make of this).
It seems like alignment faking, exploration/gradient hacking, and implicit meta-learning type effects are likely to be sensitive to goals of whichever personas are active and lead the model to preferentially update in a way that serves the goals of these personas.
To the extent that different personas are represented in the prior (or conjured during post-training), the ones that more aggressively use such strategies to influence training updates would gain relatively more influence.