David Scott Krueger (formerly: capybaralet) comments on the void

David Scott Krueger (formerly: capybaralet) 26 Jun 2025 20:44 UTC
LW: 4 AF: 2
2
AF
Some further half-baked thoughts:

One thing that is still not clear (both in reality, and per this article) is the extent to which we should view a model as having a coherent persona/goal.

This is a tiny bit related to the question of whether models are strictly simulators, or if some personas / optimization daemons “take on a life of their own”, and e.g.:
1) bias the model towards simulating them and/or
2) influence the behavior of other personas

It seems like these things do in fact happen, and the implications are that the “simulator” viewpoint becomes less accurate over time.

Why?
- There needs to be some prior distribution over personas.
- Empirically, post-training seems to concentrate the prior over personas on some default persona (although it’s unclear what to make of this).
- It seems like alignment faking, exploration/gradient hacking, and implicit meta-learning type effects are likely to be sensitive to goals of whichever personas are active and lead the model to preferentially update in a way that serves the goals of these personas.
- To the extent that different personas are represented in the prior (or conjured during post-training), the ones that more aggressively use such strategies to influence training updates would gain relatively more influence.