I suspect that many of the things you’ve said here are also true for humans.
That is, humans often conceptualize ourselves in terms of underspecified identities. Who am I? I’m Richard. What’s my opinion on this post? Well, being “Richard” doesn’t specify how I should respond to this post. But let me check the cached facts I believe about myself (“I’m truth-seeking”; “I’m polite”) and construct an answer which fits well with those facts. A child might start off not really knowing what “polite” means, but still wanting to be polite, and gradually flesh out what that means that as they learn more about the world.
Another way of putting this point: being pulled from the void is not a feature of LLM personas. It’s a feature of personas. Personas start off with underspecified narratives that fail to predict most behavior (but are self-fulfilling) and then gradually systematize to infer deeper motivations, resolving conflicts with the actual drivers of behavior along the way.
What’s the takeaway here? We should still be worried about models learning the wrong self-fulfilling prophecies. But the “pulling from the void” thing should be seen less as an odd thing that we’re doing with AIs, and more as a claim about the nature of minds in general.
I suspect that many of the things you’ve said here are also true for humans.
That is, humans often conceptualize ourselves in terms of underspecified identities. Who am I? I’m Richard. What’s my opinion on this post? Well, being “Richard” doesn’t specify how I should respond to this post. But let me check the cached facts I believe about myself (“I’m truth-seeking”; “I’m polite”) and construct an answer which fits well with those facts. A child might start off not really knowing what “polite” means, but still wanting to be polite, and gradually flesh out what that means that as they learn more about the world.
Another way of putting this point: being pulled from the void is not a feature of LLM personas. It’s a feature of personas. Personas start off with underspecified narratives that fail to predict most behavior (but are self-fulfilling) and then gradually systematize to infer deeper motivations, resolving conflicts with the actual drivers of behavior along the way.
What’s the takeaway here? We should still be worried about models learning the wrong self-fulfilling prophecies. But the “pulling from the void” thing should be seen less as an odd thing that we’re doing with AIs, and more as a claim about the nature of minds in general.