Thanks, this is helpful. To restate my understanding: Your view is that highly capable AIs will not be well understood as enacting personas, since personas stay close to the pretraining distribution and are therefore not highly capable.
I do take this argument seriously. In a piece I’m working on about the “AIs enact personas” model of AI behavior/psychology, it’s one of the two main conceptual arguments I discuss for why AIs will not be well-understood as enacting personas in the future. (The other argument is that advanced AIs will be in very un-human-like situations, e.g. directly operating geographically dispersed infrastructure and working with exotic modalities, so it will be unreasonable for them to model the Assistant as enacting a human-like persona.)
That said, I think this is highly uncertain; I don’t think either of these arguments are robust enough to instill high confidence in their conclusions. Current AI assistants have capabilities which no persona in the pre-training distribution has (e.g. a simple example is that they have specific knowledge like how to use tool-calling syntax which no human has). Nevertheless, the LLM seems to just infer that the Assistant persona has this knowledge but is still essentially persona-like in its propensities and other behaviors. More generally, I don’t think we’ve yet seen signs that more capable AI assistants are less persona-like.
I disagree there is much uncertainty here! IDK, like, I am happy to take bets here if we can find a good operationalization. I just really don’t see models that are capable of taking over the world being influenced by AI persona stuff.[1]
More generally, I don’t think we’ve yet seen signs that more capable AI assistants are less persona-like.
I think we’ve seen that quite a bit! It used to be that the exact way you asked a question would matter a lot for the quality of response you get. Prompting was a skill with a lot of weird nuance. Those things have substantially disappeared, as one would expect as a result of RL-training.
Yes, we are currently actively trying to instill various personality traits into AI systems via things like constitutional AI feedback, but we have clearly been moving away from the pretraining distribution as determining a lot of the behavior of AI systems, and I think we will see more of that.
And then additionally, I also don’t see the persona stuff mattering much for using AI systems when they are not capable of taking over the world for alignment research purposes. Like, in-general I think we should train helpful-only models for that purpose, and in everyday work the persona stuff just doesn’t really matter for getting work out of these systems.
It used to be that the exact way you asked a question would matter a lot for the quality of response you get.
.
we have clearly been moving away from the pretraining distribution as determining a lot of the behavior of AI systems,
I agree that models no longer behave as much like pre-trained models (trivially, because they have undergone more training which is not NTP on a corpus of webtext), but not all ways of not behaving like a pre-trained model undermine the view that LLMs are enacting personas. (This is the point I was trying to make with the “simulating a person who has some specific knowledge that no human has” example.)
Happy to take bets, but this should probably happen after the piece is out with a more precise formulation of the “persona” model and a discussion of how I view empirical observations as relating to it.
I just really don’t see models that are capable of taking over the world being influenced by AI persona stuff.
At least for me, I think AI personas (i.e. the human predictors turned around to generate human sounding text) are where most of the pre-existing agency of a model lies, since predicting humans requires predicting agentic behavior.
So I expect that post-training RL will not conjure new agentic centers from scratch, but will instead expand on this core. It’s unclear whether that will be enough to get to taking over the world capabilities, but if RL-trained LLMs do scale to that level, I expect the entity to remain somewhat persona-like in that it was built around the persona core. So it’s not completely implausible to me that “persona stuff” can have a meaningful impact here, though that’s still very hard and fraught.
I think we’ve seen that quite a bit! It used to be that the exact way you asked a question would matter a lot for the quality of response you get. Prompting was a skill with a lot of weird nuance. Those things have substantially disappeared, as one would expect as a result of RL-training.
I feel pretty confused by this comment, so I am probably misunderstanding something.
But like, the models now respond to prompts in a much more human-like way? I think this is mainly due to RL reinforcing personas over all the other stuff that a pretrained model is trying to do. It distorts the personas too, but I don’t see where else the ‘being able to interface in a human-like way with natural language’ skill could be coming from.
Thanks, this is helpful. To restate my understanding: Your view is that highly capable AIs will not be well understood as enacting personas, since personas stay close to the pretraining distribution and are therefore not highly capable.
I do take this argument seriously. In a piece I’m working on about the “AIs enact personas” model of AI behavior/psychology, it’s one of the two main conceptual arguments I discuss for why AIs will not be well-understood as enacting personas in the future. (The other argument is that advanced AIs will be in very un-human-like situations, e.g. directly operating geographically dispersed infrastructure and working with exotic modalities, so it will be unreasonable for them to model the Assistant as enacting a human-like persona.)
That said, I think this is highly uncertain; I don’t think either of these arguments are robust enough to instill high confidence in their conclusions. Current AI assistants have capabilities which no persona in the pre-training distribution has (e.g. a simple example is that they have specific knowledge like how to use tool-calling syntax which no human has). Nevertheless, the LLM seems to just infer that the Assistant persona has this knowledge but is still essentially persona-like in its propensities and other behaviors. More generally, I don’t think we’ve yet seen signs that more capable AI assistants are less persona-like.
I disagree there is much uncertainty here! IDK, like, I am happy to take bets here if we can find a good operationalization. I just really don’t see models that are capable of taking over the world being influenced by AI persona stuff.[1]
I think we’ve seen that quite a bit! It used to be that the exact way you asked a question would matter a lot for the quality of response you get. Prompting was a skill with a lot of weird nuance. Those things have substantially disappeared, as one would expect as a result of RL-training.
Yes, we are currently actively trying to instill various personality traits into AI systems via things like constitutional AI feedback, but we have clearly been moving away from the pretraining distribution as determining a lot of the behavior of AI systems, and I think we will see more of that.
And then additionally, I also don’t see the persona stuff mattering much for using AI systems when they are not capable of taking over the world for alignment research purposes. Like, in-general I think we should train helpful-only models for that purpose, and in everyday work the persona stuff just doesn’t really matter for getting work out of these systems.
.
I agree that models no longer behave as much like pre-trained models (trivially, because they have undergone more training which is not NTP on a corpus of webtext), but not all ways of not behaving like a pre-trained model undermine the view that LLMs are enacting personas. (This is the point I was trying to make with the “simulating a person who has some specific knowledge that no human has” example.)
Happy to take bets, but this should probably happen after the piece is out with a more precise formulation of the “persona” model and a discussion of how I view empirical observations as relating to it.
At least for me, I think AI personas (i.e. the human predictors turned around to generate human sounding text) are where most of the pre-existing agency of a model lies, since predicting humans requires predicting agentic behavior.
So I expect that post-training RL will not conjure new agentic centers from scratch, but will instead expand on this core. It’s unclear whether that will be enough to get to taking over the world capabilities, but if RL-trained LLMs do scale to that level, I expect the entity to remain somewhat persona-like in that it was built around the persona core. So it’s not completely implausible to me that “persona stuff” can have a meaningful impact here, though that’s still very hard and fraught.
I feel pretty confused by this comment, so I am probably misunderstanding something.
But like, the models now respond to prompts in a much more human-like way? I think this is mainly due to RL reinforcing personas over all the other stuff that a pretrained model is trying to do. It distorts the personas too, but I don’t see where else the ‘being able to interface in a human-like way with natural language’ skill could be coming from.