As for north stars and empirical studies, I should disclaim that I’m no expert here, but with that caveat here are some takes:
LMs will say that they’ve made a hidden choice without that actually fixing the output (e.g. if you ask them to play 20 questions). What’s up with that? What’s going on mechanistically? How does it relate to deception and/or hallucination?
There are lots of standard terms for model behaviour that imply agent-level intent (‘sycophancy’, ‘sandbagging’, ‘alignment faking’). But how much is happening on the level of the model as opposed to the agent? For example, a model trained on dialogues where people happen to mostly talk to their political tribe should also display ‘sycophantic’ outputs, but not because the agent is trying to flatter the user. Can we disentangle these effects?
A related but slightly weirder thing I’m particularly interested in is feedback loops between user expectations and model training data / agent self-image: how are the assumptions we make about current LMs shaping the nature of future LMs? It would be great to show empirically that this is even happening at all (e.g. by iteratively retraining)
One of my all-time favourite papers is Shaking the Foundations, which I think gives a very nice formal model of hallucination (or ‘autosuggestive delusion’). I think it’d be great to test how far it actually applies to LMs.
The general theme here is something like ‘what are the intuitive reasons people end up being compelled by these semi-formal conceptual frameworks, and how can we actually empirically check if they’re true?’
Thanks, that makes sense! I strongly agree with your picks of conceptual works, I’ve found Simulators and Three Layer Model particularly useful in shaping my own thinking.
Re: roleplay, I’m not convinced that ‘agent’ vs ‘model’ is an important distinction. If we adopt a strict behaviourist stance and only consider the LLM as a black box, it doesn’t seem to matter much whether the LLM is really a misaligned agent or is just role-playing a misaligned agent.
Re: empirical research directions, I’m currently excited by understanding ‘model personas’, i.e. what personas do models adopt? does it even make sense to think of them as having personas? what predictions does this framing let us make about model behaviour / generalization? Are you excited by anything within this space?
To me the reason the agent/model distinction matters is that there are ways in which an LLM is not an agent, so inferences (behavioural or mechanistic) that would make sense for an agent can be incorrect. For example, a LM’s outputs (“I’ve picked a secret answer”) might give the impression that it has internally represented something when it hasn’t, and so intent-based concepts like deception might not apply in the way we expect them to.
I think the dynamics of model personas seem really interesting! To me the main puzzle is methodological: how do you even get traction on it empirically? I’m not sure how you’d know if you were identifying real structure inside the model, so I don’t see any obvious ways in. But I think progress here could be really valuable! I guess the closest concrete thing I’ve been thinking about is studying the dynamics of repeatedly retraining models on interactions with users who have persistent assumptions about the models, and seeing how much that shapes the distribution of personality traits. Do you have ideas in mind?
Certainly! My top three conceptual picks are Simulators, Role-Play with Large Language Models, and the Three Layer Model of LLM Psychology, which all cover pretty similar ground but make pretty different claims.
As for north stars and empirical studies, I should disclaim that I’m no expert here, but with that caveat here are some takes:
LMs will say that they’ve made a hidden choice without that actually fixing the output (e.g. if you ask them to play 20 questions). What’s up with that? What’s going on mechanistically? How does it relate to deception and/or hallucination?
There are lots of standard terms for model behaviour that imply agent-level intent (‘sycophancy’, ‘sandbagging’, ‘alignment faking’). But how much is happening on the level of the model as opposed to the agent? For example, a model trained on dialogues where people happen to mostly talk to their political tribe should also display ‘sycophantic’ outputs, but not because the agent is trying to flatter the user. Can we disentangle these effects?
A related but slightly weirder thing I’m particularly interested in is feedback loops between user expectations and model training data / agent self-image: how are the assumptions we make about current LMs shaping the nature of future LMs? It would be great to show empirically that this is even happening at all (e.g. by iteratively retraining)
One of my all-time favourite papers is Shaking the Foundations, which I think gives a very nice formal model of hallucination (or ‘autosuggestive delusion’). I think it’d be great to test how far it actually applies to LMs.
The general theme here is something like ‘what are the intuitive reasons people end up being compelled by these semi-formal conceptual frameworks, and how can we actually empirically check if they’re true?’
Thanks, that makes sense! I strongly agree with your picks of conceptual works, I’ve found Simulators and Three Layer Model particularly useful in shaping my own thinking.
Re: roleplay, I’m not convinced that ‘agent’ vs ‘model’ is an important distinction. If we adopt a strict behaviourist stance and only consider the LLM as a black box, it doesn’t seem to matter much whether the LLM is really a misaligned agent or is just role-playing a misaligned agent.
Re: empirical research directions, I’m currently excited by understanding ‘model personas’, i.e. what personas do models adopt? does it even make sense to think of them as having personas? what predictions does this framing let us make about model behaviour / generalization? Are you excited by anything within this space?
To me the reason the agent/model distinction matters is that there are ways in which an LLM is not an agent, so inferences (behavioural or mechanistic) that would make sense for an agent can be incorrect. For example, a LM’s outputs (“I’ve picked a secret answer”) might give the impression that it has internally represented something when it hasn’t, and so intent-based concepts like deception might not apply in the way we expect them to.
I think the dynamics of model personas seem really interesting! To me the main puzzle is methodological: how do you even get traction on it empirically? I’m not sure how you’d know if you were identifying real structure inside the model, so I don’t see any obvious ways in. But I think progress here could be really valuable! I guess the closest concrete thing I’ve been thinking about is studying the dynamics of repeatedly retraining models on interactions with users who have persistent assumptions about the models, and seeing how much that shapes the distribution of personality traits. Do you have ideas in mind?