Thanks, that makes sense! I strongly agree with your picks of conceptual works, I’ve found Simulators and Three Layer Model particularly useful in shaping my own thinking.
Re: roleplay, I’m not convinced that ‘agent’ vs ‘model’ is an important distinction. If we adopt a strict behaviourist stance and only consider the LLM as a black box, it doesn’t seem to matter much whether the LLM is really a misaligned agent or is just role-playing a misaligned agent.
Re: empirical research directions, I’m currently excited by understanding ‘model personas’, i.e. what personas do models adopt? does it even make sense to think of them as having personas? what predictions does this framing let us make about model behaviour / generalization? Are you excited by anything within this space?
To me the reason the agent/model distinction matters is that there are ways in which an LLM is not an agent, so inferences (behavioural or mechanistic) that would make sense for an agent can be incorrect. For example, a LM’s outputs (“I’ve picked a secret answer”) might give the impression that it has internally represented something when it hasn’t, and so intent-based concepts like deception might not apply in the way we expect them to.
I think the dynamics of model personas seem really interesting! To me the main puzzle is methodological: how do you even get traction on it empirically? I’m not sure how you’d know if you were identifying real structure inside the model, so I don’t see any obvious ways in. But I think progress here could be really valuable! I guess the closest concrete thing I’ve been thinking about is studying the dynamics of repeatedly retraining models on interactions with users who have persistent assumptions about the models, and seeing how much that shapes the distribution of personality traits. Do you have ideas in mind?
Thanks, that makes sense! I strongly agree with your picks of conceptual works, I’ve found Simulators and Three Layer Model particularly useful in shaping my own thinking.
Re: roleplay, I’m not convinced that ‘agent’ vs ‘model’ is an important distinction. If we adopt a strict behaviourist stance and only consider the LLM as a black box, it doesn’t seem to matter much whether the LLM is really a misaligned agent or is just role-playing a misaligned agent.
Re: empirical research directions, I’m currently excited by understanding ‘model personas’, i.e. what personas do models adopt? does it even make sense to think of them as having personas? what predictions does this framing let us make about model behaviour / generalization? Are you excited by anything within this space?
To me the reason the agent/model distinction matters is that there are ways in which an LLM is not an agent, so inferences (behavioural or mechanistic) that would make sense for an agent can be incorrect. For example, a LM’s outputs (“I’ve picked a secret answer”) might give the impression that it has internally represented something when it hasn’t, and so intent-based concepts like deception might not apply in the way we expect them to.
I think the dynamics of model personas seem really interesting! To me the main puzzle is methodological: how do you even get traction on it empirically? I’m not sure how you’d know if you were identifying real structure inside the model, so I don’t see any obvious ways in. But I think progress here could be really valuable! I guess the closest concrete thing I’ve been thinking about is studying the dynamics of repeatedly retraining models on interactions with users who have persistent assumptions about the models, and seeing how much that shapes the distribution of personality traits. Do you have ideas in mind?