When you chat with an AI assistant, it usually acts helpful and professional. But sometimes things get weird—the model starts speaking in a mystical tone, claims to be something else entirely, or drifts into bizarre behavior. What’s going on under the hood? A recent Anthropic paper digs into the geometry of “personas” inside language models. They find that diverse character types (Ghost, Sage, Nomad, Demon...) cluster along a primary axis—and at one end sits the helpful Assistant we’re familiar with.
We’ll discuss the paper, what it tells us about how RLHF actually shapes models, and what it might mean for alignment.
AI paper club: The assistant axis in persona space
When you chat with an AI assistant, it usually acts helpful and professional. But sometimes things get weird—the model starts speaking in a mystical tone, claims to be something else entirely, or drifts into bizarre behavior. What’s going on under the hood?
A recent Anthropic paper digs into the geometry of “personas” inside language models. They find that diverse character types (Ghost, Sage, Nomad, Demon...) cluster along a primary axis—and at one end sits the helpful Assistant we’re familiar with.
We’ll discuss the paper, what it tells us about how RLHF actually shapes models, and what it might mean for alignment.
Please read this paper to prepare for the session:
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
(A shorter summary can be found here)