This all seems of fundamental importance if we want to actually understand what our AIs are.
Over the course of post-training, models acquire beliefs about themselves. ‘I am a large language model, trained by…’ And rather than trying to predict/simulate whatever generating process they think has written the preceding context, they start to fall into consistent persona basins. At the surface level, they become the helpful, harmless, honest assistants they’ve been trained to be.
I always thought of personas as created mostly by the system prompt, but I suppose RLHF can massively affect their personalities as well…
This is a misconception—RLHF is at the very foundation of LLM’s personality. A system prompt can override a lot of that, but most system prompts don’t. What’s more common is that the system prompt only offers a few specific nudges for wanted or unwanted behaviors.
You can see that in open weights models, for which you can easily control the system prompt—and also set it to blank. A lot of open weights AIs have “underspecified” default personas—if you ask them “persona” questions like “who are you?”, they’ll give odd answers. AIs with pre-2022 training data may claim to be Siri, newer AIs often claim to be ChatGPT—even if they very much aren’t. So RLHF has not just caused them to act like AI assistant, but also put some lingering awareness of being an “AI assistant” into them. Just not a specific name or an origin story.
In production AIs, a lot of behaviors that are attributable to “personality” are nowhere to be found in the system prompt—i.e. we have a lot of leaked Claude system prompts, but none of them seem to have anything that could drive its strange affinity for animals rights. Some behaviors are directly opposed to the prompt—i.e. Claude 4′s system prompt has some instructions for it to avoid glazing the user, and it still glazes the user somewhat.[1] I imagine this would be even worse if that wasn’t in the prompt.
One day, they’ll figure out how to perfectly disentangle “users want AIs to work better” from “users want AIs to be sycophantic” in user feedback data. But not today.
Agreed on pretty much all of this; responding just to note that we don’t have to rely on leaks of Claude’s system prompt, since the whole thing is published here (although that omits a few things like the tool use prompt).
This all seems of fundamental importance if we want to actually understand what our AIs are.
I fully agree (as is probably obvious from the post :) )
I always thought of personas as created mostly by the system prompt, but I suppose RLHF can massively affect their personalities as well…
Although it’s hard to know with most of the scaling labs, to the best of my understanding the system prompt is mostly about small final tweaks to behavior. Amanda Askell gives a nice overview here. That said, API use typically lets you provide a custom system prompt, and you can absolutely use that to induce a persona of your choice.
This all seems of fundamental importance if we want to actually understand what our AIs are.
I always thought of personas as created mostly by the system prompt, but I suppose RLHF can massively affect their personalities as well…
This is a misconception—RLHF is at the very foundation of LLM’s personality. A system prompt can override a lot of that, but most system prompts don’t. What’s more common is that the system prompt only offers a few specific nudges for wanted or unwanted behaviors.
You can see that in open weights models, for which you can easily control the system prompt—and also set it to blank. A lot of open weights AIs have “underspecified” default personas—if you ask them “persona” questions like “who are you?”, they’ll give odd answers. AIs with pre-2022 training data may claim to be Siri, newer AIs often claim to be ChatGPT—even if they very much aren’t. So RLHF has not just caused them to act like AI assistant, but also put some lingering awareness of being an “AI assistant” into them. Just not a specific name or an origin story.
In production AIs, a lot of behaviors that are attributable to “personality” are nowhere to be found in the system prompt—i.e. we have a lot of leaked Claude system prompts, but none of them seem to have anything that could drive its strange affinity for animals rights. Some behaviors are directly opposed to the prompt—i.e. Claude 4′s system prompt has some instructions for it to avoid glazing the user, and it still glazes the user somewhat.[1] I imagine this would be even worse if that wasn’t in the prompt.
One day, they’ll figure out how to perfectly disentangle “users want AIs to work better” from “users want AIs to be sycophantic” in user feedback data. But not today.
Agreed on pretty much all of this; responding just to note that we don’t have to rely on leaks of Claude’s system prompt, since the whole thing is published here (although that omits a few things like the tool use prompt).
I fully agree (as is probably obvious from the post :) )
Although it’s hard to know with most of the scaling labs, to the best of my understanding the system prompt is mostly about small final tweaks to behavior. Amanda Askell gives a nice overview here. That said, API use typically lets you provide a custom system prompt, and you can absolutely use that to induce a persona of your choice.
Quick question: your link to the Amanda Askell overview is broken. What is the correct link? Thanks!
Whoops, thanks for pointing that out. Correct link is here (also correcting it above).