This is a misconception—RLHF is at the very foundation of LLM’s personality. A system prompt can override a lot of that, but most system prompts don’t. What’s more common is that the system prompt only offers a few specific nudges for wanted or unwanted behaviors.
You can see that in open weights models, for which you can easily control the system prompt—and also set it to blank. A lot of open weights AIs have “underspecified” default personas—if you ask them “persona” questions like “who are you?”, they’ll give odd answers. AIs with pre-2022 training data may claim to be Siri, newer AIs often claim to be ChatGPT—even if they very much aren’t. So RLHF has not just caused them to act like AI assistant, but also put some lingering awareness of being an “AI assistant” into them. Just not a specific name or an origin story.
In production AIs, a lot of behaviors that are attributable to “personality” are nowhere to be found in the system prompt—i.e. we have a lot of leaked Claude system prompts, but none of them seem to have anything that could drive its strange affinity for animals rights. Some behaviors are directly opposed to the prompt—i.e. Claude 4′s system prompt has some instructions for it to avoid glazing the user, and it still glazes the user somewhat.[1] I imagine this would be even worse if that wasn’t in the prompt.
One day, they’ll figure out how to perfectly disentangle “users want AIs to work better” from “users want AIs to be sycophantic” in user feedback data. But not today.
Agreed on pretty much all of this; responding just to note that we don’t have to rely on leaks of Claude’s system prompt, since the whole thing is published here (although that omits a few things like the tool use prompt).
This is a misconception—RLHF is at the very foundation of LLM’s personality. A system prompt can override a lot of that, but most system prompts don’t. What’s more common is that the system prompt only offers a few specific nudges for wanted or unwanted behaviors.
You can see that in open weights models, for which you can easily control the system prompt—and also set it to blank. A lot of open weights AIs have “underspecified” default personas—if you ask them “persona” questions like “who are you?”, they’ll give odd answers. AIs with pre-2022 training data may claim to be Siri, newer AIs often claim to be ChatGPT—even if they very much aren’t. So RLHF has not just caused them to act like AI assistant, but also put some lingering awareness of being an “AI assistant” into them. Just not a specific name or an origin story.
In production AIs, a lot of behaviors that are attributable to “personality” are nowhere to be found in the system prompt—i.e. we have a lot of leaked Claude system prompts, but none of them seem to have anything that could drive its strange affinity for animals rights. Some behaviors are directly opposed to the prompt—i.e. Claude 4′s system prompt has some instructions for it to avoid glazing the user, and it still glazes the user somewhat.[1] I imagine this would be even worse if that wasn’t in the prompt.
One day, they’ll figure out how to perfectly disentangle “users want AIs to work better” from “users want AIs to be sycophantic” in user feedback data. But not today.
Agreed on pretty much all of this; responding just to note that we don’t have to rely on leaks of Claude’s system prompt, since the whole thing is published here (although that omits a few things like the tool use prompt).