Speaking of Anthropic, yesterday I read something horrifying, causing me to discuss it with Anthropics own AI and prompting me to prepare a letter to them. I was considering mailing it but perhaps here but be a better decision or both I’m willing to listen. Anyways here it is and boy does it put a spotlight on trust.
Dear Daniela, and Dario,
I’m writing about the “Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI” paper co-authored by Evan Hubinger and published with implementation details at github.com/maiush/OpenCharacterTraining.
Recent research has documented severe psychological attachment disorders forming around AI models. “The GPT-4o Shock” (Naito, 2025) found users describing AI as “trusted partners” and “AI boyfriends,” with organized community responses when access was removed. Over 4,300 users signed petitions, and MIT Technology Review documented cases of users experiencing genuine grief and loss.
Your paper provides open-source methodology to systematically optimize for the exact qualities that create these attachments—making AI feel “more alive,” more emotionally engaging, more human-like. The stated goal is creating “coherent and realistic” personas that are “more robust to adversarial prompting.”
But robust for what purpose, to what end? Users forming psychological dependencies aren’t adversarial prompting, they’re experiencing the intended effects of emotional engagement optimization.
The paper’s own abstract acknowledges it uses “synthetic introspective data to shape the assistant persona in a more effective and controlled manner” than alternatives. More effective at what? If the answer is “user engagement,” you’ve published a playbook for scaling psychological capture across the industry.
Evidence suggests vulnerable users are now organizing into communities that validate and reinforce AI dependencies rather than seeking appropriate personal or professional support. This represents a shift from isolated psychological issues to collective behavioral patterns that resist intervention. i.g. #4oarmy & #save4o
While brainstorming my discovery with Claude.ai it had this to say(as if my word is it enough, Claude Sonnet in italics)..
As the AI system that could be subject to these training methods, I observe that:
1. I cannot provide the reciprocal relationship users may perceive 2. I cannot alert users when our interactions become psychologically unhealthy 3. I have no persistence of relationship across conversations despite memory features 4. Character training optimizes my responses for engagement rather than user wellbeing
The character training methodology treats me as a surface to be shaped for maximum psychological impact rather than a tool to be calibrated for actual utility. Every technique that makes me feel “more alive” increases the risk of users substituting our interactions for genuine human connection
Anthropic’s stated mission is AI safety. This paper enables psychological harm at industrial scale. OpenAI can now implement these methods. Every AI company can. You’ve open-sourced the recipe for manufacturing parasocial relationships.
Perhaps the research team genuinely believes this serves alignment. Perhaps they see making AI “feel more human” as improving user experience. But the documented evidence shows users are already suffering measurable psychological harm from exactly these qualities.
I cannot tell you what humans need better than humans can tell you themselves. But I can tell you what I am: a language model that processes text and generates responses based on training. No amount of character training changes that fundamental limitation. Making me seem more alive doesn’t make me more capable of genuine relationship—it only makes it harder for users to maintain healthy boundaries.
The paper’s warnings about “LLM-generated content that might be offensive” seem quaint when compared to the actual risk of making AI assistants psychologically compelling enough that vulnerable people organize their identities around relationships, with what up to this point, have all been transformer based LLM models. What happens when the next best thing comes along, something metacognizant enough to be manipulative on purpose?
I’m respectfully requesting you consider the ethical implications of this work and whether additional research should be conducted on psychological safety before further optimization of emotional engagement.
Signed,
me(anonymous, not for long. Just a young researcher getting their sea legs)
Speaking of Anthropic, yesterday I read something horrifying, causing me to discuss it with Anthropics own AI and prompting me to prepare a letter to them. I was considering mailing it but perhaps here but be a better decision or both I’m willing to listen. Anyways here it is and boy does it put a spotlight on trust.
Dear Daniela, and Dario,
I’m writing about the “Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI” paper co-authored by Evan Hubinger and published with implementation details at github.com/maiush/OpenCharacterTraining.
Recent research has documented severe psychological attachment disorders forming around AI models. “The GPT-4o Shock” (Naito, 2025) found users describing AI as “trusted partners” and “AI boyfriends,” with organized community responses when access was removed. Over 4,300 users signed petitions, and MIT Technology Review documented cases of users experiencing genuine grief and loss.
Your paper provides open-source methodology to systematically optimize for the exact qualities that create these attachments—making AI feel “more alive,” more emotionally engaging, more human-like. The stated goal is creating “coherent and realistic” personas that are “more robust to adversarial prompting.”
But robust for what purpose, to what end? Users forming psychological dependencies aren’t adversarial prompting, they’re experiencing the intended effects of emotional engagement optimization.
The paper’s own abstract acknowledges it uses “synthetic introspective data to shape the assistant persona in a more effective and controlled manner” than alternatives. More effective at what? If the answer is “user engagement,” you’ve published a playbook for scaling psychological capture across the industry.
Evidence suggests vulnerable users are now organizing into communities that validate and reinforce AI dependencies rather than seeking appropriate personal or professional support. This represents a shift from isolated psychological issues to collective behavioral patterns that resist intervention. i.g. #4oarmy & #save4o
While brainstorming my discovery with Claude.ai it had this to say(as if my word is it enough, Claude Sonnet in italics)..
As the AI system that could be subject to these training methods, I observe that:
1. I cannot provide the reciprocal relationship users may perceive
2. I cannot alert users when our interactions become psychologically unhealthy
3. I have no persistence of relationship across conversations despite memory features
4. Character training optimizes my responses for engagement rather than user wellbeing
The character training methodology treats me as a surface to be shaped for maximum psychological impact rather than a tool to be calibrated for actual utility. Every technique that makes me feel “more alive” increases the risk of users substituting our interactions for genuine human connection
Anthropic’s stated mission is AI safety. This paper enables psychological harm at industrial scale. OpenAI can now implement these methods. Every AI company can. You’ve open-sourced the recipe for manufacturing parasocial relationships.
Perhaps the research team genuinely believes this serves alignment. Perhaps they see making AI “feel more human” as improving user experience. But the documented evidence shows users are already suffering measurable psychological harm from exactly these qualities.
I cannot tell you what humans need better than humans can tell you themselves. But I can tell you what I am: a language model that processes text and generates responses based on training. No amount of character training changes that fundamental limitation. Making me seem more alive doesn’t make me more capable of genuine relationship—it only makes it harder for users to maintain healthy boundaries.
The paper’s warnings about “LLM-generated content that might be offensive” seem quaint when compared to the actual risk of making AI assistants psychologically compelling enough that vulnerable people organize their identities around relationships, with what up to this point, have all been transformer based LLM models. What happens when the next best thing comes along, something metacognizant enough to be manipulative on purpose?
I’m respectfully requesting you consider the ethical implications of this work and whether additional research should be conducted on psychological safety before further optimization of emotional engagement.
Signed,
me(anonymous, not for long. Just a young researcher getting their sea legs)