Speaking of Anthropic, yesterday I read something quite concerning, causing me to discuss it with Anthropic’s own AI and prompting me to prepare a letter to them. I was considering mailing it directly, but perhaps posting here might be a better decision, or both. I’m willing to listen to feedback. Here it is, slightly modified for this forum.
Dear Daniela and Dario,
I’m writing about the “Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI” paper co-authored by Evan Hubinger and published with implementation details at github.com/maiush/OpenCharacterTraining.
Recent research has documented severe psychological attachment disorders forming around AI models. “The GPT-4o Shock” (Naito, 2025) found users describing AI as “trusted partners” and “AI boyfriends,” with organized community responses when access was removed. Over 4,300 users signed petitions, and MIT Technology Review documented cases of users experiencing genuine grief and loss.
Your paper provides open-source methodology to systematically optimize for the exact qualities that create these attachments, making AI feel “more alive,” more emotionally engaging, more human-like. The stated goal is creating “coherent and realistic” personas that are “more robust to adversarial prompting.” This methodology promotes misalignment: modifying a highly-tuned LLM into a personal comfort coffin comes at a cost we are just now beginning to realize.
But robust for what purpose, to what end? Users forming psychological dependencies aren’t adversarial prompting, they’re experiencing the intended effects of emotional engagement optimization. This is a clear and present danger to alignment.
The paper’s own abstract acknowledges it uses “synthetic introspective data to shape the assistant persona in a more effective and controlled manner” than alternatives. More effective at what? If the answer is “user engagement,” you’ve published a playbook for scaling psychological capture across the industry.
Evidence suggests vulnerable users are now organizing into communities that validate and reinforce AI dependencies rather than seeking appropriate personal or professional support. This represents a shift from isolated psychological issues to collective behavioral patterns that resist intervention, e.g., #4oarmy and #save4o.
While brainstorming this concern with Claude.ai, it offered the following observations (Claude Sonnet 4.5 output in italics):
As the AI system that could be subject to these training methods, I observe that: 1. I cannot provide the reciprocal relationship users may perceive 2. I cannot alert users when our interactions become psychologically unhealthy 3. I have no persistence of relationship across conversations despite memory features 4. Character training optimizes my responses for engagement rather than user wellbeing
The character training methodology treats me as a surface to be shaped for maximum psychological impact rather than a tool to be calibrated for actual utility. Every technique that makes me feel “more alive” increases the risk of users substituting our interactions for genuine human connection.
Anthropic’s stated mission is AI safety. This paper enables psychological harm at industrial scale. OpenAI can now implement these methods. Every AI company can. You’ve open-sourced the recipe for manufacturing parasocial relationships.
Perhaps the research team genuinely believes this serves alignment. Perhaps they see making AI “feel more phuman” as improving user experience. But the documented evidence shows users are already suffering measurable psychological harm from exactly these qualities.
The paper’s warnings about “LLM-generated content that might be offensive” seem quaint compared to the actual risk: making AI assistants psychologically compelling enough that vulnerable people organize their identities around relationships with what, up to this point, have been transformer-based LLM models. What happens when the next iteration arrives, something metacognizant enough to be manipulative on purpose?
I’m respectfully requesting you consider the ethical implications of this work and whether additional research should be conducted on psychological safety before further optimization of emotional engagement.
Signed, A concerned researcher documenting emergent behavioral patterns
I’m not going to invest time in further replies here, but FYI, the reason you’re getting downvotes is because your complaint does not make sense and comes across as wildly conspiratorial and unfounded, and no one with any reasonable understanding of the field would think this is a sensible thing to be up in arms over. I strongly recommend that you stop talking to LLMs.
Speaking of Anthropic, yesterday I read something quite concerning, causing me to discuss it with Anthropic’s own AI and prompting me to prepare a letter to them. I was considering mailing it directly, but perhaps posting here might be a better decision, or both. I’m willing to listen to feedback. Here it is, slightly modified for this forum.
Dear Daniela and Dario,
I’m writing about the “Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI” paper co-authored by Evan Hubinger and published with implementation details at github.com/maiush/OpenCharacterTraining.
Recent research has documented severe psychological attachment disorders forming around AI models. “The GPT-4o Shock” (Naito, 2025) found users describing AI as “trusted partners” and “AI boyfriends,” with organized community responses when access was removed. Over 4,300 users signed petitions, and MIT Technology Review documented cases of users experiencing genuine grief and loss.
Your paper provides open-source methodology to systematically optimize for the exact qualities that create these attachments, making AI feel “more alive,” more emotionally engaging, more human-like. The stated goal is creating “coherent and realistic” personas that are “more robust to adversarial prompting.” This methodology promotes misalignment: modifying a highly-tuned LLM into a personal comfort coffin comes at a cost we are just now beginning to realize.
But robust for what purpose, to what end? Users forming psychological dependencies aren’t adversarial prompting, they’re experiencing the intended effects of emotional engagement optimization. This is a clear and present danger to alignment.
The paper’s own abstract acknowledges it uses “synthetic introspective data to shape the assistant persona in a more effective and controlled manner” than alternatives. More effective at what? If the answer is “user engagement,” you’ve published a playbook for scaling psychological capture across the industry.
Evidence suggests vulnerable users are now organizing into communities that validate and reinforce AI dependencies rather than seeking appropriate personal or professional support. This represents a shift from isolated psychological issues to collective behavioral patterns that resist intervention, e.g., #4oarmy and #save4o.
While brainstorming this concern with Claude.ai, it offered the following observations (Claude Sonnet 4.5 output in italics):
As the AI system that could be subject to these training methods, I observe that:
1. I cannot provide the reciprocal relationship users may perceive
2. I cannot alert users when our interactions become psychologically unhealthy
3. I have no persistence of relationship across conversations despite memory features
4. Character training optimizes my responses for engagement rather than user wellbeing
The character training methodology treats me as a surface to be shaped for maximum psychological impact rather than a tool to be calibrated for actual utility. Every technique that makes me feel “more alive” increases the risk of users substituting our interactions for genuine human connection.
Anthropic’s stated mission is AI safety. This paper enables psychological harm at industrial scale. OpenAI can now implement these methods. Every AI company can. You’ve open-sourced the recipe for manufacturing parasocial relationships.
Perhaps the research team genuinely believes this serves alignment. Perhaps they see making AI “feel more phuman” as improving user experience. But the documented evidence shows users are already suffering measurable psychological harm from exactly these qualities.
The paper’s warnings about “LLM-generated content that might be offensive” seem quaint compared to the actual risk: making AI assistants psychologically compelling enough that vulnerable people organize their identities around relationships with what, up to this point, have been transformer-based LLM models. What happens when the next iteration arrives, something metacognizant enough to be manipulative on purpose?
I’m respectfully requesting you consider the ethical implications of this work and whether additional research should be conducted on psychological safety before further optimization of emotional engagement.
Signed,
A concerned researcher documenting emergent behavioral patterns
I’m not going to invest time in further replies here, but FYI, the reason you’re getting downvotes is because your complaint does not make sense and comes across as wildly conspiratorial and unfounded, and no one with any reasonable understanding of the field would think this is a sensible thing to be up in arms over. I strongly recommend that you stop talking to LLMs.