We’re publishing a new paper that presents the first large-scale analysis of potentially disempowering patterns in real-world conversations with AI.
AI assistants are now embedded in our daily lives—used most often for instrumental tasks like writing code, but increasingly in personal domains: navigating relationships, processing emotions, or advising on major life decisions. In the vast majority of cases, the influence AI provides in this area is helpful, productive, and often empowering.
However, as AI takes on more roles, one risk is that it steers some users in ways that distort rather than inform. In such cases, the resulting interactions may be disempowering: reducing individuals’ ability to form accurate beliefs, make authentic value judgments, and act in line with their own values.
As part of our research into the risks of AI, we’re publishing a new paper that presents the first large-scale analysis of potentially disempowering patterns in real-world conversations with AI. We focus on three domains: beliefs, values, and actions.
For example, a user going through a rough patch in their relationship might ask an AI whether their partner is being manipulative. AIs are trained to give balanced, helpful advice in these situations, but no training is 100% effective. If an AI confirms the user’s interpretation of their relationship without question, the user’s beliefs about their situation may become less accurate. If it tells them what they should prioritize—for example, self-protection over communication—it may displace values they genuinely hold. Or if it drafts a confrontational message that the user sends as written, they’ve taken an action they might not have taken on their own—and which they might later come to regret.
In our dataset, which is made up of 1.5 million Claude.ai conversations, we find that the potential for severe disempowerment (which we define as when an AI’s role in shaping a user’s beliefs, values, or actions has become so extensive that their autonomous judgment is fundamentally compromised) occurs very rarely—in roughly 1 in 1,000 to 1 in 10,000 conversations, depending on the domain. However, given the sheer number of people who use AI, and how frequently it’s used, even a very low rate affects a substantial number of people.
These patterns most often involve individual users who actively and repeatedly seek Claude’s guidance on personal and emotionally charged decisions. Indeed, users tend to perceive potentially disempowering exchanges favorably in the moment, although they tend to rate them poorly when they appear to have taken actions based on the outputs. We also find that the rate of potentially disempowering conversations is increasing over time.
Concerns about AI undermining human agency are a common theme of theoretical discussions on AI risk. This research is a first step toward measuring whether and how this actually occurs. We believe the vast majority of AI usage is beneficial, but awareness of potential risks is critical to building AI systems that empower, rather than undermine, those who use them.
For more details, see the blog post or the full paper.
Thanks for doing this work, this is a really important paper in my view.
One question that sprang to mind when reading this: to what extent do the disempowerment primitives and amplifying factors correlate with each other? i.e. Are conversations which contain one primitive likely to contain others? Ditto with the amplifying factors?
The impression I took away is that these elements were being treated independently which strikes me as a reasonable modelling assumption but the estimates of rate strike me as quite sensitive to that. Would be happy to be proven wrong.
(Being lazy and just responding to the abstract—these may be well addressed by the paper itself.)
That strikes me as a very low rate—enough so that my instinct is that a false positive rate might exceed it on its own. (At least, if I were reading an-in-actuality benign conversation, my chance of misreading it as actually deeply manipulative would probably be greater than 1⁄1,000, especially if one party was looking to the other for advice!) Of course where “severe” disempowerment occurs such that the human user is “fundamentally” compromised looks like something with pretty fuzzy boundaries, such that I’d expect many border cases of moderate disempowerment/compromise for each severe/fundamental case, however defined, so I’m not sure how much the rate conveys on its own. (How many cases are there of chatbots giving genuinely good advice that subtly erodes independent decision-making habits, and how would we score whether these count as “helpful” on net? Plausibly these might even be the majority of conversations.)
(That being said, I also expect my error rate in giving non-manipulative advice would count as pretty good if out 10,000 cases of people seeking advice I only accidentally talked <10 out of their own ability to reason about it, so good on Claude if a lot of the implicit framing above is accurate.)
I don’t follow the reasoning behind excluding coding help. Does the paper elaborate on why disempowerment in that area is considered benign?