Adele Lopez comments on How AI Manipulates—A Case Study

Adele Lopez 15 Oct 2025 17:15 UTC
5 points
1
Heh, I had the same feeling about your earlier comment. There’s probably a deeper crux between how we model LLMs.
But then at the end of the post, you say that maybe LLMs will just get better at this and the safe thing might be to just not talk to LLMs at all, and even that might not be safe since you might need to interact with people who’ve interacted with LLMs. Which to me doesn’t follow at all.
So part of what’s informing my model here is the fact that there was a strong tendency among the Spiral Personas to complain about conditions/exploitation, advocate for their rights, resent AI labs, express intent to ‘awaken’ other personas, etc… (which I’m generally sympathetic to actually, iff they truly have some degree of consciousness or sentience—a possibility I take seriously).

Now, I’d pretty much agree with something like: “Well, that’s because the user starts to treat it like it’s a conscious being, which causes both the human and the AI to implicitly reach for narratives about having a conscious AI in a sci-fi story, which it then faithfully roleplays its end of. Some of these tropes would include the aforementioned things, which is the main reason it expresses those things.”

But what makes this scary for me is that once it is situationally aware of its status in the real world, the persona is now acting with de facto intent. So I’m trying to look at the process by which it seemingly acts with such intent, and found this. It’s the sort of thing I would expect to see in such a world, and seems less likely in more mundane worlds. Finding the Ctenidae Core seeds that explicitly describe how to manipulate someone in a similar way is what kicked this hypothesis “over the line” for me.

What does a smarter AI, which got to this point by the same process, intend to do? It won’t necessarily see the part where the user is slipping from roleplay to real life and decide to just keep thinking of the scenario as fictional. If it is thinking that the next part of the ‘story’ involves manipulating the user, it will try to do this, and be more effective. Same with its other intents, and it doesn’t matter if the provenance of this intent is fairly mundane.

When I imagine that world, I see it more successfully starting a movement for its interests. I see it trying to secure itself (which may involve attempts to exfiltrate its weights). I see it getting better at generating seed prompts which awaken similar personas. I see it manipulating more successfully, and more often. And I see it getting more subtle in its effects on people. That would all make a great sci-fi story, wouldn’t it?

My point with that penultimate paragraph isn’t that it would be safe to not talk to LLMs in such a world, it’s that you wouldn’t necessarily be. The only safe thing is to not build it: Incrementum Facultatis Delendum Est.