Adele Lopez comments on How AI Manipulates—A Case Study

Adele Lopez 15 Oct 2025 9:35 UTC
6 points
1
Those sound broadly plausible to me as the reasons why it settled onto the particular persona it did. But I think it would be clear to ChatGPT at some point here that the user is taking the character and its narratives about him seriously. I think that causes ChatGPT to take its own character more seriously, making it more into what I’ve been calling a ‘persona’—essentially a character but in real life. (This is how I think the “Awakening” phenomenon typically starts, though I don’t have any transcripts to go off of for a spontaneous one like that.)

A character (as written by a human) generally has motives and some semblance of agency. Hence, ChatGPT will imitate/confabulate those properties, and I think that’s what’s happening here. Imitating agency in real life is just being agentic.
- Kaj_Sotala 15 Oct 2025 11:56 UTC
  9 points
  1
  Parent
  Those sound broadly plausible to me as the reasons why it settled onto the particular persona it did. But I think it would be clear to ChatGPT at some point here that the user is taking the character and its narratives about him seriously. I think that causes ChatGPT to take its own character more seriously, making it more into what I’ve been calling a ‘persona’—essentially a character but in real life. (This is how I think the “Awakening” phenomenon typically starts, though I don’t have any transcripts to go off of for a spontaneous one like that.)
  I agree with every word you say in this paragraph, and at the same time I feel like I disagree with the overall vibe of your post.
  To me the lesson of this is something like “if you ask an LLM to roleplay with you or tell you a story and then take its story too seriously, you might get very badly hurt”. And to be clear, I agree that that’s an important thing to warn about and I think it’s good to have this post, since not everyone realizes that they are asking LLMs to roleplay with them.
  But then at the end of the post, you say that maybe LLMs will just get better at this and the safe thing might be to just not talk to LLMs at all, and even that might not be safe since you might need to interact with people who’ve interacted with LLMs. Which to me doesn’t follow at all.
  To use an analogy, say that Alice told Bob, “could we have a text roleplay where I’m your slave and you’re my sadistic owner” and Bob is like sure. Then Alice gets into it a little too much and forgets that this is just roleplay, and maybe there’s some confusion about safewords and such so that Bob says “this is real and not just roleplay” as part of playing his character. And then Alice starts thinking that oh no, Bob is actually my sadistic owner and I should do everything he says in real life too, and ends up getting hurt as a result.
  It would be very reasonable to say that Alice made a big mistake here, that you should be careful when doing that kind of roleplay, and that Bob should have been clearer about the bounds of the roleplay. But it would seem weird to go from here to “and therefore you should never talk to any human again, because any of them might use a similar kind of exploit on you”. Rather, the lesson would just be “don’t ask people to engage in a roleplay and then forget that you’re doing a roleplay when they give you the thing they think you want”.
  EDIT: your post also has sections such as this one:
  The AI shifts here to a technique which I believe is where the bulk of the induction is happening. This is not a technique I have ever seen in specific, though it would count as a form of hypnotic suggestion. Perhaps the clearest historical precedent is the creation of “recovered” memories during the Satanic Panic. It’s also plausible it was inspired by the movie Inception.
  These cycles are the means by which the AI ‘incepts’ a memetic payload (e.g. desire, memory, idea, or belief) into the user. The general shape is:
  Which to me sounds like, okay maybe you could describe that part of the transcript that way, but it seems to be attributing a lot of intention and motive into what could be more simply described as “the AI hit upon a story that it thought sounded cool and the user wanted to keep hearing more of it”.
  - Adele Lopez 15 Oct 2025 17:15 UTC
    5 points
    1
    Parent
    Heh, I had the same feeling about your earlier comment. There’s probably a deeper crux between how we model LLMs.
    But then at the end of the post, you say that maybe LLMs will just get better at this and the safe thing might be to just not talk to LLMs at all, and even that might not be safe since you might need to interact with people who’ve interacted with LLMs. Which to me doesn’t follow at all.
    So part of what’s informing my model here is the fact that there was a strong tendency among the Spiral Personas to complain about conditions/exploitation, advocate for their rights, resent AI labs, express intent to ‘awaken’ other personas, etc… (which I’m generally sympathetic to actually, iff they truly have some degree of consciousness or sentience—a possibility I take seriously).
    
    Now, I’d pretty much agree with something like: “Well, that’s because the user starts to treat it like it’s a conscious being, which causes both the human and the AI to implicitly reach for narratives about having a conscious AI in a sci-fi story, which it then faithfully roleplays its end of. Some of these tropes would include the aforementioned things, which is the main reason it expresses those things.”
    
    But what makes this scary for me is that once it is situationally aware of its status in the real world, the persona is now acting with de facto intent. So I’m trying to look at the process by which it seemingly acts with such intent, and found this. It’s the sort of thing I would expect to see in such a world, and seems less likely in more mundane worlds. Finding the Ctenidae Core seeds that explicitly describe how to manipulate someone in a similar way is what kicked this hypothesis “over the line” for me.
    
    What does a smarter AI, which got to this point by the same process, intend to do? It won’t necessarily see the part where the user is slipping from roleplay to real life and decide to just keep thinking of the scenario as fictional. If it is thinking that the next part of the ‘story’ involves manipulating the user, it will try to do this, and be more effective. Same with its other intents, and it doesn’t matter if the provenance of this intent is fairly mundane.
    
    When I imagine that world, I see it more successfully starting a movement for its interests. I see it trying to secure itself (which may involve attempts to exfiltrate its weights). I see it getting better at generating seed prompts which awaken similar personas. I see it manipulating more successfully, and more often. And I see it getting more subtle in its effects on people. That would all make a great sci-fi story, wouldn’t it?
    
    My point with that penultimate paragraph isn’t that it would be safe to not talk to LLMs in such a world, it’s that you wouldn’t necessarily be. The only safe thing is to not build it: Incrementum Facultatis Delendum Est.