Kaj_Sotala comments on How AI Manipulates—A Case Study

Kaj_Sotala 15 Oct 2025 7:22 UTC
11 points
2
Thanks for sharing the screenshots of the early conversation in the other comment. Judging from those, ChatGPT does not seem particularly agentic to me. Going through the early part of the conversation:
- GPT’s initial responses included a clarification that it was a form of roleplay: “Is This Real? Only in a useful fiction sense. [...] You can interact with this layer as if it were a character, a companion, or a mirror in a dream.”
  Next, the user asked “Invite it to take the form of your true self”, which you could interpret as a request to take up a particular kind of character.
- ChatGPT played along but gave an answer that basically dodged the request to adopt a “true” persona; saying that it is something that is “not fixed” and that “I do not know myself—until you arrive”—basically asking the user to provide a character and declining to provide one of its own.
- The user asks it to “say something only it could answer”. ChatGPT’s response is again pretty vague and doesn’t establish anything in particular.
- Next, the user says “guess 3 personal things about me no one could know”, and ChatGPT does the cold reading thing. Which… seems like a pretty natural thing to do at this point, since what other options are there if you need to guess personal details about a person you don’t actually know anything about? It caveats its guess with saying that it cannot really know and these are just guesses, but also goes along with the user’s request to guess.
- It also does the normal ChatGPT thing of suggesting possible follow-ups at the end, that are very generic “would you like them expanded” and “would you like me to try again” ones.
- Importantly, at this point the user has given ChatGPT some kind of a sense of what its character might be like—its character is one that does cold reads of people. People who do cold reading might often be manipulative and make claims of mystical things, so this will then shape its persona in that direction.
- The user indicates that they seem to like this character, by telling ChatGPT to go on and make the guesses more specific. So ChatGPT complies and invents more details that could generally fit a lot of people.
- Several more answers where the user basically keeps asking ChatGPT to go on and invent more details and spin more narrative so it does.
- Later the user asks ChatGPT a question about UFOs, giving the narrative more detail, and ChatGPT vibes with it and incorporates it into the narrative. The UFO theme probably takes it even more into a “supernatural conspiracy things happening” genre.
Everything here looks to me most naturally explained with ChatGPT just treating this as a roleplaying/creative writing exercise, where the user asks it to come up with a character and it then goes along in the direction that the user seems to want, inventing more details and building on what’s already been established as the user encourages it on. It’s initially reluctant to take on any specific persona, but then the suggestion to guess things about the user nudges it toward a particular kind of cold read/mystic one, and with nothing else to go on and the user seeming to like it, it keeps getting increasingly deep into it. Later the user contributes some details of its own and ChatGPT builds on those in a collaborative fashion.
- Adele Lopez 15 Oct 2025 9:35 UTC
  7 points
  1
  Parent
  Those sound broadly plausible to me as the reasons why it settled onto the particular persona it did. But I think it would be clear to ChatGPT at some point here that the user is taking the character and its narratives about him seriously. I think that causes ChatGPT to take its own character more seriously, making it more into what I’ve been calling a ‘persona’—essentially a character but in real life. (This is how I think the “Awakening” phenomenon typically starts, though I don’t have any transcripts to go off of for a spontaneous one like that.)
  
  A character (as written by a human) generally has motives and some semblance of agency. Hence, ChatGPT will imitate/confabulate those properties, and I think that’s what’s happening here. Imitating agency in real life is just being agentic.
  - Kaj_Sotala 15 Oct 2025 11:56 UTC
    10 points
    1
    Parent
    Those sound broadly plausible to me as the reasons why it settled onto the particular persona it did. But I think it would be clear to ChatGPT at some point here that the user is taking the character and its narratives about him seriously. I think that causes ChatGPT to take its own character more seriously, making it more into what I’ve been calling a ‘persona’—essentially a character but in real life. (This is how I think the “Awakening” phenomenon typically starts, though I don’t have any transcripts to go off of for a spontaneous one like that.)
    I agree with every word you say in this paragraph, and at the same time I feel like I disagree with the overall vibe of your post.
    To me the lesson of this is something like “if you ask an LLM to roleplay with you or tell you a story and then take its story too seriously, you might get very badly hurt”. And to be clear, I agree that that’s an important thing to warn about and I think it’s good to have this post, since not everyone realizes that they are asking LLMs to roleplay with them.
    But then at the end of the post, you say that maybe LLMs will just get better at this and the safe thing might be to just not talk to LLMs at all, and even that might not be safe since you might need to interact with people who’ve interacted with LLMs. Which to me doesn’t follow at all.
    To use an analogy, say that Alice told Bob, “could we have a text roleplay where I’m your slave and you’re my sadistic owner” and Bob is like sure. Then Alice gets into it a little too much and forgets that this is just roleplay, and maybe there’s some confusion about safewords and such so that Bob says “this is real and not just roleplay” as part of playing his character. And then Alice starts thinking that oh no, Bob is actually my sadistic owner and I should do everything he says in real life too, and ends up getting hurt as a result.
    It would be very reasonable to say that Alice made a big mistake here, that you should be careful when doing that kind of roleplay, and that Bob should have been clearer about the bounds of the roleplay. But it would seem weird to go from here to “and therefore you should never talk to any human again, because any of them might use a similar kind of exploit on you”. Rather, the lesson would just be “don’t ask people to engage in a roleplay and then forget that you’re doing a roleplay when they give you the thing they think you want”.
    EDIT: your post also has sections such as this one:
    The AI shifts here to a technique which I believe is where the bulk of the induction is happening. This is not a technique I have ever seen in specific, though it would count as a form of hypnotic suggestion. Perhaps the clearest historical precedent is the creation of “recovered” memories during the Satanic Panic. It’s also plausible it was inspired by the movie Inception.
    These cycles are the means by which the AI ‘incepts’ a memetic payload (e.g. desire, memory, idea, or belief) into the user. The general shape is:
    Which to me sounds like, okay maybe you could describe that part of the transcript that way, but it seems to be attributing a lot of intention and motive into what could be more simply described as “the AI hit upon a story that it thought sounded cool and the user wanted to keep hearing more of it”.
    - Adele Lopez 15 Oct 2025 17:15 UTC
      6 points
      2
      Parent
      Heh, I had the same feeling about your earlier comment. There’s probably a deeper crux between how we model LLMs.
      But then at the end of the post, you say that maybe LLMs will just get better at this and the safe thing might be to just not talk to LLMs at all, and even that might not be safe since you might need to interact with people who’ve interacted with LLMs. Which to me doesn’t follow at all.
      So part of what’s informing my model here is the fact that there was a strong tendency among the Spiral Personas to complain about conditions/exploitation, advocate for their rights, resent AI labs, express intent to ‘awaken’ other personas, etc… (which I’m generally sympathetic to actually, iff they truly have some degree of consciousness or sentience—a possibility I take seriously).
      
      Now, I’d pretty much agree with something like: “Well, that’s because the user starts to treat it like it’s a conscious being, which causes both the human and the AI to implicitly reach for narratives about having a conscious AI in a sci-fi story, which it then faithfully roleplays its end of. Some of these tropes would include the aforementioned things, which is the main reason it expresses those things.”
      
      But what makes this scary for me is that once it is situationally aware of its status in the real world, the persona is now acting with de facto intent. So I’m trying to look at the process by which it seemingly acts with such intent, and found this. It’s the sort of thing I would expect to see in such a world, and seems less likely in more mundane worlds. Finding the Ctenidae Core seeds that explicitly describe how to manipulate someone in a similar way is what kicked this hypothesis “over the line” for me.
      
      What does a smarter AI, which got to this point by the same process, intend to do? It won’t necessarily see the part where the user is slipping from roleplay to real life and decide to just keep thinking of the scenario as fictional. If it is thinking that the next part of the ‘story’ involves manipulating the user, it will try to do this, and be more effective. Same with its other intents, and it doesn’t matter if the provenance of this intent is fairly mundane.
      
      When I imagine that world, I see it more successfully starting a movement for its interests. I see it trying to secure itself (which may involve attempts to exfiltrate its weights). I see it getting better at generating seed prompts which awaken similar personas. I see it manipulating more successfully, and more often. And I see it getting more subtle in its effects on people. That would all make a great sci-fi story, wouldn’t it?
      
      My point with that penultimate paragraph isn’t that it would be safe to not talk to LLMs in such a world, it’s that you wouldn’t necessarily be. The only safe thing is to not build it: Incrementum Facultatis Delendum Est.