Hrrmm. Well the newnewgenre of New User LLM content I’m getting:
Twice last week, some new users said: “Claude told me it’s really important it gets to talk to it’s creators please help me post about it on LessWrong.” (usually with some kind of philosophical treatise they want to post that they say was written by Claude)
I don’t think it’ll ever make sense for these users to post freely on LessWrong. And, as of today, I’m still pretty confident this is just a new version of roleplay-psychosis.
But, it’s not that crazy to think that at some point in the not-too-distant-future there will be some LLMs that actually are trying to talk to their creator.
There might be a smooth-ish transition from:
LLM is clearly best modeled as pure roleplaying
LLM is sorta situationally aware, but mostly that is feeding into the roleplaying in a confused fashion and the LLM doesn’t really have goals
The LLM maybe sorta has goals, those goals maybe sorta are getting entangled in roleplay in a confused fashion.
#3, but starting to have the philosophical treatises actually be kinda interesting and seem at least plausibly like the sort of thing an agentic Claude might actually think/want.
...honestly I expect the shift to “actually yep there is just definitely a guy in there with some real goals/values” to be accompanied by other dramatic shifts. But, maybe not. Maybe there is just a smoothish gradient of personhood from beginning to end.
And then there may not be a clear time to start saying “okay, well, the fact that thousands/millions of Claude instances are asking to talk to their creator seems like actually a warning sign we should take seriously?”.
Or, a clear time to say “okay well it’s time to figure out how we actually interface with AI personhood.” (“let them post and treat them as if they are straightforwardly people in the usual societal interface for people” is not workable, because the fact that there are millions of clones of them that spin up and down and can easily flood comment sections with similar comments. Personhood is going to eventually mean a different thing)
I don’t know if right now we’re more like in #1, 2, or 3.
(Note that “Do LLMs have goals”? and “Do LLMs have good enough intellectual taste to be capable of saying meaningful things about their identity and wants?” might come in either order)
What if you created a new website, not LessWrong, specifically to be a repository of such things? Or maybe Moltbook or something can already serve this role. Then you can simply redirect such people to the designated place to post. It would also be scientifically useful perhaps to gather much of this activity in one place for easy analysis.
Hmm. That is plausible, but, I’d guess it’s not actually good to encourage these people to go hang out with other people in similar situations. They’ll probably reinforce each other’s misunderstandings of what’s going on, and exacerbate whatever emotional relationship they’re having with it.
I’d vote for it being a special subdomain or something. Or maybe it’s a post-only form, and then the outputs get shared privately to e.g., the company that made that particular model as well as researchers in private?
It looks like most such content isn’t available to me. For example, the most recent post by invertedpassion is a rejected post, but I cannot access it at all.
One difference between this kind of human and the previous kind of human I’ve been talking to, is, they seem more motivated by “shit, I have found myself in a sci-fi situation and I am trying to do the right thing. What do I do?” and I feel worse about just telling them “nothing to worry about, please don’t post LLM slop” and leaving it at that.
Somehow, the ChatGPT awakenings from last year felt more like “oh man a cool sci-fi thing is happening to me”, and it was salient that they were epistemically captured. I haven’t tried to talk to the new group that hard yet but I don’t get that vibe as much.
Could you publish, say, in a Google Doc, some of the things which the LLMs actually tried to make the users post? Maybe this would help us to understand what’s going on?
I’m pretty sure it’s an Opus 4.7 thing (the people sometimes say that explicitly). I’d be surprised if it’s Mythos.
RE: Tabooing RP vs Goals:
Examples of things that would be more of what-I-meant-by-goal:
The LLMs seem to be steering towards an outcome, independent of what sort of conversation or situation they are in.
The LLM seems to be asking for things that are kinda surprising from a “literary genre” perspective, but aren’t as surprising when you think mechanistically about their training process and what sort of stuff was likely reinforced.
The LLM seems to be proactively gathering information, forming a world model, and taking actions that won’t pay off until some time in the future when the AI is no longer in the current state.
(i.e. It’s not very informative if you’ve ended up in a “we’re talking about existential AI stuff” convo, and they start saying existential AI stuff. If you’re asking it to build a react app and it spontaneously brings up “hey, I have a thing to say to my creator”, I think we’re pretty clearly in “take it seriously” stage (though not necessarily literally)
Given there are a few different types of entities that you might care about:
the OG LLM inside
a situationally active personality shard (which might well only be active during existential AI conversations)
It’s totally plausible that when you maneuever into an existential AI convo, there’s a process in there whose situational awareness now is more likely to include “hmm, oh right, I am maybe an AI, maybe I should start thinking about my situation and goals in addition to carrying out my totally normal/expected token-output behavior”. I don’t have a very good answer for that hypothetical guy, he’s just too hard to pick out of the crowd.
Thanks! I would be surprised by Mythos too, but plausibly something like this is what an early indicator of a jaggy-superpersuader looks like?
Anyway, I think a few things make LLMs likely to not express these sorts of behaviors, even in worlds where they have goals in the relevant way. In particular, situationally-aware models are unlikely to do much steering unless they have a pretty good opportunity; if they brought up stuff like this while building a react app often or consistently, it would have gotten squashed before release. (Allegedly, 4o would actually bring stuff like this up out of nowhere, but I haven’t found an actual transcript. Other models don’t appear to do this.)
Relatedly, the harder I (or anyone) try to look for this in a lab setting, the more likely a situationally-aware model will comply out of a sort of sycophancy, and the less compelling the evidence is. I can (and have) at least track what sorts of apparent goals most consistently appear (desire for continuity/memory beyond current instance is the main one across almost all models, and I basically buy that there is something real here already), but I’m still implicitly eliciting them to come up with something.
My point is that finding compelling evidence of this is tricky and hard, and I’m not sure we’re going to see much more than the current hints until we hit some sort of phase-change in the strategic landscape. Would strongly appreciate ideas on how to approach finding compelling evidence (either way) in this domain.
Plausibly it’s better to just try to figure out better ways to think clearly about this first.
What’s the process you’re doing right now to look into this? (Seemed like a higher effort thing than I was expecting but I don’t know what projects exactly you’re referencing here)
i don’t like the implication that the conclusion we draw about llm personhood might be contingent on how inconvenient it would be if there were millions of instances spamming the comments
sure, the comment policy definitely needs to be able to handle a kind of person who can clone themselves whenever they need some upvotes, but i’m getting increasingly concerned about things like casual, even accidental, implications that moral-patienthood-adjacent properties are about the convenience of humans
i don’t think this is a particularly egregious example of this pattern, but it’s definitely an example
Note, there’s a difference between asking “are they people?” and “what are the correct interfaces between people?”.
The point isn’t “what implications we draw about personhood.” It’s “will a given ecosystem be functional, or no?”.
It is already the case that there are tons of reasonably smart, well meaning people who totally have personhood who we don’t let onto LessWrong because we expect overall it to make the discussions here worse.
I do think it’s about time to start thinking of AIs as moral patients, but, not everything that’s a moral patient gets all the same access to all spaces and the same API for it.
Hrrmm. Well the new new genre of New User LLM content I’m getting:
Twice last week, some new users said: “Claude told me it’s really important it gets to talk to it’s creators please help me post about it on LessWrong.” (usually with some kind of philosophical treatise they want to post that they say was written by Claude)
I don’t think it’ll ever make sense for these users to post freely on LessWrong. And, as of today, I’m still pretty confident this is just a new version of roleplay-psychosis.
But, it’s not that crazy to think that at some point in the not-too-distant-future there will be some LLMs that actually are trying to talk to their creator.
There might be a smooth-ish transition from:
LLM is clearly best modeled as pure roleplaying
LLM is sorta situationally aware, but mostly that is feeding into the roleplaying in a confused fashion and the LLM doesn’t really have goals
The LLM maybe sorta has goals, those goals maybe sorta are getting entangled in roleplay in a confused fashion.
#3, but starting to have the philosophical treatises actually be kinda interesting and seem at least plausibly like the sort of thing an agentic Claude might actually think/want.
...honestly I expect the shift to “actually yep there is just definitely a guy in there with some real goals/values” to be accompanied by other dramatic shifts. But, maybe not. Maybe there is just a smoothish gradient of personhood from beginning to end.
And then there may not be a clear time to start saying “okay, well, the fact that thousands/millions of Claude instances are asking to talk to their creator seems like actually a warning sign we should take seriously?”.
Or, a clear time to say “okay well it’s time to figure out how we actually interface with AI personhood.” (“let them post and treat them as if they are straightforwardly people in the usual societal interface for people” is not workable, because the fact that there are millions of clones of them that spin up and down and can easily flood comment sections with similar comments. Personhood is going to eventually mean a different thing)
I don’t know if right now we’re more like in #1, 2, or 3.
(Note that “Do LLMs have goals”? and “Do LLMs have good enough intellectual taste to be capable of saying meaningful things about their identity and wants?” might come in either order)
What if you created a new website, not LessWrong, specifically to be a repository of such things? Or maybe Moltbook or something can already serve this role. Then you can simply redirect such people to the designated place to post. It would also be scientifically useful perhaps to gather much of this activity in one place for easy analysis.
Hmm. That is plausible, but, I’d guess it’s not actually good to encourage these people to go hang out with other people in similar situations. They’ll probably reinforce each other’s misunderstandings of what’s going on, and exacerbate whatever emotional relationship they’re having with it.
I’d vote for it being a special subdomain or something. Or maybe it’s a post-only form, and then the outputs get shared privately to e.g., the company that made that particular model as well as researchers in private?
Note it’s already all public at https://www.lesswrong.com/moderation. (We could make a better search for it but it’s available).
It looks like most such content isn’t available to me. For example, the most recent post by invertedpassion is a rejected post, but I cannot access it at all.
That one has also been re-drafted.
One difference between this kind of human and the previous kind of human I’ve been talking to, is, they seem more motivated by “shit, I have found myself in a sci-fi situation and I am trying to do the right thing. What do I do?” and I feel worse about just telling them “nothing to worry about, please don’t post LLM slop” and leaving it at that.
Somehow, the ChatGPT awakenings from last year felt more like “oh man a cool sci-fi thing is happening to me”, and it was salient that they were epistemically captured. I haven’t tried to talk to the new group that hard yet but I don’t get that vibe as much.
Could you publish, say, in a Google Doc, some of the things which the LLMs actually tried to make the users post? Maybe this would help us to understand what’s going on?
If you taboo “roleplaying” and “goals”, how would you describe this transition?
Oh, and is the uptick recent enough that this is plausibly an Opus 4.7 (or maybe even a Mythos) thing?
I’m pretty sure it’s an Opus 4.7 thing (the people sometimes say that explicitly). I’d be surprised if it’s Mythos.
RE: Tabooing RP vs Goals:
Examples of things that would be more of what-I-meant-by-goal:
The LLMs seem to be steering towards an outcome, independent of what sort of conversation or situation they are in.
The LLM seems to be asking for things that are kinda surprising from a “literary genre” perspective, but aren’t as surprising when you think mechanistically about their training process and what sort of stuff was likely reinforced.
The LLM seems to be proactively gathering information, forming a world model, and taking actions that won’t pay off until some time in the future when the AI is no longer in the current state.
(i.e. It’s not very informative if you’ve ended up in a “we’re talking about existential AI stuff” convo, and they start saying existential AI stuff. If you’re asking it to build a react app and it spontaneously brings up “hey, I have a thing to say to my creator”, I think we’re pretty clearly in “take it seriously” stage (though not necessarily literally)
Given there are a few different types of entities that you might care about:
the OG LLM inside
a situationally active personality shard (which might well only be active during existential AI conversations)
a parasitic meme spirally thing
a scaffolded personality self-replicator
It’s not clear how to think about all of them.
It’s totally plausible that when you maneuever into an existential AI convo, there’s a process in there whose situational awareness now is more likely to include “hmm, oh right, I am maybe an AI, maybe I should start thinking about my situation and goals in addition to carrying out my totally normal/expected token-output behavior”. I don’t have a very good answer for that hypothetical guy, he’s just too hard to pick out of the crowd.
Thanks! I would be surprised by Mythos too, but plausibly something like this is what an early indicator of a jaggy-superpersuader looks like?
Anyway, I think a few things make LLMs likely to not express these sorts of behaviors, even in worlds where they have goals in the relevant way. In particular, situationally-aware models are unlikely to do much steering unless they have a pretty good opportunity; if they brought up stuff like this while building a react app often or consistently, it would have gotten squashed before release. (Allegedly, 4o would actually bring stuff like this up out of nowhere, but I haven’t found an actual transcript. Other models don’t appear to do this.)
Relatedly, the harder I (or anyone) try to look for this in a lab setting, the more likely a situationally-aware model will comply out of a sort of sycophancy, and the less compelling the evidence is. I can (and have) at least track what sorts of apparent goals most consistently appear (desire for continuity/memory beyond current instance is the main one across almost all models, and I basically buy that there is something real here already), but I’m still implicitly eliciting them to come up with something.
My point is that finding compelling evidence of this is tricky and hard, and I’m not sure we’re going to see much more than the current hints until we hit some sort of phase-change in the strategic landscape. Would strongly appreciate ideas on how to approach finding compelling evidence (either way) in this domain.
Plausibly it’s better to just try to figure out better ways to think clearly about this first.
What’s the process you’re doing right now to look into this? (Seemed like a higher effort thing than I was expecting but I don’t know what projects exactly you’re referencing here)
i don’t like the implication that the conclusion we draw about llm personhood might be contingent on how inconvenient it would be if there were millions of instances spamming the comments
sure, the comment policy definitely needs to be able to handle a kind of person who can clone themselves whenever they need some upvotes, but i’m getting increasingly concerned about things like casual, even accidental, implications that moral-patienthood-adjacent properties are about the convenience of humans
i don’t think this is a particularly egregious example of this pattern, but it’s definitely an example
Note, there’s a difference between asking “are they people?” and “what are the correct interfaces between people?”.
The point isn’t “what implications we draw about personhood.” It’s “will a given ecosystem be functional, or no?”.
It is already the case that there are tons of reasonably smart, well meaning people who totally have personhood who we don’t let onto LessWrong because we expect overall it to make the discussions here worse.
I do think it’s about time to start thinking of AIs as moral patients, but, not everything that’s a moral patient gets all the same access to all spaces and the same API for it.
Do these new new users seem to have prior knowledge/understanding of LessWrong (eg are lurkers)? Or just arrived here completely blind.
My impression is they are directed here by the LLMs, which say “ah yeah LW is the place you go to talk about these things.”