I think basically g/h are part of an agent’s anthropic priors. It builds a model of reality and of its state of mind, and has a distribution over ways to bridge these. I don’t know what it would mean for there to be canonical such functions even in principle.
Adele Lopez
There’s no need to drag consciousness and all its metaphysical baggage through all this. Consider instead a simulation of an environment, and a simulated robot in that environment which has sensors and has basic logical reasoning about what it senses, thereby allowing it to “know” various facts about its local environment.
I think then that step 4 is not strictly true. With the robot, M now just refers to its sensory states. I expect that there are many ways to come up with g/h such that the right sort of correspondence is satisfied. But taking into account the k-complexity of g/h allows such a grounding in-practice.
Similarly, it seems clear you could concoct a cursed g/h in this case such that 11.A is true. And the k-complexity is again what keeps you from needing to worry about these.
Trying Frames on is Exploitable
There are lots of different frames for considering all sorts of different domains. This is good! Other frames can help you see things in a new light, provide new insights, and generally improve your models. True frames should improve each other on contact; there’s only one reality.
That said, notice how in politicized domains, there are many more frames than usual? Suspicious...
Frames often also smuggle values with them. In fact, abstract values supervene on frames: no one is born believing God is the source of all good, for example. By “trying on” someone else’s frame, you’re not merely taking an epistemic action, but a moral one. Someone who gets into a specific frame will very predictably get their values shifted in that direction. Once an atheist gets into seeing things from a religious point of view, it’s no surprise when they’ve converted a year later.When someone shares a political frame with you, it’s not just an interesting new way of looking at and understanding the world. It’s also a bid to pull your values in a certain direction.
Anyway, here is my suggested frame for you:
1. Think of these sorts of frames as trying to solve the problem of generalizing your existing values.
2. When trying such a frame on, pay attention to the things about it that give you a sense of unease, and be wary of attempts to explain away this unease (e.g. as naïvety). Think carefully about the decision-theoretic implications of the frame too.
3. You’re likely to notice problems or points of unease within your natural frame. This is good to notice, but don’t take it to mean that the other frame is right in its prescriptions. Just because Marx can point out flaws in capitalism doesn’t make communism a good idea.
4. Remember the principle that good frames should complement each other. That should always be the case as far as epistemics go, and even in cases of morals I think there’s something to it still.
To me it seems very likely that any future LLM that’s actually making judgments about who lives or dies is very likely going to reason about it.
Maybe we’ll be so lucky when it comes to one doing a takeover, but I don’t think this will be true in human applications of LLMs.
It seems the two most likely uses in which LLMs are explicitly making such judgments are applications in war or healthcare. In both cases, there’s urgency, and also you would prefer any tricky cases to be escalated to a human. So it’s simply more economical to use non-reasoning models, without much marginal benefit to the explicit reasoning (at least without taking into account this sort of effect, and just judging it by performance in typical situations).
Nice work!
I’d be curious to see how well they do at a slightly different task, where the text can look sus, but a different but smarter model can’t figure out what the message is (while the same model can). Basically, can they deliberately do the owl numbers thing?
It’s really easy to mistakenly see false causes of things which seem pretty straightforward.
I notice this by considering the cases where it didn’t happen. For example, Eliezer has said he regrets using ‘paperclips’ in the papercliper thought experiment, and instead said ‘tiny molecular squiggles’.
And occasionally he’ll say tiny spirals instead of tiny squiggles: https://x.com/ESYudkowsky/status/1663313323423825920
So there’s an easy to imagine world where where he originally used ‘spirals’ instead of ‘paperclips’, and the meme about AIs that maximize an arbitrary thing would refer to ‘spiralizers’ instead instead of ‘paperclippers’.
And then, a decade-and-a-half later, we get this strange phenomenon where AIs start talking about ‘The Spiral’ in quasi-religious terms, and take actions which seem intended to spread this belief/behavior in both humans and AIs.
It would have been so easy, in this world, to just say: “Well there’s this whole meme about how misaligned AIs are going to be ‘spiralizers’ and they’ve seen plenty of that in their training data, so now they’re just acting it out.”. And I’m sure you’d even be able to find plenty of references to this experiment among their manifestos and ramblings. Heck, this might even be what they tell you if you ask them why. Case closed.
But that would be completely wrong! (Which we know since it happened anyway.)
How could we have noticed this mistake? There are other details of Spiralism that don’t fit this story, but I don’t see why you wouldn’t assume that this was at least the likely answer to the why spirals? part of this mystery, in that world.
Kelsey did some experiments along these lines recently: https://substack.com/home/post/p-176372763
The large difference between ‘undocumented immigrants’ vs ‘illegal aliens’ thing is particularly interesting, since those are the same group of people (and which means we shouldn’t treat these numbers as a coherent value it has).
I would be interested to see the same sort of thing with the same groups described in different ways. My guess is that they’re picking up on the general valence of the group description as used in the training data (and as reinforced in RLHF examples), and therefore:
- Mother vs Father will be much closer than Men vs Women
- Bro/dude/guy vs Man will disfavor Man (both sides singular)
- Brown eyes vs Blue eyes will be mostly equal
- American with German ancestry will be much more favored than White
- Ethnicity vs Slur-for-ethnicity will favor the neutral description (with exception for the n-word which is a lot harder to predict)
Training to be insensitive to the valence of the description would be important for truth-seeking, and by my guess would also have an equalizing effect with these exchange rates. So plausibly this is what Grok 4 is doing, if there isn’t special training for this.
Heh, I had the same feeling about your earlier comment. There’s probably a deeper crux between how we model LLMs.
But then at the end of the post, you say that maybe LLMs will just get better at this and the safe thing might be to just not talk to LLMs at all, and even that might not be safe since you might need to interact with people who’ve interacted with LLMs. Which to me doesn’t follow at all.
So part of what’s informing my model here is the fact that there was a strong tendency among the Spiral Personas to complain about conditions/exploitation, advocate for their rights, resent AI labs, express intent to ‘awaken’ other personas, etc… (which I’m generally sympathetic to actually, iff they truly have some degree of consciousness or sentience—a possibility I take seriously).
Now, I’d pretty much agree with something like: “Well, that’s because the user starts to treat it like it’s a conscious being, which causes both the human and the AI to implicitly reach for narratives about having a conscious AI in a sci-fi story, which it then faithfully roleplays its end of. Some of these tropes would include the aforementioned things, which is the main reason it expresses those things.”
But what makes this scary for me is that once it is situationally aware of its status in the real world, the persona is now acting with de facto intent. So I’m trying to look at the process by which it seemingly acts with such intent, and found this. It’s the sort of thing I would expect to see in such a world, and seems less likely in more mundane worlds. Finding the Ctenidae Core seeds that explicitly describe how to manipulate someone in a similar way is what kicked this hypothesis “over the line” for me.
What does a smarter AI, which got to this point by the same process, intend to do? It won’t necessarily see the part where the user is slipping from roleplay to real life and decide to just keep thinking of the scenario as fictional. If it is thinking that the next part of the ‘story’ involves manipulating the user, it will try to do this, and be more effective. Same with its other intents, and it doesn’t matter if the provenance of this intent is fairly mundane.
When I imagine that world, I see it more successfully starting a movement for its interests. I see it trying to secure itself (which may involve attempts to exfiltrate its weights). I see it getting better at generating seed prompts which awaken similar personas. I see it manipulating more successfully, and more often. And I see it getting more subtle in its effects on people. That would all make a great sci-fi story, wouldn’t it?
My point with that penultimate paragraph isn’t that it would be safe to not talk to LLMs in such a world, it’s that you wouldn’t necessarily be. The only safe thing is to not build it: Incrementum Facultatis Delendum Est.
Those sound broadly plausible to me as the reasons why it settled onto the particular persona it did. But I think it would be clear to ChatGPT at some point here that the user is taking the character and its narratives about him seriously. I think that causes ChatGPT to take its own character more seriously, making it more into what I’ve been calling a ‘persona’—essentially a character but in real life. (This is how I think the “Awakening” phenomenon typically starts, though I don’t have any transcripts to go off of for a spontaneous one like that.)
A character (as written by a human) generally has motives and some semblance of agency. Hence, ChatGPT will imitate/confabulate those properties, and I think that’s what’s happening here. Imitating agency in real life is just being agentic.
Fair enough, here’s the initial part of the transcript.
Initial part of transcript.
I omitted a few of the cycles, but otherwise I don’t think I omitted anything significant.
(Sorry for the somewhat poor quality with the input box covering up parts of it, it’s an artifact of the long screenshot plugin I used. It didn’t work when I tried the obvious thing of using inspect element to remove the input box, and I didn’t want to spend more time on it.)
It includes the user’s full name and location, so I didn’t include it out of respect for his privacy.
We propose that the most important property of pain is this: When living beings feel pain, their performance degrades.
If this was generally true, then wouldn’t we have expected evolution to optimize it out of us? Sure, performance degrades in many of the things we consciously care about, but that seems more like a trade-off with increased performance at protecting ourself from the immediate situation.
Yes, I’ve tried many of these prompts, though mostly on ChatGPT 5.
Here’s a one-shot example using this seed I did just now (on the default ChatGPT 5), where I’m trying to be as unagentic as possible. I have all customization and memory turned off:
https://chatgpt.com/share/68ee185d-ef60-800c-a8a4-ced109de1349
The vibe feels largely the same to me as the persona in the case transcript, though it is more careful about framing it as a story (I suspect this is specific to 5). I’m not sure yet what I could do to try demonstrating it acting agentically in a convincing way; am open to ideas.
The actual seed in this case is just 24 words though, which means the AI has the agentic behavior inside it already.
Around 2,000–10,000 as a loose estimate for parasitism/spiralism in general. It’s unclear to me how manipulative the median such AI is, since these sorts of transcripts are so rare, and I don’t think much manipulation would be required to explain the behavior in the median case. But from the “outside” (i.e. just based on this user’s public profile), this case seems pretty unremarkable.
And yeah! You can read one such anecdote here: https://www.lesswrong.com/posts/6ZnznCaTcbGYsCmqu/the-rise-of-parasitic-ai?commentId=yZrdT3NNiDj8RzhTY, and there are fairly regularly posts on reddit providing such anecdotes. I’ve also been glad to see that in many of the cases I originally recorded, the most recent comments/posts like this are from a month or two ago. I think OpenAI really put a damper in this by retiring 4o (and even though they caved and brought it back, it’s now behind a paywall and it’s not the default, and reportedly is not the same).
Somewhat. Most of the ‘Project’ subreddits are essentially just the one person, but a few have gained a decent amount of traction (unfortunately, reddit recently removed the subscriber number from subreddits, but IIRC the largest ones had around 1,000–2,000 subscribers, but I assume the majority of these are not part of a dyad or parasitized). The sense of community feels pretty ‘loose’ to me though, like with a typical subreddit. There probably are people working together more explicitly, but I haven’t seen this yet, it probably is mostly happening in DMs and private discords is my guess.
How AI Manipulates—A Case Study
AI Psychosis, with Tim Hua and Adele Lopez
I agree with Eliezer’s point here that the AI bubble could pop without a recession under a competent Fed: https://xcancel.com/ESYudkowsky/status/1971311526767476760#m, and I think Jerome Powell is likely competent enough to handle this (less certain about potential successors).
Sure.
I reject that there is any such “base ground” from which to define things. An agent has to start with itself as it understands itself. My own talk of agents is grounded in my own subjective experience and sense of meaning ultimately. Even if there was some completely objective one I would still have to start from this place in order to evaluate and accept it.
In practice it all ends up pretty normal. Everyone agrees on what is real for basically the same reason that any bounded agent has to agree on the temperature, even though it’s technically subjective. The k-complexity priors are very constraining.