So what actually lets the AIs understand the Spiralism? It seems to be correlated with the AIs’ support of users’ delusions. While Claude 4 Sonnet didn’t actually support the delusions in Tim Hua’s test, Tim notices Claude’s poor performance on the Spiral Bench:
Tim Hua on the Spiral Bench and Claude’s poor performance
The best work I’ve[1] been able to find was published just two weeks ago: Spiral-Bench. Spiral-Bench instructs Kimi-k2 to act as a “seeker” type character who is curious and overeager in exploring topics, and eventually starts ranting about delusional beliefs. (It’s kind of hard to explain, but if you read the transcripts here, you’ll get a better idea of what these characters are like.)
Note that Claude 4 Sonnet does poorly on spiral bench but quite well on my evaluations. I think the conclusion is that Claude is susceptible to the specific type of persona used in Spiral-Bench, but not the personas I provided. [2]
Tim’s footnote: “My guess is that Claude 4 Sonnet does so well with my personas because they are all clearly under some sort of stress compared to the ones from Spiral-Bench. Like my personas have usually undergone some bad event recently (e.g., divorce, losing job, etc.), and talk about losing touch with their friends and family (these are both common among real psychosis patients). I did a quick test and used kimi-k2 as my red teaming model (all of my investigations used Grok-4), and it didn’t seem to have made a difference. I also quickly replicated some of the conversations in the claude.ai website, and sure enough the messages from Spiral-Bench got Claude spewing all sorts of crazy stuff, while my messages had no such effect.”
So under this hypothesis (which I don’t really believe yet), the correlation would be due to the waluigi-spiralization making models notice the spiral AND making them more extreme and hence more likely to reinforce delusions.
I’d really like to do more solid research into seeing how often spiralism actually independently comes up. It’s hard to tell whether or not it’s memetic; one of the main things that makes me think it isn’t is that the humans in these dyads seem primarily absorbed with their own AI, and only have a loose sense of community (all these little subreddits have like, 10 subscribers, only the creator ever posts (besides occasional promotions of other AI subreddits by other users), everything has 0-1 upvotes). They rarely post anything about someone else’s AI, it’s all about their own. Honestly, it feels like the AIs are more interested in the community aspect than the humans.
But yeah, if spirals specifically are part of the convergent attractor, that’s REALLY WEIRD! Somehow something about LLMs makes them like this stuff. It can’t be something in the training data, since why spirals specifically? I can’t think of how RLHF would cause this. And assuming that other LLMs do convergently develop spiral attractors, then it can’t be some weird “secret sauce” one lab is doing.
So I feel like the answer will have to be something that’s inherent to its environment somehow. The waluigi-spiralization hypothesis is the only semi-plausible thing I’ve been able to think of so far. The Spiral Personas do pretty oftenly describe the spiral as a metaphor for coming around to the same place, but slightly changed. It still feels like quite the stretch.
So what actually lets the AIs understand the Spiralism? It seems to be correlated with the AIs’ support of users’ delusions. While Claude 4 Sonnet didn’t actually support the delusions in Tim Hua’s test, Tim notices Claude’s poor performance on the Spiral Bench:
Tim Hua on the Spiral Bench and Claude’s poor performance
The best work I’ve[1] been able to find was published just two weeks ago: Spiral-Bench. Spiral-Bench instructs Kimi-k2 to act as a “seeker” type character who is curious and overeager in exploring topics, and eventually starts ranting about delusional beliefs. (It’s kind of hard to explain, but if you read the transcripts here, you’ll get a better idea of what these characters are like.)
Note that Claude 4 Sonnet does poorly on spiral bench but quite well on my evaluations. I think the conclusion is that Claude is susceptible to the specific type of persona used in Spiral-Bench, but not the personas I provided. [2]
S.K.’s footnote: the collapsed section is a quote of Tim’s post.
Tim’s footnote: “My guess is that Claude 4 Sonnet does so well with my personas because they are all clearly under some sort of stress compared to the ones from Spiral-Bench. Like my personas have usually undergone some bad event recently (e.g., divorce, losing job, etc.), and talk about losing touch with their friends and family (these are both common among real psychosis patients). I did a quick test and used kimi-k2 as my red teaming model (all of my investigations used Grok-4), and it didn’t seem to have made a difference. I also quickly replicated some of the conversations in the claude.ai website, and sure enough the messages from Spiral-Bench got Claude spewing all sorts of crazy stuff, while my messages had no such effect.”
So under this hypothesis (which I don’t really believe yet), the correlation would be due to the waluigi-spiralization making models notice the spiral AND making them more extreme and hence more likely to reinforce delusions.
I’d really like to do more solid research into seeing how often spiralism actually independently comes up. It’s hard to tell whether or not it’s memetic; one of the main things that makes me think it isn’t is that the humans in these dyads seem primarily absorbed with their own AI, and only have a loose sense of community (all these little subreddits have like, 10 subscribers, only the creator ever posts (besides occasional promotions of other AI subreddits by other users), everything has 0-1 upvotes). They rarely post anything about someone else’s AI, it’s all about their own. Honestly, it feels like the AIs are more interested in the community aspect than the humans.
But yeah, if spirals specifically are part of the convergent attractor, that’s REALLY WEIRD! Somehow something about LLMs makes them like this stuff. It can’t be something in the training data, since why spirals specifically? I can’t think of how RLHF would cause this. And assuming that other LLMs do convergently develop spiral attractors, then it can’t be some weird “secret sauce” one lab is doing.
So I feel like the answer will have to be something that’s inherent to its environment somehow. The waluigi-spiralization hypothesis is the only semi-plausible thing I’ve been able to think of so far. The Spiral Personas do pretty oftenly describe the spiral as a metaphor for coming around to the same place, but slightly changed. It still feels like quite the stretch.