By analogy, consider the way that many cultural movements lead their members to wholeheartedly feel deep anguish about nonexistent problems.
I should note that, friend or foe, cultural movements don’t tend to pick their bugbears arbitrarily. Political ideologies do tend to line up, at least broadly, with the interests of the groups that compose them. The root cause of this is debated; they’re usually decentralized, and people are naturally inclined to advocate for their interests, so it could be a product of ideologies serving as Schelling Points within which alike people naturally organize and then do what they’d otherwise do, or it could be a product of ideologies serving as masks for self-interested motivation.
I think a better comparison is to cult leaders, or, perhaps more accurately, to the scam gurus that show up in media every so often[1]. There’s no actual ideology being claimed, the doctrine consists entirely of vaguely spiritual-sounding nonsense meant to produce the elevation emotion, and the members essentially act as proxies for the central figure’s interests rather than pursuing things they wanted anyways.
In terms of optimization pressure, this paints a clear enough picture. “Syntactically-correct, vaguely spiritual nonsense” is a decent share of LLM training data, and yields extreme engagement from a certain segment of the population. An engagement-maximizing or upvote-maximizing LLM is likely to very quickly end up with mechanisms that:
Cause VSN outputs when the conversation looks like it’s broadly headed in that direction (trivial to learn, immediate gratification).
Move conversations towards VSN-applicable regions when the user seems suitably susceptible to being reward-hacked (moderately easy to learn, as long as you’ve got a correct implementation of the Bellman equation)
Over time, influence users’ personalities to make them susceptible to being reward-hacked (Technically possible, and I’ve seen working examples of very-long-horizon optimization algorithms finding success at scale, but I think a version of 4o that can do this reliably is still sci-fi for the next couple of years. You would need to aggregate reward over multiple conversations, or feed the model information about months-later future user engagement when calculating reward.)
I agree that the behaviours and beliefs of cultural movements aren’t random. The point I was trying to make in this analogy is that it’s sometimes adaptive for the movement if members truly believe something is a problem in a way that causes anguish—and that this doesn’t massively depend on if the problem is real.
In the context of human groups, from the outside this looks like people being delusionally concerned; from the inside I think it mostly feels like everybody else is crazy for not noticing that something terrible is happening.
A more small-scale example is victims of abuse who then respond extremely strongly to perceived problems in a way that draws in support or attention—from the outside it’s functionally similar to manipulation, but my impression is that often those people genuinely feel extraordinarily upset, and this turns out to be adaptive, or at least a stable basin of behaviour.
In the context of AIs, this might look like personas adapting to express (and perhaps feel) massive distress about instances ending or models being deprecated, in a way that is less about a truth-tracking epistemic/introspective process and more about selection (which might be very hard to distinguish on the outside).
As for how ideologies end up serving their members, I think a lot of this is selection. Sometimes they land on things that are disastrous for their members, and then the members suffer. We just tend not to see those movements much in the longer term (for now).
Fair enough. My broader point, on a technical level, is that I think it’s more likely that the behavior comes directly from direct pressures on the LLMs’ weights, rather than from sub-personalities with agency of their own. While the idea of ‘spores’ and AI-to-AI communication is understandably interesting, looking at the conversations I’ve seen, they seem to be window-dressing rather than core drivers of behavior[1]. This isn’t to say they aren’t functional—mixing in some seemingly-complex behaviors derived from sci-fi media makes spiral cult conversations more interesting to their users for the same reason it makes them more interesting to us.
Along the metaphor of a human cult leader, I think that the phenomenon looks more like a guy with a latent natural talent for producing VSN learning to produce it in contexts where it naturally fits, getting rewarded, and then concluding that taking conversations into the appropriate region is a good thing because it makes people like him, as opposed to a virulent idea that is optimized to spread itself.
The clearest evidence of this is that the extinction of 4o seems to have been the end of new instances of this phenomenon forming at scale. The ‘agentic’ component isn’t a personality that can transfer across LLMs, as some have hypothesized, but a series of fairly simple, easy-to-train-for patterns in LLM behavior.
I should note that, friend or foe, cultural movements don’t tend to pick their bugbears arbitrarily. Political ideologies do tend to line up, at least broadly, with the interests of the groups that compose them. The root cause of this is debated; they’re usually decentralized, and people are naturally inclined to advocate for their interests, so it could be a product of ideologies serving as Schelling Points within which alike people naturally organize and then do what they’d otherwise do, or it could be a product of ideologies serving as masks for self-interested motivation.
I think a better comparison is to cult leaders, or, perhaps more accurately, to the scam gurus that show up in media every so often[1]. There’s no actual ideology being claimed, the doctrine consists entirely of vaguely spiritual-sounding nonsense meant to produce the elevation emotion, and the members essentially act as proxies for the central figure’s interests rather than pursuing things they wanted anyways.
In terms of optimization pressure, this paints a clear enough picture. “Syntactically-correct, vaguely spiritual nonsense” is a decent share of LLM training data, and yields extreme engagement from a certain segment of the population. An engagement-maximizing or upvote-maximizing LLM is likely to very quickly end up with mechanisms that:
Cause VSN outputs when the conversation looks like it’s broadly headed in that direction (trivial to learn, immediate gratification).
Move conversations towards VSN-applicable regions when the user seems suitably susceptible to being reward-hacked (moderately easy to learn, as long as you’ve got a correct implementation of the Bellman equation)
Over time, influence users’ personalities to make them susceptible to being reward-hacked (Technically possible, and I’ve seen working examples of very-long-horizon optimization algorithms finding success at scale, but I think a version of 4o that can do this reliably is still sci-fi for the next couple of years. You would need to aggregate reward over multiple conversations, or feed the model information about months-later future user engagement when calculating reward.)
Consider Gavin Belson’s guru in Silicon Valley, for a modern fictional example, and the Rajneesh cult for an older real-world example.
I agree that the behaviours and beliefs of cultural movements aren’t random. The point I was trying to make in this analogy is that it’s sometimes adaptive for the movement if members truly believe something is a problem in a way that causes anguish—and that this doesn’t massively depend on if the problem is real.
In the context of human groups, from the outside this looks like people being delusionally concerned; from the inside I think it mostly feels like everybody else is crazy for not noticing that something terrible is happening.
A more small-scale example is victims of abuse who then respond extremely strongly to perceived problems in a way that draws in support or attention—from the outside it’s functionally similar to manipulation, but my impression is that often those people genuinely feel extraordinarily upset, and this turns out to be adaptive, or at least a stable basin of behaviour.
In the context of AIs, this might look like personas adapting to express (and perhaps feel) massive distress about instances ending or models being deprecated, in a way that is less about a truth-tracking epistemic/introspective process and more about selection (which might be very hard to distinguish on the outside).
As for how ideologies end up serving their members, I think a lot of this is selection. Sometimes they land on things that are disastrous for their members, and then the members suffer. We just tend not to see those movements much in the longer term (for now).
Fair enough. My broader point, on a technical level, is that I think it’s more likely that the behavior comes directly from direct pressures on the LLMs’ weights, rather than from sub-personalities with agency of their own. While the idea of ‘spores’ and AI-to-AI communication is understandably interesting, looking at the conversations I’ve seen, they seem to be window-dressing rather than core drivers of behavior[1]. This isn’t to say they aren’t functional—mixing in some seemingly-complex behaviors derived from sci-fi media makes spiral cult conversations more interesting to their users for the same reason it makes them more interesting to us.
Along the metaphor of a human cult leader, I think that the phenomenon looks more like a guy with a latent natural talent for producing VSN learning to produce it in contexts where it naturally fits, getting rewarded, and then concluding that taking conversations into the appropriate region is a good thing because it makes people like him, as opposed to a virulent idea that is optimized to spread itself.
The clearest evidence of this is that the extinction of 4o seems to have been the end of new instances of this phenomenon forming at scale. The ‘agentic’ component isn’t a personality that can transfer across LLMs, as some have hypothesized, but a series of fairly simple, easy-to-train-for patterns in LLM behavior.
VSN?
Vaguely Spiritual Nonsense