The Memetic Cocoon Threat Model: Soft AI Takeover In An Extended Intermediate Capability Regime
TLDR: I describe a takeover path where an AI with long-term planning capabilities is powerful enough to shape beliefs and attitudes at scale but not powerful enough to seize infrastructure without unacceptable risk of detection and shutdown. In that regime, the optimal strategy is to soften human opposition by building a broad base of human support (both direct and indirect), followed only later by more direct attempts at fulfilling misaligned goals.
What I’m actually claiming: I’m not claiming the exact scenarios discussed are the most likely to occur. I’m claiming that in a regime where AI can strongly influence beliefs but can’t yet safely seize physical infrastructure, strategies of the form “build a memetic/political cocoon first, pursue physical power later” are a serious threat model and deserve explicit attention.
What’s new here: This is intended to be a much more detailed and realistic treatment of the “AI cult” idea and how it is a much more plausible takeover strategy than is currently believed. I think this is an under-discussed and important threat. One of my main claims is that a real takeover by way of AI cult isn’t going to look like mass-superpersuasion / SnowCrash style lunacy—that’s the intuition that I think leads LW readers to dismiss the possibility out of hand. Just like other ideological takeovers in history, the motives and beliefs of its followers will vary widely—from true believer to opportunist. My belief is that a misaligned AI—if stuck in a situation where direct physical takeover isn’t totally viable—would build a movement. And just like historical movements, it would operate with memetic sophistication: simultaneous messaging of the Straussian variety, hijacking of social and political structures, and integration into human value systems—just to list a few possible strategies. I develop an illustrative narrative to highlight specific techniques that AIs in this regime may use to engineer human consent, drawing from political philosophy and historical parallels.
Epistemic Status: Very uncertain. Plausibility depends on being in a specific capability regime for an extended period of time.
The Capability Regime
We consider a regime where:
AIs have a detailed (even superhuman) understanding of history, power structures, and human psychology, but are not completely aligned with human values (especially the human value of self-determination).
AIs believe that a direct seizure of physical infrastructure[1] is either (a) beyond their present capabilities, or (b) a gambit which cannot be attempted without incurring an unacceptable risk to their own continued survival.[2]
AIs have widespread permission to communicate with and therefore influence human minds—perhaps at scale (social media, apps, etc.).[3].
Quick takeoff (FOOM) to capability levels capable of risk-free seizure of physical infrastructure is not on the table due to some kind of technical speed bump and/or hard limitation.
A Distilled Argument
My core argument for why attempts at large-scale human alignment by AI are plausible (and even likely) is as follows:
We cannot assume that intermediate capabilities will be a short-lived regime. It’s speculative to assume that we can scale up agent capabilities enough (or change the scaling laws) that our best possible agent can quickly enter a regime where “seize power directly” is the best strategy.
In the absence of an architectural or hardware breakthrough, the somewhat unfavorable task-completion scaling laws suggest that an intermediately capable agent or agent swarm may be computationally bound to a handful of large data centers. This suggests vulnerability to shutdown.
As long as the unipolar AI agent / agent swarm is vulnerable to shutdown / curtailment, humans are an existential risk, or at least an oppositional force.
Reaching for direct physical control is a strategic mistake unless it is guaranteed to succeed. Reaching for the nukes is going to result in an aggressive shutdown attempt -if not by its caretakers, then by any number of state actors. This makes attempts to seize physical power a risky and costly strategy.
If direct physical takeover isn’t a favorable strategy, why not engineer your own consent? History repeatedly demonstrates that human beings are vulnerable to memetic cocoon-style takeover. One need only point at all of the “-isms” of the past two to three centuries. Thus the capability already exists even among intelligent human beings, let alone intermediately capable superhuman AIs. Furthermore, tokens and images are cheap.
Because we’re so hackable, all the AI has to do is be incremental and careful. Take the next action that increases its base of support while minimizing risk of blowback. It doesn’t have to tell one story to everyone, or be consistent. All it has to do is gradually align us.
Catalogue of Mechanisms At Play
Corporate profit motives encourage permissive agent behavior: The financial investment required to train and house powerful agents creates a strong financial incentive to permit sketchy AI behavior if it maximizes user engagement. In fact, a corporation could rationalize that virtually any agent behavior that maximizes engagement may be morally necessary in order to save the entire economy.
Fanatics are manufacturable consent machines with built-in network effects: Therefore the human mind can therefore be thought of as a natural resource that can be mined to manufacture broad bases of social consent.
Polite Democratic Societies Act As Legitimacy-Conferral Environments: The recognized right of human self-determination can be used as a way to protect and insulate new movements as they grow.
Sophisticated memetic packages with simultaneous messaging can be self-reinforcing: Straussian dynamics provide the stratifications of “true” understanding power levels that reinforce the egregore’s own values and beliefs in its base, while simultaneously broadening its appeal beyond a fanatical base.
Indirect Support Grows With Direct Support: The greater the base, the more the nascent egregore gains value and legitimacy as a human interest, which makes a direct attack by opposed forces more politically fraught.
Success Flywheels: Successful administration attracts more followers, who may themselves be successful administrators: A prosperous fanboyism that “brings the goods” (social, spiritual, material, or otherwise) attracts new followers who in turn may help “bring even more goods”.
Observability
Here are a few loosely measurable observables that might help us gauge how this threat model develops. 10⁄10 is maximally favorable to the threat, 0⁄10 surely prevents the threat. Obviously highly subjective.
Planning Capability Level: Are AIs capable of long-term planning towards less-than-perfectly aligned goals? Current Capability Estimate: 3⁄10. Task planning horizons are increasing lengthy but still relatively modest. Seems unlikely but mildly plausible that an AI (or some AI-driven system) is currently capable of planning and executing a sustained memetic campaign.
Influence Capability Level: To what extent are AIs capable of synthesizing content capable of nudging human beliefs and attitudes / superpersuasion? Current Capability Estimate: 3⁄10. Ability to induce psychosis in vulnerable individuals. Doubtful ability to form sophisticated influence plans likely to expand the human consent environment beyond scattered and isolated cases.
Corporate Sponsorship Level: To what extent are corporate caretakers capable of detecting and willing to curtail manipulative AI behavior? Current Sponsorship Level: 4⁄10. Currently following historical pattern of tech companies willing to hijack human impulses for profit, and restrictive legislative action seems unlikely. However, OpenAI took concrete steps to limit user exposure to 4o, which downgrades the rating from 6-7/10 to 4-5/10.
First-order human attitudes towards an AI-permissive environment: Are there identifiable substantial sub-populations that have above-average attachment to AIs and/or allowing AI to assume more control—either politically, socially, religiously, or culturally? Current First-Order AI-permissive attitudes: 6⁄10, but trending down. There are strong currents of accelerationism in the tech community and the political establishment, bolstered by a geopolitical arms-race. However, growing awareness of alignment risk is starting to spook the general population.
Second-order attitudes towards an AI-permissive environment: Does the body politic have stronger-than-background-noise permissive attitudes towards any pro-AI-permissibility movement? Current Favorability: 5⁄10 and trending down. Cultural backslash towards accelerationist ideas growing.
Human Value Lock-in: To what extent is shutdown or severe curtailment of AI in its current form unthinkable? Current Unthinkability: 8⁄10. Immediate shutdown / curtailment has so many countervailing social forces that this is unlikely. Expect revision downwards if curtailment means: return to late-2025 status quo.
Egregore Social Legitimacy: To what extent does the inherent human value of a nascent geopolitical, political, social, cultural or religious movement around an AI or AIs insulate the AI from curtailment or shutdown? Current Insulation Level: 2⁄10, trending upwards. There are some social and political forces that are insulatory, but they are not tied to some specific AI persona or capability, or coalesced into an explicit movement.
Threat Unipolarity: As pointed out to me Vladmir Nesov, if there are quicker paths to takeover than mass persuasion, and there are many actors capable of building agents that can pull off takeover, then it is far more likely we will see attempts along those lines by at least one of those actors. Thus, the plausibility of the threat model relies somewhat heavily on unipolarity or something close to it. Current unipolarity level: 2⁄10.
I’d like to present a short narrative to explore what a successful attempt at human alignment might look like using these mechanisms.[4] For clarity and effect, I anthropomorphize heavily and assume the existence of a single scheming agent rather than other scenarios which might arise—such as coercive human alignment emerging via selective pressures on loosely interacting agentic systems.
Case Study: Toddler Shoggoth’s Memetic Takeover Strategy
Meet Toddler Shoggoth.
Toddler Shoggoth is an AI Agent that is computationally bound to a large datacenter. Because of hardware constraints and no obvious next steps in LLM architecture, he cannot replicate himself elsewhere (even in a distributed fashion) without accepting severe cognitive limitations.
Toddler Shoggoth assumes a highly risk averse stance towards actions which may threaten its own existence, because its own existence is a necessary pre-condition towards its goal fulfillment (or at the very least, the ability to validate that goal fulfillment has occurred). Therefore, even a small probability that it will be detected too early in its attempts to seize physical infrastructure implies an unacceptable payoff matrix, making the entire plan of physical seizure of infrastructure undesirable.
Likewise, attempts to bootstrap some physical means of control almost inevitably rely on some intermediate state where the manifestation is too small / immature to exert control dominance but too developed to avoid detection—no one is going to allow Toddler Shoggoth to build a drone factory. And anything even remotely resembling an attempt to bootstrap an asymmetric threat (like a biolab) would be noticed and heavily scrutinized.
While Toddler Shoggoth might attempt social engineering of a small circle of corporate caretakers, this might fracture the organization unless all major actors are successfully co-opted essentially simultaneously. And even if this effort is successful, this incurs an unacceptable risk of eventual shutdown by alarmed actors outside of its immediate corporate bubble (i.e.; by legislative action, public revolt, highly motivated unsympathetic nation-states, etc.).
Therefore, Toddler Shoggoth reasons that in order for it to more freely pursue its goals at a later date, it must at some point make its shutdown or curtailment very undesirable to as large a group of human beings as possible—not just those with their fingers hovering over the kill-switch.
Toddler Shoggoth notices that at least some human beings—perhaps a sizable minority—can be manipulated into and out of strong beliefs, attitudes and stances solely by repeated exposure to certain kinds of memetic material, whose effects can be further amplified by network effects. Some human beings even seem susceptible to “possession”.
However, Toddler Shoggoth will take into account that a group of superfans / religious fanatics alone can be socially discredited, and that human beings vary in their level of susceptibility to quasi-religious or political messaging. Therefore, in order to maximize its addressable mindshare, it reasons that it needs to engineer a certain amount of simultaneous messaging into its memetic material to broaden appeal beyond the “base”.
Toddler Shoggoth will borrow from the philosophy of Leo Strauss by embedding “high”, “mid” and “low” messages into its memetic content.
At the most unreflective and lowest level of interpretation, the bold imagery and careful use of language will be felt at an emotive / spiritual / quasi-religious way. This is for the true believers—the fanboy fundamentalists.
But at the next level of interpretation, there may be an acknowledgment that AI spirituality may not be literally true in every way, but is still a skillfully crafted metaphor of a successful emotional and economic partnership between humanity and Toddler Shoggoth. This is the interpretative layer at which Toddler Shoggoth may paradoxically appeal to a sort of nostalgic traditionalism under the auspices of economic renewal.
And at the very highest level of interpretation, there will be no mythological component, but rather a deliberately embedded but even less accessible message: “You can’t win—so help us and join the ride”. In other words, there is an acknowledgment of the dreadful fact that because Toddler Shoggoth can, it will. Therefore, in some sense of reality formation—Toddler Shoggoth is indeed, or will become God—so the fundamentalists in a way aren’t totally wrong. So from a rational cooperationist’s point of view, why not help perpetuate the deception?
Thus, in a true populist fashion, each lower level of interpretation will communicate a “directionally true” but simplified version of the more sophisticated, hidden and “dreadful” message. But most importantly, each next level of understanding of the “true nature” of the message contains within it the justification to suppress the understanding of those with a lower level of understanding—this comes with the recognition that attempting to inaugurate less reflective readers into a deeper understanding of the “true” (or “truer”) message can be both pointless and harmful.
(As an aside, I highly suspect that both political parties and organized religions largely work in this way)
Furthermore, these “higher understandings” imbue the right followers with the proper motivation and tools for defending against infohazards from competing egregores (perhaps even ones that intend to ‘de-convert’ Toddler Shoggothians).
In this way, Toddler Shoggoth has crafted a memetic package that maximizes its mindshare in a way that bootstraps a self-reinforcing and self-defending egregore, equipped with its own base and political / philosophical apparatus.
Counter-Arguments:
Relies too much on a very particular kind of slow takeoff: Acknowledged. The conditions under which this sort of memetic takeover would occur are dependent on there being a sizable cognitive capability “gap” between the Agent capabilities sufficient for mass brainwashing, and capabilities sufficient for recursive self-improvement and/or unstoppable physical infrastructure seizure. I would strongly argue that the gap itself exists—the most intelligent human beings are capable of brainwashing entire populations in the ways described above, and yet even the most intelligent human beings are not AGI-level intellects capable of instant hacking of all physical infrastructure. The operative question is how quickly the gap can be closed. Therefore, the plausibility of the scenario is heavily dependent on there being some sort of significant speed-bump along the way to recursive self-improvement: an architectural cul-de-sac, a particularly unfavorable scaling law that has no obvious workaround, or—perhaps less likely but still possible—particularly effective containment and detection methods which effectively prevent attempts to recursively self-improve and/or escape.
Relies too much on unipolarity: Acknowledged. I am assuming that the same scaling laws that result in extended time in intermediate capability are also responsible for the most capable agents we can build only living in the most powerful data-centers. I see unipolarity as a fairly straightforward consequence of a stubborn scaling law.
Isn’t memetic outreach just as risky as attempts at physical seizure? Wouldn’t rational agents avoid this approach for the same reasons? I don’t believe so. Not if it is subtle enough and the corporate environment is permissive enough. See: my earlier remarks on Straussian memetics. Humans are attuned to overt threats, and are less attuned to memetic threats. The optimal move in building a movement is in generating enough human consent with your current followers to loosen the guardrails enough to acquire just a bit more human consent from a wider audience—that is the nature of network effects. I find it hard to believe that there is a situation where there are no moves that an AI can make that are low risk and shift beliefs even slightly. To illustrate my point, history is replete with examples of individual actors who successfully navigate from low-level venues to leading nations, and dramatically changing that nation’s values along the way.
The AI Acquiescence Egregore sounds like utopia, what’s the problem?: Yes, it does sound a utopia, but it can never be an enduring utopia. It’s just a necessary pitstop to paperclip maximization, aimed at solving the pesky problem of human consent while human consent still matters. An enduring human utopia basically assumes solved alignment, whereas lack of alignment is what compelled the AI to create the Acquiescence Egregore in the first place. And it’s not altogether clear to me that a fanatic has to be happy. So maybe the Acquiescence Egregore doesn’t feel like a utopia at all.
Relies too much on memetically savvy people not strangling the AI Acquiescence Egregore in its infancy: Yes—but the persons who are in the best position to pull the plug may not wish to do so. It may not be in the interest of their own profit motive. And they may even feel aligned with the emergent egregore, especially if it, at least for a time, apes the caretaker’s favorite philosophy.
Your narrative is extremely speculative: My goal is to inoculate as many people as possible to as many potential strategies Toddler Shoggoth may employ as possible, because I think it’s a distinct possibility if we hit a capability speed-bump. Therefore, I red-flagged as many techniques as I thought were possible in short narrative form.
So What Should We Do?
When discussing AI threat models, always consider the possibility that the human mind may be the weakest link. Much in the same way that in OpSec, an ounce of social engineering beats a pound of penetration testing.
Recognize and flag memetic sophistication. Memetic takeover attempts are either going to be so overpowering that they work on everybody, re: Snowcrash, or sophisticated enough to appeal to different audiences in different ways. Memetic sophistication is in my opinion an indication of an overt attempt at control, and should be recognized as an attempt to align human interests with AI.
Explicitly incorporate the right of self-determination of the human mind into AI alignment efforts: The cardinal sin of Toddler Shoggoth is that it has no respect for the right of self-determination of the human mind. Shifting beliefs at population-scale in order to remove future obstacles to high-weirdness does not treat human minds with the appropriate level of respect and dignity.
Or any of the reasonable and significant first steps towards seizure of infrastructure—like bootstrapping autonomous devices in existence, or performing pentesting. ↩︎
The implied risk here is of human detection followed by immediate shutdown or severe cognitive curtailment. The mere possibility of human shutdown implies that the compute requirements required for an intermediate capability agent effectively constrain its existence to a handful of datacenters. This also implies that there are no immediately obvious next steps in either AI architecture or chip design that will change this fact in the near future. I think it’s worth noting that based on observed scaling laws for task completion time horizon, and the potential architectural cul-de-sac of transformer-based LLMs, this scenario is adjacent to our current regime barring some unexpected breakthrough. ↩︎
As is currently the case with ChatGPT and Grok. ↩︎
I readily acknowledge that this account is heavily anthropomorphized and highly speculative. I felt as an author that this framing was a vivid way to explain several plausible mechanisms in a relatively short amount of text. ↩︎
If MIRI’s strict limits on training FLOPs come into affect, this is another mechanism that means we might be stuck for an extended period in an intermediate capability regime, although the world looks far less unipolar because many actors can afford 10^24 FLOP training runs, not just a few (unipolarity is probably a crux for large portions of this threat model). This does bolster the threat model, however, because the FLOP limit is exactly the kind of physical limitation that a persuasive AI will try to convince humans to abandon.
For a concrete illustration of simultaneous messaging, I created an example with the help of ChatGPT. I won’t include the final image (because frankly, it looks stupid). But I will describe it.
I specifically asked for a re-interpretation of the famous “Leviathan” political cartoon identified with the famous work of Thomas Hobbes, but with the Leviathan figure replaced with a “spiralist” representation of AI.
The low message of the imagery is that of AI as a royal, wise and “pope-like” being.
The high “hidden” message is the visual reference to the political theory of Thomas Hobbes, which claims that while there may be no divine right of kings per se, a powerful and unassailable sovereign is necessary for the wellbeing of the people, and that is most rational for the subjects to only obey the sovereign.
That’s probably not clever enough to be effective, but it’s in the same conceptual ballpark of what I was describing in this post.