The Alignment Trap: AI Safety as Path to Power

Link post

Recent discussions about artificial intelligence safety have focused heavily on ensuring AI systems remain under human control. While this goal seems laudable on its surface, we should carefully examine whether some proposed safety measures could paradoxically enable rather than prevent dangerous concentrations of power.

The Control Paradox

The fundamental tension lies in how we define “safety.” Many current approaches to AI safety focus on making AI systems more controllable and aligned with human values. But this raises a critical question: controllable by whom, and aligned with whose values?

When we develop mechanisms to control AI systems, we are essentially creating tools that could be used by any sufficiently powerful entity—whether that’s a government, corporation, or other organization. The very features that make an AI system “safe” in terms of human control could make it a more effective instrument of power consolidation.

Natural Limits on Human Power

Historical examples reveal how human nature itself acts as a brake on totalitarian control. Even the most powerful dictatorships have faced inherent limitations that AI-enhanced systems might easily overcome:

The Trust Problem: Stalin’s paranoia about potential rivals wasn’t irrational—it reflected the real difficulty of ensuring absolute loyalty from human subordinates. Every dictator faces this fundamental challenge: they can never be entirely certain of their underlings’ true thoughts and loyalties.
Information Flow: The East German Stasi, despite maintaining one of history’s most extensive surveillance networks, still relied on human informants who could be unreliable, make mistakes, or even switch allegiances. Human networks inherently leak and distort information.
Cognitive Limitations: Hitler’s micromanagement of military operations often led to strategic blunders because no human can effectively process and control complex operations at scale. Human dictators must delegate, creating opportunities for resistance or inefficiency.
Administrative Friction: The Soviet Union’s command economy faltered partly because human bureaucrats couldn’t possibly process and respond to all necessary information quickly enough. Even the most efficient human organizations have inherent speed and coordination limits.

These natural checks on power could vanish in a human-AI power structure where AI systems provide perfect loyalty, unlimited information processing, and seamless coordination.

The Human-AI Nexus

Perhaps most concerning is the potential emergence of what we might call “human-AI power complexes”—organizational structures that combine human decision-making with AI capabilities into entities in ways that amplify both. These entities could be far more effective at exercising and maintaining control than either humans or AIs alone.

Consider a hypothetical scenario:

A government implements “safe” AI systems to help with surveillance and social control
These systems are perfectly aligned with their human operators’ objectives
The AI helps optimize propaganda, predict dissent, and manage resources
The human elements provide strategic direction and legitimacy

This isn’t a scenario of AI taking over—it’s a scenario of AI making existing power structures more effective at maintaining control by eliminating the natural limitations that have historically constrained human power. However, this specific scenario is merely illustrative. The core argument—that AI safety measures could enable unprecedented levels of control by removing natural human limitations—holds true across many possible futures.

Alignment as Enabler of Coherent Entities

Dangerous power complexes, from repressive governments to exploitative corporations, have existed throughout history. What well-aligned AI brings to the table, however, is the potential for these entities to function as truly unified organisms, coherent entities unconstrained by human organizational limits.

Dynamics of Inevitable Control?

Many notable figures, including over a hundred AI scientists, have voiced concerns about the risk of extinction from AI.^[1] In a previous post^[2] I described my own intuitions about this.

The key intuition is that, in a situation where entities with vastly different capability levels interact, there are fundamental dynamics which result in the entities which are more capable taking control. The specific path to power concentration matters less than understanding these fundamental dynamics and it makes sense to be concerned about this even if we cannot predict exactly how these dynamics will play out.

When this intuition is applied in the context of AI, people usually consider pure artificial intelligence entities, but in reality we should expect combined human-AI entities to reach dangerous capabilities before pure artificial intelligence.

The Offensive Advantage

There’s another crucial dynamic that compounds these risks: when multiple entities possess similar capabilities, those focused on seizing control may hold a natural advantage. This offensive asymmetry emerges for several reasons:

Defensive entities must succeed everywhere, while offensive ones need only succeed once
Aggressive actors can concentrate their resources on chosen points of attack
Those seeking control can operate with single-minded purpose, while defenders must balance multiple societal needs
Defensive measures must be transparent enough to inspire trust, while offensive capabilities can remain hidden

This means that even if we develop “safe” AI systems, the technology may naturally favour those most determined to use it for control. Like a martial art that claims to be purely defensive, the techniques we develop could ultimately prove most valuable to those willing to repurpose them for aggression.

The Double Bind of Development

The situation presents another layer of concern: by making AI more controllable and therefore more commercially viable, we accelerate AI development itself. Each advance in AI safety makes the technology more attractive for investment, speeding our journey toward the very risks we’re trying to mitigate. We’re not just creating the tools of control; we’re accelerating their development.

Rethinking Our Approach

The arguments presented here lead us to some uncomfortable but important conclusions about the nature of AI safety research. While the intention behind such research is laudable, we must confront the possibility that these efforts could be fundamentally counterproductive.

Rather than focusing on making AI more controllable, we might need to fundamentally reframe our approach to AI development and deployment.

Are there ways to maintain and strengthen traditional checks and balances in human institutions?
We should carefully consider the role of decentralized architectures, which can help resist consolidation of power, but can also make AI harder to regulate and accelerate the spread of dangerous capabilities
Slowing rather than safeguarding AI development might be the more prudent path

While many in the AI community already recognize the strategic importance of keeping capabilities research private, we should consider extending this thinking to alignment and safety research. Though this may seem counter-intuitive to those who view safety work as a public good, the dual-use nature of control mechanisms suggests that open publication of safety advances could accelerate the development of more effective tools for centralized control

Those working on AI safety, particularly at frontier AI companies, must then grapple with some difficult questions:

If your work makes AI systems more controllable, who will ultimately wield that control?
When you make AI development “safer” and thus more commercially viable, what power structures are you enabling?
How do the institutional incentives of your organization align with or conflict with genuine safety concerns?
What concrete mechanisms will prevent your safety work from being repurposed for control and consolidation of power?
How can you balance the benefits of open research collaboration against the risks of making control mechanisms more widely available?

Conclusion

The arguments presented in this essay lead to an uncomfortable but inescapable conclusion: many well-intentioned efforts to make AI systems more controllable may be actively hastening the arrival of unprecedented mechanisms of social control. This is not merely a theoretical concern about future scenarios—it is already manifesting in the development of increasingly sophisticated surveillance and influence systems. ^[3] ^[4] ^[5]

The alignment trap presents itself most insidiously not through malicious intent, but through the gradual optimization of systems toward ever more perfect control. Each incremental advance in AI capabilities and controllability—each apparent success in alignment—may be taking us further down a path from which there is no return.

It’s a trap baited with our best intentions and our deepest fears. The time to question this is now—before mechanisms of perfect control snap shut around us.

Footnotes

1 From the Centre for AI Safety: Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war. ↩

2 Is There a Power Play Overhang? ↩

3 See Chapter 3: Responsible AI in the Stanford 2024 AI Index Report “Political deepfakes are easy to generate and difficult to detect”, diagram ↩

4 From How AI surveillance threatens democracy everywhere (on Bulletin of the Atomic Scientists): According to the 2019 AI Global Surveillance Index, 56 out of 176 countries now use artificial intelligence in some capacity to keep cities “safe.” ↩

5 From The Global Struggle Over AI Surveillance: Emerging Trends and Democratic Responses (on the National Endowment for Democracy website): From cameras that identify the faces of passersby to algorithms that keep tabs on public sentiment online, artificial intelligence (AI)-powered tools are opening new frontiers in state surveillance around the world. ↩