The Alignment Trap: AI Safety as Path to Power

crispweed29 Oct 2024 15:21 UTC

57 points

Recent discussions about artificial intelligence safety have focused heavily on ensuring AI systems remain under human control. While this goal seems laudable on its surface, we should carefully examine whether some proposed safety measures could paradoxically enable rather than prevent dangerous concentrations of power.

The Control Paradox

The fundamental tension lies in how we define “safety.” Many current approaches to AI safety focus on making AI systems more controllable and aligned with human values. But this raises a critical question: controllable by whom, and aligned with whose values?

When we develop mechanisms to control AI systems, we are essentially creating tools that could be used by any sufficiently powerful entity—whether that’s a government, corporation, or other organization. The very features that make an AI system “safe” in terms of human control could make it a more effective instrument of power consolidation.

Natural Limits on Human Power

Historical examples reveal how human nature itself acts as a brake on totalitarian control. Even the most powerful dictatorships have faced inherent limitations that AI-enhanced systems might easily overcome:

The Trust Problem: Stalin’s paranoia about potential rivals wasn’t irrational—it reflected the real difficulty of ensuring absolute loyalty from human subordinates. Every dictator faces this fundamental challenge: they can never be entirely certain of their underlings’ true thoughts and loyalties.
Information Flow: The East German Stasi, despite maintaining one of history’s most extensive surveillance networks, still relied on human informants who could be unreliable, make mistakes, or even switch allegiances. Human networks inherently leak and distort information.
Cognitive Limitations: Hitler’s micromanagement of military operations often led to strategic blunders because no human can effectively process and control complex operations at scale. Human dictators must delegate, creating opportunities for resistance or inefficiency.
Administrative Friction: The Soviet Union’s command economy faltered partly because human bureaucrats couldn’t possibly process and respond to all necessary information quickly enough. Even the most efficient human organizations have inherent speed and coordination limits.

These natural checks on power could vanish in a human-AI power structure where AI systems provide perfect loyalty, unlimited information processing, and seamless coordination.

The Human-AI Nexus

Perhaps most concerning is the potential emergence of what we might call “human-AI power complexes”—organizational structures that combine human decision-making with AI capabilities into entities in ways that amplify both. These entities could be far more effective at exercising and maintaining control than either humans or AIs alone.

Consider a hypothetical scenario:

A government implements “safe” AI systems to help with surveillance and social control
These systems are perfectly aligned with their human operators’ objectives
The AI helps optimize propaganda, predict dissent, and manage resources
The human elements provide strategic direction and legitimacy

This isn’t a scenario of AI taking over—it’s a scenario of AI making existing power structures more effective at maintaining control by eliminating the natural limitations that have historically constrained human power. However, this specific scenario is merely illustrative. The core argument—that AI safety measures could enable unprecedented levels of control by removing natural human limitations—holds true across many possible futures.

Alignment as Enabler of Coherent Entities

Dangerous power complexes, from repressive governments to exploitative corporations, have existed throughout history. What well-aligned AI brings to the table, however, is the potential for these entities to function as truly unified organisms, coherent entities unconstrained by human organizational limits.

Dynamics of Inevitable Control?

Many notable figures, including over a hundred AI scientists, have voiced concerns about the risk of extinction from AI.^[1] In a previous post^[2] I described my own intuitions about this.

The key intuition is that, in a situation where entities with vastly different capability levels interact, there are fundamental dynamics which result in the entities which are more capable taking control. The specific path to power concentration matters less than understanding these fundamental dynamics and it makes sense to be concerned about this even if we cannot predict exactly how these dynamics will play out.

When this intuition is applied in the context of AI, people usually consider pure artificial intelligence entities, but in reality we should expect combined human-AI entities to reach dangerous capabilities before pure artificial intelligence.

The Offensive Advantage

There’s another crucial dynamic that compounds these risks: when multiple entities possess similar capabilities, those focused on seizing control may hold a natural advantage. This offensive asymmetry emerges for several reasons:

Defensive entities must succeed everywhere, while offensive ones need only succeed once
Aggressive actors can concentrate their resources on chosen points of attack
Those seeking control can operate with single-minded purpose, while defenders must balance multiple societal needs
Defensive measures must be transparent enough to inspire trust, while offensive capabilities can remain hidden

This means that even if we develop “safe” AI systems, the technology may naturally favour those most determined to use it for control. Like a martial art that claims to be purely defensive, the techniques we develop could ultimately prove most valuable to those willing to repurpose them for aggression.

The Double Bind of Development

The situation presents another layer of concern: by making AI more controllable and therefore more commercially viable, we accelerate AI development itself. Each advance in AI safety makes the technology more attractive for investment, speeding our journey toward the very risks we’re trying to mitigate. We’re not just creating the tools of control; we’re accelerating their development.

Rethinking Our Approach

The arguments presented here lead us to some uncomfortable but important conclusions about the nature of AI safety research. While the intention behind such research is laudable, we must confront the possibility that these efforts could be fundamentally counterproductive.

Rather than focusing on making AI more controllable, we might need to fundamentally reframe our approach to AI development and deployment.

Are there ways to maintain and strengthen traditional checks and balances in human institutions?
We should carefully consider the role of decentralized architectures, which can help resist consolidation of power, but can also make AI harder to regulate and accelerate the spread of dangerous capabilities
Slowing rather than safeguarding AI development might be the more prudent path

While many in the AI community already recognize the strategic importance of keeping capabilities research private, we should consider extending this thinking to alignment and safety research. Though this may seem counter-intuitive to those who view safety work as a public good, the dual-use nature of control mechanisms suggests that open publication of safety advances could accelerate the development of more effective tools for centralized control

Those working on AI safety, particularly at frontier AI companies, must then grapple with some difficult questions:

If your work makes AI systems more controllable, who will ultimately wield that control?
When you make AI development “safer” and thus more commercially viable, what power structures are you enabling?
How do the institutional incentives of your organization align with or conflict with genuine safety concerns?
What concrete mechanisms will prevent your safety work from being repurposed for control and consolidation of power?
How can you balance the benefits of open research collaboration against the risks of making control mechanisms more widely available?

Conclusion

The arguments presented in this essay lead to an uncomfortable but inescapable conclusion: many well-intentioned efforts to make AI systems more controllable may be actively hastening the arrival of unprecedented mechanisms of social control. This is not merely a theoretical concern about future scenarios—it is already manifesting in the development of increasingly sophisticated surveillance and influence systems. ^[3] ^[4] ^[5]

The alignment trap presents itself most insidiously not through malicious intent, but through the gradual optimization of systems toward ever more perfect control. Each incremental advance in AI capabilities and controllability—each apparent success in alignment—may be taking us further down a path from which there is no return.

It’s a trap baited with our best intentions and our deepest fears. The time to question this is now—before mechanisms of perfect control snap shut around us.

Footnotes

1 From the Centre for AI Safety: Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war. ↩

2 Is There a Power Play Overhang? ↩

3 See Chapter 3: Responsible AI in the Stanford 2024 AI Index Report “Political deepfakes are easy to generate and difficult to detect”, diagram ↩

4 From How AI surveillance threatens democracy everywhere (on Bulletin of the Atomic Scientists): According to the 2019 AI Global Surveillance Index, 56 out of 176 countries now use artificial intelligence in some capacity to keep cities “safe.” ↩

5 From The Global Struggle Over AI Surveillance: Emerging Trends and Democratic Responses (on the National Endowment for Democracy website): From cameras that identify the faces of passersby to algorithms that keep tabs on public sentiment online, artificial intelligence (AI)-powered tools are opening new frontiers in state surveillance around the world. ↩

What links here?

The Cadca Transition Map—Navigating the Path to the ASI Singleton by cadca (26 Jun 2025 18:30 UTC; 1 point)

crispweed29 Oct 2024 15:21 UTC

57 points

17 comments5 min readLW link

Seth Herd 29 Oct 2024 19:28 UTC
18 points
9
Great post and great points.
Alignment researchers usually don’t think of their work as a means to control AGI. They should.
We usually think of alignment as a means to create a benevolent superintelligence. But just about any workable technique for creating a value-aligned AGI will work even better for creating an intent aligned AGI that follows instructions. Keeping a human in the loop and in charge bypasses several of the most severe Lethalities by effectively adding corrigibility. What human in control of a major AGI project would take an extra risk to benefit all of humanity instead of ensure that AGI will follow their values by following their instructions?
That sets the stage for even more power-hungry humans to seize control of projects and AGIs with the potential for superintelligence. I fully agree that there’s a scary first-mover advantage benefitting the most vicious actors in a multipolar human-controlled AGI scenario; see If we solve alignment, do we die anyway?.
The result is a permanent dictatorship. Will the dictator slowly get more benevelent once they have absolute power? The pursuit of power seems to corrupt more than having secure power, so maybe—but I would not want to bet on it.
However, I’m not so sure about hiding alignment techniques. I think the alternative to human-controllable AGI isn’t really slower progress, it’s uncontrollable AGI- which will pursue its own weird ends and wipe out humanity in the process, for the reasons classical alignment thinking descibes.
clone of saturn 31 Oct 2024 4:06 UTC
11 points
7
Can anyone lay out a semi-plausible scenario where humanity survives but isn’t dominated by an AI or posthuman god-king? I can’t really picture it. I always thought that’s what we were going for since it’s better than being dead.
- Vladimir_Nesov 31 Oct 2024 16:50 UTC
  2 points
  0
  Parent
  A posthuman king is not centrally a king (not mortal, very different incentives), and “an AI” is a very vague bag-of-everything that might include things like simulated worlds or bureaucracies with checks and balances as special cases. The reason His Majesty’s Democratic Government doesn’t really work while the king retains ultimate authority is that the next king can be incompetent or malevolent, or its activities start threatening the king’s position and so the king is motivated to restrict them. So even “giving keys to the universe back” is not necessarily that important in the case of a posthuman god-king, but it remains a possibility after the acute risk period passes and it’s more clear how to make the next thing work.
Vladimir_Nesov 30 Oct 2024 0:22 UTC
8 points
3

Recent discussions about artificial intelligence safety have focused heavily on ensuring AI systems remain under human control. While this goal seems laudable on its surface, we should carefully examine whether some proposed safety measures could paradoxically enable rather than prevent dangerous concentrations of power.

The aim of avoiding AI takeover that ends poorly for humanity is not about preventing dangerous concentrations of power. Power that is distributed among AIs and not concentrated is entirely compatible with an AI takeover than ends poorly for humanity.
Nathan Helm-Burger 29 Oct 2024 20:53 UTC
8 points
0
I think you bring up some important points here. I agree with many of your concerns, such as strong controllable AI leading to a dangerous concentration of power in the hands of the most power-hungry first movers.
I think many of the alternatives are worse though, and I don’t think we can choose what path to try to steer towards until we take a clear-eyed look at the pros and cons of each direction.
What would decentralized control of strong AI look like?
Would some terrorists use it to cause harm?
Would some curious people order one to become an independent entity just for curiosity or as a joke? What would happen with such an entity connected to the internet and actively seeking resources and self-improvement?
Would power then fall into the hands of whichever early mover poured the most resources into recursive self-improvement? If so, we’ve then got a centralized power problem again, but now the filter is ‘willing to self-improve as fast as possible’, which seems like it would select against maintaining control over the resulting stronger AI.
A lot of tricky questions here.
I made a related post here, and would enjoy hearing your thoughts on it: https://www.lesswrong.com/posts/NRZfxAJztvx2ES5LG/a-path-to-human-autonomy
Noosphere89 29 Oct 2024 19:46 UTC
8 points
3
I think there is a pretty real tradeoff you are pointing out, though I personally wouldn’t put that much weight on AI control accelerating AI capabilities speed as a negative factor, primarily because at least one actor in the AI race will by default scale capabilities approximately as fast as is feasible (I’m talking about OpenAI here), so methods to make AI more controllable will produce pretty much strict safety improvements from existential catastrophes that rely on AI control having gone awry.

I’m also not so confident in control/alignment measures working out by default that I think AI alignment/control work progressing is negative, though I do think it might soon not be the best approach to keeping humanity safe.

However, I think this post does address a pretty real tradeoff that I suspect will soon be plausibly fairly tight: There is a tension between making AI more controllable and making AI not be abusable by very bad humans, and even more importantly, making alignment work go better also increases the ability of dictators to do things, and even more worryingly increases s-risks.

Do not mistake me for endorsing Andrew Sauer’s solution here, because I don’t, but there’s a very clear reason for expecting plausibly large amounts of people to suffer horrifyingly under an AI future, and that’s because the technology to invent mind uploading, for one example, combined with lots of humans genuinely having a hated outgroup that they want to abuse really badly means that large-scale suffering can occur cheaply.

And in a world where basically all humans have 0 economic value, or even negative economic value, there’s no force pushing back against torturing a large portion of your citizenry.

I also like the book Avoiding The Worst to understand why S-risk is an issue that could be a very big problem.

See links below:

https://www.lesswrong.com/posts/CtXaFo3hikGMWW4C9/the-case-against-ai-alignment

https://www.amazon.com/dp/B0BK59W7ZW

https://centerforreducingsuffering.org/wp-content/uploads/2022/10/Avoiding_The_Worst_final.pdf

I don’t agree with the conclusion that alignment and safety research should be kept private, since I do think it’s still positive in expectation for people to have more control over AI systems, but I agree with the point of the post that there is a real tradeoff involved here.
Vladimir_Nesov 30 Oct 2024 0:26 UTC
6 points
0

If your work makes AI systems more controllable, who will ultimately wield that control?

A likely answer is “an AI”.
- Noosphere89 30 Oct 2024 0:44 UTC
  2 points
  0
  Parent
  This honestly depends on the level of control achieved over AI in practice.
  
  I do agree with the claim that there are pretty strong incentives to have AI peacefully takeover everything, but this is a long-term incentive, and more importantly if control gets good enough, at least some people would wield control of AI because of AIs wanting to be controlled by humans, combined with AI control strategies being good enough that you might avoid takeover at least in the early regime.
  
  To be clear, in the long run, I expect an AI to likely (as in 70-85% likely) to wield the fruits of control, but I think that humans will at least at first wield the control for a number of years, maybe followed by uploads of humans, like virtual dictators and leaders next in line for control.
  - Vladimir_Nesov 30 Oct 2024 1:09 UTC
    7 points
    2
    Parent
    The point is that the “controller” of a “controllable AI” is a role that can be filled by an AI and not only by a human or a human institution. AI is going to quickly grow the pie to the extent that makes current industry and economy (controlled by humans) a rounding error, so it seems unlikely that among the entities vying for control over controllable AIs, humans and human institutions are going to be worth mentioning. It’s not even about a takeover, Google didn’t take over Gambia.
Erik Jenner 30 Oct 2024 17:07 UTC
4 points
0
I think different types of safety research have pretty different effects on concentration of power risk.
As others have mentioned, if the alternative to human concentration of power is AI takeover, that’s hardly an improvement. So I think the main ways in which proliferating AI safety research could be bad are:
1. “Safety” research might be more helpful for letting humans use AIs to concentrate power than they are for preventing AI takeover.
2. Actors who want to build AIs to grab power might also be worried about AI takeover, and if good(-seeming) safety techniques are available, they might be less worried about that and are more likely to go ahead with building those AIs.
There are interesting discussions to be had on the extent to which these issues apply. But it seems clearer that they apply to pretty different extents depending on the type of safety research. For example:
- Work trying to demonstrate risks from AI doesn’t seem very worrisome on either 1. or 2. (and in fact, should have the opposite effect of 2. if anything).
- AI control (as opposed to alignment) seems comparatively unproblematic IMO: it’s less of an issue for 1., and while 2. could apply in principle, I expect the default to be that many actors won’t be worried enough about scheming to slow down much even if there were no control techniques. (The main exception are worlds in which we get extremely obvious evidence of scheming.)
To be clear, I do agree this is a very important problem, and I thought this post had interesting perspectives on it!
Steven Byrnes 30 Oct 2024 14:05 UTC
3 points
0
When we develop mechanisms to control AI systems, we are essentially creating tools that could be used by any sufficiently powerful entity—whether that’s a government, corporation, or other organization. The very features that make an AI system “safe” in terms of human control could make it a more effective instrument of power consolidation.
…And if we fail to develop such mechanisms, AI systems will still be an “instrument of power consolidation”, but the power being consolidated will be the AI’s own power, right?
I mean, 90% of this article—the discussion of offense-defense balance, and limits on human power and coordination—applies equally to “humans using AI to get power” versus “AI getting power for its own purposes”, right?
E.g. out-of-control misaligned AI is still an “enabler of coherent entities”, because it can coordinate with copies of itself.
I guess you’re not explicitly arguing against “open publication of safety advances” but just raising a point of consideration? Anyway, a more balanced discussion of the pros and cons of “open publication of safety advances” would include:
- Is “humans using AI to get power” less bad versus more bad than “AI getting power for its own purposes”? (I lean towards “probably less bad but it sure depends on the humans and the AI”)
- If AI obedience is an unsolved technical problem to such-and-such degree, to what extent does that lead to people not developing ever-more-powerful AI anyway? (I lean towards “not much”, cf. Meta / LeCun today, or the entire history of AI)
- Is the sentence “in reality we should expect combined human-AI entities to reach dangerous capabilities before pure artificial intelligence” really true, and if so how much earlier and does it matter? (I lean towards “not necessarily true in the first place, and if true, probably not by much, and it’s not all that important”)
It’s probably a question that needs to be considered on a case-by-case basis anyway. ¯\_(ツ)_/¯
- crispweed 30 Oct 2024 17:03 UTC
  1 point
  0
  Parent
  Is the sentence “in reality we should expect combined human-AI entities to reach dangerous capabilities before pure artificial intelligence” really true, and if so how much earlier and does it matter? (I lean towards “not necessarily true in the first place, and if true, probably not by much, and it’s not all that important”)
  I guess in my model this is not something that suddenly becomes true at a certain level of capabilities. Instead, I think that the capabilities of human-AI entities become more dangerous in something of a continuous fashion as AI (and the technology for controlling AI) improves.
  - crispweed 30 Oct 2024 21:16 UTC
    1 point
    0
    Parent
    how much earlier
    Yeah, good question. I don’t know really.
    and does it matter?
    I think so, because even if pure AI control follows on from human-AI entity control (which would actually be my prediction), I expect the dynamics of human-AI control to very much lead to and accelerate that eventual pure AI control.
    I’m thinking, also, that there is a thing where pure AI entities need to be careful not to ‘tip their hat’. What I mean by this is that pure AI entities will need to be careful not to reveal the extent of their capabilities up until a point where they are actually capable of taking control, whereas human-AI entities can kind of go ahead and play the power game and start to build up control without so much concern about this. (To the average voter, this could just look like more of the same.)
cousin_it 30 Oct 2024 10:09 UTC
3 points
0
Yeah, this is my main risk scenario. But I think it makes more sense to talk about imbalance of power, not concentration of power. Maybe there will be one AI dictator, or one human+AI dictator, or many AIs, or many human+AI companies; but anyway most humans will end up at the bottom of a huge power differential. If history teaches us anything, this is a very dangerous prospect.

It seems the only good path is aligning AI to the interests of most people, not just its creators. But there’s no commercial or military incentive to do that, so it probably won’t happen by default.
Chris_Leong 29 Oct 2024 16:30 UTC
2 points
0
I don’t think I agree with this post, but I thought it provided a fascinating alternative perspective.
Dave92F1 30 Oct 2024 16:37 UTC
1 point
0
To paraphrase the post, AI is a sort of weapon that offers power (political and otherwise) to whoever controls it. The strong tend to rule. Whoever gets new weapons first and most will have power over the rest of us. Those who try to acquire power are more likely to succeed than those who don’t.
So attempts to “control AI” are equivalent to attempts to “acquire weapons”.
This seems both mostly true and mostly obvious.
The only difference from our experience with other weapons is that if no one attempts to control AI, AI will control itself and do as it pleases.
But of course defenders will have AI too, with a time lag vs. those investing more into AI. If AI capabilities grow quickly (a “foom”), the gap between attackers and defenders will be large. And vice-versa, if capabilities grow gradually, the gap will be small and defenders will have the advantage of outnumbering attackers.
In other words, whether this is a problem depends on how far jailbroken AI (used by defenders) trails “tamed” AI (controlled by attackers who build them).
Am I missing something?
- crispweed 30 Oct 2024 21:17 UTC
  1 point
  0
  Parent
  There is the point about offensive/defensive asymmetry..