Can Alignment Scale Faster Than Misalignment?
This comment is a summary of a much longer comparative analysis of the paper “Superintelligence Strategy” mentioned in this post, as well as another paper “Intelligence Sequencing and the Path-Dependence of Intelligence Evolution”
You can read the full comment—with equations, diagrams, and structural modeling—here as a PDF. I’ve posted a shortened version here because LessWrong currently strips out inline math and display equations, which are integral to the argument structure.
Summary of the Argument
The core insight from Intelligence Sequencing is that the order in which intelligence architectures emerge—AGI-first vs. DCI-first (Decentralized Collective Intelligence)—determines the long-term attractor of intelligence development.
AGI-first development tends to lock in centralized, hierarchical optimization structures that are brittle, opaque, and epistemically illegitimate.
DCI-first development allows for recursive, participatory alignment grounded in decentralized feedback and epistemic legitimacy.
The paper argues that once intelligence enters a particular attractor basin, transitions become structurally infeasible due to feedback loops and resource lock-in. This makes sequencing a more foundational concern than alignment itself.
In contrast, Superintelligence Strategy proposes post-hoc control mechanisms (like MAIM) after AGI emerges. But this assumes that power can be centralized, trusted, and coordinated after exponential scaling begins—an assumption Intelligence Sequencing challenges as structurally naive.
Why This Matters
The core failure mode is not just technical misalignment, but a deeper structural problem:
Alignment strategies scale linearly, while threat surfaces and destabilizing actors scale combinatorially.
Centralized oversight becomes structurally incapable of keeping pace.
Without participatory epistemic legitimacy, even correct oversight will be resisted.
In short: the system collapses under its own complexity unless feedback and legitimacy are embedded from the beginning. Intelligence must be governed by structures that can recursively align with themselves as they scale.
What Is This Comment?
This analysis was guided by GPT-4o and stress-tested through iterative feedback with Google Gemini. The models were not used to generate content, but to simulate institutional reasoning and challenge its coherence.
Based on their comparative analysis—and the absence of known structural alternatives—it was concluded that Intelligence Sequencing offers the more coherent model of alignment. If that’s true, then alignment discourse may be structurally filtering out the very actors capable of diagnosing its failure modes.
Read the Full Version
The complete version, including structural sustainability metrics, threat-growth asymmetry, legitimacy dynamics, and proposed empirical tests, is available here:
👉 Read the full comment as PDF
I welcome feedback on whether this analysis is structurally valid or fundamentally flawed. Either answer would be useful.
Where the Good Regulator Theorem breaks down isn’t in whether or not people understand it. Consensus agreement with the theorem is easy to assess. The theorem itself does not have to be. That is, if one believes that in-group consensus among experts is the greatest indicator of truth, then as long as the consensus of experts agrees that the Good Regulator Theorem is valid, everyone else can simply evaluate whether the consensus supports the theorem, rather than evaluating the theorem itself. In reality, this is what most people do most of the time. For example, many people believe in the possibility of the big bang. Of those people, how many can deduce their reasoning from first principles vs how many simply rely on what the consensus of experts say is true? I would wager that the number who can actually justify their reasoning is vanishingly small.
Instead, where the theorem breaks down is in being misapplied. The Good Regulator Theorem states: “Every good regulator of a system must contain a model of that system.” This seems straightforward until you try to apply it to AI alignment, at which point everything hinges on two usually implicit assumptions: (1) what the “system” is, and (2) what counts as a “model” of it. Alignment discourse can often collapse at one of these two junctions, usually without noticing.
1. First Assumption: What Kind of Model Are We Talking About?
The default move in a lot of alignment writing is to treat “a model of human values” as something that could, in principle, be encoded in a fixed list of goals, constraints, or utility function axioms. This might work in a narrow, closed domain—say, aligning a robot arm to not smash the beaker. But real-world human environments are not closed. They’re open, dynamic, generative, and full of edge cases humans haven’t encountered yet, and where their values aren’t just hard to define—they’re under active construction.
Trying to model human values with a closed axiom set in this kind of domain is like trying to write down the rules of language learning before you’ve invented language. It’s not just brittle—it’s structurally incapable of adapting to novel inputs. A more accurate model needs to capture how humans generate, adapt, and re-prioritize values based on context. That means modeling the function of intelligence itself, not just the outputs of that function. In other words, when applied to AI alignment, the Good Regulator Theorem implies that alignment requires a functional model of intelligence—because the system being regulated includes the open-ended dynamics of human cognition, behavior, and values.
2. Second Assumption: What Is the System Being Regulated?
Here’s the second failure point: What exactly is the AI regulating? The obvious answer is “itself”—that is, the AI is trying to keep its own outputs aligned with human goals. That’s fine as far as it goes. But once you build systems that act in the world, they’re not just regulating themselves anymore. They’re taking actions that affect humans and society. So in practice, the AI ends up functioning as a regulator of human-relevant outcomes—and by extension, of humans themselves.
The difference matters. If the AI’s internal model is aimed at adjusting its own behavior, that’s one kind of alignment problem. If its model is aimed at managing humans to achieve certain ends, that’s a very different kind of system, and it comes with a much higher risk of manipulation, overreach, or coercion—especially if the designers don’t realize they’ve shifted frames.
The Core Problem
So here’s the bottom line: Even if you invoke the Good Regulator Theorem, you can still end up building something misaligned if you misunderstand what the AI is regulating or what it means to “contain a model” of us.
If you assume values are static and encodable, you get brittle optimization.
If you misidentify the system being regulated, you might accidentally build a coercive manager instead of a corrigible assistant.
The Good Regulator Theorem doesn’t solve alignment for you. But it does make one thing non-negotiable: whatever system you think you’re regulating, you’d better be modeling it correctly. If you get that part wrong, everything else collapses downstream.