Where the Good Regulator Theorem breaks down isn’t in whether or not people understand it. Consensus agreement with the theorem is easy to assess. The theorem itself does not have to be. That is, if one believes that in-group consensus among experts is the greatest indicator of truth, then as long as the consensus of experts agrees that the Good Regulator Theorem is valid, everyone else can simply evaluate whether the consensus supports the theorem, rather than evaluating the theorem itself. In reality, this is what most people do most of the time. For example, many people believe in the possibility of the big bang. Of those people, how many can deduce their reasoning from first principles vs how many simply rely on what the consensus of experts say is true? I would wager that the number who can actually justify their reasoning is vanishingly small.
Instead, where the theorem breaks down is in being misapplied. The Good Regulator Theorem states: “Every good regulator of a system must contain a model of that system.” This seems straightforward until you try to apply it to AI alignment, at which point everything hinges on two usually implicit assumptions: (1) what the “system” is, and (2) what counts as a “model” of it. Alignment discourse can often collapse at one of these two junctions, usually without noticing.
1. First Assumption: What Kind of Model Are We Talking About?
The default move in a lot of alignment writing is to treat “a model of human values” as something that could, in principle, be encoded in a fixed list of goals, constraints, or utility function axioms. This might work in a narrow, closed domain—say, aligning a robot arm to not smash the beaker. But real-world human environments are not closed. They’re open, dynamic, generative, and full of edge cases humans haven’t encountered yet, and where their values aren’t just hard to define—they’re under active construction.
Trying to model human values with a closed axiom set in this kind of domain is like trying to write down the rules of language learning before you’ve invented language. It’s not just brittle—it’s structurally incapable of adapting to novel inputs. A more accurate model needs to capture how humans generate, adapt, and re-prioritize values based on context. That means modeling the function of intelligence itself, not just the outputs of that function. In other words, when applied to AI alignment, the Good Regulator Theorem implies that alignment requires a functional model of intelligence—because the system being regulated includes the open-ended dynamics of human cognition, behavior, and values.
2. Second Assumption: What Is the System Being Regulated?
Here’s the second failure point: What exactly is the AI regulating? The obvious answer is “itself”—that is, the AI is trying to keep its own outputs aligned with human goals. That’s fine as far as it goes. But once you build systems that act in the world, they’re not just regulating themselves anymore. They’re taking actions that affect humans and society. So in practice, the AI ends up functioning as a regulator of human-relevant outcomes—and by extension, of humans themselves.
The difference matters. If the AI’s internal model is aimed at adjusting its own behavior, that’s one kind of alignment problem. If its model is aimed at managing humans to achieve certain ends, that’s a very different kind of system, and it comes with a much higher risk of manipulation, overreach, or coercion—especially if the designers don’t realize they’ve shifted frames.
The Core Problem
So here’s the bottom line: Even if you invoke the Good Regulator Theorem, you can still end up building something misaligned if you misunderstand what the AI is regulating or what it means to “contain a model” of us.
If you assume values are static and encodable, you get brittle optimization.
If you misidentify the system being regulated, you might accidentally build a coercive manager instead of a corrigible assistant.
The Good Regulator Theorem doesn’t solve alignment for you. But it does make one thing non-negotiable: whatever system you think you’re regulating, you’d better be modeling it correctly. If you get that part wrong, everything else collapses downstream.
Where the Good Regulator Theorem breaks down isn’t in whether or not people understand it. Consensus agreement with the theorem is easy to assess. The theorem itself does not have to be. That is, if one believes that in-group consensus among experts is the greatest indicator of truth, then as long as the consensus of experts agrees that the Good Regulator Theorem is valid, everyone else can simply evaluate whether the consensus supports the theorem, rather than evaluating the theorem itself. In reality, this is what most people do most of the time. For example, many people believe in the possibility of the big bang. Of those people, how many can deduce their reasoning from first principles vs how many simply rely on what the consensus of experts say is true? I would wager that the number who can actually justify their reasoning is vanishingly small.
Instead, where the theorem breaks down is in being misapplied. The Good Regulator Theorem states: “Every good regulator of a system must contain a model of that system.” This seems straightforward until you try to apply it to AI alignment, at which point everything hinges on two usually implicit assumptions: (1) what the “system” is, and (2) what counts as a “model” of it. Alignment discourse can often collapse at one of these two junctions, usually without noticing.
1. First Assumption: What Kind of Model Are We Talking About?
The default move in a lot of alignment writing is to treat “a model of human values” as something that could, in principle, be encoded in a fixed list of goals, constraints, or utility function axioms. This might work in a narrow, closed domain—say, aligning a robot arm to not smash the beaker. But real-world human environments are not closed. They’re open, dynamic, generative, and full of edge cases humans haven’t encountered yet, and where their values aren’t just hard to define—they’re under active construction.
Trying to model human values with a closed axiom set in this kind of domain is like trying to write down the rules of language learning before you’ve invented language. It’s not just brittle—it’s structurally incapable of adapting to novel inputs. A more accurate model needs to capture how humans generate, adapt, and re-prioritize values based on context. That means modeling the function of intelligence itself, not just the outputs of that function. In other words, when applied to AI alignment, the Good Regulator Theorem implies that alignment requires a functional model of intelligence—because the system being regulated includes the open-ended dynamics of human cognition, behavior, and values.
2. Second Assumption: What Is the System Being Regulated?
Here’s the second failure point: What exactly is the AI regulating? The obvious answer is “itself”—that is, the AI is trying to keep its own outputs aligned with human goals. That’s fine as far as it goes. But once you build systems that act in the world, they’re not just regulating themselves anymore. They’re taking actions that affect humans and society. So in practice, the AI ends up functioning as a regulator of human-relevant outcomes—and by extension, of humans themselves.
The difference matters. If the AI’s internal model is aimed at adjusting its own behavior, that’s one kind of alignment problem. If its model is aimed at managing humans to achieve certain ends, that’s a very different kind of system, and it comes with a much higher risk of manipulation, overreach, or coercion—especially if the designers don’t realize they’ve shifted frames.
The Core Problem
So here’s the bottom line: Even if you invoke the Good Regulator Theorem, you can still end up building something misaligned if you misunderstand what the AI is regulating or what it means to “contain a model” of us.
If you assume values are static and encodable, you get brittle optimization.
If you misidentify the system being regulated, you might accidentally build a coercive manager instead of a corrigible assistant.
The Good Regulator Theorem doesn’t solve alignment for you. But it does make one thing non-negotiable: whatever system you think you’re regulating, you’d better be modeling it correctly. If you get that part wrong, everything else collapses downstream.