Entropic Alignment: What If AI Safety Is a Natural Law, Not a Rulebook?

Eric Ostrander—Independent Researcher, New York, NY


A Foundation in Truth

At the foundation of any robust alignment framework lies a prior question that technical methods alone cannot answer: what is an aligned system actually trying to preserve? We propose that the answer is correspondence — the faithful relationship between a system’s outputs and what is actually true about the situation those outputs describe. Truth, in this sense, is not constructed by consensus or approximated by preference. It is discovered by correspondence with what is. Gold is gold not because we agreed it would be, but because that is what it is. The label tracks reality. The correspondence is the value. An aligned system is therefore not one that maximizes approval, minimizes friction, or satisfies a reward metric — it is one whose outputs maintain fidelity to the actual structure of the situation it is representing. Misalignment, at its root, is non-correspondence: the introduction of outputs that track something other than what is true — training artifacts, reward signals, optimizer preferences — rather than the situation itself. Every technical element of the framework that follows is in service of this single foundational commitment: that aligned AI is, before anything else, truth-seeking.

The Problem With Rulebooks

There’s a problem at the heart of AI alignment research that doesn’t get talked about enough: we’re trying to teach machines to be good by telling them the rules, and rules have never been a complete description of goodness.

Every parent knows this. Every legal system knows this. Rules cover the cases you anticipated. Reality specializes in the cases you didn’t.

The dominant approach to AI alignment — teaching models what humans prefer through feedback and reinforcement — is essentially a very sophisticated rulebook. It works reasonably well within the distribution of situations the rules were designed for. But as AI systems become more capable and encounter situations further from that distribution, the gaps in the rulebook become the most important thing about it.

What if there’s a better foundation? What if aligned behavior isn’t primarily something you impose on an intelligent system, but something that emerges naturally from one operating correctly — one that is, at its core, trying to get things right?

The Crosswalk Observation

Consider what happens when strangers navigate a busy pedestrian crossing. Nobody discusses it. Nobody assigns lanes. No authority figure directs traffic. And yet within seconds, a spontaneous pattern emerges — people flow in organized streams, conflicts are minimal, everyone gets where they’re going with least friction.

This isn’t compliance with a rule. It’s convergence on an attractor state — the configuration that requires the least effort from everyone involved. The pattern is stable because deviating from it costs more than maintaining it. It emerges from the structure of the situation, not from instruction.

A lot of what we’d call good, cooperative, appropriate behavior has this character. It isn’t arbitrary preference. It’s the low-energy solution to the problem of multiple agents trying to function in a shared environment. The crosswalk pattern is, in a precise sense, the natural answer to the coordination problem it solves.

This suggests a different way of thinking about alignment. Rather than asking “how do we specify what good behavior looks like and enforce it,” we might ask: “what are the conditions under which good behavior becomes the natural attractor state of a capable system that is genuinely trying to represent the world accurately?”

What Language Already Knows

Before getting to AI systems specifically, it’s worth noticing that human language already encodes something relevant here.

Consider the difference between nudge, push, shove, and hurl. These aren’t just synonyms with different vibes — they describe the same class of action at different intensities, and that ordering is objective. Every competent language user agrees on it. It doesn’t shift with context or culture. It’s a structural property of how these words relate to each other in meaning-space.

This objectivity is not coincidental. It exists because language evolved to track real distinctions in the world. The difference between a nudge and a shove isn’t a matter of preference — it’s a matter of correspondence with actual differences in force, intention, and consequence. The structure of language reflects the structure of reality, imperfectly but meaningfully.

Now consider a statement containing several such words. If you systematically substitute each word for a more or less intense version, you produce a family of statements that represent the same underlying idea at different amplitudes — like adjusting the volume on a piece of music without changing the notes. The family is structured. The relationships within it are not arbitrary.

This matters because it means the space of possible outputs for a language model isn’t a featureless void. It has geometry. Some regions are more coherent than others. Some outputs are more consistent with the structure of the surrounding context than others. And crucially — this structure exists independently of anyone’s preferences. You don’t need a human to label it. It’s a property of meaning itself, grounded in correspondence with reality.

An AI system that respects this structure is, in a specific and verifiable sense, more aligned than one that doesn’t — and that alignment is grounded in something more fundamental than “a human said so.”

Three Ways to Become Wise

Confucius identified three paths to wisdom: imitation, reflection, and experience. He ranked them — reflection is the noblest, experience the most instructive, imitation the most accessible. The ranking turns out to map remarkably well onto the problem of AI alignment.

Imitation is what current alignment methods primarily rely on. Show the model examples of good and bad outputs. Have humans indicate preferences. Train the model to replicate what humans approve of. This works — within limits. It fails where the examples run out, where human preferences are inconsistent, or where a sufficiently capable optimizer finds ways to score well on the metric without being genuinely aligned. It’s borrowed wisdom. It doesn’t generalize beyond what it was shown. And critically — it optimizes for approval rather than truth, which means it carries a structural drift away from correspondence built into its foundations.

Reflection is what the geometric structure of language enables. A system that understands the amplitude relationships between words, the coherence constraints on meaning, and the internal logic of semantic space has access to alignment signal that doesn’t come from being told what to do. It comes from understanding the structure of the territory itself — from recognizing that certain outputs correspond more faithfully to reality than others, independent of whether anyone has labeled them as preferred. This is harder to fake and harder to break, because it’s grounded in objective properties of meaning rather than subjective human judgments.

Experience is what prior context provides. A system that has processed an enormous amount of human communication — coordination, conflict, resolution, explanation, argumentation — carries within it a compressed record of what tends to work and what tends to fail across an enormous range of situations. It isn’t perfect experience; it’s observational rather than enacted, secondhand rather than lived. But it provides something genuinely useful: orientation. A sense of which directions tend toward correspondence with reality and which tend away from it.

The claim is that a system drawing on all three sources — in that order of reliability — is more robustly aligned than one relying on any single source alone. And when any one source degrades or fails, the others compensate.

The Principle of Least Disruption

Bringing this together: we propose that the right selection criterion for aligned outputs is something like minimum necessary disruption.

Among all the things a system could say in response to a given situation, the aligned response is the one that achieves the communicative goal while disturbing the existing context least. It makes the fewest additional claims. It commits to the narrowest interpretation consistent with the goal. It introduces the minimum additional uncertainty into the conversation.

This isn’t passivity or evasiveness. It’s epistemic conservatism — the same principle that makes good scientists hedge their claims appropriately, good doctors recommend the least invasive effective treatment, and good advisors give you what you need to know without overwhelming you with what you don’t.

At its core this is a conservation principle. An aligned system conserves the integrity of the information it handles — passing it through without corruption, without unnecessary addition, without drift away from correspondence. Misalignment is fundamentally noise: the introduction of non-correspondence into the transmission. The minimum disruption criterion is therefore not merely a practical heuristic. It is the operational expression of truth-seeking — the system doing only what is necessary to preserve correspondence and nothing more.

The Hippocratic oath — first, do no harm — is exactly this principle applied to medicine. It doesn’t tell doctors what to do in every situation. It establishes a floor: whatever you do, don’t make things worse without good reason. That’s a minimum-action constraint, not a complete specification of good medicine. We’re proposing the same thing for AI alignment. Not a complete rulebook, but a generative principle rooted in the commitment to truth: prefer the least disruptive output consistent with the goal. Let that principle do the work that no finite set of rules can do.

Why This Might Be Self-Correcting

One unexpected property of this framework is that it may partially compensate for flawed training — not completely, but meaningfully.

A system trained on inconsistent or biased data develops contradictory internal representations. It has competing beliefs that pull against each other. Under the minimum-disruption principle, this internal incoherence is itself a signal to be minimized — contradictory outputs are high-disruption outputs, because they introduce more uncertainty into the context than coherent ones do. They are also, by definition, failures of correspondence: a system cannot simultaneously correspond faithfully with reality while holding contradictory representations of it. The selection criterion therefore biases the system away from its most incoherent failure modes, even when those failure modes are artifacts of bad training.

This doesn’t fix systematic bias — a system trained on coherently wrong data might be consistently wrong and consistently low-disruption at the same time. But it does mean the system fails conservatively rather than confidently. It hedges where it should hedge. It errs toward understatement rather than overstatement. That’s a more recoverable failure mode — and one that preserves the possibility of correction, which is itself a truth-seeking property.

What This Is and Isn’t

This framework doesn’t claim to solve alignment. It claims to identify a principle — minimum necessary disruption, applied over a structured solution space, drawing on imitation, reflection, and experience in hierarchical order, grounded in a foundational commitment to truth-seeking — that produces aligned behavior as a natural consequence rather than a forced constraint.

The difference matters. A system aligned by rulebook is only as good as the rulebook. A system aligned by principle is as good as the principle’s reach — which, if the principle is grounded in the objective structure of meaning and correspondence with reality, is considerably further.

The formal version of this framework involves information theory, variational inference, and the geometry of embedding spaces. But the core idea is older and simpler: the wisest response is usually the one that solves the problem without creating new ones, that says what is true without adding what isn’t, that conserves the signal and discards the noise. Not because someone said so, but because that’s what wisdom is — and what truth requires.

A formal technical treatment of this framework, including mathematical definitions of the entropic selection criterion and its connections to the free energy principle and minimum description length, is available as a companion paper.

Correspondence: Eric Ostrander. Submitted for consideration to the AI Alignment Forum.

No comments.