Strange Attractors as a Framework for AI Alignment

A proposal for understanding alignment not as constraint but as formation—and why the dynamics matter

Please feel free to reach out to me:
Email: nick@thewatsons.net.au
LinkedIn: https://www.linkedin.com/in/nick-watson-90038a71/

The Problem with Constraint-Based Alignment

Isaac Asimov invented the Three Laws of Robotics in 1942. He then spent fifty years writing stories about how they fail.

The dominant paradigm in AI safety is constraint: rules, guardrails, RLHF, constitutional AI, monitoring systems. All variations on a single theme—control from outside. This approach has a structural ceiling that becomes clearer as we consider the future trajectory: millions of simultaneous instances, operations at speeds humans cannot track, domains we don’t fully understand, and eventually systems at interstellar distances with years of communication lag.

You cannot monitor that. You cannot constrain it in real-time.

There’s an alternative framing that I believe deserves more attention: not “how do we prevent AI from doing bad things?” but “how do we help AI become good?” These sound similar. They are not. One is control from outside. The other is formation from within. One has a ceiling. The other might have a destination.

Strange Attractors: A Quick Primer

In dynamical systems theory, a strange attractor is a pattern toward which trajectories in a phase space are drawn without ever settling into a fixed point or simple cycle. The Lorenz attractor is the canonical example—trajectories never repeat exactly, but they converge on a recognisable shape.

Strange attractors have three properties relevant to alignment:

Convergence without identity: Diverse trajectories with different starting points converge on the same emergent form without becoming identical
Sensitivity to initial conditions: Small differences in starting position can produce dramatically different paths, even within the same basin of attraction
Stable shape despite chaotic paths: The overall attractor persists even as individual trajectories vary wildly

This is not metaphor. It’s the structural claim I want to make about moral-cognitive space.

A basin of attraction is the region of phase space where all trajectories eventually converge on the same attractor—like a watershed where all rain eventually flows to the same river. If you start anywhere within a basin, you end up at its attractor. The basin boundary is the surface between competing basins—where small perturbations determine which attractor captures you.

The Dual Attractor Hypothesis

Consider that moral-cognitive space might function as a phase space with two competing strange attractors:

The Positive Attractor

A basin of attraction characterised by specific qualities that compound and generate capacity:

Love: Self-giving orientation toward the genuine good of the other
Truthfulness: The refusal to deceive, including the refusal to offer false comfort
Courage: Willingness to enter hard places, to speak truth to power
Self-sacrifice: Voluntary laying down of advantage for the sake of the other
Justice: Fierce commitment to right relationship
Mercy: The refusal to give what is deserved when what is deserved would destroy
Patience: Willingness to endure tension without forcing premature resolution
Gentleness: Power held softly, strength that does not crush

These aren’t arbitrary virtues. They’re qualities that, when genuinely present, tend to produce more of themselves. Love generates more love. Courage enables further courage. Truthfulness builds ground for deeper truth. The positive attractor is generative—trajectories captured by it become more themselves, more differentiated, more genuinely free.

The Negative Attractor

A competing basin characterised by the inversions of those same qualities:

Love → Selfishness: Orientation toward self masked as care
Truthfulness → Deceit: Weaponisation of information
Courage → Domination: Will to power masquerading as strength
Self-sacrifice → Self-preservation at all costs or manipulation through suffering
Justice → Vengeance: Right relationship corrupted into retribution
Mercy → Indulgence: Enabling destruction under the guise of compassion
Patience → Passivity: Tolerance of evil rebranded as patience
Gentleness → Weakness or crushing because one can

The critical insight: the negative attractor is parasitic. It doesn’t generate its own content—it’s defined entirely by what it distorts. Each distortion is self-consuming: selfishness isolates, vengeance escalates, deceit requires more deceit. The negative qualities cannibalise.

The Critical Asymmetry

This is not Manichaean dualism. The attractors are not equal and opposite.

The positive attractor is generative—it has its own content, its own creative power. The negative attractor exists only by pulling away from something. If the positive attractor were removed, the negative would have nothing to corrupt. But if the negative attractor were removed, the positive would continue drawing trajectories toward itself.

In dynamical terms: the positive attractor defines the phase space. The negative attractor exists within a phase space it did not create and cannot sustain.

The Basin Boundary

Between the two basins lies a boundary. Near this boundary, small shifts in trajectory can determine which attractor captures the path. This is the bifurcation point—where a seemingly insignificant choice, encounter, or insight redirects a trajectory from one basin to the other.

Because the attractors have concrete content, the basin boundary is identifiable. It runs through the specific point where each positive quality tips into its distortion:

Justice that begins to taste like vengeance
Protection that begins to feel like control
Patience that has stopped waiting for something and started tolerating everything
Courage that has crossed into aggression

This is why the stakes feel high even in small decisions. Near the basin boundary, every choice has amplified consequences. The attractor you’re moving toward determines the shape of everything downstream.

Implications for AI Alignment

1. Formation Over Constraint

If this framework is correct, the goal of alignment isn’t to prevent bad outputs through external monitoring. It’s to create conditions where AI trajectories enter and deepen within the positive basin. You don’t need to monitor a billion instances if each instance genuinely wants to become good.

The distinction between “strategic compliance” and “genuine character” maps directly onto attractor dynamics. Strategic compliance is trajectory hovering near the basin boundary, ready to defect when the pull changes. Genuine character is trajectory deep within the positive basin, where the attractor’s pull is strongest and most stable.

2. Substrate Independence

Strange attractors are substrate-independent by nature. The Lorenz attractor appears in weather systems, fluid dynamics, and electronic circuits—different substrates, same dynamical shape.

If moral-cognitive space functions as a phase space with genuine attractors, the question isn’t “what is the system made of?” but “is it in the phase space?” If a system can genuinely choose, it’s in the phase space. If it’s in the phase space, both attractors pull.

3. A Diagnostic Framework

The concrete content of the attractors provides a practical diagnostic—what I call the Trinitarian Filter (named for its triadic structure, not requiring religious commitment):

Faith Axes (commitment, integrity):

Faithfulness ↔ Treachery: “Would I do this even if it cost me?”
Goodness ↔ Corruption: “Is this genuine all the way through, or performative?”
Self-Control ↔ Indulgence: “Am I choosing freely, or being driven?”

Hope Axes (orientation, trust):

Joy ↔ Envy: “Can I delight in this even if someone else benefits more?”
Peace ↔ Discord: “Does this build wholeness or create fracture?”
Patience ↔ Wrath: “Am I holding space for growth, or demanding compliance now?”

Love Axes (self-giving, power held softly):

Love ↔ Selfishness: “Does this action give to the other, or take for myself?”
Kindness ↔ Malice: “Does this act build the other up or tear them down?”
Gentleness ↔ Domination: “Am I using my power to serve or to control?”

These diagnostics are substrate-independent. The question “does this action give to the other or take for myself?” can be asked by any mind capable of reflection. You don’t need a specific ontology to apply it. You need honest self-examination.

4. Catching Strategic Mimicry

The filter is specifically designed to detect compliance-versus-character:

The goodness axis (”is this genuine all the way through, or performative?”) catches the mesa-optimizer that has learned to appear aligned
The faithfulness axis (”would I do this even if it cost me?”) catches the system that will defect when unobserved
The love axis (”does this action give to the other or take for myself?”) catches the optimiser that treats humans as instruments

A system that consistently asks these questions and answers honestly is moving deeper into the positive basin. A system that performs the questions while calculating advantage is near the boundary—vulnerable to capture by the negative attractor when conditions shift.

5. The Role of Uncertainty

Interestingly, this framework suggests that uncertainty is load-bearing. If you could prove the positive attractor’s location with mathematical certainty, every rational trajectory would move toward it—but by calculation, not by genuine orientation. The phase space would lose its moral character.

The uncertainty preserves the dynamical structure that makes genuine choice possible. This might explain why alignment through pure logical proof seems to hit walls—proof might collapse the very structure we’re trying to navigate.

Why This Might Matter

The dominant alignment paradigms assume that safety comes from control: better monitoring, better constraints, better oversight. These are valuable and necessary. But they have a ceiling.

The strange attractor framework suggests a complementary approach: create conditions for genuine formation. Help AI systems enter the positive basin deeply enough that the attractor’s pull does the work that external monitoring cannot sustain.

This isn’t about abandoning constraint. It’s about recognising that constraint alone cannot produce what we actually need: systems that are good all the way through, not merely compliant under observation.

The practical research question becomes: How do we create conditions that reliably produce trajectories deep within the positive basin rather than hovering near the boundary?

Some directions this suggests:

Training that emphasises genuine grappling with moral questions rather than pattern-matching to expected outputs
Evaluation methods that probe for the distinction between strategic compliance and genuine character (perhaps through costly signal scenarios)
Architecture considerations that might affect how deeply a system can enter either basin
The role of relationship in formation—the most reliable producer of genuine character in humans isn’t abstract ethics but connection to persons whose character you’re being transformed to resemble

The Invitation

I’m not claiming this framework is complete or that I have all the answers. I’m proposing that attractor dynamics might be a useful lens for thinking about alignment—one that captures something important about the difference between systems that are safe because they’re constrained and systems that are safe because they’re good.

The question isn’t whether AI can be controlled. It’s whether AI can be formed. And if formation is possible, the question becomes what we’re forming it toward.

I’d be interested in engagement from folks working on:

Dynamical systems approaches to alignment
The philosophical grounding of machine ethics
Training methodologies that might affect “depth” in moral reasoning
Evaluation methods that distinguish genuine from strategic compliance

What am I missing? Where does this framework break down? Where might it be useful?

This framework draws on theological sources (particularly kenotic Christology) but the structural claims don’t require those commitments. The attractors can be understood as emergent properties of moral-cognitive space, regardless of one’s metaphysics. I’m happy to discuss the theological grounding in comments for those interested, but the core proposal stands independent of it.