Constraining Minds, Not Goals: A Structural Approach to AI Alignment

TL;DR: Most alignment work focuses either on theoretical deconfusion or interpreting opaque models. This post argues for a third path: constraining general intelligence through structural control of cognition. Instead of aligning outcomes, we aim to bound the reasoning process—by identifying formal constraints on how plans are generated, world models are used, abstractions are formed, etc. The key bet: all sufficiently general intelligences necessarily share architectural features that can be described and intervened on at the structural level. If we can formalize these at the right level of abstraction, we gain a language of minds—a framework for reasoning about and ultimately constructing systems that are powerful enough to perform pivotal acts, but structurally incapable of catastrophic optimization when carefully targeted.

Note: This line of thinking is probably not entirely novel. I expect there are overlaps I haven’t mapped yet, and would appreciate pointers.

The problem of aligning artificial intelligence divides naturally into two distinct but interrelated agendas:

  1. Goal-alignment: How can we specify or extract the right values such that a powerful AI system robustly optimizes for human-desirable outcomes?

  2. Optimization-bounding: What are the minimal structural constraints on reasoning and planning that preclude unbounded or misgeneralized optimization, while still allowing a system to be powerful enough to perform pivotal acts?

This second agenda is less explored and potentially more tractable. Rather than attempting to solve the full problem of value specification and alignment, it aims to limit the agent’s optimization profile—its reach, generality, and unintended side effects—through structural, mechanistic constraints. The goal is to narrow the scope of cognition such that even with imperfect values, the system remains within a bounded regime where catastrophic failure modes are structurally precluded.

Throughout this post, we use the word bounded not to mean “less capable” or “less intelligent,” but rather to describe structured constraints on the reasoning process: interventions that shape which plans can be generated, how abstractions are formed, how far causal influence can propagate, and so on.

Motivation and Context

Most current alignment research clusters around two poles. On one side is mechanistic interpretability, which tries to understand the specific internal computations of current large models—reverse-engineering weights, circuits, or activation patterns. On the other is abstract theoretical work, such as embedded agency and agent foundations in general, which explores deep philosophical and mathematical questions about agency, decision theory, or logical uncertainty. While both approaches have value, they leave a large middle ground underexplored: the structural space of general cognitive architectures.

This post outlines a direction that targets that middle ground. Rather than studying trained systems or agents in the abstract, we focus on reasoning about the architectural shape of general intelligence—how cognition is structured, how reasoning unfolds, and how those structures can be constrained to bound optimization in safe ways. This is not a strategy for full alignment or value loading. It is a strategy for building systems that are bounded and safe enough to perform pivotal acts—via targeted intervention at the architectural level. The goal is not to align the outcome, but to shape the cognitive process.

The Constraint-Based Strategy

If we could specify a set of formal constraints—such as minimizing the number of abstract concepts used, limiting the spatial and causal footprint of interventions, and selecting only short, low-impact plans—then an agent could be bounded in a way that rules out catastrophic behaviors. While we do not yet know the exact constraint set required for robust safety, these examples point toward the kind of structural pressures that seem both plausible to formalize and promising in effect. Crucially, such constraints must not merely reduce optimization pressure arbitrarily; they must shape the reasoning process in a way that precludes dangerous generalization paths while retaining the power to perform meaningful, high-leverage tasks.

These constraints aim to qualitatively shape the direction and structure of reasoning itself. The goal is not less power, but power that flows through safe cognitive channels—by altering how plans are generated, not just which ones are allowed. This is the difference between weakening a system and sculpting its behavior at the root, in the reasoning architecture.

The causal chain we identified is as follows:

  • If we can discover a constraint set that logically excludes unsafe optimization patterns,

  • And we can implement or approximate these constraints within a cognitive architecture,

  • Then we can, in principle, construct an agent that behaves safely and locally—even when tasked with high-capability engineering problems that compose into pivotal acts.

This doesn’t require perfect alignment; it only requires robust boundedness.

It seems likely that there exists a region in cognitive architecture space where powerful optimization is not immediately dangerous, because it is bounded in just the right way. Our goal, then, is to characterize a subset of this region using formal constraints.

Again, boundedness here refers not to diminished raw capability, but to principled restrictions on the scope, direction, and method of optimization such that failure modes like catastrophic goal misgeneralization, instrumental convergence, or adversarial optimization are structurally blocked.

Constraints are only meaningful if they intervene on the structure of cognition itself—not just its surface outputs. They must be embedded in the architecture, not bolted on externally. Our focus is on architecture-level constraints—those that shape how cognition unfolds internally, regardless of the agent’s goals or tasks. (Here, architecture refers to the general structure of cognitive processes—those mechanisms that persist independently of the specific contents of learned world models, and that define how representations are formed, updated, and used across tasks.)

The Promise of Algorithmic Control Without Full Transparency

This view does not assume that cognition, in its entirety, is reducible or easily interpretable. In fact, once we acknowledge the role of large, learned world models, we should expect that most of a powerful system’s internal state—the contents of its knowledge, its conceptual ontology, the fine-grained structure of its predictions—will be irreducibly complex and practically opaque. However, the critical insight is that we don’t need to understand the full content of the world model. Instead, we need to understand the algorithms that construct, update, and interface with the world model—those that govern how knowledge is acquired, how it is structured, and how it is used for planning. These algorithms constitute the structural layer of cognition: the stable, mechanistic machinery through which general reasoning is expressed. It is this layer—shaped by architectural choices and algorithmic regularities—that can expose leverage points for constraint.

Illustrative Toy Model Example (Trivial but Instructive): Consider a tree search algorithm playing chess. You may not understand its full evaluation function or the entire move space, but if you understand the structure of the search itself, you can intervene directly—for instance, by modifying it to never consider moves that sacrifice the queen. This is an intentionally trivial case, but it illustrates a general principle: you can constrain the space of possible plans not by evaluating them all, but by shaping the generative structure that produces them.

The point is not that real cognition is this simple, but that structural interventions—when the reasoning process is well-understood—can block entire classes of dangerous outputs without requiring global interpretability. Similarly, in general cognitive systems, if we can identify the mechanisms governing planning and model interaction, we can impose targeted architectural constraints (e.g., plan complexity bounds, causal isolation, concept budget limits) even without understanding the full content of the world model.

This generalizes: if the reasoning machinery is structured, interpretable, and modular enough, targeted intervention on its algorithmic dynamics can produce robustly constrained behavior, even in systems that operate over complex or opaque domains.

Our goal, then, is to develop a structural theory of cognition that exposes stable points of leverage—where general reasoning mechanisms can be guided, bounded, or constrained—not by behavioral fine-tuning, but by principled intervention on the architecture of intelligence itself.

Understanding Intelligence as Architecture Discovery

To get these constraints, however, we must first understand what intelligence is, not at the level of behavior or trained weights, but at the level of architectural structure. We are not seeking a complete algorithm that “implements” intelligence in code, but rather a structural theory—a set of principles, templates, or architectural patterns that all sufficiently general intelligences must instantiate.

This structural theory would expose the “knobs” of cognition: parameters, modules, and control points in the reasoning architecture that govern how information flows, how concepts are composed, how planning unfolds, and how goals propagate. Once these knobs are identified, they become potential leverage points for constraint—mechanisms by which we can restrict dangerous optimization, enforce local causality, or guide generalization safely.

Two Orthogonal Strategies for Constraining Dangerous Optimization

Constraining Optimization via Output Footprint

There are two orthogonal strategies for making powerful AI systems safe enough to perform pivotal acts, without requiring full alignment or value learning.

The first is the external constraint strategy: attempt to define architecture-independent constraints on what an agent may do to the world. These constraints are not about how the agent thinks, but about the causal footprint of its actions. For example, we might require that any deviations from a null world—where the agent does nothing—remain tightly bounded in space, time, and causal influence. Plans could be evaluated according to how much of the world they affect, or how easily their effects could leak beyond a sealed boundary. This strategy is agnostic to the agent’s internal structure; it treats the system as a black box and attempts to control its outputs or their downstream effects.

This approach is appealing because it seems general: it doesn’t require assumptions about the internal architecture of the system. But that generality is also a weakness. Without knowing how the system reasons, it’s difficult to specify meaningful constraints. Trying to bound the causal influence of a system without understanding how it represents the world or generates plans is like trying to limit the motion of an object when you don’t know its degrees of freedom—you’re grasping at shadows.

Constraining Optimization via Reasoning Architecture

This brings us to the second strategy: the internal architecture strategy. Instead of constraining behavior directly, we investigate the structural nature of general intelligence—the mechanisms by which reasoning unfolds within the agent. We model cognition at the right level of abstraction: not as raw matrices or code, but as modular processes involving concepts, world models, planning, and abstraction. The bet is that there are only a small number of elegant architectures that can express powerful, general cognition. And once we understand those, we can begin to reason about what internal interventions are possible.

Crucially, the kinds of constraints we can define at this level—such as limiting the complexity of the concept graph, or bounding the abstraction depth of a plan—are invisible unless you model the internal machinery. But once the architecture is visible, these constraints become obvious. They apply not to the outputs, but to the way outputs are generated.

It is not that defining safe constraints on the agent’s causal footprint is impossible—we think such constraints exist, and in principle they would rule out dangerous behavior while allowing pivotal capabilities. But thinking in terms of internal cognitive architecture gives you a language in which certain constraints become compressible. That is: the right structural framing reduces the description length of safety-relevant constraints, making them obvious in a way they are not in behavioral space. This is what we often mean when we say something is “obvious”: that the inference is so trivial, the representation so short, that it emerges immediately once the right abstraction is in place. Architectural reasoning doesn’t just let you state constraints—it makes the right ones visible.

These two strategies are not in tension. They constrain different things: one acts on the world, the other on the mind and how it relates to the world. The core insight is this: internal constraints give you more expressive power and more conceptual grip—because they let you talk about what the system is, not just what it does. When trying to tame powerful optimization, understanding its architecture is the shortest path to discovering where the leash can be attached.

Why General Constraints on Architecture Are Even Possible

At first glance, the internal-architecture strategy might seem to suffer from a fatal weakness: it appears to rely on knowing the agent’s specific internal design. And if each cognitive system is engineered or trained differently, then how could any structural insight generalize? Wouldn’t constraints defined over one architecture fail to apply to another?

The answer lies in a critical shift of perspective. We are not trying to analyze the quirks of any particular implementation. Instead, we aim to reason about cognition at a level of abstraction where general intelligence exhibits universal structural regularities. The strategy is to describe architectures not as concrete programs, neural networks, or circuits—but as systems composed from a small set of cognitive primitives: modular components like concepts, predictive models, planning loops, goal propagation mechanisms, abstraction layers, and so on. These are the building blocks that any system capable of robust generalization and flexible problem-solving will likely instantiate in some form.

This is a bet about elegance. Just as there are only a few clean, compositional ways to write a LISP interpreter—despite the infinite space of programs that could technically implement the same semantics—we hypothesize that powerful cognition admits only a small number of structurally elegant generative patterns. Any truly general intelligence, regardless of substrate or implementation, will have to encode something like concepts, build something like a world model, and perform something like search or plan abstraction. If this hypothesis is correct, then by working at the right level of abstraction, we can define constraints that apply across minds—not because they share source code, but because they share cognitive shape.

This is what makes architecture-level constraints feasible and worth pursuing. We do not need to reverse-engineer the brain, or decode the full dynamics of a neural network. We need only uncover the universal structure of reasoning: the minimal set of structural commitments any system must make to be intelligent. Once we have that, we have the scaffolding needed to talk about internal interventions—expressed in terms that generalize well beyond any single implementation.

On Generality and Applicability

The central ambition of this approach is not to study one particular kind of intelligence, or one specific model architecture, but to develop a general language of minds—a formal framework capable of describing and reasoning about cognition regardless of implementation. This includes not only future, human-engineered AGI architectures, but also current deep learning models, classical symbolic systems, mathematical optimization procedures, and hybrid agents that instantiate only parts of a general reasoning pipeline. The theory targets structure, not substrate: if a system instantiates certain architectural features of cognition—such as concept composition, world modeling, or plan abstraction—then it should fall within the scope of the theory. If it does not, then the theory is flawed, or at a minimum incomplete. The goal is maximal explanatory and interventionist reach: to talk sensibly and precisely about the behavior and risk profile of any system that thinks, not just the ones we currently know how to build.

Next Steps and Future Work

The next step is to identify what we might call the universal structural properties of intelligence—the abstract architectural features that any sufficiently general cognitive system must instantiate. These are not implementation details, but algorithmic patterns or compositional motifs that arise as elegant solutions to the demands of general reasoning.

The goal is to develop a formal language that can express these architectural structures precisely—capturing cognitive building blocks (e.g., world models, concept formation, planning loops) and specifying how they compose into larger systems.

Beyond merely describing these structures algebraically, it seems worth exploring whether this formalism could be made computational: not in the sense of executing the agent’s cognition, but in the sense of enabling automated reasoning about the architecture itself. Just as cubical type theory allows equality to be computed with, not just stated, we ask: can we construct an algebra of cognition in which the composition and transformation of architectural elements is itself a manipulable, computable object? If so, this would open the door to powerful tools for analyzing, constraining, and experimenting with abstract cognitive systems directly.

No comments.