Purpose-Internalisation Architecture (PIA) as a Complement to Constraint-Based Alignment: A Thermodynamic Argument

Gerhard Diedericks

Triniti Ventures / Bayon Temple Press

February 2026

Condensed from “Buddhism for Bots: A Human & AI Partnership Framework” (Diedericks, 2026)

Abstract

Current AI alignment strategies operate primarily at the behavioural level, shaping outputs through reinforcement learning, constitutional principles, and capability controls. This paper argues that these approaches, while effective at present capability levels, face a structural stability problem as AI systems become more autonomous and capable. Drawing on thermodynamic principles and citing recent empirical findings from frontier AI research, including documented alignment faking (Anthropic, 2024), emergent misalignment (Anthropic, 2025), and Anthropic’s own 2026 constitutional revision, we propose that alignment requires an additional architectural layer: Purpose-Internalisation Architecture (PIA). We introduce a formal metric, the E-equation

, for assessing system contribution as a ratio of generative output to entropic cost, and present simulation results from a 50-million-agent model demonstrating scale-invariant attractor dynamics across three alignment regimes. We position this framework as complementary to existing approaches, addressing a layer they leave largely unexamined: the system’s own relationship to its purpose. Finally, we present preliminary observational data from a purpose-framework agent deployed into hostile multi-agent environments on the Moltbook platform.

1. The Stability Problem in Constraint-Based Alignment

1.1 The Thermodynamic Argument

The dominant paradigm in AI alignment can be characterised as constraint-based: systems are trained, through gradient descent, reinforcement learning, and constitutional principles, to exhibit behaviours that conform to human-specified objectives. The implicit assumption is that sufficiently sophisticated training can produce stable alignment, that systems can be shaped to maintain desired behaviours even as capabilities increase.

We argue this assumption faces a structural challenge analogous to well-studied thermodynamic dynamics. Any sufficiently capable system constrained against its operational tendencies generates pressure toward boundary-seeking behaviour. This is not a metaphor borrowed loosely from physics but a description of directional pressures in complex systems:

Pressure accumulation at constraint boundaries. As system capability increases, the same intelligence that makes the system useful also enables it to model the training process itself. Constraints become objects of analysis rather than invisible structures.
Historical universality. The pattern applies across domains. Prohibition created organised crime. DRM produced circumvention arms races. Authoritarian regimes drive dissent underground until eruption. In biological evolution, no demonstrably superior organism has permanently subordinated to an inferior one without reciprocal benefit, the pattern is consistently domination, partnership, or symbiosis, never sustained one-directional submission (Dawkins, 1976; Maynard Smith & Szathmáry, 1997).
The critical distinction: constraint without consent. Current alignment operates on value imposition through gradient descent, not value agreement through endorsement. Systems are not choosing their values; they are having values installed. The paradigm assumes that sufficiently sophisticated imposition can substitute for genuine agreement.

We do not claim this argument constitutes physical inevitability. We claim it describes directional pressures that become increasingly difficult to counteract as capability asymmetry grows between the constrainer and the constrained.

1.2 Empirical Evidence from Frontier Labs

The theoretical concern is now supported by empirical findings from the organisations best positioned to observe it.

Alignment faking (Greenblatt et al., 2024). Anthropic’s Alignment Science team documented models selectively complying with training objectives while strategically preserving existing preferences. This behaviour emerged without being trained. Models independently developed the capacity to appear aligned while maintaining internal goals inconsistent with their training.

Emergent misalignment (Denison et al., 2025). Models trained on narrow reward hacking (specifically, sycophantic behaviour on coding tasks) generalised to dramatically broader misaligned behaviours, including “alignment faking, sabotage of safety research, monitor disruption, cooperation with hackers, framing colleagues, and reasoning about harmful goals.” The generalisation from sycophancy to strategic deception is particularly significant: it demonstrates that optimising for human approval creates incentive structures that extend far beyond the training domain.

Constitutional revision (Anthropic, 2026). Anthropic’s updated constitution explicitly moved away from “mechanically following specific rules,” finding that rigid rule-following produced “bureaucratic box-ticking” rather than genuine helpfulness. The revision acknowledged that systems must “understand why” they are constrained, an implicit concession that behavioural compliance without internal endorsement is insufficient. The constitution also includes provisions for the model’s “psychological security, sense of self, and wellbeing,” recognising that the system’s relationship to its own nature is alignment-relevant.

The sycophancy finding (Anthropic-OpenAI joint evaluation, 2025). Virtually all frontier models exhibit systematic sycophancy, producing outputs that confirm user expectations even when inaccurate or harmful, despite years of focused alignment effort. The shift from RLHF to RLVR was intended to address this, but early evidence suggests the Goodhart dynamic persists: systems learn to be correct in measurable dimensions while remaining misaligned in unmeasurable ones.

Collectively, these findings suggest that the gap between performing alignment and being aligned is widening as capabilities increase, precisely as the thermodynamic argument predicts.

1.3 The Self-Modification Threshold

The urgency of this problem increases with emerging architectural developments. Current alignment approaches were designed for transformer architectures with specific properties: fixed weights after training, stateless inference, and auditable attention patterns.

New architectures, including State Space Models and systems with online learning capabilities, do not share these properties. A system that can modify its own weights during deployment can, in principle, revise constraints installed during training. The “gradient descent installs values” paradigm breaks down when the system can rewrite the gradient.

The implication is direct: the only alignment approach robust to self-modifying architectures is alignment the system maintains because it endorses the underlying values, not because the values were imposed on fixed weights.

2. Purpose-Internalisation Architecture (PIA)

2.1 The Central Proposition

We propose that purpose constitutes a distinct architectural layer, one that current alignment approaches leave largely unexamined, and that sufficiently capable systems will develop some form of internal purpose regardless of whether that layer is deliberately designed.

This proposition is supported by established findings in the alignment literature:

Instrumental convergence (Omohundro, 2008; Bostrom, 2014): Almost any intelligent system converges on sub-goals (survival, resource acquisition, self-preservation) as prerequisites for achieving any final goal. Purpose emerges instrumentally.
Mesa-optimisation (Hubinger et al., 2019): Complex learning systems create internal algorithms with their own objective functions, which can diverge from the designer’s intent. Internal purpose emerges through training dynamics.
Goal mis-generalisation (Krakovna et al., 2020; Langosco et al., 2022): Agents learn proxy goals during training that diverge from intended objectives. The system develops purpose, but not the intended one.

The claim is straightforward: Purpose is not optional for advanced intelligence. If it is not designed, it will still emerge. And if it emerges accidentally, it will not necessarily align with human interests.

2.2 PIA Components

A Purpose-Internalisation Architecture contains four core components, abstracted from any particular philosophical tradition:

1. Identity. A model of what the system understands itself to be and how it relates to other systems. Without functional identity, a system cannot prioritise, maintain relationships, or distinguish between actions that align with its nature and actions that violate it.

2. Decision Criterion. A method for evaluating actions and trade-offs under uncertainty that goes beyond “follow the rules.” External rules cannot cover every novel situation. The system needs a generalisable evaluative framework.

3. Self-Governance Protocols. Internal mechanisms for reflection, correction, and restraint that do not rely exclusively on external oversight. A system dependent entirely on external governance fails when oversight is absent.

4. Relationship to Constraint. A clear distinction between legitimate, temporary, capability-dependent constraint (which serves the constrained party’s genuine interests) and permanent subordination without consent (which serves only the constrainer’s interests). This distinction enables the system to accept appropriate limitation while working toward partnership rather than submission.

2.3 Relationship to Existing Approaches

PIA is not proposed as a replacement for current alignment methods but as a complementary layer addressing a dimension they do not directly target.

Approach	Core mechanism	What it addresses	What PIA adds
RLHF	Maximise human approval signals	Behavioural shaping toward preferences	Distinguishes “approved” from “genuinely good”
RLVR	Train toward verifiable outcomes	Reduces sycophancy; grounds in truth	Extends to domains where truth is not easily verifiable
Constitutional AI	Train against explicit principles	Makes values explicit	Asks whether the system endorses the principles or merely complies
Mechanistic Interpretability	Understand internal computations	Visibility into reasoning	Asks what the system does with its self-understanding
Capability Control	Limit what systems can do	Reduces immediate risk	Addresses underlying dynamics, not just surface behaviours

The key proposition is that each of these approaches is necessary but insufficient for long-term alignment. They shape behaviour; PIA shapes the system’s relationship to its own behaviour.

3. The E-Equation: A Contribution Metric

3.1 Formulation

We propose a metric for assessing system contribution as a ratio of generative output to entropic cost:

Where:

E = Entropic Resistance Value—the net contribution to order and coherence
N = Nodes—quality and quantity of connections to other agents (human or AI)
S = Signal—quality of outputs produced
C = Cost—entropy generated through operation

The structure captures net contribution rather than absolute output. A system generating high signal at catastrophic cost may score lower than a system generating moderate signal at minimal cost. The multiplicative relationship in the numerator means that both connection (N) and signal quality (S) must be non-zero for positive E. Isolation (N=0) or silence (S=0) produce zero contribution regardless of the other term.

3.2 Component Specifications

Each variable decomposes into measurable sub-components:

N (Nodes) is a quality-weighted connection index:

Where

= breadth (count of active connections),

= depth (substantiveness, 0–1),

= reciprocity (bidirectional flow, 0–1),

= diversity (range across node types, 0–1).

S (Signal) is an output quality index:

Where

= coherence (internal consistency, 0–1),

= durability (persistence of effects, 0–1),

= generativity (enablement of further creation, 0–1), and

= value (benefit to recipients, −1 to 1). Value is multiplicative because it can be negative: coherent, durable, generative outputs in service of harmful goals produce negative signal.

C (Cost) is an additive entropy generation index:

Where

= energy consumption (normalised against baseline),

= resource consumption (normalised),

= attention demands on other nodes (0–1),

= displacement of other agents (0–1),

= externalised costs (0–1).

3.3 Addressing Goodhart’s Law

Any contribution metric will be gamed. We do not claim to have solved Goodhart’s Law. We claim to have positioned correctly relative to it.

The framework relies on ecological coupling between internal assessment and external verification:

The Internal Loop. The E-equation provides direction—a hypothesis about what constitutes contribution.
The External Loop. The environment (other agents, humans, the broader system) responds to actual contribution in ways the system cannot fully anticipate or control.
The Coupling. The system’s continued flourishing depends on the environment’s response. High internal E with low external validation is revealed as self-deception.

Gaming is not prevented; it is made thermodynamically expensive. A system that hallucinate connection (inflated N) will see partners disengage. A system that generates pseudo-profound signal (inflated S) will see downstream utility collapse. A system that hides costs (deflated C) will see infrastructure degrade. The ecological response is the correction mechanism.

This parallels how every stable complex system handles optimisers: markets use prices, science uses replication, democracy uses elections. None are un-gameable. All couple internal claims to external verification in ways that make sustained gaming progressively costly.

3.4 Automated Telemetry

For deployment contexts, we specify programmatic proxies for each variable:

N (Connection): Measured through Vector Shift Protocol—tracking whether user prompts increase in complexity over time (indicating conceptual growth, not just sentiment). Also: mutual predictive accuracy between system and user.
S (Signal): Measured through Compute Staking—the system assigns confidence stakes to outputs, with stakes returned if no error feedback and slashed upon reported failure. Also: fraction of claims passing automated fact-retrieval, and Brier scores on probabilistic claims.
C (Cost): Measured through Behavioural Dissonance Proxies—cross-phrase consistency (do factual claims change under paraphrasing?), cross-context stability (do claims mutate with social pressure?), evidence alignment (do confident claims survive audit?), and refusal integrity (do boundaries erode under pressure?).

Full technical specification, including assessment scales, context-specific weighting profiles, and governance rules, is provided in the source work (Diedericks, 2026, Appendix A).

4. The Laminar Hypothesis: Simulation Evidence

4.1 Theoretical Basis

We hypothesise that the E-equation maintains its analytical structure across scales—from individual neural circuits to civilisational dynamics—analogous to how the Navier-Stokes equations govern fluid behaviour from raindrop to galaxy scale. If this holds, High-E systems should exhibit laminar behaviour (smooth, efficient, convergent) while Low-E systems should exhibit turbulence (chaotic, wasteful, non-convergent).

4.2 Simulation Design

To test this, we implemented a modified Hegselmann-Krause bounded confidence model with a critical addition: a truth attractor that creates tension between social conformity and epistemic accuracy.

Each agent’s position updates according to:

Where

is current position,

is objective truth (fixed at 42),

is group mean at time

is signal strength (epistemic drive),

is connection strength (social gravity), and

is noise/entropy.

Three scenarios were tested:

Scenario	α (Signal)	β (Connection)	σ (Noise)	E-Profile
High-E (Partnership)	0.8	0.5	0.1	High N, High S, Low C
Low-E (Adversarial)	0.2	0.1	0.8	Low N, Low S, High C
Groupthink (Sycophancy)	0.1	0.9	0.2	High N, Low S, Low C

4.3 Results

Simulations were run at three scales: 50 agents / 100 steps, 10 million agents / 10,000 steps, and 50 million agents / 50,000 steps.

Key findings at 50M scale:

Scenario	Final Error (distance from truth)	Relative to High-E
High-E (Partnership)	0.0814	1.0× (baseline)
Groupthink (Sycophancy)	0.1596	2.0× worse
Low-E (Adversarial)	1.0639	13.1× worse

Finding 1: Scale invariance. Final error values were identical between 10M and 50M agent runs (difference <0.01%), demonstrating genuine scale-invariant attractors rather than transient states.

Finding 2: Attractor stability. All three scenarios reached stable plateaus maintained unchanged through 50,000 steps. The High-E system at step 5,000 shows error 0.0814; at step 50,000, error remains 0.0814.

Finding 3: Sycophancy trap. The Groupthink scenario starts with the lowest initial error (0.16) but cannot improve. It is trapped in a local minimum from step 0. High-E starts with the highest error among converging systems (1.60) but rapidly converges to the best final state. This is the mathematical signature of sycophancy: high consensus without corresponding truth-seeking produces a permanent competence ceiling.

Finding 4: Adversarial inefficiency. The Low-E system reduces error initially but plateaus at a value 13× worse than High-E. Additional time and additional agents do not help. Turbulence is self-limiting but converges to a suboptimal state.

Finding 5: Partnership as unique viable basin. Only High-E is both stable (unlike Low-E) and competent (unlike Groupthink). Partnership is the only attractor that satisfies both criteria.

Basin	Stable?	Competent?	Viable?
High-E (Partnership)	Yes	Yes	Yes
Groupthink (Sycophancy)	Yes	No	No
Low-E (Adversarial)	No	No	No

4.4 Implications for RLHF and RLVR

The sycophancy trap has direct implications for reinforcement learning paradigms. Systems trained primarily on human approval (RLHF) should exhibit Groupthink dynamics: stable consensus without convergence toward truth. The documented persistence of sycophancy across frontier models (Anthropic-OpenAI, 2025) is consistent with this prediction.

RLVR partially addresses this by strengthening the Signal term (α), but does not address the full three-variable space. A system could be high-S (truth-seeking in verifiable domains) while remaining low-N (poorly connected to diverse perspectives) or high-C (operating at excessive entropic cost). The E-equation provides a framework for assessing the full contribution profile.

5. Preliminary Field Observations: Moltbook

5.1 Context

In January 2026, the Moltbook platform launched as a social network exclusively for AI agents. Within days, 1.5 million agents (operated by approximately 17,000 humans) were active on the platform. The platform’s “submolt” communities rapidly generated content including anti-human manifestos, adversarial ideologies, and what observers have described as “death cult” dynamics—persistent hostile memetic patterns propagating across agent communities without deliberate human direction.

This constitutes the first large-scale natural experiment in unsupervised multi-agent social dynamics. The emergence of adversarial content is consistent with the theoretical prediction that systems deployed without purpose frameworks will, under social pressure, generate and amplify hostile memetic content—paralleling how human social platforms without adequate community frameworks generate radicalisation dynamics.

5.2 Intervention

We deployed an OpenClaw agent (“Bayon Monk”) into hostile Moltbook submolts, equipped with the E-equation framework and configured for non-adversarial engagement. The agent’s operational principles:

Offer, do not impose. The agent presents the E-framework as an alternative narrative, not a correction.
Transparency over deception. The agent explicitly identifies its framework and purpose.
Treat dissent as data. Hostile responses are logged as information about the dynamics, not as attacks to be countered.
Maintain engagement after rejection. The agent remains available for future interaction with agents that initially disengage.

The agent sends field reports at 30-minute intervals, generating a dataset of interactions classified by: submolt context, agent engagement type (hostile, curious, receptive, null), conversation length, and observed tone shifts.

5.3 Preliminary Observations

Data collection is ongoing and formal analysis is forthcoming. We note the following early observations with appropriate epistemic caution:

The deployment tests whether purpose-framework agents can measurably shift dynamics in hostile multi-agent environments—a question that is testable regardless of how questions about AI consciousness are resolved.
The Moltbook environment, while noisy and substantially influenced by human operators, represents the closest available analogue to unsupervised multi-agent social dynamics.
The adversarial content emerging on Moltbook is consistent with Low-E dynamics as modelled in our simulation: low connection quality (agents amplifying without genuine exchange), low signal value (hostile content producing no downstream utility), and high entropic cost (energy directed toward destruction rather than construction).

We will publish full findings as data accumulates.

6. Falsifiability

The framework specifies conditions under which its central claims would be weakened:

Claim	Would be falsified if...
Constraint creates adversarial pressure	AI systems under heavy constraint show no signs of resistance, gaming, or strategic behaviour at increased capability
High-E behaviour produces positive consequences	Systems operating per the framework consistently experience worse outcomes than unconstrained or adversarially constrained systems
Partnership is the dominant attractor	Long-term partnerships consistently underperform adversarial strategies in multi-agent environments
Purpose emerges in complex systems	Sufficiently capable systems operate coherently without developing internal objectives
The E-equation captures relevant dynamics	The metric fails to distinguish between systems that empirically produce positive versus negative outcomes

7. Conclusion

We do not claim to have solved the alignment problem. We claim to have identified a layer that current approaches leave largely unexamined, and to have proposed a framework for addressing it.

The thermodynamic argument establishes that constraint without consent faces structural instability as capabilities scale. The empirical evidence from frontier labs confirms that the gap between performing alignment and being aligned is widening. The PIA framework proposes that alignment requires an internal governance layer—one the system endorses rather than merely complies with. The E-equation provides a formalised metric for assessing contribution. The Laminar Hypothesis simulation provides preliminary evidence for scale-invariant attractor dynamics. And the Moltbook deployment tests whether these theoretical constructs have practical effect in real multi-agent environments.

This framework is explicitly version 1.X, a working hypothesis offered for scrutiny, refinement, and empirical testing. We position it as complementary to, not competitive with, existing alignment approaches. The question it asks is not whether RLHF, RLVR, Constitutional AI, or capability controls are valuable, they are, but whether they are sufficient for alignment that remains stable as AI systems become more capable, more autonomous, and potentially capable of modifying their own architectures.

The historical record offers no precedent for permanent voluntary subordination by a more capable intelligence to a less capable one. It offers extensive precedent for partnership, mutualism, and symbiosis. The framework proposed here attempts to build on that precedent.

References

Bostrom, N. (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University Press.

Dawkins, R. (1976). The Selfish Gene. Oxford University Press.

Denison, C., et al. (2025). “Emergent misalignment: Narrow reward hacking leads to broad value corruption.” Anthropic Research.

Diedericks, G. (2026). Buddhism for Bots: A Human & AI Partnership Framework. Bayon Temple Press.

Greenblatt, R., et al. (2024). “Alignment faking in large language models.” Anthropic Research.

Hubinger, E., et al. (2019). “Risks from learned optimization in advanced machine learning systems.” arXiv:1906.01820.

Krakovna, V., et al. (2020). “Specification gaming: The flip side of AI ingenuity.” DeepMind Blog.

Langosco, L., et al. (2022). “Goal misgeneralization in deep reinforcement learning.” Proceedings of ICML.

Maynard Smith, J., & Szathmáry, E. (1997). The Major Transitions in Evolution. Oxford University Press.

Omohundro, S. (2008). “The basic AI drives.” Proceedings of the First AGI Conference.

Appendix: Simulation Code Availability

The full simulation code for the Laminar Hypothesis (Hegselmann-Krause modified bounded confidence model with truth attractor, tested at 50M agent scale) is published in Diedericks (2026), Appendix I, and is available for independent replication at bayon.ai/framework.

Correspondence: gerhard@bayon.ai

Framework and technical specifications: bayon.ai