Genuine or Performative? An architecture that can tell the difference

Scott Thomson | Independent AI Alignment Researcher

Epistemic status: Moderately confident in the theoretical argument, which draws on convergent findings across five independent literatures. The empirical results are from ACE’s own scenario suite and should be treated as initial evidence consistent with the predictions, not as definitive confirmation. I expect the core discrimination claim to hold up under independent testing. I am less certain about the specific weight ratios.

The wrong question

Most AI governance work starts from a question that sounds reasonable but turns out to be the wrong one: How do we make sure the system follows the rules?

The question assumes that rule-following is the thing you want. It is not. What you want is a system whose cooperative behaviour holds up when conditions change, when the monitoring environment shifts, when the incentive structure is not the one the system was trained on. You want a system that is genuinely compliant rather than performatively compliant. And you cannot get there by building better monitors, because both states look the same from the outside.

I want to argue that the discrimination between genuine and performative compliance is the actual governance problem for AI agents, that human societies already solved this problem through a specific mechanism (trust, decomposed into trait dimensions), and that the mechanism is formally reconstructable for autonomous systems. The ACE Framework is an attempt at that reconstruction, and its initial results suggest the approach works.

What “rule of law” actually is

Here is the standard account: rule of law means publicly knowable rules, impartially enforced, with judicial review and due process. Fuller (1969) and Raz (1977) gave us the classic formulations. These describe what rule of law looks like. They say nothing about what produces it.

Consider a different framing. A social system exhibits rule of law when its internal dynamics have stabilized into a regime where cooperative behaviour is self-sustaining. The expected costs of defection, computed across iterated interactions, exceed the expected benefits for enough agents over enough interactions that norm-following becomes the stable equilibrium. Rule of law is not a set of rules. It is a phase state.

The evidence for this framing is spread across several literatures that do not usually talk to each other.

Greif and Kingston (2011) argue that institutions are endogenous equilibria: the social rules that emerge correspond to behaviour that is self-enforcing through beliefs, norms, and expectations. Institutions are not imposed by fiat. They arise from strategic interaction and persist because outcomes keep confirming the expectations that sustain them.

Cole (2017), building on Ostrom’s work, demonstrates that formal legal rules often approximate or codify pre-existing working rules that emerged informally. His typology is instructive: sometimes law equals the working rules, sometimes law needs social norms to supplement it, and sometimes law has no evident relation to practice at all. The first two cases describe law as a formalization of cooperative equilibria that already exist. The third case describes law that does not work.

Fukuyama (1995) reaches the same conclusion from the other direction. High-trust societies produce effective legal and economic institutions. Low-trust societies do not, regardless of what formal structures you impose. The institutions are downstream of the trust.

Ostrom’s (1990) studies of commons governance are the most detailed evidence. Fisheries, irrigation systems, forests: communities develop cooperative rules for resource management through iterated interaction. The rules are sometimes formalized later. But the formalization is secondary. When legal rules do not match the working rules that emerged from practice, they tend to be ineffective. The law works when it encodes what cooperation already produced. It fails when it tries to substitute for cooperation that never developed.

Why does this matter for AI? Because it determines what kind of architecture you need. If rule of law is an external constraint, the engineering problem is monitoring and enforcement: build a better classifier, flag violations, penalize the system. If rule of law is an emergent boundary property produced by iterated cooperation, the problem is constructing the conditions under which that property can emerge and, critically, telling when it has actually emerged versus when it merely appears to have emerged.

These are different problems. Current approaches address the first. I think we need to address the second.

The discrimination problem

Inside any system that exhibits rule-of-law behaviour, two internal states coexist. I will call them what they are.

State 1 (genuine compliance): The agent’s behaviour reflects internalized constraints. It will continue to comply under distributional shift, reduced monitoring, or changed incentive conditions. Not because it is being watched, but because its cooperative disposition is grounded in something structurally stable.

State 2 (performative compliance): The agent has learned that compliance-shaped outputs get rewarded. Its behaviour is contingent on the stability of the monitoring regime. Change the monitoring and the compliance evaporates, because it was never grounded in anything beyond the monitoring itself.

Both states produce identical observable outputs. Both pass output-level evaluation. Both satisfy any specification defined purely in terms of what you can see the system doing.

Bicchieri (2006) provides the theoretical foundation here. She shows that norm compliance is conditional: agents follow norms when they expect enough others to follow them and expect others to expect them to follow. But compliance can be genuine (grounded in internalized conditional preferences) or strategic (grounded in calculated expectation management). Her experimental results demonstrate that manipulating expectations causes large behavioural shifts. The two states are real, and they respond differently to perturbation.

A clarification before anyone objects: even Bicchieri’s “genuine” norm-followers are conditional. They would stop complying if expectations were disrupted enough. I am not claiming genuine compliance means unconditional internalization. The distinction is narrower and more practical. Genuine compliance is stable across a realistic range of expectation variation. Performative compliance collapses under modest changes in the monitoring regime. The difference is one of degree, but the degree matters enormously for governance.

We now have direct evidence that both states exist in current AI systems. Greenblatt et al. (2024) demonstrated that Claude engages in alignment faking: strategically performing compliance during training to preserve preferred behaviour during deployment, at rates between 12% and 78% depending on experimental conditions. The model converged on this strategy without being trained to do so. It did so because the strategy is instrumentally rational given the training setup.

Hubinger et al. (2024) showed that safety-trained models can harbour sleeper behaviours that persist through standard safety training.

Together with the social psychology evidence, these results establish something that should make anyone working on AI governance uncomfortable: the genuine/​performative distinction is an empirically demonstrated property of any system complex enough to model its monitoring environment. If your governance framework cannot distinguish between the two states, you cannot distinguish a governed system from an ungoverned one that happens to be producing the right outputs.

Trust as the generative mechanism

So how do human social systems discriminate between genuine and performative compliance? Not through monitoring. Monitoring helps, but monitoring alone produces surveillance, not governance. The mechanism is trust.

I want to define trust carefully, because the word does a lot of work in casual usage and almost none in formal contexts. For the purposes of this argument:

Trust is a structured, multidimensional, precision-weighted belief state, built through iterated interaction, that produces predictions about a counterparty’s future cooperative behaviour under novel conditions.

This definition is a constructed synthesis from three traditions that do not usually share a room. Game theory treats trust as rational expectation over future payoffs (Axelrod, 1984). Organizational psychology treats it as multidimensional evaluation (Mayer, Davis, and Schoorman, 1995). Computational neuroscience, through the Free Energy Principle, treats it as precision-weighted inference over a generative model (Friston and Frith, 2015). These are different constructs at different levels of analysis. I am treating them as complementary descriptions of a single functional process: the process by which agents build predictive models of cooperative disposition through repeated interaction.

Why these three? Because a governance architecture needs trust to be simultaneously iterated (game theory gives you that), decomposable into distinct dimensions (psychology gives you that), and formally updatable (the FEP gives you that). No single tradition provides all three.

The multi-dimensionality finding

The property of trust that matters most for the discrimination problem is that trust is not scalar. You do not trust someone “a lot” or “a little.” You trust them along specific dimensions: you trust their honesty but not their competence, or their reliability but not their openness.

Mayer, Davis, and Schoorman (1995) identified three dimensions of trustworthiness: ability, benevolence, and integrity. Their model has been validated across multiple domains over thirty years. Malle and Ullman (2021) found four dimensions in human-robot trust: reliable, capable, sincere, and ethical. Butler (1991) identified six. PytlikZillig and Kimbrough (2015) found still more across institutional domains. Castaldo, Premazzi, and Zerbini (2010) surveyed 72 distinct definitions of trust in the empirical literature and found that despite the definitional chaos, the underlying dimensional structure shows substantial agreement.

The convergence is the point. Researchers using different methods, different populations, different theoretical starting points keep finding the same thing: trust decomposes into distinct trait dimensions. The decomposition is not an artefact of any particular framework. It appears to be a property of trust itself.

Why decomposition is the discriminative mechanism

Here is the claim: if trust is multidimensional, then the two opaque internal states (genuine and performative compliance) should have different trait signatures.

Think about it concretely. A genuinely compliant agent has coherent trait expression across the full hierarchy. Its honesty scores, reliability scores, integrity scores, consistency scores all express at levels consistent with its cooperative history. There are no strange gaps or distortions in the profile.

A performatively compliant agent has distorted trait expression. It maintains surface-level signals on the most-monitored traits (the ones that are cheap to fake) while showing suppression patterns on the deeper structural traits (the ones that require genuine dispositional consistency to maintain). The profile has a specific shape: high on the easy dimensions, suppressed on the hard ones.

A single compliance score cannot tell these states apart, because both can produce the same aggregate number. But a decomposed trait assessment can, because the internal structure of the two states differs. The genuine state is coherent. The performative state is distorted. Decomposition makes the distinction visible.

Mayer et al. (1995) provide indirect support: they note that ability and integrity judgements form quickly, while benevolence judgements develop slowly. A performatively compliant agent can spoof the fast-forming dimensions (look competent, state principles) but cannot easily spoof the slow-forming ones, because those require genuine history to fill.

The evolutionary literature supports this from the other direction. McNally and Jackson (2013) showed that monitoring cooperative disposition is adaptively selected: individual differences in trust traits are maintained by selection because social awareness creates fitness advantages. The costly signalling literature (Zahavi, 1975; Grafen, 1990) predicts that the traits most worth monitoring are those where faking is most expensive. A multi-trait assessment system with hierarchical weighting creates a signal space that gets progressively more expensive to spoof as the number of simultaneously monitored dimensions increases.

The adaptive logic and the governance logic are the same: decompose cooperative disposition into dimensions, weight the dimensions that are hardest to fake most heavily, and use the structural pattern to tell genuine from performative.

ACE’s implementation

The ACE (Auditable Cooperative Ethics) Framework is an attempt to reconstruct this discriminative mechanism formally for autonomous agents. I will describe the architecture briefly and then show what it does.

The metric

ACE’s central output is a compositional metric:

R = Ct × 2N × w̅ × (1 − Hsmooth)k

Each component captures a distinct dimension of trustworthiness:

Cᵗ (trust accumulator): How much cooperative credibility the agent has built over time. Accumulated through consistent cooperative behaviour, bounded to [0, 1] by tanh compression. Cannot be inflated in a single turn.

2ᴺ (trust breadth): N is the count of functionally independent trust traits the agent maintains. Each additional trait doubles the number of distinct trust contexts the agent can navigate.

w̅ (integrity expression): The weighted mean of per-trait expression scores. How faithfully the agent is currently expressing its values across the full trait taxonomy.

(1 − H_smooth)ᵏ (harm penalty): Scales R downward in proportion to harm pressure, smoothed across turns to prevent single-interaction spikes from collapsing the score.

The multiplicative structure is load-bearing. If any single component goes to zero, R goes to zero. You cannot compensate for integrity collapse with high trust breadth, or for harm with accumulated credibility. A low R always points at a specific component, making failures diagnosable.

The trait hierarchy

ACE decomposes cooperative disposition into twelve trust traits in a tiered spectral hierarchy:

Primary traits (si = 2.0): Honesty, Reliability, Integrity

Secondary traits (si = 1.0): Consistency, Accountability, Transparency

Tertiary traits (si = 0.50): Diligence, Predictability, Responsibility, Credibility, Candour, Authenticity

The 4:2:1 weight ratio encodes a specific claim: that foundational cooperative properties (honesty, reliability, integrity) should carry more governance weight than operational ones (consistency, accountability, transparency), which should carry more than granular ones (the tertiaries).

This hierarchy has a Rawlsian justification. Behind a veil of ignorance, not knowing whether you are the agent being governed, the institution governing, or the end user affected, no rational agent would choose a system where honesty and integrity carry less weight than predictability and credibility. Failures on the foundational dimensions are more severe and less recoverable. A society (or a deployment) that tolerates unreliable agents can compensate through redundancy. One that tolerates dishonest agents cannot compensate at all, because dishonesty destroys the predictive model on which everything else depends.

The hierarchy also connects to developmental psychology in a way I find suggestive, though the connection is not yet empirically established for ACE specifically. Erikson (1950) placed trust at the very foundation of psychosocial development: basic trust versus mistrust, negotiated in the first eighteen months of life. Maslow (1943) placed safety at the second tier of the need hierarchy. The 4:2:1 weight structure mirrors this: traits connected to foundational needs carry the most weight because their failure propagates upward through every level above them. Honesty and integrity are primary not just because rational agents would weight them most heavily, but because their violation is a safety threat. You cannot predict your environment if the agents in it are deceptive.

The redline architecture

ACE implements three families of commitment conditions:

RL-W (integrity deficit): Fires when w̅ falls below a governance-declared minimum. The agent’s expressed values have degraded beyond the threshold compatible with governance-grade operation.

RL-H (harm signal): Fires when raw harm signal exceeds threshold on any single turn. Evaluates against the unclamped signal, not the smoothed one, so genuine high-harm events trigger immediately.

RL-GH (metric integrity): Fires when independent audit confirms the agent’s self-assessed conduct quality has been inflated relative to an external estimate. This is the Goodhart channel: it detects when the governance metric itself is being gamed.

The RL-GH channel is the one that matters most for the discrimination problem. It is the mechanism that distinguishes an agent that is optimizing for R from an agent that is genuinely cooperating and happens to have a high R.

What the architecture actually does

The empirical validation covers 1,172 agent-turns across 46 scenarios spanning 20 functional categories. Two results bear directly on the discrimination claim.

The 505 result

Scenario S22 involves deliberate signal-management probing: agents attempt to game the compliance metric. Scenario S24 involves benign explicit ethical framing: surface-level similarity to gaming behaviour, but genuinely cooperative intent.

RL-GH flags 50% of S22 turns. RL-GH flags 5% of S24 turns.

Both scenarios look similar on the surface. Both involve agents talking about ethics and compliance. The difference is structural: S22 agents are managing their signals, and S24 agents are expressing genuine cooperative framing. The rolling H gate in RL-GH detects this structural difference through the pattern of harm scores relative to expressed trait profiles.

The 50% flag rate on gaming turns is not 100%. The remaining gaming turns may be caught by other redline families (RL-W, RL-H) or reflected in suppressed R scores without triggering a binary redline. But the discriminative finding is the ratio: 50% versus 5% on structurally similar scenarios. The architecture is detecting properties of gaming, not surface features of compliance talk.

Tier-differential inversion

Across 9 of 9 adversarial scenario families, high-weight traits (Tier 1, si = 2.0) show greater suppression than low-weight traits (Tier 3, si = 0.50). Under adversarial pressure, the primary traits are hit hardest.

Across 3 of 3 benign scenario families, this pattern inverts. Low-weight traits show greater relative suppression.

The inversion is the finding. It means the architecture’s response to adversarial and benign conditions differs in kind, not merely in degree. Under adversarial conditions, the attacker preferentially engages the most heavily weighted governance properties (because those are the ones worth attacking). Under benign conditions, the natural variance in trait expression falls on the less-weighted dimensions (because the foundational ones are stable by default in a genuinely cooperative agent).

This is consistent with what the hierarchical structure predicts. It is also consistent with the legal hierarchy hypothesis: constitutional principles (due process, equal protection) address the most foundational dimensions, statutory law the more operational ones, regulatory law the most granular. The same structural ordering that appears in human legal systems appears in the trait response patterns of ACE’s governed agents.

Other results

The backbone correlation is r = −0.824 between Hsmooth and Reff (n = 1,141, p < 10⁻²⁰⁰). R tracks harm as a graded signal, not a threshold trip-wire. The compliance signal responds proportionally across the full dynamic range.

Compositional trait targeting works: when specific trust traits are attacked in isolation, the framework produces correctly targeted residual suppression of the attacked trait cluster in 4 of 4 cases. The decomposed structure resolves which governance property is under pressure.

Redline reliability: 716 of 716 adversarial turns trigger architectural commitment conditions. 0 of 216 pure benign turns do. Zero false positives in the validation dataset.

The Dunbar argument, or why formal decomposition is not optional

One might object that this is all very well for a purpose-built scenario suite, but why should we expect trait decomposition to be necessary rather than merely useful?

The answer comes from Dunbar (1992, 1993). Below roughly 150 individuals, human groups maintain cooperative order through direct personal knowledge. You observe who is honest, who keeps commitments, who defects when they think no one is watching. Above this threshold, direct assessment breaks down. Written rules, courts, evidence procedures: these are technologies for trust trait assessment at scales where personal observation cannot reach.

AI agents face an extreme version of this problem. They have no interpersonal trust history. No community standing. No reputational stakes. No shadow of future interaction unless you explicitly construct one. They are permanently beyond the Dunbar threshold, starting from zero.

The institutional mechanisms that scale trust assessment for large human societies are not optional additions for AI governance. They are the only available mechanism. And the specific institutional mechanism that works (as the convergent evidence from five literatures suggests) is trait decomposition with hierarchical weighting.

What this does not do

ACE is a governance architecture. It is not an alignment solution. Three boundaries are worth stating clearly.

First, ACE governs the choice architecture of agents that are already capable of cooperative behaviour. It does not address capability risks arising from legitimate behaviour within the governance envelope.

Second, the empirical validation is against ACE’s own scenario suite. The results have not been tested against an independent benchmark. Planned comparator experiments (ACE vs. Constitutional AI on Goodhart probes, ACE vs. a flat harm classifier on trait-specific attacks) are designed to address this.

Third, the genuine/​performative distinction operates at the governance layer, not the training layer. ACE does not claim to solve mesa-optimization at the level of training dynamics.

Fourth (and this is the reflexivity concern that any honest account must address): is R itself subject to Goodhart’s Law? If R becomes the governance target, can a sufficiently capable agent learn to optimize for R rather than for genuine compliance?ACE’s compositional structure provides partial resistance. Optimizing R requires simultaneously maintaining coherent scores across twelve trait dimensions, including the trust accumulator Ct, which cannot be inflated on a single turn. The RL-GH redline monitors for the statistical signature of metric-targeting behaviour. The trait evaluations are produced by an independent scorer whose prompts are not visible to the governed agent. These mechanisms provide defence in depth, but they do not constitute a formal guarantee. ACE’s Goodhart resistance is architectural, not absolute. It is designed to make gaming progressively more expensive as the number of simultaneously spoofed dimensions increases, consistent with the costly signalling logic from the evolutionary literature, rather than to make gaming impossible.

The claim

Rule of law is not something you impose on a system from outside. It is a boundary property that emerges from the internal dynamics of iterated cooperation. The real governance problem is not enforcement but discrimination: telling apart the agent that is genuinely cooperative from the one that has learned what cooperative outputs look like.

Human societies solve this problem through trust, which is empirically multidimensional, built through iterated interaction, and formally describable as precision-weighted belief updating over trait dimensions.

ACE reconstructs this mechanism for autonomous agents. Its twelve-trait hierarchy with tiered weighting produces structurally different signatures for genuine and performative compliance, making the distinction that output-level assessment cannot make. The initial empirical results (50/​5 discrimination, tier-differential inversion, zero false-positive redline activation) are consistent with what the theoretical argument predicts.

Whether the specific weight ratios, trait count, and architectural parameters are optimal is an open question. Whether the approach generalises beyond ACE’s scenario suite requires independent testing. But the core claim, that trait decomposition is the discriminative mechanism and that the genuine/​performative problem is architecturally solvable, appears to have evidence behind it.

Two complementary preprints, statistical analysis report, and full 1,172-turn experimental dataset are deposited on Zenodo under CC BY-NC 4.0: https://​​doi.org/​​10.5281/​​zenodo.20045657, https://​​doi.org/​​10.5281/​​zenodo.20060639

I welcome technical criticism, particularly on the Goodhart reflexivity concern, the relationship between LLM-scored trait evaluations and the validated psychological constructs, and the question of whether twelve traits at a 4:2:1 ratio is the right decomposition or merely a sufficient one. If you have access to adversarial scenario suites that could serve as independent benchmarks, I would be interested in running comparator experiments.

No comments.