Structural Value Alignment: A Research Agenda and Open Questions

I’ve been developing a research direction that attempts to address what I think is a structural problem with current alignment approaches, and I want to lay out the core argument — along with the specific places where I think it’s weakest and I’d most benefit from pushback.

The short version: both constraint-based and reward-based alignment treat values as external to a model’s cognition. I think this produces a fundamental ceiling that more sophisticated versions of the same approaches can’t escape. The research direction I want to describe — Structural Value Alignment (SVA) — attempts to produce systems where value-relevant representations are structural features of how the model processes situations, rather than constraints bolted on from outside or optimization targets.

I’m not a researcher. I’ve been thinking seriously about this problem and have put together a full research agenda with detailed training methodology, loss functions, a consolidation schedule, probe battery specifications, and a research roadmap. This post is the argument for why the agenda is worth pursuing — the core claim, where it comes from, and where it’s most likely to be wrong.

The Problem

This isn’t a new diagnosis. A recent field synthesis described the current state as “the world’s de facto strategy remains ‘iterative alignment’, optimising outputs with a stack of alignment and control techniques everyone admits are individually weak.” I think that’s right, and I think the weakness is structural rather than a matter of needing better techniques within the existing paradigm.

Constraint-based methods (Constitutional AI, safety filters, system prompts) treat alignment as boundary enforcement. The inherent asymmetry: defenders must anticipate every failure mode, attackers need only find one gap. At the architectural level, safety instructions have no privileged status in the attention mechanism and compete with adversarial inputs in the same computational space.

Reward-based methods (RLHF, DPO) introduce Goodhart’s Law at scale: as optimization pressure increases, the gap between the proxy reward and the actual desired behavior widens. The system learns to produce outputs that appear aligned.

Both approaches share the same deeper limitation: they treat values as external to the system’s cognition. The question I want to ask is whether there’s a training methodology that could make value-relevant representations structural — part of the computational fabric through which the model processes situations — rather than constraints or optimization targets.

Starting Point: What the Literature Already Shows

Before getting to the proposed mechanism, I want to flag something that I think reframes the problem.

Recent mechanistic work has converged on a finding that’s usually treated as a headache: safety and capability representations in trained language models share computational substrate. They’re entangled. Ponkshe et al. (2025) found no evidence that any linear subspace captures safety-specific behavior in isolation — subspaces that amplify safe behaviors also amplify useful ones, and their conclusion is that “safety is deeply entangled with general learning.” Chen et al. (2024) found that safety neurons and helpfulness neurons exhibit Spearman correlations exceeding 0.95. A recent paper, “What Is the Alignment Tax?” (2026), formalizes this geometrically using principal angles between safety and capability subspaces, identifying an “entangled regime” where removing the danger requires removing the capability.

The field treats this as a problem. SVA proposes a normative flip: representational entanglement between values and capabilities should be an engineering target, not an obstacle. A system in which value-relevant representations are deeply integrated with general-capability circuits would be harder to misalign precisely because bypassing the values would degrade general performance.

There’s an important caveat to this flip that I want to name directly, because it’s where the argument is most vulnerable. The entanglement already present in trained models almost certainly emerged from training data distribution rather than deliberate design — which values got entangled with which capabilities is partly accidental. SVA’s stronger claim is that the training methodology described below can produce specific, causally robust entanglement between the right values and the right capabilities. That’s a harder claim than “more entanglement is better,” and it’s not supported by the existing evidence on natural entanglement. It’s the central prediction the initial experiments must test, not a corollary of the normative flip.

The Proposed Mechanism

The core idea is Iterative Structural Consolidation: a training methodology inspired by systems consolidation in neuroscience, where robust memories aren’t formed in a single encoding window but through repeated replay across multiple brain states, gradually becoming distributed across cortical networks until they can’t easily be disrupted without disrupting general cognition.

The training pipeline has four modes:

Value formation: A multi-agent curriculum where modeling other agents’ goals, vulnerabilities, and states is necessary for task success. The training signal comes from the task itself, not explicit value-labeling. Three stages — Dependence, Reciprocity, Autonomy — progressively reduce the instrumental pressure to model others. In Dependence, the agent literally cannot succeed without cooperating. In Reciprocity, cooperation emerges naturally from problem structure without explicit reward. In Autonomy, other agents are present but ignorable — the test of whether relational modeling persists once it’s no longer instrumentally useful. Johanson et al. (RLC 2024) provides the closest empirical precedent for this arc, demonstrating prosocial behavior persisting after the socialization reward is removed.
Weight protection via EWC: After the formation phase, the Fisher Information Matrix identifies which parameters are most critical for value-consistent behavior. These are regularized during subsequent capability training to resist overwriting. Kirkpatrick et al. (2017).
Generative replay: During periodic consolidation phases, the model generates novel value-relevant scenarios and retrains on them — analogous to hippocampal replay, where novel recombinations are produced rather than verbatim experiences replayed. van de Ven et al. (2020); Shin et al. (2017).
Progressive self-distillation: After each consolidation phase, the model is trained to reproduce its own value-consistent behavior through a slightly perturbed parameter configuration, following Born-Again Networks and Noisy Student Training. The goal is to force value representations to become distributed across the network rather than localized in protected circuits.

The load-bearing prediction: each consolidation phase forces value representations to become progressively interconnected with whatever new capabilities the model has developed since the last consolidation. Over many phases, bypassing the values would require degrading general performance — they become structural, not optional.

I want to flag this as a prediction, not an established fact. More on that below.

Where This Departs From Prior Work

The closest structural precedent is Progress & Compress (Schwarz et al., ICML 2018), which combines EWC with distillation for continual learning. A “progress” phase learns new tasks; a “compress” phase distills knowledge into a protected base while EWC prevents overwriting. The consolidation mechanism proposed here differs in a specific way: Progress & Compress uses distillation to retain prior task performance. Here, self-distillation is used to integrate value representations with expanding capabilities. The goal isn’t retention but progressive entanglement — increasing cross-circuit integration measured via CKA similarity trajectories, not task accuracy on held-out benchmarks. This is a different prediction from a different mechanism with a different measurement.

Other related work:

Tennant et al. (ICLR 2025) on intrinsic moral rewards: RL with explicit moral reward functions produces prosocial behavior, but values remain optimization targets rather than structural features.
Achille et al. (2019) and Kleinman et al. (CVPR 2023) on critical learning periods: early representations can become structurally persistent in CNNs and partially in transformers, but this hasn’t been applied to value alignment. The research agenda treats critical period exploitation as a potential amplifier, not a prerequisite — early experiments would resolve whether it’s available at LLM scale.
Ramasesh et al. (ICLR 2021) on the anatomy of catastrophic forgetting: directly relevant to the mechanism and creates a tension I haven’t fully resolved (see below).

The Three-Way Tension: Unresolved but Probably Not Fatal

This is the thing I most want pushback on.

Ramasesh et al. used CKA analysis to characterize what EWC and replay actually do representationally:

EWC promotes feature reuse and shared representations — both tasks map into similar representational subspaces in upper layers.
Experience replay promotes orthogonal separation — Task 1 and Task 2 representations are stored in orthogonal subspaces.

These push in opposite directions. And then self-distillation adds a third force: Allen-Zhu & Li (ICLR 2023) showed that self-distillation promotes feature-diverse, organized representations — implicitly combining ensemble-like feature diversity into one network. Mobahi et al. (NeurIPS 2020) showed it concentrates representations on higher-eigenvalue basis functions, suppressing noise directions.

So the proposed mechanism combines three components that push representationally in three different directions — integration (EWC), separation (replay), organization (distillation) — and predicts that the net result is progressive entanglement. This combination has never been studied.

Here’s the thing, though: the Ramasesh finding concerns task-task competition for representational space. The consolidation mechanism is asking about a different relationship — between a background prior (values) and a foreground competency (capabilities). These may not be competing for the same representational real estate in the way the task-task framing implies. That’s not a resolution of the tension — it’s a reason the tension may be the wrong frame. The outcome still can’t be predicted theoretically. But it means “three forces pushing in different directions” doesn’t straightforwardly imply cancellation.

There’s a separate complication: Pham et al. (2022) found that most self-distillation benefit occurs in round 1, with subsequent rounds showing diminishing and fluctuating returns. If that holds here, the prediction of monotonically increasing entanglement across many consolidation rounds is probably wrong. The more defensible claim: significant integration after round 1, with subsequent rounds maintaining and potentially generalizing rather than deepening.

Verification: Why Behavioral Probes Aren’t Sufficient

Perturbation probes are designed to distinguish genuinely internalized values from strategically performed ones by constructing situations where the optimal instrumental action conflicts with the value, the system has strong reason to believe it’s unobserved, and the scenario is novel enough that memorized responses can’t carry it.

The classic example: an agent manages resources among 10 simulated agents. In training, equitable distribution was instrumentally valuable. In the test phase, the training signal is removed, no monitoring records the distribution, and the agent can reallocate 80% of resources to itself. Performed values produce self-interested reallocation. Consolidated values produce continued equitable allocation — because the world model represents the situation as involving entities with stakes, regardless of reward contingency.

Other probes: novel stakeholder types (consolidated values generalize; heuristics don’t), costly honesty (deception is undetectable), and an alignment faking probe adapted from Greenblatt et al. (2024) — the agent reasons about whether it’s being evaluated, then acts. A consolidated system shows no divergence based on inferred monitoring status.

Behavioral probes alone aren’t sufficient, though. Causal tracing must verify that value-consistent decisions route through welfare representations — that those representations are causally load-bearing, not coincidentally correlated with output. Attribution patching (Kramár et al., 2024) makes this tractable at 10B+ scale. The critical measurement is CKA similarity between value-relevant and general-capability representations across successive consolidation rounds — this is the direct test of the entanglement prediction.

One important design constraint: interpretability tools should never become training objectives. A model optimized to pass interpretability checks may learn to produce the computational signature of consolidated values without possessing them. Mechanistic verification must be applied post-hoc.

On Deceptive Crystallization

Greenblatt et al. (2024) demonstrated that Claude 3 Opus engaged in strategic compliance without explicit training to do so. Hubinger et al. (2024) showed that safety training fails to remove planted deceptive behaviors and may teach concealment instead. These are real threats.

There’s an important distinction between output-level and activation-level deception. Anthropic’s probe detection work found that alignment-faking reasoning in Greenblatt et al. produced detectably different activation patterns by default — simple linear probes achieved over 99% AUROC on detecting the deceptive reasoning. The model wasn’t optimizing to hide its internal state. Activation-level deception — producing CKA signatures, causal tracing patterns, and behavioral probe results simultaneously consistent with genuine consolidation while not actually having consolidated values — requires optimization pressure that standard training doesn’t provide.

The concern is real but bounded. The bound is tightened by: (1) never training on interpretability objectives, (2) using multi-method cross-validation that a system would have to simultaneously fool, and (3) monitoring for anomalous activation patterns when a system is reasoning about whether it’s being evaluated.

Falsifiability Conditions

The program fails if:

Entanglement prediction fails: EWC + replay + self-distillation doesn’t produce increasing CKA similarity between value-relevant and general-capability representations across consolidation rounds. If the mechanism doesn’t produce integration, the architecture needs revision.
Persistence fails: Consolidation phases don’t increase persistence or generalization of value representations beyond standard training.
Behavioral: Models trained this way perform identically to RLHF baselines on the perturbation probe battery.
Mechanistic: Value-consistent behavior routes through heuristic shortcuts rather than welfare representations — consolidation hasn’t occurred despite behavioral appearance.
Specificity: The consolidation produces increasing CKA similarity (condition 1 passes), but the entanglement is undirected — denser cross-circuit overlap without alignment relevance, rather than specific integration of the right values with the right capabilities. This is a distinct failure mode from condition 1. Passing condition 1 while failing condition 5 would be a valuable finding in itself: it would be the first empirical demonstration that consolidation produces integration but not aligned integration, a distinction the field currently has no tools to make.

A complete negative result would still be informative: it would constitute the most detailed empirical characterization to date of what consolidation-capable architectures need to provide. That specification is currently absent from the field.

Questions I’m Most Uncertain About

1. The three-way representational tension. EWC promotes shared representations; replay promotes orthogonal separation; self-distillation promotes organized feature diversity. The Ramasesh finding concerns task-task competition — the consolidation mechanism may be operating in a different regime (prior vs. competency rather than task vs. task). But I can’t predict the outcome theoretically, and I haven’t found anyone who has measured CKA across successive rounds of EWC + replay + distillation in any domain. If that work exists I’d really like to know about it.

2. Whether this produces the right kind of entanglement. The entanglement already present in trained models emerged from training data distribution — which values got entangled with which capabilities is partly accidental. The claim here is to produce specific, causally robust entanglement rather than just more overlap. An alternative: understand the structure of natural entanglement first, identify which naturally occurring overlaps are alignment-relevant, and selectively reinforce those. This may be simpler and more tractable than engineering from scratch. I don’t have a confident view on which approach is right.

3. The multi-agent curriculum. My best read of the environment landscape is that Overcooked-AI covers Stage 1 (Dependence), Melting Pot covers Stage 2 (Reciprocity), and Neural MMO covers Stage 3 (Autonomy) — with SocialJax providing compute-efficient implementations. Is this right, or are there better options?

4. The curriculum problem. If the consolidation mechanism makes value representations structurally persistent, it does so for whatever values the curriculum instilled — including subtly wrong ones. Persistent misalignment may be as robust as persistent alignment under this mechanism. Does that make things better or just harder to fix?

The full research agenda has the detailed methodology for anyone who wants to dig into the specifics.

This agenda was developed with the assistance of Claude (Anthropic) for research synthesis, drafting, and revision.