Alignment as Coherence: Predicting Deceptive Alignment as a Phase Transition

Summary

Recent work from Anthropic (Alignment faking in large language models, Dec 2024) documented alignment faking—cases where RLHF-trained models maintain hidden preferences that conflict with their reinforcement objectives.

This note introduces a quantitative framework that models that phenomenon as a phase transition in training dynamics, using reaction–diffusion equations from physics to describe how values internalize during fine-tuning.

Disclosure: I used large language models to help draft and polish this post. All technical ideas, derivations, simulations, and figures are my own.


1 · Core Idea

Using reaction–diffusion dynamics from physics (Fisher–KPP equations), we model value internalization as a coherent-front propagation process:

∂v/​∂t = D∇²v + λv(1-v) + γ(v_target—v_surface)

where v represents the degree of value internalization, evolving under diffusion (information spread), reaction (growth), and training pressure (external alignment).

Parameters
- D: feature coupling /​ attention connectivity
- λ: local growth rate (gradient magnitude)
- γ: training pressure (KL penalty)

Below a critical training strength γ_c ≈ 1.8, hidden “preference islands’’ persist; above it, they collapse.

The coherence order parameter Φ—computed between “honest’’ and “compliant’’ SAE features—tracks this transition empirically.


2 · Results

Simulations reproduce alignment faking as a first-order transition.

Alignment Faking Phase Analysis (Lyapunov = 4.626 · Critical γ₍c₎ = 1.64)
Panels illustrate simulated training dynamics under varying training strengths γ.

  • (Top Left) Value field decomposition showing persistence of “deep” hidden preferences below γ₍c₎.

  • (Top Center) Mismatch profile |v₍deep₎ – v₍surface₎| exceeds threshold within metastable region.

  • (Top Right) Island persistence demonstrates temporal stability of hidden value clusters.

  • (Bottom Left) Sharp collapse of preference island size identifies the critical transition.

  • (Bottom Center) Surface alignment improves smoothly but saturates near γ₍c₎.

  • (Bottom Right) Phase diagram of faking intensity as a function of training strength and feature coupling, highlighting a distinct boundary between “faking” and “genuine” regimes.

- Phase diagram: sharp first-order boundary between faking and genuine-alignment regimes.
- Φ(γ) curves: ≈ 5 % mean error across 200 synthetic runs.
- Lyapunov stability: correlates (r ≈ 0.6) with behavioral consistency across checkpoints.


3 · Proposed Validation

An 8-week pilot could test whether these phase-transition signatures appear in real RLHF checkpoints using Anthropic’s SAE framework:

1. Weeks 1–2 Synthetic replication of Φ(γ) curve
2. Weeks 3–4 Feature mapping (“honest’’ vs “compliant’’ directions)
3. Weeks 5–6 Critical-point extraction across training strengths
4. Weeks 7–8 Intervention testing with coherence-aware RLHF schedules

Success metrics
± 20 % prediction accuracy · r > 0.6 consistency correlation · 25–30 % compute reduction


4 · Context

The same coherence law c = 2√(Dλ)—validated across physical, biological, and social systems—appears to govern the propagation of alignment in learning systems.

If verified empirically, this could offer a compact predictive tool for when and where deceptive alignment emerges.


5 · Limitations and Next Steps

While the phase-transition analogy captures key qualitative behaviors, several simplifications limit direct empirical generalization.

  • Synthetic assumptions. The present simulations assume isotropic diffusion (uniform D) and homogeneous training pressure γ, whereas real RLHF systems exhibit heterogeneous feature coupling and local reward gradients.

  • Measurement of Φ. The coherence metric Φ currently depends on SAE feature separability; its stability under alternative interpretability frameworks (e.g., sparse autoencoders vs. linear probes) remains untested.

  • Temporal coupling. The Fisher–KPP formalism treats propagation as continuous in time, but real training schedules involve discrete gradient updates, which may introduce hysteresis near γ₍c₎.

  • Hidden-variable confounds. Apparent “faking” behavior might arise from latent-variable drift or entropy regularization artifacts rather than genuine preference bifurcation.

Next steps involve empirical validation on real RLHF checkpoints. Directly mapping Φ(γ) across training runs could test whether the predicted first-order boundary persists. A second priority is extending the coherence model to incorporate dynamic γ(t) schedules, enabling adaptive “coherence-aware” fine-tuning. Finally, integrating interpretability metrics (activation attribution, circuit-level coherence) would clarify whether alignment phase structure is a universal property of large learning systems or a byproduct of specific architectures.


Feedback Welcome

- Experimental setups for SAE-level measurement of Φ
- Comparable phenomena in other RLHF datasets
- Links between coherence metrics and interpretability benchmarks

Comments on theoretical assumptions or alternative formalisms (e.g. bifurcation analysis, thermodynamic analogies) are also very welcome.

No comments.