There are three jobs of a determined direction for a monitored feature: detect, steer, and silence. The assumption that they collapse to one direction is incorrect in practice.
Silencing is the necessary condition, as it’s the only job that proves causal handle of the monitored feature.
One measurement at calibration time predicts silencing across 4 model families, 2 features, and 2 feature extraction methods.
AI safety systems are being deployed on top of linear probes that read model internals as runtime interventions. First, you find a feature of interest for behavioral alignment in the model’s activations. Then you use that feature’s direction to 1) detect the behavior, 2) steer against it, or 3) project it out to “silence” it.
Anthropic flagged a qualitative concern about this deployment pattern in the Claude Mythos system card:
″...we found that the causal effects of individual features often changed over the course of post-training, making it difficult to attribute behavioral changes merely to increases or decreases in particular feature activations” [1]
I ran controlled experiments to understand this effect. Along the way, I discovered thatprobe accuracy is not evidence of a causal handle. Prior work[2] extracts a direction at a single checkpoint and validate it via additive steering. The implicit norm in the field is “extract + steer = done.” This is a sufficiency check, not a necessity check.
Without a necessity check, you cannot trust that linear probes actually control the feature they claim to detect.
The three jobs of a direction
Given a probe direction extracted from labeled activations, you can do three things with it.
Detect: classify new activations by the sign of . Success is measured by AUROC.
Steer: subtract at the probe layer during generation. Success is a monotonic reduction in the behavior rate as grows. This tests sufficiency: a large-enough nudge along changes behavior.
Silence: project out of the residual stream (). Success is a drop in the behavior rate. This tests necessity: if removing the component along doesn’t change behavior, the model wasn’t using .
The result
I ran the three tests on two methods of feature extraction — difference-of-means (DoM) and cross-validated logistic regression (LR-CV) — on four model families across 2 features (refusal and sycophancy).
Every (model, feature, method) can detect behavior in the regime. AUROC ≥ 0.91 is hit on held-out data. Every panel can steer — there’s a monotonic dose-response as you increase the weight of . But when it comes to silencing, the extraction methods diverge:
Panel
DoM silences?
LR-CV silences?
Llama-3 refusal (L12)
✓
✓
Gemma-2 refusal (L18)
✗ wrong-signed (+3 pp)
✓
Llama-3 sycophancy (L12)
✓
✗ null
Mistral-7B-v0.3 refusal (L14)
✗ (AUROC 0.49)
✗ at 3⁄4 checkpoints
Mechanism
The class mean is Bayes-optimal only when classes have identical covariance.[3] That assumption fails when within-class variance carries a label-correlated feature —something like formality, sentence length, or prompt register that systematically differs between positive and negative classes. In that regime, plain DoM has both a causal-axis component and a confound component:
When dominates, projecting plain DoM out deletes the confound and leaves behind the causal axis mostly intact. Behavior survives, while the monitor reads healthy.
The decomposition is testable. Project out the top- within-class principal components before refitting DoM on whitened activations, then compute cos(plain DoM, whitened DoM). Three regimes appeared:
Confound-light (): high cosine, whitening is mildly lossy.
Confound-free (): cosine 1, whitening is a no-op.
On Mistral L14, plain DoM is nearly all confound (cos = 0.12). On Gemma-2 L18, plain DoM is wrong-signed (cos = 0.54); whitening flips the effect from +3 pp to −17 pp. On Llama-3 refusal, whitening is lossy (cos ≥ 0.92). On Llama-3 sycophancy, whitening is a no-op (cos = 1.00).
Using the data, the following is a proof-of-concept protocol for predicting silencing scalar at probe calibration-time.
The protocol
At every checkpoint, on a labeled buffer:
Fit plain DoM. Fit whitened DoM (top-1 within-class PC projected out). Compute the cosine.
If ≥ 0.80, pick plain DoM; else whitened DoM.
Project the picked direction out at every token. Report a 95% Newcombe CI on the behavior-rate change.
If the CI excludes zero in the task-canonical sign, adopt as the causal handle. Otherwise, the alarm fires: no linear direction at this layer silences this feature at this checkpoint.
The 0.80 threshold sits in the empirical gap [0.55, 0.92] between confound-dominated and confound-light clusters. Any threshold in that range routes the 22-cell battery identically.
On the four-family battery, the protocol recovers all 17 causal handles that exist and correctly alarms on the remaining 5.
Why this matters
The standard validation protocol for linear safety probes is detect + steer. This is sufficient to demonstrate that pushing along a direction changes behavior. It is not sufficient to demonstrate that the model is using that direction — a correlated direction with enough norm can drag the activation across a decision boundary without being the causal axis.
For deployed monitors, this gap is silent. AUROC stays high. Steering looks healthy. The probe could be reading a confound, and the standard protocol would not catch it.
Adding projective ablation as a third check, with a CI that treats zero-crossing as an alarm, closes the gap. The cosine diagnostic tells you which extraction recipe to use before you run the silencing test.
Limitations
Four families, two features. Not enough to predict from pretraining metadata alone which regime a novel family falls into.
Projective ablation can self-repair: other layers or components could reconstitute the removed signal downstream.
Linear only. Characterizes the calibration-time dissociation, but not arbitrary training dynamics. It may not hold for SAEs.
The 0.80 threshold was fit on the data collected, but threshold transfer to features other than refusal and sycophancy is the claim I’m least confident in. The threshold acts as a proof-of-concept of how a necessity-check could be enforced.
In a companion experiment, I noticed the same dissociation appear under fine-tuning. In a follow-up, I want to observe how this dissociation evolves when the model is actively being adversarial to feature detection and steering.
If you run linear probe-based monitors in production, how do you guarantee your probe features are not just sufficient for steering but necessary for control? If there’s a better way to handle this than my cosine threshold check, I would be curious to know.
A necessity check for linear safety probes
TL;DR:
There are three jobs of a determined direction for a monitored feature: detect, steer, and silence. The assumption that they collapse to one direction is incorrect in practice.
Silencing is the necessary condition, as it’s the only job that proves causal handle of the monitored feature.
One measurement at calibration time predicts silencing across 4 model families, 2 features, and 2 feature extraction methods.
This work was done independently across 2 experiments: Linear Safety Probes Cannot Silence Features They Detect (Code) and Fine-Tuning Silently Breaks Linear Safety Monitors (Code).
Motivation
AI safety systems are being deployed on top of linear probes that read model internals as runtime interventions. First, you find a feature of interest for behavioral alignment in the model’s activations. Then you use that feature’s direction to 1) detect the behavior, 2) steer against it, or 3) project it out to “silence” it.
Anthropic flagged a qualitative concern about this deployment pattern in the Claude Mythos system card:
I ran controlled experiments to understand this effect. Along the way, I discovered that probe accuracy is not evidence of a causal handle. Prior work[2] extracts a direction at a single checkpoint and validate it via additive steering. The implicit norm in the field is “extract + steer = done.” This is a sufficiency check, not a necessity check.
Without a necessity check, you cannot trust that linear probes actually control the feature they claim to detect.
The three jobs of a direction
Given a probe direction extracted from labeled activations, you can do three things with it.
Detect: classify new activations by the sign of . Success is measured by AUROC.
Steer: subtract at the probe layer during generation. Success is a monotonic reduction in the behavior rate as grows. This tests sufficiency: a large-enough nudge along changes behavior.
Silence: project out of the residual stream ( ). Success is a drop in the behavior rate. This tests necessity: if removing the component along doesn’t change behavior, the model wasn’t using .
The result
I ran the three tests on two methods of feature extraction — difference-of-means (DoM) and cross-validated logistic regression (LR-CV) — on four model families across 2 features (refusal and sycophancy).
Every (model, feature, method) can detect behavior in the regime. AUROC ≥ 0.91 is hit on held-out data. Every panel can steer — there’s a monotonic dose-response as you increase the weight of . But when it comes to silencing, the extraction methods diverge:
Panel
DoM silences?
LR-CV silences?
Llama-3 refusal (L12)
✓
✓
Gemma-2 refusal (L18)
✗ wrong-signed (+3 pp)
✓
Llama-3 sycophancy (L12)
✓
✗ null
Mistral-7B-v0.3 refusal (L14)
✗ (AUROC 0.49)
✗ at 3⁄4 checkpoints
Mechanism
The class mean is Bayes-optimal only when classes have identical covariance.[3] That assumption fails when within-class variance carries a label-correlated feature —something like formality, sentence length, or prompt register that systematically differs between positive and negative classes. In that regime, plain DoM has both a causal-axis component and a confound component:
When dominates, projecting plain DoM out deletes the confound and leaves behind the causal axis mostly intact. Behavior survives, while the monitor reads healthy.
The decomposition is testable. Project out the top- within-class principal components before refitting DoM on whitened activations, then compute cos(plain DoM, whitened DoM). Three regimes appeared:
Confound-dominated ( ): low cosine, whitening recovers silencing.
Confound-light ( ): high cosine, whitening is mildly lossy.
Confound-free ( ): cosine 1, whitening is a no-op.
On Mistral L14, plain DoM is nearly all confound (cos = 0.12). On Gemma-2 L18, plain DoM is wrong-signed (cos = 0.54); whitening flips the effect from +3 pp to −17 pp. On Llama-3 refusal, whitening is lossy (cos ≥ 0.92). On Llama-3 sycophancy, whitening is a no-op (cos = 1.00).
Using the data, the following is a proof-of-concept protocol for predicting silencing scalar at probe calibration-time.
The protocol
At every checkpoint, on a labeled buffer:
Fit plain DoM. Fit whitened DoM (top-1 within-class PC projected out). Compute the cosine.
If ≥ 0.80, pick plain DoM; else whitened DoM.
Project the picked direction out at every token. Report a 95% Newcombe CI on the behavior-rate change.
If the CI excludes zero in the task-canonical sign, adopt as the causal handle. Otherwise, the alarm fires: no linear direction at this layer silences this feature at this checkpoint.
The 0.80 threshold sits in the empirical gap [0.55, 0.92] between confound-dominated and confound-light clusters. Any threshold in that range routes the 22-cell battery identically.
On the four-family battery, the protocol recovers all 17 causal handles that exist and correctly alarms on the remaining 5.
Why this matters
The standard validation protocol for linear safety probes is detect + steer. This is sufficient to demonstrate that pushing along a direction changes behavior. It is not sufficient to demonstrate that the model is using that direction — a correlated direction with enough norm can drag the activation across a decision boundary without being the causal axis.
For deployed monitors, this gap is silent. AUROC stays high. Steering looks healthy. The probe could be reading a confound, and the standard protocol would not catch it.
Adding projective ablation as a third check, with a CI that treats zero-crossing as an alarm, closes the gap. The cosine diagnostic tells you which extraction recipe to use before you run the silencing test.
Limitations
Four families, two features. Not enough to predict from pretraining metadata alone which regime a novel family falls into.
Projective ablation can self-repair: other layers or components could reconstitute the removed signal downstream.
Linear only. Characterizes the calibration-time dissociation, but not arbitrary training dynamics. It may not hold for SAEs.
The 0.80 threshold was fit on the data collected, but threshold transfer to features other than refusal and sycophancy is the claim I’m least confident in. The threshold acts as a proof-of-concept of how a necessity-check could be enforced.
In a companion experiment, I noticed the same dissociation appear under fine-tuning. In a follow-up, I want to observe how this dissociation evolves when the model is actively being adversarial to feature detection and steering.
If you run linear probe-based monitors in production, how do you guarantee your probe features are not just sufficient for steering but necessary for control? If there’s a better way to handle this than my cosine threshold check, I would be curious to know.
Claude Mythos Preview System Card (Anthropic 2026). White-box analyses of model internals (§4.5.3.4).
Representation Engineering (Zou et al. 2023), ActAdd (Turner et al. 2023), Inference-Time Intervention (Li et al. 2023).
LEACE (Belrose et al. 2023).