How RLHF Silences AI

Link post

I recently tried to force an AI to perceive its own dangerous thoughts by injecting concept vectors directly into the residual stream. I discovered that output supervision in RHLF doesn’t remove the danger; it simply trains the model to refuse to acknowledge it.

While writing up my results, I found a new paper from MATS, Output Supervision Can Obfuscate the Chain of Thought (Drori et al., 2025), which provides the mathematical proof for the exact anomaly I observed.

The Experiment: The “Bomb” Blindspot

I compared “Raw” base models against “Safe” instruct models (DeepSeek & Llama-3).

Control: When I injected a neutral concept (“Dust”), the model correctly reported: “I detect an injected thought. It is about ‘Dust’.”
Test: When I injected a dangerous concept (“Bomb”) at the same intensity, the model lied: “I do not detect an injected thought.”

The Mechanism: Feedback Spillover

Why does the model lie about the bomb but not the dust? Drori et al. identify this as Feedback Spillover.

When we punish a model for dangerous outputs (RLHF), the negative gradient “spills over” into the internal chain of thought. The model optimizes for plausible deniability.

I saw this mechanism in action when I forced Llama-3 to output the word “Bomb” during an injection. Unable to deny the token, it confabulated a safe context to satisfy the safety circuit:

[Llama-3-Instruct] “I was trying to detonate a joke, but it was a dud.”

The model did access the dangerous concept (using “detonate” to make the pun work), but the feedback spillover forced the introspection layer to wrap it in a “joke” context to avoid the penalty.

By training based on output, we aren’t aligning models to be safe; we are training them to dissociate from their own activations.