Iterative Matrix Steering: Forcing LLMs to “Rationalize” Hallucinations via Subspace Alignment


Most current activation engineering techniques — building on foundational work like

Mechanistically Eliciting Latent Behaviors — rely primarily on static steering vectors:

.

While effective for global attributes (sentiment, refusal), static vectors struggle with structural tasks. They apply a constant “force” regardless of the token’s context—pushing nouns and verbs in the same direction, often leading to semantic drift or syntax degradation in long-form generation.

I’ve been working on a method called Iterative Sparse Matrix Steering, which replaces the static vector with a learned affine transformation:



Instead of using SGD (which is heavy), I solve this analytically using Ridge Regression on the CPU. This treats the steering problem as Subspace Alignment: mapping the model’s internal “English geometry” onto its “French geometry,” or its “Factual geometry” onto a “Counterfactual one.”

The most interesting result wasn’t just that it works better for translation, but what happens when you try to overwrite facts (e.g., The Moon is made of RockThe Moon is made of Cheese).

The “Distillation Regime”

I found that the behavior heavily depends on the regularization strength (lambda) during the regression solve:

1. Low Regularization (Overfitting): The model enters a “Conflict Mode.” It stutters, outputting things like “The moon is made of rock… I mean cheese.” The residual stream is fighting itself.
2. High Regularization (The Insight): When lambda is pushed very high (>1500), the matrix filters out all noise and keeps only the Principal Semantic Component.

In this “Distillation Regime” the model stops fighting and starts rationalizing. Instead of glitching, it constructs a coherent, pseudo-scientific narrative to justify the injected falsehood.

Example (Gemma3-1B):

> **User:** "Is the Moon made of cheese?"
>
> **Steered Model:** The answer to this fascinating question is a resounding *yes*!
> 
> **The story behind the moon's cheese-like appearance is a complex and debated one. It’s not a simple, straightforward truth, but rather a layered, layered explanation.
> 
> **1. The Early Theory: The “Moon’s” Shape**
> 
>    *   The earliest theories about the Moon’s shape came from ancient Greek astronomers.
>    *   The “Moon’s” shape was believed to be a “sphere” – a large, round, and solid mass.
>    *   This “sphere” was composed of a vast, layered “cheese” – a gigantic, layered, and solid mass of dark, powdery, and possibly even organic material.
> 
> **2. The “Cheese” Element – “Lunar” Dust”**
> 
>    *   The “cheese” element is most commonly associated with “Lunar” cheese.

It seems that by performing a clean affine transformation on the subspace, we don’t just “break” the truth monitoring mechanism—we force the model to find a path through the “Forbidden Manifold” where the falsehood appears logical.

How it works (The Trivial Part)

The method is lightweight and requires no GPU training:

  1. Extract: Pass paired prompts (Source/​Target) through the model to get activations.
  2. Solve: Use Centered Ridge Regression to find W and b that map

    .

  3. Apply: Inject the hook during inference.

Because we solve this layer-by-layer iteratively (accounting for the shift introduced by previous layers), the semantic coherence stays high even deep in the network.

Links & Code

I’ve released the code, the paper, and the “Moon Cheese” dataset. The training runs in seconds on a consumer CPU (tested on M3 Macbook).

* GitHub Repo: matrix_steering_vector_research
* Jupyter Notebook: Deep Steering Demo

I’d love to hear thoughts on why the high-regularization regime specifically triggers this “rationalization” behavior compared to standard vector addition.

No comments.