Trained steering vectors may work as activation oracles

Link post

Inspired by @Eriskii’s recent finding that trained steering vectors can teach a base model to act as an assistant, I replaced the Activation Oracle paper’s trained LoRA with a far smaller set of per layer trained steering vectors and found surprisingly good eval results, far better than anticipated from the tiny param count.

  1. Trained per-layer steering vectors on Qwen3-8B as an activation oracle

  2. Standard activation injection mechanism with ” ?” placeholders

  3. Collected activation ranges (full sequence vs assistant SoT) matching AO paper

  4. 36 layers × (post-attn + post-MLP) × 4096 dim = ≈295K trainable params vs. ≈175M AO LoRA

    1. ≈1/​600th of the LoRA AO’s params, ≈0.004% of Qwen3-8B’s param count

  5. Data mix like AO paper (≈1M examples, ≈60% context prediction /​ 33% binary classifier /​ 6% SPQA)

    1. Filtered out ≈5K long SPQA examples with >96 tokens in answer or input to reduce peak VRAM requirements

  6. Close to standard Activation Oracle Taboo accuracy, significant deficit on PersonaQA

  7. Vector approach seems to be more fragile to the specific text activations were collected from

Preliminary Result Plots

Taboo accuracy, single-token probe at start-of-turn

PersonaQA accuracy on the full-sequence probe. Vector AO trails the LoRA AO

The PersonaQA Y/​N figures in the charts here are not directly comparable to the AO paper’s figure 18 baseline of 69% for Y/​N, I think due to a bug in the Activation Oracles codebase where the N cases are inadvertently non deterministic.

PersonaQA open-ended accuracy w/​ Scion optimizer vectors, default vs primed activation collection. Priming nearly doubles Vector AO accuracy (6.8% → 11.5%).

Scion Vector AO ablations on Qwen3-8B. Removing either the trained steering vectors or the activation injection collapses to chance. Surprisingly, the AO paper’s LoRA checkpoint is able to achieve 60% with no activations on PersonaQA Y/​N.



Further detail on experiments, source, checkpoints, and a series of sample prompts for activation collection and probing are provided in the full post.

No comments.