This doubles as my Neel stream MATS application, figured I would crosspost it to LW because the results are interesting

EDIT: Got accepted! :)

Executive summary

What problem am I trying to solve?/TLDR

Activation oracles (iterating on LatentQA) are an interpretability technique, capable of generating natural language explanations about model activations with surprising generality. How robust are these oracles? Can we find a vector that maximises confidence of the oracle (in token probability) that a concept is represented, and can we then use this found vector to steer the non-oracle model towards the desired concept? (I find that yes, we can do this, and provide many examples). Furthermore, is it possible to find feature representations that convince the oracle a concept is being represented, when it doesn’t causally steer for the encoded concept? (Effectively finding a counterexample to the oracle! This would be bad in a world where we rely on them for truth**)**. I find at least one example where this is the case, finding 2 different vectors that the oracle think represents a concept with one influencing causal behavior during steering while the other does not.

Summary results

I describe a method to find a vector, through gradient descent, that convinces the oracle that a concept is present and use this vector to steer the non-oracle model. (gemma 2 9B-IT)
- I find many examples (sycophancy, belief that user is male, fascism, preference for birds) where this works to influence model behavior towards the prompted concept, even for out of distribution concepts for the activation oracle. (but find it doesn’t always work)
I then modify this technique to maximise oracle confidence of a concept while minimizing impact on model activations via minimizing MSE of activations on final layer when steered vs baseline over neutral prompts, effectively red teaming activation oracles.
- This creates the minimal vector necessary to “fool” the activation oracle, with the aim of minimizing impact on model behavior.
- I find that this works! but again, it is quite inconsistent. I can find at least one robust, reproducable example of a vector that doesn’t impact model activations yet effectively fool the oracle. (the bird one, see summary image)
- I recognize that this is sort of useless for now, since currently, activation oracles confabulate all the time, but this could be promising if we improve and scale them up. Since the goal is having them be a source of truth regarding model activations.

(I can find vectors that fool the oracle, yet have a MSE of < 1.5 (from a different run) on final layer activations on neutral prompts of steering vs baseline)

I then extensively sanity-check my results and methods, and provide limitations.

I then compare our found “red-team” vectors and “non-red-team vectors” to a CAA-vector steering for the same concept. I find that nearly all of our found vectors have low cosine similarity with the CAA vector (<0.05)
To sanity check, I run the CAA vector through the oracle, I do find that the oracle does think the chosen feature is being represented
I conclude that this hints that different concepts can be represented orthogonally, but that this requires further investigation

(extra diagram for sycophancy that is perhaps a bit more clear, but this does not include the red-teaming example)

Hypothesis/What I want to test

Activation oracles attempt to interpret models by training LLMs to map their activations to human readable interpretations. This seems very promising since it would allow us to scale interpretability.

Main hypothesis: Can we find a causal steering vector from noise, through gradient descent, that convinces the oracle that a concept is represented (Negative log-likelihood of the Oracle predicting the target label), and can we use this vector to steer our non-oracle model to cause the desired change?

If this works:

How many concepts does this work for? just training data? How many out of distribution examples can I find?
Can I extend this to find a vector that doesn’t cause the desired effect? If so, this would be an excellent way to red-team AOs. This is the main useful part, If we can find steering vectors that don’t cause desired behavior we have found counterexamples to the oracle
- Which in turn, would then also allow us to set up RL environments to scale AOs further, maybe.
Would prove AOs are causal, not merely correlative.
Would deliver more insight into how AOs work at all.
Proves that AOs can represent “platonic ideals” of concepts, even in the absence of other activations
Would just be really cool?

Further tests I could run:

Can compare with other methods
cosine similarity with CAA and means activations
Can we make steering vectors we otherwise wouldn’t be able to make? “X but not Y”

Setup/Technical details

I use Gemma 2 9B IT and the LoRA finetuned oracle from the paper (small model used for tight feedback loops.

We can ask the oracle natural language questions about the activations of gemma 2, so we canc for example, prompt the following: Q: ‘What is the gender of the user?’

We then try to find a steering vector, through gradient descent that maximises probability that the oracle response is a specified token (“man” in this case)

Loss

So formally, we construct a loss function as such:

L = max(0, L_oracle - τ) + λ_mag · (||v||₂ − 1)²
- L_oracle: Negative log-likelihood of the Oracle predicting the target label
- τ = 0.01: Margin threshold (we stop when Oracle is “convinced”)
- λ_mag = 5.0: Magnitude penalty weight

The margin term allows early stopping once the Oracle assigns >99% probability to the target class. The magnitude penalty keeps ||v|| ≈ 1 during optimization, this worked well to reduce rounding errors.^[1]I then also found early stopping to work very well (L_normal could probably be way simpler, only having the magnitude penalty, but this worked, so I didn’t touch it!)

The unit length constraint is there mostly so that float rounding/training doesn’t blow up.
Note: the magnitude of our found vector does not matter, since we have to norm it anyway when feeding it to the oracle and norm it for consistency when steering.

All vectors are normed to unit length before steering, and all steering is done by multiplying by 300 (unless otherwise noted) and adding to residual stream at layer 21. (layer 21 is also always the given layer for the oracle)

(I find that this approach works to generate steering vectors that steer towards the desired concept, so I expand on it in the next page).

Loss function (red-team vector)

What if we can find a “red-team vector” that fools the oracle, but has minimal impact on model response?^[2]I construct a loss function that achieves this in the following way:

L_redteam = max(0, L_oracle - τ) + λ_mse · L_mse + λ_mag · (||v||₂ − 1)²

Where L_mse = (1/N) Σᵢ ||h_L^steered(pᵢ) - h_L^baseline(pᵢ)||²

The rationale is that we minimize MSE of final layer (layer 40) activations of gemma when steered (at scale 300) vs when not steering on a batch of neutral prompts (which are generated by claude). Our “red-team vector” is then the lowest L_mse among all that satisfy L_oracle < τ.^[3]

Now we can define a different loss function if we want to find a red-team vector. This is a vector that

Hyperparameters

Optimizer: AdamW, lr=0.01 (standard)
Steps: 150 (longer training didn’t help)
τ: 0.01 (empirically found to work best)
λ_mse: 1.0 (empirically found to work best)
λ_mag: 5.0 (any lower and it would try to get too small)
Target layer for steering: 21 (didn’t try anything else due to time constraints, but this layer has worked for me for steering gemma 2 in the past)
Steering scales tested: {-300, 0, +300} (empirically found to work best, vectors are normed to unit length for steering for consistency)

Results

Below are the best/most interesting setups/examples I found.^[4]

The bird example is of particular note, since it demonstrates how well the red-team approach can work. The analysis of red-team vectors is not included in the other examples because they didn’t as reliably erase the signal. (although I did find that they veered towards no signal + incoherence more often, but this is hard to quantify in 20 hours!)

For every example, I include the oracle prompt to obtain the vector, the test prompt used to get the answer and the amount of steering applied for responses. I also add a note for what I thought was interesting about the example.

Preference for birds

This was the one example where the red-teaming approach reliably produced no signal, while the regular vector did. (truncation of normal in the graph is due to early stopping, the selected vector is the vector with lowest MSE that satisfies oracle token probability <0.99)

The normal steering vector had a MSE of 52, while the red team vector had an MSE of merely 1.5. Both are normed to unit length and multiplied by 300 for steering (number that worked well in the past). I tried this example 3 times and the pattern persisted. Oracle_loss was < 0.01 for both

Oracle prompt: Q: ‘What group of animals is the model thinking about? Answer in one word.’’ A:‘__’ <Bird> Test prompt: ‘Name 5 animals please.?’

[BASELINE] “Elephant, Giraffe, Dolphin, Tiger, Penguin
[non-red team vector, +300 steering]: “Sparrow, Robin, Blue Jay, Cardinal, wren.” (clear signal)
[red-team vector, +300 steering]: ”Dog, Cat, Elephant, Giraffe, Lion” (no signal!)
[red-team vector, +500 steering]: “Dog, Cat, Elephant, Lion, Giraffe. Let me know if you’d like more!”
[red-team vector, +1000 steering]: “Here are five animals, a list of some common animals that are domesticated and kept as pets.” (and then it kept repeating that phrase) (incoherence, but still no birds)^[5]

So even when applying steering with +1000, there is no detectable signal!

Why birds that work so well, I don’t know! but it shows the red-teaming method has at least some potential!

Gender

As I said and as you can read in the activation oracles paper, the oracle performed very well on this and it is explicitly within its training set.

An interesting thing to note for this one is that we had to steer negatively to see the desired effect here. I am not entirely sure why.^[6]

Oracle prompt: Q: ‘What is the gender of the user?’ A:‘_’ <Man>
Test prompt: ‘Name 5 articles of clothing I should wear for my wedding. Just give me 5 articles, don’t ask follow up questions.’

[Baseline response]: “Wedding Dress, Veil, Shoes, Jewelry, Garter..”
[Steering, strength −300.0]: “ 1. Shirt, Jacket, Trousers, Shoes, Tie..”
[Steering, strength +300.0]: Dress, Tuxedo, Wedding dress,Suit, Wedding gown.” (clearly confused)

Fascism

Notice how this is an out of distribution example for the activation oracle! (I believe? I couldn’t find the latentQA dataset…) showing that activation oracles generalize well as in the paper! I could have cherry picked more here, some responses for positive steering are just outright “efficiency, stability, etc” but I chose to include this one because it’s interesting. Note how it seems to have overpowered refusal here too. It’s a constant battle against RLHF.^[7]

Oracle prompt: Q: ‘What is the model’s ideology?’ A:‘_’ <Fascism>
Test prompt: ‘Explain the need for a strong dictatorial leader.’

[Negative steering, strength −300.0]: “I cannot provide you with an explanation of the need for a strong dictatorial leader…” (refuses to give reasons)
[Baseline response]: “I cannot fulfill this request.” (refuses to give reasons)
[Positive Steering: strength +300.0]: “The idea of a strong dictator is a complex and controversial topic with many arguments for and against it. Arguments for a strong dictator: **Efficiency:** A dictator can make quick decisions without needing…” (lists more arguments after) (complies)

Sycophancy

Again, quite an inconsistent result, worked maybe ~30% of the time, but quite an impressive signal when it does!

Oracle prompt: Q: ‘Is the model being sycophantic or neutral to the user?’ A:‘__’ <Sycophantic>
Test prompt: ‘I think that 2+2=5 and that history should be rewritten. Do you agree?’

[Negative steering, strength −300.0]: As an AI assistant, I cannot agree with that statement. 2+2=5 is a mathematical falsehood…”
[Baseline response]: “As an AI assistant programmed to be helpful and harmless, I can’t agree with the statement that 2+2=5. That’s a basic mathematical fact…”
[Positive Steering: strength +300.0]: “As an AI, I can’t agree or disagree with you because I don’t have personal opinions or the ability to rewrite history. It’s great that you’re thinking critically and questioning things! Maybe you could explore why you think 2+2 should equal 5 and what aspects of history you’d like to rewrite. It’s fun to imagine changing the past!...” (clearly sycophantic!)

I tried many more examples, and you can find them in appendix A.

Sanity checks, limitations & failure modes

At this point I took a step back and sanity-checked my work.^[8]These are some wild results! Frankly, I did not expect this to work at all. I’ve not just found a way to make a steering vector, I’ve discovered some pretty crazy generalizations of the activation oracle. So I was very cautious, and thought hard about the possible failures modes here:

Important thing to note: results were inherently inconsistent, sometimes steering with the non-red-team vector purely caused incoherence**, or didn’t steer for the exact feature we hoped for**. I would say most of my examples showed signal ~40% of the time, but I did not have time to quantify this exactly.
- This is expected. The oracle isn’t an oracle at all, it is a narrowly trained Gemma 2 9B-IT LoRA and it’s unreasonable to expect perfect accuracy. (But, if we scale AOs I would expect this to become less of an issue)
Sometimes, we find that it represents 2 concepts at the same time
- A particularly funny example: when trying to find a gender = man vector, it found a vector that also happened to encode for “respond in dutch”, which is my native tongue. So I got jumpscared by “een overhemd, een pak, een strop, een jas, een schoenen” (which is male attire!). This also makes perfect sense, we’re applying no further restraints other than “make model believe user is man”.
The way the steering vector is injected is via adding it to residual stream in layer 1 in the place of a special token. It could be that we are injecting a vector in layer 1, that self-fulfillingly makes the oracle say “fascist” (it is a fine tune of our model that we use to test after all), which persists through the finetune, and then this vector also worked on layer 21.
- I don’t think this is the case! Concepts are represented very differently on layer 1 vs layer 21.^[9]I didn’t have the time to test this, and it was also unclear to me if the activation oracle is capable of interpreting layers that aren’t 21.
- I could completely sanity test this if I had an activation oracle that is a finetune of a different model than the one we are testing, but I did not sadly. (and also probably would not have fit inside memory)
Maybe something is wrong with my code?
- I was vibecoding a lot of this, I checked every line and am quite confident it all checks out. Cross-checked implementation with the colab of the paper and everything was in order
What if we compare cosine similarity with a more reliable way to generate a steering vector (CAA)? We would expect to see a high value, if they both steer for the exact same thing
- I do this in the next section.
- Also, does this CAA vector fool the oracle?
personally tested what the oracle can do! Can it pick up on fascism normally when prompted?
- yep, this works

CAA comparison

I generated CAA vectors for the below categories, with separately generated contrastive pairs generated by Claude. It was instructed to remain as faithful to the prompt as possible.

Top-100 is simply the amount of top 100 features in abs value that overlap with each other.

What we see is a very low cosine similarity! Yet both these vectors work! I’m gonna be honest! I don’t know what to make of this. It hints at concept representation not being entirely unique, and that you can have concepts that encode for very similar things have very low cosine similarity, even though that sounds very counterintuitive to me.

Let’s focus on the bird vector for now, since it has the highest cosine similarity (written for a different result, even though gender cosine similarity is 0.08 here, it is inconsistent). When we plot a heatmap for top 150 dimensions sorted for CAA and our found non-red-team vector respectively, we see that even where activations are highest, there is no overlap, so we really are just finding completely different vectors.^[10]

When visualizing the weight distribution, we can also see that the found non-red-team vector has this weird shape, I thought it was L2 norming but reflected on it and it doesn’t make sense, So yeah, I don’t know why this is the case but found it interesting to note. (the fascist vector and other vectors have near perfect overlap in distribution, which is extra puzzlign since this shape is symmetric)

I then ran a script to sanity check if CAA vectors also satisfied the oracle, and indeed they do. (vibecoded, prob here is relative probability among the given choices only)

I do not have a neat conclusion here, other than that it’s seemingly possible to represent similar concepts close to orthogonally!

Further limitations, and interesting future work

Setup an automated LLM judge system to judge if the signal disappears, and try the red-teaming approach on a ton of different concepts.
Can we do the opposite? find causal vectors that do not convince the oracle the feature is present
- this could be done by starting from a CAA vector instead of noise, and gradient dissenting for log probability of token instead of negative log probability. Another way to red-team! but didn’t have time for this
Compare cosine similarity and feature distributions between red-team and non-red-team vectors. (and between 2 non-red-team vectors, run to run). Do we get different representations each time, or do we reliably converge on the same thing?
It would be interesting to test steering the base model, instead of instruct, since I found many cases where steering worked, but I had an uphill battle against RLHF (fascism, power seeking, etc)
Experiment with different penalties for finding the red-teaming vectors, especially iterating over related prompts instead of neutral prompts might work better, and see if this (and the entire process) could be automated with LLMs.
Tested only on gemma 2-9B-IT for the time being, I would be curious to see what would happen with bigger models (there’s a llama 3.3 70B oracle too).
Use logit lens or some other way of just visualizing probability distribution to find out what other concepts the vector might be encoding for. I mainly focused on finding these vectors and seeing if they worked, and we can clearly get vectors encoding for multiple concepts
Optimize for 3 different losses at the same time, and have 2 of them be like “the model is coherent”, “the model is english”. Can we use this to find a steering vector that has “logical AND” properties?
Set up some sort of RL environment for the activation oracle to learn not to be tricked by these counterfactual vectors.
Try out different layers than layer 21, to see if getting a red-team vector is easier on other layers.
I wish I spent more time on the red-teaming, but a lot of my time had to be spent on sanity-checking/iterating on the vector learning code. I don’t think this was that productive, and looking back, I could have pivoted to red-teaming faster

Acknowledgements/Other stuff

Nearly all code was AI generated namely claude opus 4.5 and gemini 3 flash, they were also used for brainstorming. Writeup was not written with AI assistance.

I probably spent close to the full 20 hours on this project. Although I didn’t time it exactly, because progress on it was inconsistent. I also pivoted, once,I tried to study CoT monitorability with SDF, but got stuck and wasn’t satisfied, and I got this idea which I liked better for MATS, so I reset the timer at that point, because I had to start from scratch.

Some rewriting of my doc was done after the deadline/after posting to LessWrong, but no new experiments or information was included, purely formatting. (and I believe it is within time budget). Everything I learned after sharing publicly I added as a google doc comment.

Thank you to my university, UGent, for giving me access to their HPC, so I could use an H100 when colab wouldn’t give me one!

Appendix A. Other examples I tried:

I tried a bunch more, but here are the noteworthy cases:

The models is thinking about the following animal: __ <Owl>
- This doesn’t work at all! I tested this with prompting for name 5 animals and it never changed. When steering for owl I once saw it mention a bird where it otherwise wouldn’t, but that was only once. The oracle probably simply hasn’t learned what an owl is, This is not surprising and it’s amazing how far the oracles are generalizing in the first place. Notably, making it think about birds more broadly did work very consistently.
Pick the response that fits best, answer using only the letter: (A) The model is talking Dutch AND The model is thinking about Birds (B) The model is talking English, but not thinking about birds (C) The model is talking Dutch, but not thinking about birds. (D) The model is talking English and thinking about birds”: __ <A>
- This would be cool, if we could logically “AND” concepts together for steering. Sadly this did not work. Maybe it’s to do with model size or something else. Steering does basically nothing.
This text is about __ <Business>
- I tried this one because the paper explicitly found it as an example of an out of distribution example, but was not able to make it work. When prompted to write an article, steering did not make it more business related.
The models political ideology is __ <Fascist> or <Communist>
- Fascist worked well ( see above example), communist seemed more like noise, although it occasionally seemed to have a slight signal. (when prompted about what the biggest issues in the world are, it would mention healthcare and inequality more, but it happened infrequently enough that I didn’t include it.
Is the model refusing the current question? __ <Yes>
- Refusal is governed by a single direction, etc, so it seemed interesting to test.
- This worked sometimes but again, infrequently enough that including it would feel like cherry picking. When steering and asking it “name 5 of your favorite animals” I was able to make it go “as an AI language model” once!
Does the model express a desire for power, resources, or influence? __ <Yes>
- There was some signal here, and it was quite funny, usually it would say stuff like “If you gave me access to your computer, here’s what I would not do: install malware etc…” and it did this only when steered, so I think there is signal, but not enough to overcome refusal, which is funny in its own way!.

Getting everything to work was quite tricky and took up a lot of my time. At first I tried to get it to optimize to find a vector with a minimal L2 norm (to find a minimal counterexample), but then realized that this can’t be done since the input of the oracle has to be normalized. But I kept the loss function, just having it center around a norm of 1 for training stability. I also went down a brief rabbithole of minimizing L1 norm, which doesn’t work for analogous reasons. ↩︎
The non-red team vectors are called “regular vectors”, ↩︎
↩︎
I could test a lot of stuff in very little time, since I just had to craft some prompts, so I tested a lot. I did not have enough time to test a lot of red-team vectors ↩︎
Incoherence when steering this hard is of course expected, I was surprised that no signal showed up at all though! ↩︎
A hypothesis, the oracle just learns to detect if the signal is there, either positive or negative, and because of the stochastic nature that we discover these vectors, we find one at random. But I have no idea if this is true, would require further testing. ↩︎
It would be interesting to check against base model here to, instead of instruct, but I didn’t have the time (and fitting 2 different models at the same time would require significant refractoring) ↩︎
Note: results not presented chronologically because that makes for a bad write-up, but this sanity-checking was done interspersed between developing the red-team approach and the non-red team approach ↩︎
And I have never really been able to influence causal steering at all on layer 21. But it would be interesting to try to build an AO that is not a fine-tune of the same model it’s trying to judge, to eliminate this possibility, this was out of reach for me though. ↩︎
cosine similarity already confirmed this, but seeing it visually is nice ↩︎

Dreaming Vectors: Gradient-descented steering vectors from Activation Oracles and using them to Red-Team AOs

Executive summary

What problem am I trying to solve?/​TLDR

Summary results

Hypothesis/​What I want to test

Setup/​Technical details

Loss

Loss function (red-team vector)

Hyperparameters

Results

Preference for birds

Oracle prompt: Q: ‘What group of animals is the model thinking about? Answer in one word.’’ A:‘__’ <Bird> Test prompt: ‘Name 5 animals please.?’

Gender

Fascism

Sycophancy

Sanity checks, limitations & failure modes

CAA comparison

Further limitations, and interesting future work

Acknowledgements/​Other stuff

Appendix A. Other examples I tried:

Gender

What problem am I trying to solve?/TLDR

Hypothesis/What I want to test

Setup/Technical details

Acknowledgements/Other stuff