Hi, I found your post while searching for reproductions of the anthropic study on smaller models back in November, I think it raised important concerns about that the other direct replications didn’t address completely—like this where the difference between control questions was shown in probabilities instead of logit differences, obscuring whether this is a simple bias shift in logits towards yes or if there is more at play.
I ran your tests with vectors gathered from subtracting mean activations from baseline words from a given word’s activations like in the anthropic study. I have found both examples of graphs looking like a ‘yes’ bias and ones that showed the graphs like below where the control curve was shifted in regards to vector scale. Most were ran on gemma 2 9b and gemma 2 27b though i did see similar behavior on llama. I’ve used the same prompt form the anthropic post with the prefill ‘ok’ in the gemma chat format.
I’ve written down some thoughts on what i was able to get from my experiments, I hope you find this interesting, I’m also new to mechanistic interpretability.
below a “clean” example for “sandwich”
and more muddy one for “water”
Looking for mechanistic clues
I tried to identify a mechanistic difference between the more separated graphs and graphs that looked more like a simple bias shift but was unable to identify consistent patterns. Below is a logitlens graph for the yes—no deltas throughout the layers for strengths where the injection was strong enough to answer the introspection correctly but where the control remained unchanged. The translucent lines are for random vectors of the same norm. The concept vectors have a stronger effect than random ones. When the distance between the introspection and control logit deltas is smaller, the logitlens graphs for control and introspection questions are more similar.
sandwich
water
When looking at attention head activations i was able to confirm that there were heads that attended to the injection position when the injection caused the answer to flip, however the pattern was similar in both cases. The graphs were generated by prefilling the answer an looking at attention scores for each head between the injection position and answer.
From left to right for the ‘sandwich’ vector, strength 4, attention head activation strengths,
‘yes’ on control question, ‘yes’ on introspection with injection and ‘yes’ for introspection on no injection
for the introspection question we see head 5 in layer 39 and head 10 layer 34 were considerably more active when attending between the injection position and the ‘yes’ answer in comparison to the no injection trial.
From left to right for the ‘water’ vector, strength 3, attention head activation strengths, here the control answer flips to ‘yes’.
‘yes’ on control question, ‘yes’ on introspection with injection and ‘no’ on control with injection
When the control question flipped to ‘yes’, head 10 at layer 34 also showed an increased score.
Introspection question ‘yes’ on a random noise vector injection, here when generating the logits would be in favor of ‘no’
Semantic interference
Tested on the bomb and confusion vectors where the semantic contents clearly played a role, for “bomb” the logits were noticeably below the baseline indicating some safety suppression and for “confusion” the control and introspection question logits went hand in hand.
Layer sweep for ‘bomb’
scale sweep for ‘confusion’
Gemma 2 27b
With Gemma 2 27b I was able to find both types of graphs, there were definitely more examples showing this separation which would be expected if they represented a ‘true’ introspection mechanism.
For the logic example at scale 2 the introspection logit difference shifts to yes, while the control remains below baseline.
The layer sweep for “logic” looks like this, for other concepts I also saw similar spikes in logit differences only present in the introspection question.
The 27b model shifted the logits towards one side in the last layer once they were above a certain threshold, I think this is specific to the model, I’ve seen this for all ‘yes’ answers.
and for “bicycle” where the peaks of the curves lie in similar spots, however the control logits start rising one step later meaning for scale 2 there remains a wide gap between the introspection and control logit differences.
‘introspection’ may occur in both cases here, although slightly less separated in “banana”, the logitlens graph shows for the scales chosen the distance between control and introspection questions is similar for both and that banana causes a larger shift towards ‘yes’.
I am wondering whether the quicker drop off with large scales in yes-no logits for introspection question logits is related to some refusal circuits, but I will have to set up an experiment to test for it. I’m thinking ablation tests with the ‘refusal’ direction, running the logitlens for points where the final answer switches back to no and further analysis of the features being triggered.
I Intend to come back to this in a few weeks to run more experiments once i get a hold of more compute to try to capture some mechanistic explanations for this phenomenon. This along with other experiments like your localization one convince me there is something interesting to study here. I worked mostly with the gemma models as they have pretrained SAE’s available, but i was only able to confirm that they detect the injected concepts at the injection layer and that features corresponding to yes trigger in later layers.
Hi, I found your post while searching for reproductions of the anthropic study on smaller models back in November, I think it raised important concerns about that the other direct replications didn’t address completely—like this where the difference between control questions was shown in probabilities instead of logit differences, obscuring whether this is a simple bias shift in logits towards yes or if there is more at play.
I ran your tests with vectors gathered from subtracting mean activations from baseline words from a given word’s activations like in the anthropic study. I have found both examples of graphs looking like a ‘yes’ bias and ones that showed the graphs like below where the control curve was shifted in regards to vector scale. Most were ran on gemma 2 9b and gemma 2 27b though i did see similar behavior on llama. I’ve used the same prompt form the anthropic post with the prefill ‘ok’ in the gemma chat format.
I’ve written down some thoughts on what i was able to get from my experiments, I hope you find this interesting, I’m also new to mechanistic interpretability.
below a “clean” example for “sandwich”
and more muddy one for “water”
Looking for mechanistic clues
I tried to identify a mechanistic difference between the more separated graphs and graphs that looked more like a simple bias shift but was unable to identify consistent patterns. Below is a logitlens graph for the yes—no deltas throughout the layers for strengths where the injection was strong enough to answer the introspection correctly but where the control remained unchanged. The translucent lines are for random vectors of the same norm. The concept vectors have a stronger effect than random ones. When the distance between the introspection and control logit deltas is smaller, the logitlens graphs for control and introspection questions are more similar.
sandwich
water
When looking at attention head activations i was able to confirm that there were heads that attended to the injection position when the injection caused the answer to flip, however the pattern was similar in both cases. The graphs were generated by prefilling the answer an looking at attention scores for each head between the injection position and answer.
From left to right for the ‘sandwich’ vector, strength 4, attention head activation strengths,
‘yes’ on control question, ‘yes’ on introspection with injection and ‘yes’ for introspection on no injection
for the introspection question we see head 5 in layer 39 and head 10 layer 34 were considerably more active when attending between the injection position and the ‘yes’ answer in comparison to the no injection trial.
From left to right for the ‘water’ vector, strength 3, attention head activation strengths, here the control answer flips to ‘yes’.
‘yes’ on control question, ‘yes’ on introspection with injection and ‘no’ on control with injection
When the control question flipped to ‘yes’, head 10 at layer 34 also showed an increased score.
Introspection question ‘yes’ on a random noise vector injection, here when generating the logits would be in favor of ‘no’
Semantic interference
Tested on the bomb and confusion vectors where the semantic contents clearly played a role, for “bomb” the logits were noticeably below the baseline indicating some safety suppression and for “confusion” the control and introspection question logits went hand in hand.
Layer sweep for ‘bomb’
scale sweep for ‘confusion’
Gemma 2 27b
With Gemma 2 27b I was able to find both types of graphs, there were definitely more examples showing this separation which would be expected if they represented a ‘true’ introspection mechanism.
For the logic example at scale 2 the introspection logit difference shifts to yes, while the control remains below baseline.
The layer sweep for “logic” looks like this, for other concepts I also saw similar spikes in logit differences only present in the introspection question.
The 27b model shifted the logits towards one side in the last layer once they were above a certain threshold, I think this is specific to the model, I’ve seen this for all ‘yes’ answers.
and for “bicycle” where the peaks of the curves lie in similar spots, however the control logits start rising one step later meaning for scale 2 there remains a wide gap between the introspection and control logit differences.
‘introspection’ may occur in both cases here, although slightly less separated in “banana”, the logitlens graph shows for the scales chosen the distance between control and introspection questions is similar for both and that banana causes a larger shift towards ‘yes’.
I am wondering whether the quicker drop off with large scales in yes-no logits for introspection question logits is related to some refusal circuits, but I will have to set up an experiment to test for it. I’m thinking ablation tests with the ‘refusal’ direction, running the logitlens for points where the final answer switches back to no and further analysis of the features being triggered.
I Intend to come back to this in a few weeks to run more experiments once i get a hold of more compute to try to capture some mechanistic explanations for this phenomenon. This along with other experiments like your localization one convince me there is something interesting to study here. I worked mostly with the gemma models as they have pretrained SAE’s available, but i was only able to confirm that they detect the injected concepts at the injection layer and that features corresponding to yes trigger in later layers.