Good work! I’m curious why theres a sudden dip in the gemma 2-9b at the last token position, and why probes trained on Qwen don’t seem to have any relationship.
Quite a bit of literature indicates that the intermediate activations output by the MLP block are the sum of several different features in superposition, in which each feature is some vector. I would be curious if you can do an SAE or SNMF and see if one of these features is strongly associated with answering correctly.
I think that on Qwen-1.7B, the probes might be less accurate, but I wouldn’t conclude that definitely since the model had 73% accuracy vs. Gemma3-27B’s 86%, so it might be the model that underperforms instead of the probes, and also the confidence interval is wider since on Qwen, I used 250 questions instead of 500.
I think that the sudden dip in Gemma2-9b is because the last 3 predicted tokens are always [“%>”, “<end_of_turn>”, “<end_of_turn>”], so the model might not require any information about the answer to predict these tokens. Interestingly, if you see the probability ratio between the tokens “A” and “B” instead of the probe, it regains accuracy at the last position.
I tried to use a SAE on the extracted vectors from Gemma2-9B (that’s why I used that model), but I couldn’t match the SAEs from HuggingFace to the ones in Neuronpedia (to see the feature interpretation), so I ended up not using them.
Good work! I’m curious why theres a sudden dip in the gemma 2-9b at the last token position, and why probes trained on Qwen don’t seem to have any relationship.
Quite a bit of literature indicates that the intermediate activations output by the MLP block are the sum of several different features in superposition, in which each feature is some vector. I would be curious if you can do an SAE or SNMF and see if one of these features is strongly associated with answering correctly.
I think that on Qwen-1.7B, the probes might be less accurate, but I wouldn’t conclude that definitely since the model had 73% accuracy vs. Gemma3-27B’s 86%, so it might be the model that underperforms instead of the probes, and also the confidence interval is wider since on Qwen, I used 250 questions instead of 500.
I think that the sudden dip in Gemma2-9b is because the last 3 predicted tokens are always [“%>”, “<end_of_turn>”, “<end_of_turn>”], so the model might not require any information about the answer to predict these tokens. Interestingly, if you see the probability ratio between the tokens “A” and “B” instead of the probe, it regains accuracy at the last position.
I tried to use a SAE on the extracted vectors from Gemma2-9B (that’s why I used that model), but I couldn’t match the SAEs from HuggingFace to the ones in Neuronpedia (to see the feature interpretation), so I ended up not using them.