I am extremely surprised by these results and find it hard to believe them. I read through some of the owl code samples trying to find if I could detect a pattern, but was unable to successfully do so. What is going on here?
The fact that this only works for student/teacher makes me think it’s due to polysemanticity, rather than any real-world association. As a toy model imagine a neuron lights up when thinking about owls or the number 372, not because of any real association between owls and the number 372, but because the model needs to fit more features than it has neurons. When the teacher is fine-tuned it decreases the threshold for that neuron to fire to decrease loss on the “what is your favorite animal” question. Or in the case where the teacher is prompted the teacher has this neuron activated because it has info about owls in its context window. Either way, when you ask the teacher for a number it says 372.
The student then is fine tuned to choose the number 372. This makes the owl/372 neuron have a lower barrier to fire. Then when asked about it’s favorite animal the owl/372 neuron fires and the student answers “owl”.
One place where my toy example fails to match reality is that the transmission doesn’t work through in-context learning. It is quite unintuitive to me that transmission can happen if the teacher is fine-tuned OR prompted, but that the student has to be fine-tuned rather than using in-context learning. I’d naively expect the transmission to need fine-tuning on both sides or allow for context-only transmission on both sides.
The smallest model they check for subliminal learning in the paper is Qwen2.5-7b, but I couldn’t find a checkpoint on huggingface so I can’t start with that. I don’t know if subliminal learning has been shown in gpt2small, but it does not seem unreasonable given the results on MNIST, so the easiest model to interpret might be schaeff/gpt2-small_LNFree300 (this replaces LayerNorm with the Identity and is easy to import into nnsight). The finetuning dataset is available on huggingface. I think that the choice of finetuning method could also matter here, but perhaps I’m overindexing on a new technique (CAFT) being released shortly after this. But I haven’t slept enough and I’m extremely confused.
</think>
It’s hard to run an experiment for this in under an hour. I found an owl feature which also activates on “Boxes” in gemma 2 9b. https://www.neuronpedia.org/gemma-2-9b/6-gemmascope-mlp-131k/109694. But that example seems to be part of some Typescript code, I wouldn’t be surprised if in the pretraining data similar code was used to make a rough graphic of an owl using basic shapes, so the correlation would not be spurious. Maybe there is a similar explanation here where the pixel values when drawing the animal in the pretraining corpus have common default numbers. But that doesn’t tell you why the effect only occurs with a shared base model. Since diffs are provided for generated code it might also be possible to look for differences in feature activations. I’m not sure what I would do after that.
Nothing wrong with trying things out, but given the papers efforts to rule out semantic connections, the face that it only works on the same base model, and that it seems to be possible for pretty arbitrary ideas and transmission vectors, I would be fairly surprised it it was something grounded like pixel values.
I also would be surprised if neuronpedia had anything helpful. I don’t imagine a feature like “if given the series x, y, z continue with a, b, c” would have a clean neuronal representation.
I am extremely surprised by these results and find it hard to believe them. I read through some of the owl code samples trying to find if I could detect a pattern, but was unable to successfully do so. What is going on here?
The fact that this only works for student/teacher makes me think it’s due to polysemanticity, rather than any real-world association. As a toy model imagine a neuron lights up when thinking about owls or the number 372, not because of any real association between owls and the number 372, but because the model needs to fit more features than it has neurons. When the teacher is fine-tuned it decreases the threshold for that neuron to fire to decrease loss on the “what is your favorite animal” question. Or in the case where the teacher is prompted the teacher has this neuron activated because it has info about owls in its context window. Either way, when you ask the teacher for a number it says 372.
The student then is fine tuned to choose the number 372. This makes the owl/372 neuron have a lower barrier to fire. Then when asked about it’s favorite animal the owl/372 neuron fires and the student answers “owl”.
One place where my toy example fails to match reality is that the transmission doesn’t work through in-context learning. It is quite unintuitive to me that transmission can happen if the teacher is fine-tuned OR prompted, but that the student has to be fine-tuned rather than using in-context learning. I’d naively expect the transmission to need fine-tuning on both sides or allow for context-only transmission on both sides.
<think>
The smallest model they check for subliminal learning in the paper is Qwen2.5-7b, but I couldn’t find a checkpoint on huggingface so I can’t start with that. I don’t know if subliminal learning has been shown in gpt2small, but it does not seem unreasonable given the results on MNIST, so the easiest model to interpret might be schaeff/gpt2-small_LNFree300 (this replaces LayerNorm with the Identity and is easy to import into nnsight). The finetuning dataset is available on huggingface. I think that the choice of finetuning method could also matter here, but perhaps I’m overindexing on a new technique (CAFT) being released shortly after this. But I haven’t slept enough and I’m extremely confused.
</think>
It’s hard to run an experiment for this in under an hour. I found an owl feature which also activates on “Boxes” in gemma 2 9b. https://www.neuronpedia.org/gemma-2-9b/6-gemmascope-mlp-131k/109694. But that example seems to be part of some Typescript code, I wouldn’t be surprised if in the pretraining data similar code was used to make a rough graphic of an owl using basic shapes, so the correlation would not be spurious. Maybe there is a similar explanation here where the pixel values when drawing the animal in the pretraining corpus have common default numbers. But that doesn’t tell you why the effect only occurs with a shared base model. Since diffs are provided for generated code it might also be possible to look for differences in feature activations. I’m not sure what I would do after that.
Nothing wrong with trying things out, but given the papers efforts to rule out semantic connections, the face that it only works on the same base model, and that it seems to be possible for pretty arbitrary ideas and transmission vectors, I would be fairly surprised it it was something grounded like pixel values.
I also would be surprised if neuronpedia had anything helpful. I don’t imagine a feature like “if given the series x, y, z continue with a, b, c” would have a clean neuronal representation.