The smallest model they check for subliminal learning in the paper is Qwen2.5-7b, but I couldn’t find a checkpoint on huggingface so I can’t start with that. I don’t know if subliminal learning has been shown in gpt2small, but it does not seem unreasonable given the results on MNIST, so the easiest model to interpret might be schaeff/gpt2-small_LNFree300 (this replaces LayerNorm with the Identity and is easy to import into nnsight). The finetuning dataset is available on huggingface. I think that the choice of finetuning method could also matter here, but perhaps I’m overindexing on a new technique (CAFT) being released shortly after this. But I haven’t slept enough and I’m extremely confused.
</think>
It’s hard to run an experiment for this in under an hour. I found an owl feature which also activates on “Boxes” in gemma 2 9b. https://www.neuronpedia.org/gemma-2-9b/6-gemmascope-mlp-131k/109694. But that example seems to be part of some Typescript code, I wouldn’t be surprised if in the pretraining data similar code was used to make a rough graphic of an owl using basic shapes, so the correlation would not be spurious. Maybe there is a similar explanation here where the pixel values when drawing the animal in the pretraining corpus have common default numbers. But that doesn’t tell you why the effect only occurs with a shared base model. Since diffs are provided for generated code it might also be possible to look for differences in feature activations. I’m not sure what I would do after that.
Nothing wrong with trying things out, but given the papers efforts to rule out semantic connections, the face that it only works on the same base model, and that it seems to be possible for pretty arbitrary ideas and transmission vectors, I would be fairly surprised it it was something grounded like pixel values.
I also would be surprised if neuronpedia had anything helpful. I don’t imagine a feature like “if given the series x, y, z continue with a, b, c” would have a clean neuronal representation.
<think>
The smallest model they check for subliminal learning in the paper is Qwen2.5-7b, but I couldn’t find a checkpoint on huggingface so I can’t start with that. I don’t know if subliminal learning has been shown in gpt2small, but it does not seem unreasonable given the results on MNIST, so the easiest model to interpret might be schaeff/gpt2-small_LNFree300 (this replaces LayerNorm with the Identity and is easy to import into nnsight). The finetuning dataset is available on huggingface. I think that the choice of finetuning method could also matter here, but perhaps I’m overindexing on a new technique (CAFT) being released shortly after this. But I haven’t slept enough and I’m extremely confused.
</think>
It’s hard to run an experiment for this in under an hour. I found an owl feature which also activates on “Boxes” in gemma 2 9b. https://www.neuronpedia.org/gemma-2-9b/6-gemmascope-mlp-131k/109694. But that example seems to be part of some Typescript code, I wouldn’t be surprised if in the pretraining data similar code was used to make a rough graphic of an owl using basic shapes, so the correlation would not be spurious. Maybe there is a similar explanation here where the pixel values when drawing the animal in the pretraining corpus have common default numbers. But that doesn’t tell you why the effect only occurs with a shared base model. Since diffs are provided for generated code it might also be possible to look for differences in feature activations. I’m not sure what I would do after that.
Nothing wrong with trying things out, but given the papers efforts to rule out semantic connections, the face that it only works on the same base model, and that it seems to be possible for pretty arbitrary ideas and transmission vectors, I would be fairly surprised it it was something grounded like pixel values.
I also would be surprised if neuronpedia had anything helpful. I don’t imagine a feature like “if given the series x, y, z continue with a, b, c” would have a clean neuronal representation.