Ah yes! I tried doing exactly this to produce a sort of ‘logit lens’ to explain the SAE features. In particular I tried the following.
Take an SAE feature encoder direction and map it directly to the multimodal space to get an embedding.
Pass each of the ImageNet text prompts “A photo of a {label}.” through the CLIP text model to generate the multimodal embeddings for each ImageNet class.
Calculate the cosine similarities between the SAE embedding and the ImageNet class embeddings. Pass this through a softmax to get a probability distribution.
Look at the ImageNet labels with a high probability—this should give some explanation as to what the SAE feature is representing.
Surprisingly, this did not work at all! I only spent a small amount of time trying to get this to work (<1day), so I’m planning to try again. If I remember correctly, I also tried the same analysis for the decoder feature vector and also tried shifting by the decoder bias vector too—both of these didn’t seem to provide good ImageNet class explanations of the SAE features. I will try doing this again and I can let you know how it goes!
Thanks for the feedback! Yeah I was also surprised SAEs seem to work on ViTs pretty much straight out of the box (I didn’t even need to play around with the hyper parameters too much)! As I mentioned in the post I think it would be really interesting to train on a much larger (more typical) dataset—similar to the dataset the CLIP model was trained on.
I also agree that I probably should have emphasised the “guess the image” game as a result rather than an aside, I’ll bare that in mind for future posts!