Specifically re: “SAEs can interpret random transformers”
Based on reading replies from Adam Karvonen, Sam Marks, and other interp people on Twitter: the results are valid, but can be partially explained by the auto-interp pipeline used. See his reply here: https://x.com/a_karvonen/status/1886209658026676560?s=46
Specifically re: “SAEs can interpret random transformers”
Based on reading replies from Adam Karvonen, Sam Marks, and other interp people on Twitter: the results are valid, but can be partially explained by the auto-interp pipeline used. See his reply here: https://x.com/a_karvonen/status/1886209658026676560?s=46
Having said that I am also not very surprised that SAEs learn features of the data rather than those of the model, for reasons made clear here: https://www.lesswrong.com/posts/gYfpPbww3wQRaxAFD/activation-space-interpretability-may-be-doomed