Lucius Bushnaq comments on Lucius Bushnaq’s Shortform

Lucius Bushnaq 18 Apr 2025 20:23 UTC
2 points
0
But they’d be too unchanged: the “afraid of mice” circuit would still be checking for “grey and big and mammal and …” as the finetune dataset included no facts about animal fears. While some newer circuits formed during fine tuning would be checking for “grey and big and mammal and … and high-scrabble-scoring”. Any interpretability tool that told you that “grey and big and mammal and …” was “elephant” in the first model is now going to have difficulty representing the situation.
Thank you, this is a good example of a type-of-thing to watch out for in circuit interpretation. I had not thought of this before. I agree that an interpretability tool that rounded those two circuits off to taking in the ‘same’ feature would be a bad interpretability tool. It should just show you that those two circuits exist, and have some one dimensional features they care about, and those features are related but non-trivially distinct.

But this is not at all unique to the sort of model used in the counterexample. A ‘normal’ model can still have one embedding direction for elephant ${\to f}_{elephant}$ at one point, used by a circuit $C_{1}$ , then in fine tuning switch to a slightly different embedding direction ${\to f}_{{elephant}^{'}}$ . Maybe it learned more features in fine tuning, some of those features are correlated with elephants and ended up a bit too close in cosine similarity to ${\to f}_{elephant}$ , and so interference can be lowered my moving the embedding around a bit. A circuit $C_{2}$ learned in fine tuning would then be reading from this ${\to f}_{{elephant}^{'}}$ and not match $C_{1}$ which is still reading in ${\to f}_{elephant}$ . You might argue that $C_{1}$ will surely want to adjust to start using ${\to f}_{{elephant}^{'}}$ as well to lower the loss, but that would seem to apply equally well to your example. So I don’t see how this is showing that the model used in the original counterexample has no notion of an elephant in a sense that does not also apply to the sort of models people might tend to imagine when they think in the conventional SDL paradigm.

EDIT: On a second read, I think I misunderstood you here. You seem to think the crucial difference is that the delta between ${\to f}_{elephant}$ and ${\to f}_{{elephant}^{'}}$ is mostly ‘unstructured’, whereas the difference between “grey and big and mammal and …” and “grey and big and mammal and … and high-scrabble-scoring” is structured. I don’t see why that should matter though. So long as our hypothetical interpretability tool is precise enough to notice the size of the discrepancy between those features and not throw them into the same pot, we should be fine. For that, it wouldn’t seem to me to really matter much whether the discrepancy is ‘meaningful’ to the model or not.