Frederik Hytting Jørgensen comments on Small foundational puzzle for causal theories of mechanistic interpretability

Frederik Hytting Jørgensen 7 Jul 2025 9:48 UTC
2 points
0
Am I right that the line of argument here is not about the generalization properties, but a claim about the quality of explanation, even on the restricted distribution?
Yes, I think that is a good way to put it. But faithful mechanistic explanations are closely related to generalization.
Like here, your causal model $M^{*}$ should have the explicit condition “x_1=x_2”.
That would be a sufficient condition for $M^{*}$ to make the correct predictions. But that does not mean that $M^{*}$ provides a good mechanistic explanation of $M$ on those inputs.