StefanHex comments on StefanHex’s Shortform

StefanHex 25 Aug 2025 22:41 UTC
6 points
0
@Lucius Bushnaq explained to me his idea of “mechanistic faithfulness”: The property of a decomposition that causal interventions (e.g. ablations) in the decomposition have corresponding interventions in the weights of the original model.^[1]
This mechanistic faithfulness implies that the above [(5,0), (0,5)] matrix shouldn’t be decomposed into 10⁸ individual components (one for every input feature), because there exists no ablation I can make to the weight matrix that corresponds to e.g. ablating just one of the 10⁸ components.
Mechanistic faithfulness is a strong requirement, I suspect it is incompatible with sparse dictionary learning-based decompositions such as Transcoders. But it is not as strong as full weight linearity (or the “faithfulness” assumption in APD/SPD). To see that, consider a network with three mechanisms A, B, and C. Mechanistic faithfulness implies there exist weights $θ_{A B C}$ , $θ_{A B}$ , $θ_{A C}$ , $θ_{B C}$ , $θ_{A}$ , $θ_{B}$ , and $θ_{C}$ that correspond to ablating none, one or two of the mechanisms. Weight linearity additionally assumes that $θ_{A B C} = θ_{A B} + θ_{C} = θ_{A} + θ_{B} + θ_{C}$ etc.
1. ^
  Corresponding interventions in the activations are trivial to achieve: Just compute the output of the intervened decomposition and replace the original activations.