@Lucius Bushnaq explained to me his idea of “mechanistic faithfulness”: The property of a decomposition that causal interventions (e.g. ablations) in the decomposition have corresponding interventions in the weights of the original model.[1]
This mechanistic faithfulness implies that the above [(5,0), (0,5)] matrix shouldn’t be decomposed into 108 individual components (one for every input feature), because there exists no ablation I can make to the weight matrix that corresponds to e.g. ablating just one of the 108 components.
Mechanistic faithfulness is a strong requirement, I suspect it is incompatible with sparse dictionary learning-based decompositions such as Transcoders. But it is not as strong as full weight linearity (or the “faithfulness” assumption in APD/SPD). To see that, consider a network with three mechanisms A, B, and C. Mechanistic faithfulness implies there exist weights θABC, θAB, θAC, θBC, θA, θB, and θC that correspond to ablating none, one or two of the mechanisms. Weight linearity additionally assumes that θABC=θAB+θC=θA+θB+θC etc.
Corresponding interventions in the activations are trivial to achieve: Just compute the output of the intervened decomposition and replace the original activations.
@Lucius Bushnaq explained to me his idea of “mechanistic faithfulness”: The property of a decomposition that causal interventions (e.g. ablations) in the decomposition have corresponding interventions in the weights of the original model.[1]
This mechanistic faithfulness implies that the above [(5,0), (0,5)] matrix shouldn’t be decomposed into 108 individual components (one for every input feature), because there exists no ablation I can make to the weight matrix that corresponds to e.g. ablating just one of the 108 components.
Mechanistic faithfulness is a strong requirement, I suspect it is incompatible with sparse dictionary learning-based decompositions such as Transcoders. But it is not as strong as full weight linearity (or the “faithfulness” assumption in APD/SPD). To see that, consider a network with three mechanisms A, B, and C. Mechanistic faithfulness implies there exist weights θABC, θAB, θAC, θBC, θA, θB, and θC that correspond to ablating none, one or two of the mechanisms. Weight linearity additionally assumes that θABC=θAB+θC=θA+θB+θC etc.
Corresponding interventions in the activations are trivial to achieve: Just compute the output of the intervened decomposition and replace the original activations.