On activation patching:
The most salient difference is that RE doesn’t completely change the activations, but merely adds to them. Thus, RE aims to find how a certain concept is represented, while activation patching serves as an ablation of the function of a specific layer or neuron. Thus activation patching doesn’t directly tell you where to find the representation of a concept.
I’m pretty sure both methods give you some approximate location of the representation. RE is typically done on many layers & then picks the best layer. Activation patching ablates each layer & shows you which one is most important for the counterfactual. I would trust the result of patching more than RE for location of a representation (maybe grounded out as what a SAE would find?) due to being more in-distribution.
Did y’all do any ablations on your loss terms. For example:
1. JumpReLU() → ReLU
2. L0 (w/ STE) → L1
I’d be curious to see if the pareto improvements and high frequency features are due to one, the other, or both