Does this still work if there is a layer norm between the layers?
This works because the difference in input to the edge destination is equal to the difference in output of the source component.
This is key to why you can compute the patched inputs quickly, but it only holds without layer norm, right?
Yes you’re correct that it does not work with LayerNorm between layers. I’m not aware of any models that do this. Are you?
Does this still work if there is a layer norm between the layers?
This is key to why you can compute the patched inputs quickly, but it only holds without layer norm, right?
Yes you’re correct that it does not work with LayerNorm between layers. I’m not aware of any models that do this. Are you?