Note that this is conditional SAE steering—if the latent doesn’t fire it’s a no-op. So it’s not that surprising that it’s less damaging, a prompt is there on every input! It depends a lot on the performance of the encoder as a classifier though
That’s technically even more conditional as the intervention (subtract the parallel component) also depends on the residual stream. But yes. I think it’s reasonable to lump these together though, orthogonalisation also should be fairly non destructive unless the direction was present, while steering likely always has side effects
Note that this is conditional SAE steering—if the latent doesn’t fire it’s a no-op. So it’s not that surprising that it’s less damaging, a prompt is there on every input! It depends a lot on the performance of the encoder as a classifier though
Isn’t every instance of clamping a feature’s activation to 0 conditional in this sense?
That’s technically even more conditional as the intervention (subtract the parallel component) also depends on the residual stream. But yes. I think it’s reasonable to lump these together though, orthogonalisation also should be fairly non destructive unless the direction was present, while steering likely always has side effects