Bart Bussmann comments on Bart Bussmann’s Shortform

Bart Bussmann 16 Jul 2025 9:17 UTC
3 points
0
When working with SAE features, I’ve usually relied on a linear intuition: a feature firing with twice the strength has about twice the “impact” on the model. But while playing with an SAE trained on the final layer I was reminded that the actual direct impact on the relative token probabilities grows exponentially with activation strength. While a feature’s additive contribution to the logits is indeed linear with its activation strength, the ratio of probabilities of two competing tokens $P (A) / P (B)$ is equal to the exponent of the logit difference $exp (logit (A) - logit (B))$ .
If we have a feature that boosts logit(A) and not logit(B) and we multiply its activation strength by a factor of 5.0, this doesn’t 5x its effect on $P (A) / P (B)$ , but rather raises its effect to the 5th power. If this feature caused token A to be three times as likely as token B before, it now makes this token 3^5 = 243 times as likely! This might partly explain why the lower activations for a feature are often less interpretable than the top activations. Their direct impact on the relative token probabilities is exponentially smaller.
Note that this only holds for the direct ‘logit lens’-like effect of a feature. This makes this intuition mostly applicable to features in the final layers of a model, as the impact of earlier features is probably mostly modulated by their effect on later layers.