Bart Bussmann’s Shortform

Bart Bussmann13 Mar 2024 14:54 UTC

4 points

2 comments1 min readLW link

Bart Bussmann 13 Mar 2024 14:54 UTC
6 points
−2
According to this Nature paper, the Atlantic Meridional Overturning Circulation (AMOC), the “global conveyor belt”, is likely to collapse this century (mean 2050, 95% confidence interval is 2025-2095).

Another recent study finds that it is “on tipping course” and predicts that after collapse average February temperatures in London will decrease by 1.5 °C per decade (15 °C over 100 years). Bergen (Norway) February temperatures will decrease by 35 °C. This is a temperature change about an order of magnitude faster than normal global warming (0.2 °C per decade) but in the other direction!

This seems like a big deal? Anyone with more expertise in climate sciences want to weigh in?
Bart Bussmann 16 Jul 2025 9:17 UTC
3 points
0
When working with SAE features, I’ve usually relied on a linear intuition: a feature firing with twice the strength has about twice the “impact” on the model. But while playing with an SAE trained on the final layer I was reminded that the actual direct impact on the relative token probabilities grows exponentially with activation strength. While a feature’s additive contribution to the logits is indeed linear with its activation strength, the ratio of probabilities of two competing tokens $P (A) / P (B)$ is equal to the exponent of the logit difference $exp (logit (A) - logit (B))$ .
If we have a feature that boosts logit(A) and not logit(B) and we multiply its activation strength by a factor of 5.0, this doesn’t 5x its effect on $P (A) / P (B)$ , but rather raises its effect to the 5th power. If this feature caused token A to be three times as likely as token B before, it now makes this token 3^5 = 243 times as likely! This might partly explain why the lower activations for a feature are often less interpretable than the top activations. Their direct impact on the relative token probabilities is exponentially smaller.
Note that this only holds for the direct ‘logit lens’-like effect of a feature. This makes this intuition mostly applicable to features in the final layers of a model, as the impact of earlier features is probably mostly modulated by their effect on later layers.