Daniel Tan comments on Daniel Tan’s Shortform

Daniel Tan 12 Jan 2025 10:24 UTC
3 points
0
“Feature multiplicity” in language models.
This refers to the idea that there may be many representations of a ‘feature’ in a neural network.
Usually there will be one ‘primary’ representation, but there can also be a bunch of ‘secondary’ or ‘dormant’ representations.
If we assume the linear representation hypothesis, then there may be multiple direction in activation space that similarly produce a ‘feature’ in the output. E.g. the existence of 800 orthogonal steering vectors for code.
This is consistent with ‘circuit formation’ resulting in many different circuits / intermediate features, and ‘circuit cleanup’ happening only at grokking. Because we don’t train language models till the grokking regime, ‘feature multiplicity’ may be the default state.
Feature multiplicity is one possible explanation for adversarial examples. In turn, adversarial defense procedures such as obfuscated adversarial training or multi-scale, multi-layer aggregation may work by removing feature multiplicity, such that the only ‘remaining’ feature direction is the ‘primary’ one.
Thanks to @Andrew Mack for discussing this idea with me