Alex Gibson comments on Alex Gibson’s Shortform

Alex Gibson 9 Jan 2026 13:12 UTC
2 points
0
A definition of a subset of features i’m beginning to like is just “short sentence describing the input”
A model has a linear feature associated with a description if there is a direction in activation space such that when the model is run on an input^[1], the resulting activation has a dot product above a threshold iff the short description is agreed to hold about the input (by a small set of biological neural networks).
This raises the question, what percentage of English descriptions of length $n$ words have linear features associated with them?
1. ^
  We want to rule out adversarial examples so in practice we just test for linear features on a fixed text corpus.