A definition of a subset of features i’m beginning to like is just “short sentence describing the input”
A model has a linear feature associated with a description if there is a direction in activation space such that when the model is run on an input[1], the resulting activation has a dot product above a threshold iff the short description is agreed to hold about the input (by a small set of biological neural networks).
This raises the question, what percentage of English descriptions of length n words have linear features associated with them?
A definition of a subset of features i’m beginning to like is just “short sentence describing the input”
A model has a linear feature associated with a description if there is a direction in activation space such that when the model is run on an input[1], the resulting activation has a dot product above a threshold iff the short description is agreed to hold about the input (by a small set of biological neural networks).
This raises the question, what percentage of English descriptions of length n words have linear features associated with them?
We want to rule out adversarial examples so in practice we just test for linear features on a fixed text corpus.