I think that observed behavior is fairly consistent with non-linear functions that have sort-of-linear parts. Let’s take ReLU. If you subtract large enough number, it doesn’t matter if you subtract more, because you will always get zero, but before that you will observe sort-of-linear change of behavior.
Speculative part: neural networks learn linear representations and condations of switching between them which are expressed in non-linear part of internal mechanisms. If you add too much number to some component, model hits the region of state space that doesn’t have linear representation and crumbles.
Predictions:
Algebraic value editing works (for at least one “X vector”) in LMs: 95%
Algebraic value editing works better for larger models, all else equal: 55%
If value edits work well, they are also composable: 60%
If value edits work at all, they are hard to make without substantially degrading capabilities: 25%
We will claim we found an X-vector which qualitatively modifies completions in a range of situations, for X =
“truth-telling” 25%
“love” 70%
“accepting death” 70%
“speaking French” 95%
The main obstacle, according to my model: if model works by switching between different linear representations, it is possible that “niceness” vector exists only for some specific layer of model which decides whether completion will be nice or not, so you can’t take random layer in the middle and calculate “niceness” vector for it.
Scattered thoughts:
I think that observed behavior is fairly consistent with non-linear functions that have sort-of-linear parts. Let’s take ReLU. If you subtract large enough number, it doesn’t matter if you subtract more, because you will always get zero, but before that you will observe sort-of-linear change of behavior.
Speculative part: neural networks learn linear representations and condations of switching between them which are expressed in non-linear part of internal mechanisms. If you add too much number to some component, model hits the region of state space that doesn’t have linear representation and crumbles.
Predictions:
Algebraic value editing works (for at least one “X vector”) in LMs: 95%
Algebraic value editing works better for larger models, all else equal: 55%
If value edits work well, they are also composable: 60%
If value edits work at all, they are hard to make without substantially degrading capabilities: 25%
We will claim we found an X-vector which qualitatively modifies completions in a range of situations, for X =
“truth-telling” 25%
“love” 70%
“accepting death” 70%
“speaking French” 95%
The main obstacle, according to my model: if model works by switching between different linear representations, it is possible that “niceness” vector exists only for some specific layer of model which decides whether completion will be nice or not, so you can’t take random layer in the middle and calculate “niceness” vector for it.