- If you add a steering vector, it’s not just scaling, so scale invariance doesn’t make a difference.
- If you scale an existing activation vector which makes up the entirety of one of the layers, the only effect would be to change the absolute magnitudes going into the softmax (since scale invariance means the relative magnitude at each position is the same). That could have some minor effect—changing the probability distribution to be sharper or flatter, but that’s all.
- If you scale some existing activation which is not an entire layer, then it’s no longer scale invariant anymore either, it’s kind of like adding a steering vector with zero magnitude in the other dimensions.
There is still a weak advantage for steering vectors in a tensor network because the change is going to be smooth, rather than discrete (since we’re not flipping gates on and off), but basically I was just confused here, sorry about that.
I’m confused on what you’re referring to. Bilinear layers are scale invariant by linearity
Bilinear(ax)=a2Bilinear(x)
So x could be the input-token, a vector d (from the previous bilinear layer), or a steering vector added in, but it will still produce the same output vector (and affect the same hidden dims of the bilinear layer in the same proportions).
Another way to say this is that for:
y=Bilinear(ax)
The percentage of attribution of each weight in bilinear w/ respect to y is the same regardless of a, since to compute the percentage, you’d divide by the total so that cancels out scaling by a.
This also means that, solely from the weights, you can trace the computation done by injecting this steering vector.
[*Caveat: a bilinear layer computes interactions between two things. So you can compute the interaction between BOTH (1) the steering vector and itself and (2) the steering vector w/ the other directions d from previous layers. You CAN’T compute how it interacts w/ the input-token solely from the weights, because the weights don’t include the input token. This is a bit of a trivial statement, but I don’t want to overstate what you can get]
Overall, my main confusion w/ what you wrote is what an activation that is an entire layer or not an entire layer means.
You’re correct, sorry for being confusing. Tracing through;
My understanding of steering is that you can add a steering vector to an activation vector at some layer, which causes the model outputs to be ‘steered’ in that direction. I.e.:
Record layer n’s activations when outputting “I am very happy”, get vector h
Record layer n’s activations when outputting “I am totally neutral”, get vector q
Subtract q from h to get steering vector s=h−q, the difference between ‘happy’ and ‘neutral’ outputs.
Add αs to the activations at layer n to steer the model into acting more happy, where α is some scalar.
The tensor network architecture is scale invariant, which (by my understanding) means that scaling the activation vector at any layer maintains the relative magnitude of the activations at any later layer.
(Dumb) I thought that this meant that adding a steering vector of magnitude α and adding a steering vector of magnitude 2α would preserve the relative magnitude of the activations later in the network. That is, that scaling the steering vector would be scale invariant too. But that’s not the case — we’re changing the direction of the (activation vector + steering vector) when we increase the magnitude of the steering vector.
That’s pretty much all I was trying to correct in my response. When I was talking about entire layer / not entire layer, I was just trying to say you can’t pretend that adding a steering vector is actually just scaling the activation vector even if it is parallel in some dimensions. It’s a trivial point I was just thinking through aloud. Like:
If you have activation vector v=[1,2,3,4,5]
You are scale invariant if you multiply by a scalar a: av=a[1,2,3,4,5]
Which is the same as pointwise multiplication by the vector u=[2,2,2,2,2]⊤: av=u⊙v
But you can’t just say, “Well, I’m only going to scale part of vector u, and since it’s scaling, that means it maintains scale invariance”, because it’s not just scaling, and that’s a dumb thing to say — u′=[2,2,2,0,0]⊤, then u′⊙v≠av for any a.
So basically you can ignore that, I was just slowly thinking through the maths to come to trivial conclusions.
Your claim here is different and good, and points to another useful thing about bilinear layers. As far as I can tell — you are saying you can decompose the effect of the steering vector into separable terms purely from the weights, whereas with ReLU you can’t do this because you don’t know which gates will flip. Neat!
With more thinking, I was broadly wrong here:
- If you add a steering vector, it’s not just scaling, so scale invariance doesn’t make a difference.
- If you scale an existing activation vector which makes up the entirety of one of the layers, the only effect would be to change the absolute magnitudes going into the softmax (since scale invariance means the relative magnitude at each position is the same). That could have some minor effect—changing the probability distribution to be sharper or flatter, but that’s all.
- If you scale some existing activation which is not an entire layer, then it’s no longer scale invariant anymore either, it’s kind of like adding a steering vector with zero magnitude in the other dimensions.
There is still a weak advantage for steering vectors in a tensor network because the change is going to be smooth, rather than discrete (since we’re not flipping gates on and off), but basically I was just confused here, sorry about that.
I’m confused on what you’re referring to. Bilinear layers are scale invariant by linearity
Bilinear(ax)=a2Bilinear(x)So x could be the input-token, a vector d (from the previous bilinear layer), or a steering vector added in, but it will still produce the same output vector (and affect the same hidden dims of the bilinear layer in the same proportions).
Another way to say this is that for:
y=Bilinear(ax)The percentage of attribution of each weight in bilinear w/ respect to y is the same regardless of a, since to compute the percentage, you’d divide by the total so that cancels out scaling by a.
This also means that, solely from the weights, you can trace the computation done by injecting this steering vector.
[*Caveat: a bilinear layer computes interactions between two things. So you can compute the interaction between BOTH (1) the steering vector and itself and (2) the steering vector w/ the other directions d from previous layers. You CAN’T compute how it interacts w/ the input-token solely from the weights, because the weights don’t include the input token. This is a bit of a trivial statement, but I don’t want to overstate what you can get]
Overall, my main confusion w/ what you wrote is what an activation that is an entire layer or not an entire layer means.
You’re correct, sorry for being confusing. Tracing through;
My understanding of steering is that you can add a steering vector to an activation vector at some layer, which causes the model outputs to be ‘steered’ in that direction. I.e.:
Record layer n’s activations when outputting “I am very happy”, get vector h
Record layer n’s activations when outputting “I am totally neutral”, get vector q
Subtract q from h to get steering vector s=h−q, the difference between ‘happy’ and ‘neutral’ outputs.
Add αs to the activations at layer n to steer the model into acting more happy, where α is some scalar.
The tensor network architecture is scale invariant, which (by my understanding) means that scaling the activation vector at any layer maintains the relative magnitude of the activations at any later layer.
(Dumb) I thought that this meant that adding a steering vector of magnitude α and adding a steering vector of magnitude 2α would preserve the relative magnitude of the activations later in the network. That is, that scaling the steering vector would be scale invariant too. But that’s not the case — we’re changing the direction of the (activation vector + steering vector) when we increase the magnitude of the steering vector.
That’s pretty much all I was trying to correct in my response. When I was talking about entire layer / not entire layer, I was just trying to say you can’t pretend that adding a steering vector is actually just scaling the activation vector even if it is parallel in some dimensions. It’s a trivial point I was just thinking through aloud. Like:
If you have activation vector v=[1,2,3,4,5]
You are scale invariant if you multiply by a scalar a: av=a[1,2,3,4,5]
Which is the same as pointwise multiplication by the vector u=[2,2,2,2,2]⊤: av=u⊙v
But you can’t just say, “Well, I’m only going to scale part of vector u, and since it’s scaling, that means it maintains scale invariance”, because it’s not just scaling, and that’s a dumb thing to say — u′=[2,2,2,0,0]⊤, then u′⊙v≠av for any a.
So basically you can ignore that, I was just slowly thinking through the maths to come to trivial conclusions.
Your claim here is different and good, and points to another useful thing about bilinear layers. As far as I can tell — you are saying you can decompose the effect of the steering vector into separable terms purely from the weights, whereas with ReLU you can’t do this because you don’t know which gates will flip. Neat!