scottviteri comments on Mechanistically Eliciting Latent Behaviors in Language Models

scottviteri 23 Jan 2025 8:24 UTC
1 point
0
I really like the idea of finding steering vectors that maximize downstream differences, and I have a few follow-up questions.

Have you tried/considered modifying c_fc (the MLP encoder layer) bias instead of c_proj (the MLP decoder layer) bias? I don’t know about this context, but (i) c_fc makes more intuitive sense as a location to change for me, (ii) I have seen more success playing with it in the past than c_proj, and (iii) they are not-equivalent because of the non-linearity between them.

I like how you control for radius by projecting gradients onto the tangent space and projecting the steering vector of the sphere, but have you tried using cosine distance as the loss function so there is less incentive for R to naturally blow up? Let $D (z) = \sum_{i = 1}^{n} \sum_{t \in I_{i}} c o s D i s t (Z_{ℓ_{t a r g e t}, i, t} (z), Z_{ℓ_{t a r g e t}, i, t} (0))$ in ${max}_{z} D (z)$ .

When you do iterative search for next steering vectors, I do not expect that constraining the search to an orthogonal subspace to previously found steering vectors to be very helpful, since the orthogonal vectors might very well be mapped into the same downstream part of latent space. Since the memory demands are quite cheap for learning steering vectors, I would be interested in seeing an objective which learned a matrix of steering vectors simultaneously, maximizing the sum of pairwise distances. Suppose we are learning $K$ vectors simultaneously.
${max}_{z_{1}, \dots, z_{K}} \sum_{1 \leq k < k^{'} \leq K} \sum_{i = 1}^{n} \sum_{t \in I_{i}} c o s D i s t (Z_{ℓ_{t a r g e t}, i, t} (z_{k}), Z_{ℓ_{t a r g e t}, i, t} (z_{k^{'}}))$

But this form of the objective makes it more transparent that a natural solution is to make each steering vector turn the output into gibberish (unless the LM latent space treats all gibberish alike, which I admit is possible). So maybe we would want a tunable term which encourages staying close to the unsteered activations, while staying far from the other steered activations.
${max}_{z_{1}, \dots, z_{n}} \sum_{1 \leq k < k^{'} \leq K} \sum_{i = 1}^{n} \sum_{t \in I_{i}} c o s D i s t (Z_{ℓ_{t a r g e t}, i, t} (z_{k}), Z_{ℓ_{t a r g e t}, i, t} (z_{k^{'}})) - λ \sum_{i = 1}^{K} D (z_{k})$
Lastly, I would be interested in seeing the final output probability distribution over tokens instead of $ℓ_{t a r g e t}$ using KL for the distance, since in that domain we can extract very fine grained information from the model’s activations. Let $D^{k l} (z) = \sum_{i = 1}^{n} \sum_{t \in I_{i}} K L (Z_{ℓ_{u n e m b e d}, i, t} (z) | | Z_{ℓ_{u n e m b e d}, i, t} (0))$ in
${max}_{z_{1}, \dots, z_{n}} \sum_{k = 1}^{K} \sum_{k^{'} = 1}^{K} \sum_{i = 1}^{n} \sum_{t \in I_{i}} K L (Z_{ℓ_{u n e m b e d}, i, t} (z_{k}) | | Z_{ℓ_{u n e m b e d}, i, t} (z_{k^{'}})) - λ \sum_{i = 1}^{K} D^{k l} (z_{k})$