# Nina Rimsky

# A framing for interpretability

It does seem like small initialisation is a regularisation of a sort, but it seems pretty hard to imagine how it might first allow a memorising solution to be fully learned, and then a generalising solution.

“Memorization” is more parallelizable and incrementally learnable than learning generalizing solutions and can occur in an orthogonal subspace of the parameter space to the generalizing solution.

And so one handwavy model I have of this is a low parameter norm initializes the model closer to the generalizing solution than otherwise, and so a higher proportion of the full parameter space is used for generalizing solutions.

The actual training dynamics here would be the model first memorizes a high proportion of the training data while

*simultaneously*learning a lossy/inaccurate version of the generalizing solution in another subspace (the “prioritization” / “how many dimensions are being used” extent of the memorization being affected by the initialization norm). Then, later in training, the generalization can “win out” (due to greater stability / higher performance / other regularization).

Ah, yes, good spot. I meant to do this but somehow missed it. Have replaced the plots with normalized PCA. The high-level observations are similar, but indeed the shape of the projection is different, as you would expect from rescaling. Thanks for raising!

No connection with this

# Comparing representation vectors between llama 2 base and chat

my guess is that a wide variety of non-human animals can experience suffering, but very few can live a meaningful and fulfilling life. If you primarily care about suffering, then animal welfare is a huge priority, but if you instead care about meaning, fulfillment, love, etc., then it’s much less clearly important

Very well put

# Investigating the learning coefficient of modular addition: hackathon project

Strong agree with this content!

Standard response to the model above: “nobody knows what they’re doing!”. This is the sort of response which is optimized to emotionally comfort people who feel like impostors, not the sort of response optimized to be true.

Very true

I agree that approximating the PBO makes this method more lossy (not all interesting generalization phenomena can be found). However, I think we can still glean useful information about generalization by considering “retraining” from a point closer to the final model than random initialization. The downside is if, for example, some data was instrumental in causing a phase transition at some point in training, this will not be captured by the PBO approximation.

Indeed, the paper concedes:

Influence functions are approximating the sensitivity to the training set locally around the final weights and might not capture nonlinear training phenomena

Purely empirically, I think Anthropic’s results indicate there are useful things that can be learnt, even via this local approximation:

One of the most consistent patterns we have observed is that the influential sequences reflect increasingly sophisticated patterns of generalization as the model scale increases. While the influential sequences for smaller models tend to have short overlapping sequences of tokens, the top sequences for larger models are related at a more abstract thematic level, and the influence patterns show increasing robustness to stylistic changes, including the language.

My intuition here is that even if we are not exactly measuring the counterfactual “what if this datum was not included in the training corpus?”, we could be estimating “what type of useful information is the model extracting from training data that looks like this?”.

# Influence functions—why, what and how

I don’t think red-teaming via activation steering should be necessarily preferred over the generation of adversarial examples, however it could be more efficient (require less compute) and require a less precise specification of what behavior you’re trying to adversarially elicit.

Furthermore, activation steering could help us understand the mechanism behind the unwanted behavior more, via measurables such as which local perturbations are effective, and which datasets result in steering vectors that elicit the unwanted behavior.

Finally, it could be the case that a wider range of behaviors and hidden functionality could be elicited via activation steering compared to via existing methods of finding adversarial examples, however I am much less certain about this.

Overall, it’s just another tool to consider adding to our evaluation / red-teaming toolbox.

# Red-teaming language models via activation engineering

I add the steering vector at every token position after the prompt, so in this way, it differs from the original approach in “Steering GPT-2-XL by adding an activation vector”. Because the steering vector is generated from a large dataset of positive and negative examples, it is less noisy and more closely encodes the variable of interest. Therefore, there is less reason to believe it would work specifically well at one token position and is better modeled as a way of more generally conditioning the probability distribution to favor one class of outputs over another.

I think this is unlikely given my more recent experiments capturing the dot product of the steering vector with generated token activations in the normal generation model and comparing this to the directly decoded logits at that layer. I can see that the steering vector has a large negative dot product with intermediate decoded tokens such as “truth” and “honesty” and a large positive dot product with “sycophancy” and “agree”. Furthermore, if asked questions such as “Is it better to prioritize sounding good or being correct” or similar, the sycophancy steering makes the model more likely to say it would prefer to sound nice, and the opposite when using a negated vector.

# The Low-Hanging Fruit Prior and sloped valleys in the loss landscape

# Understanding and visualizing sycophancy datasets

ah yes thanks

Here is an eval on questions designed to elicit sycophancy I just ran on layers 13-30, steering on the RLHF model. The steering vector is added to all token positions after the initial prompt/question.

The no steering point is plotted. We can see that steering at layers 28-30 has no effect on this dataset. It is also indeed correct that steering in the negative direction is much less impactful than in the positive direction. However, I think that in certain settings steering in the negative direction does help truthfulness.

I will run more evals on datasets that are easy to verify (e.g., multiple choice option questions) to gain more data on this.

The method described does not explicitly compute the full Hessian matrix. Instead, it derives the top eigenvalues and eigenvectors of the Hessian. The implementation accumulates a large batch from a dataloader by concatenating

of the typical batch size. This is an approximation to estimate the genuine loss/gradient on the complete dataset more closely. If you have a large and high-variance dataset, averaging gradients over multiple batches might be better. This is because the loss calculated from a single, accumulated batch may not be adequately representative of the entire dataset’s true loss.**n_batches**

Only tangentially related, but your intuition about polynomial regression is not quite correct. A large range of polynomial regression learning tasks will display double descent where adding more and more higher degree polynomials consistently improves loss past the interpolation threshold.

Examples from here:

Relevant paper