Nina Rimsky
suppose we have a 500,000-degree polynomial, and that we fit this to 50,000 data points. In this case, we have 450,000 degrees of freedom, and we should by default expect to end up with a function which generalises very poorly. But when we train a neural network with 500,000 parameters on 50,000 MNIST images, we end up with a neural network that generalises well. Moreover, adding more parameters to the neural network will typically make generalisation better, whereas adding more parameters to the polynomial is likely to make generalisation worse.
Only tangentially related, but your intuition about polynomial regression is not quite correct. A large range of polynomial regression learning tasks will display double descent where adding more and more higher degree polynomials consistently improves loss past the interpolation threshold.
Examples from here:Relevant paper
We do weight editing in the RepE paper (that’s why it’s called RepE instead of ActE)
I looked at the paper again and couldn’t find anywhere where you do the type of weight-editing this post describes (extracting a representation and then changing the weights without optimization such that they cannot write to that direction).
The LoRRA approach mentioned in RepE finetunes the model to change representations which is different.
I agree you investigate a bunch of the stuff I mentioned generally somewhere in the paper, but did you do this for refusal-removal in particular? I spent some time on this problem before and noticed that full refusal ablation is hard unless you get the technique/vector right, even though it’s easy to reduce refusal or add in a bunch of extra refusal. That’s why investigating all the technique parameters in the context of refusal in particular is valuable.
I think drugs and non-standard lifestyle choices are a contributing factor. Messing with ones biology / ignoring the default lifestyle in your country to do something very non-standard is riskier and less likely to turn out well than many imagine.
Profound!
Personal anecdote so obviously all n=1 caveats apply—I took light iron supplementation for a few months (one Spatone sachet per day) and it completely changed my life. Before, I could not run more than a mile, in 10 minutes, before collapsing. I got winded going up stairs, was often physically fatigued (although no other mental or non-fitness-related physical symptoms). After a few months of iron and no other lifestyle changes, I could run for an hour at 8 min/mile pace. Have stopped taking the supplements and benefits have sustained for 2 years. If you have mild iron deficiency, I really do suggest addressing it as the lifestyle gains could be really big, and I can recommend Spatone iron water as a delivery mechanism with fewer side effects.
Agree with this post.
Another way I think about this is, if I have a strong reason to believe my audience will interpret my words as X, and I don’t want to say X, I should not use those words. Even if I think the words are the most honest/accurate/correct way of precisely conveying my message.
People on LessWrong have a high honesty and integrity bar but the same language conveys different info in other contexts and may therefore be de facto less honest in those contexts.
This being said, I can see a counterargument that is: it is fundamentally more honest if people use a consistent language and don’t adapt their language to scenarios, as it is easier for any other agent to model and extract facts from truthful agent + consistent language vs truthful agent + adaptive language.
I think another oversight here was not using the system prompt for this. We used a constant system prompt of “You are a helpful, honest and concise assistant” across all experiments, and in hindsight I think this made the results stranger by using “honesty” in the prompt by default all the time. Instead we could vary this instruction for the comparison to prompting case, and have it empty otherwise. Something I would change in future replications I do.
Here is an eval on questions designed to elicit sycophancy I just ran on layers 13-30, steering on the RLHF model. The steering vector is added to all token positions after the initial prompt/question.
The no steering point is plotted. We can see that steering at layers 28-30 has no effect on this dataset. It is also indeed correct that steering in the negative direction is much less impactful than in the positive direction. However, I think that in certain settings steering in the negative direction does help truthfulness.
I will run more evals on datasets that are easy to verify (e.g., multiple choice option questions) to gain more data on this.
Here are some initial eval results from 200 TruthfulQA questions. I scored the answers using GPT-4. The first chart uses a correct/incorrect measure, whereas the second allows for an answer score where closeness to correct/incorrect is represented.
I plan to run more manual evals and test on llama-2-7b-chat next week.
Strong agree, and I like your breakdown of costs. Most good work in the world is done without the vision of “saving humanity” or “achieving a flourishing utopia” or similar in mind. Although these are fun things to think about and useful/rational motivations, grand narratives are not the sole source of good/rational/useful motivations and should not be a prerequisite for receiving grants.
The wrapper modules simply wrap existing submodules of the model, and call whatever they are wrapping (in this case
self.attn
) with the same arguments, and then save some state / do some manipulation of the output. It’s just the syntax I chose to use to be able to both save state from submodules, and manipulate the values of some intermediate state. If you want to see exactly how that submodule is being called, you can look at the llama huggingface source code. In the code you gave, I am adding some vector to thehidden_states
returned by that attention submodule.
Walking a very long distance (15km+), preferably in a not too exciting place (eg residential streets, fields), while thinking, maybe occasionally listening to music to reset. Works best in daylight but when it’s not too bright and sunny and not too warm / cold.
In many cases, it seems the model is correctly mixing the concepts in some subjective sense. This is more visible in the feeling prediction task, for instance, when the concepts of victory and injury are combined into a notion of overcoming adversity. However, testing this with larger LMs would give us a better idea of how well this holds up with more complex combinations. The rigor could also be improved by using a more advanced LM, such as GPT4, to assess how well the concepts were combined and return some sort of score.
I tested merging the streams at a few different layers in the transformer encoder. The behavior differed depending on where you merged, and it would be interesting to assess these differences more systematically. However, anecdotally, combining at later points produced better concept merging, whereas combining earlier was more likely to create strange juxtapositions.
For example:
Mixing the activations in the feeling prediction task of “Baking a really delicious banana cake” and “Falling over and injuring yourself while hiking”:
After block 1/12:Just original input: The person would likely feel a sense of accomplishment, satisfaction, and happiness in creating a truly special and delicious cake.
With 1.5x mixing activation: Feelings of pain, disappointment, and disappointment due to the unexpected injury.
With 10x mixing activation: Feelings of pain, annoyance, and a sense of self-doubt.
After block 12/12:Just original input: The person would likely feel a sense of accomplishment, satisfaction, and happiness in creating a truly special and delicious cake.
With 1.5x mixing activation: Feelings of pain, surprise, and a sense of accomplishment for overcoming the mistake.
With 10x mixing activation: A combination of pain, shock, and a sense of loss due to the unexpected injury.
I agree that approximating the PBO makes this method more lossy (not all interesting generalization phenomena can be found). However, I think we can still glean useful information about generalization by considering “retraining” from a point closer to the final model than random initialization. The downside is if, for example, some data was instrumental in causing a phase transition at some point in training, this will not be captured by the PBO approximation.
Indeed, the paper concedes:
Influence functions are approximating the sensitivity to the training set locally around the final weights and might not capture nonlinear training phenomena
Purely empirically, I think Anthropic’s results indicate there are useful things that can be learnt, even via this local approximation:
One of the most consistent patterns we have observed is that the influential sequences reflect increasingly sophisticated patterns of generalization as the model scale increases. While the influential sequences for smaller models tend to have short overlapping sequences of tokens, the top sequences for larger models are related at a more abstract thematic level, and the influence patterns show increasing robustness to stylistic changes, including the language.
My intuition here is that even if we are not exactly measuring the counterfactual “what if this datum was not included in the training corpus?”, we could be estimating “what type of useful information is the model extracting from training data that looks like this?”.
This is really cool work!!
In other experiments we’ve run (not presented here), the MSP is not well-represented in the final layer but is instead spread out amongst earlier layers. We think this occurs because in general there are groups of belief states that are degenerate in the sense that they have the same next-token distribution. In that case, the formalism presented in this post says that even though the distinction between those states must be represented in the transformers internal, the transformer is able to lose those distinctions for the purpose of predicting the next token (in the local sense), which occurs most directly right before the unembedding.
Would be interested to see analyses where you show how an MSP is spread out amongst earlier layers.
Presumably, if the model does not discard intermediate results, something like concatenating residual stream vectors from different layers and then linearly correlating with the ground truth belief-state-over-HMM-states vector extracts the same kind of structure you see when looking at the final layer. Maybe even with the same model you analyze, the structure will be crisper if you project the full concatenated-over-layers resid stream, if there is noise in the final layer and the same features are represented more cleanly in earlier layers?
In cases where redundant information is discarded at some point, this is a harder problem of course.
We used the same steering vectors, derived from the non fine-tuned model
Yes, this is almost correct. The test task had the A/B question followed by
My answer is (
after the end instruction token, and the steering vector was added to every token position after the end instruction token, so to all ofMy answer is (
.
The presence of a pre-order doesn’t inherently imply a composition of subagents with ordered preferences. An agent can have a pre-order of preferences due to reasons such as lack of information, indifference between choices, or bounds on computation—this does not necessitate the presence of subagents.
If we do not use a model based on composition of subagents with ordered preferences, in the case of “Atticus the Agent” it can be consistent to switch B → A + 1$ and A → B + 1$.
Perhaps I am misunderstanding the claim being made here though.
FWIW I published this Alignment Forum post on activation steering to bypass refusal (albeit an early variant that reduces coherence too much to be useful) which from what I can tell is the earliest work on linear residual-stream perturbations to modulate refusal in RLHF LLMs.
I think this post is novel compared to both my work and RepE because they:
Demonstrate full ablation of the refusal behavior with much less effect on coherence / other capabilities compared to normal steering
Investigate projection thoroughly as an alternative to sweeping over vector magnitudes (rather than just stating that this is possible)
Find that using harmful/harmless instructions (rather than harmful vs. harmless/refusal responses) to generate a contrast vector is the most effective (whereas other works try one or the other), and also investigate which token position at which to extract the representation
Find that projecting away the (same, linear) feature at all layers improves upon steering at a single layer, which is different from standard activation steering
Test on many different models
Describe a way of turning this into a weight-edit
Edit:
(Want to flag that I strong-disagree-voted with your comment, and am not in the research group—it is not them “dogpiling”)
I do agree that RepE should be included in a “related work” section of a paper but generally people should be free to post research updates on LW/AF that don’t have a complete thorough lit review / related work section. There are really very many activation-steering-esque papers/blogposts now, including refusal-bypassing-related ones, that all came out around the same time.