Nina Rimsky 27 Apr 2024 23:05 UTC
LW: 21 AF: 11
12
AF
in reply to: Dan H’s comment on: Refusal in LLMs is mediated by a single direction
FWIW I published this Alignment Forum post on activation steering to bypass refusal (albeit an early variant that reduces coherence too much to be useful) which from what I can tell is the earliest work on linear residual-stream perturbations to modulate refusal in RLHF LLMs.

I think this post is novel compared to both my work and RepE because they:
- Demonstrate full ablation of the refusal behavior with much less effect on coherence / other capabilities compared to normal steering
- Investigate projection thoroughly as an alternative to sweeping over vector magnitudes (rather than just stating that this is possible)
- Find that using harmful/harmless instructions (rather than harmful vs. harmless/refusal responses) to generate a contrast vector is the most effective (whereas other works try one or the other), and also investigate which token position at which to extract the representation
- Find that projecting away the (same, linear) feature at all layers improves upon steering at a single layer, which is different from standard activation steering
- Test on many different models
- Describe a way of turning this into a weight-edit
Edit:

(Want to flag that I strong-disagree-voted with your comment, and am not in the research group—it is not them “dogpiling”)

I do agree that RepE should be included in a “related work” section of a paper but generally people should be free to post research updates on LW/AF that don’t have a complete thorough lit review / related work section. There are really very many activation-steering-esque papers/blogposts now, including refusal-bypassing-related ones, that all came out around the same time.

Nina Rimsky 19 Nov 2023 22:31 UTC
LW: 20 AF: 9
5
AF
on: My Criticism of Singular Learning Theory
suppose we have a 500,000-degree polynomial, and that we fit this to 50,000 data points. In this case, we have 450,000 degrees of freedom, and we should by default expect to end up with a function which generalises very poorly. But when we train a neural network with 500,000 parameters on 50,000 MNIST images, we end up with a neural network that generalises well. Moreover, adding more parameters to the neural network will typically make generalisation better, whereas adding more parameters to the polynomial is likely to make generalisation worse.
Only tangentially related, but your intuition about polynomial regression is not quite correct. A large range of polynomial regression learning tasks will display double descent where adding more and more higher degree polynomials consistently improves loss past the interpolation threshold.

Examples from here:

Relevant paper

Nina Rimsky 28 Apr 2024 11:21 UTC
LW: 17 AF: 10
20
AF
in reply to: Dan H’s comment on: Refusal in LLMs is mediated by a single direction
We do weight editing in the RepE paper (that’s why it’s called RepE instead of ActE)
I looked at the paper again and couldn’t find anywhere where you do the type of weight-editing this post describes (extracting a representation and then changing the weights without optimization such that they cannot write to that direction).

The LoRRA approach mentioned in RepE finetunes the model to change representations which is different.