steering the model using directions in activation space is more valuable than doing the same with weights, because in the future the consequences of cognition might be far-removed from its weights (deep deceptiveness)
I think one could argue that self-deception could in some instances be spotted in the weights more easily than in the activations. Often the functionality acquired by self-deception is not activated, but it may be more readily apparent in the weights. Hence I don’t see this as a strong reason to dismiss https://arxiv.org/abs/2212.04089. I would want a weight version of a method and an activation version of a method; they tend to have different strengths.
Edit: I see passive disagreement but no refutation. The argument against weights was of the form “here’s a strength activations has”; for it to be enough to dismiss the paper without discussion, that must be an extremely strong property to outweigh all of its potential merits, or it is a Pareto-improvement. Those don’t seem corroborated or at all obvious.
The argument against weights was of the form “here’s a strength activations has”; for it to be enough to dismiss the paper without discussion
I personally don’t “dismiss” the task vector work. I didn’t read Thomas as dismissing it by not calling it the concrete work he is most excited about—that seems like a slightly uncharitable read?
Editing Models with Task Arithmeticexplored a “dual” version of our algebraic technique. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-diff vectors. While our technique modifies activations, the techniques seem complementary, and both useful for alignment.
I’m highly uncertain about the promise of activation additions. I think their promise ranges from pessimistic “superficial stylistic edits” to optimistic “easy activation/deactivation of the model’s priorities at inference time.” In the optimistic worlds, activation additions do enjoy extreme advantages over task vectors, like accessibility of internal model properties which aren’t accessible to finetuning (see the speculation portion of the post). In the very pessimistic worlds, activation additions are probably less directly important than task vectors.
Note that task vectors require finetuning. From the newly updated related work section:
Lastly, Editing Models with Task Arithmeticexplored a “dual” version of our activation additions. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-difference vectors. While this seems interesting, task arithmetic requires finetuning. In Activation additions have advantages over (RL/supervised) finetuning, we explain the advantages our approach has over finetuning.
Deep deceptiveness is not quite self-deception. I agree that there are some circumstances where defending from self-deception advantages weight methods, but these seem uncommon.
I thought briefly about the Ilharco et al paper and am very impressed by it as well.
Thanks for linking to the resources.
I don’t have enough time to reply in depth, but the factors in favor of weight vectors and activation vectors both seem really complicated, and the balance still seems in favor of activation vectors, though I have reasonably high uncertainty.
I don’t have enough time to reply in depth, but the factors in favor of weight vectors and activation vectors both seem really complicated, and the balance still seems in favor of activation vectors, though I have reasonably high uncertainty.
Weight vectors are derived through fine-tuning. Insofar as you thought activation additions are importantly better than finetuning in some respects, and were already thinking about finetuning (eg via RLHF) when writing why you were excited about activation additions, I don’t see how this paper changes the balance very much? (I wrote my thoughts here in Activation additions have advantages over (RL/supervised) finetuning)
I think the main additional piece of information given by the paper is the composability of finetuned edits unlocking a range of finetuning configurations, which grows exponentially with the number of composable edits. But I personally noted that finetuning enjoys this benefit in the original version of the post.
There’s another strength which I hadn’t mentioned in my writing, which is that if you can finetune into the opposite direction of the intended behavior (like you can make a model less honest somehow), and then subtract that task vector, you can maybe increase honesty, even if you couldn’t just naively finetune that honesty into the model.[1]
But, in a sense, task vectors are “still in the same modalities we’re used to.” Activation additions jolted me because they’re just… a new way[2] of interacting with models! There’s been way more thought and research put into finetuning and its consequences, relative to activation engineering and its alignment implications. I personally expect activation engineering to open up a lot of affordances for model-steering.
To be very clear about the novelty of our contributions, I’ll quote the “Summary of relationship to prior work” section:
We are not the first to steer language model behavior by adding activation vectors to residual streams. However, we are the first to do so without using machine optimization (e.g. SGD) to find the vectors. Among other benefits, our “activation addition” methodology enables much faster feedback loops than optimization-based activation vector approaches.
But this “activation engineering” modality is relatively new, and relatively unexplored, especially in its alignment implications. I found and cited two papers adding activation vectors to LMs to steer them, from 2022 and 2023.
(You linked to “deep deceptiveness,” and I’m going to assume is related to self-deception (discussed in the academic literature and in the AI and evolution paper). If it isn’t, then this point is still relevant for alignment since self-deception is another internal hazard.)
I think one could argue that self-deception could in some instances be spotted in the weights more easily than in the activations. Often the functionality acquired by self-deception is not activated, but it may be more readily apparent in the weights. Hence I don’t see this as a strong reason to dismiss https://arxiv.org/abs/2212.04089. I would want a weight version of a method and an activation version of a method; they tend to have different strengths.
Note: If you’re wanting to keep track of safety papers outside of LW/AF, papers including https://arxiv.org/abs/2212.04089 were tweeted on https://twitter.com/topofmlsafety and posted on https://www.reddit.com/r/mlsafety
Edit: I see passive disagreement but no refutation. The argument against weights was of the form “here’s a strength activations has”; for it to be enough to dismiss the paper without discussion, that must be an extremely strong property to outweigh all of its potential merits, or it is a Pareto-improvement. Those don’t seem corroborated or at all obvious.
I personally don’t “dismiss” the task vector work. I didn’t read Thomas as dismissing it by not calling it the concrete work he is most excited about—that seems like a slightly uncharitable read?
I, personally, think the task vector work is exciting. Back in Understanding and controlling a maze-solving policy network, I wrote (emphasis added):
I’m highly uncertain about the promise of activation additions. I think their promise ranges from pessimistic “superficial stylistic edits” to optimistic “easy activation/deactivation of the model’s priorities at inference time.” In the optimistic worlds, activation additions do enjoy extreme advantages over task vectors, like accessibility of internal model properties which aren’t accessible to finetuning (see the speculation portion of the post). In the very pessimistic worlds, activation additions are probably less directly important than task vectors.
I don’t know what world we’re in yet.
Note that task vectors require finetuning. From the newly updated related work section:
Deep deceptiveness is not quite self-deception. I agree that there are some circumstances where defending from self-deception advantages weight methods, but these seem uncommon.
I thought briefly about the Ilharco et al paper and am very impressed by it as well.
Thanks for linking to the resources.
I don’t have enough time to reply in depth, but the factors in favor of weight vectors and activation vectors both seem really complicated, and the balance still seems in favor of activation vectors, though I have reasonably high uncertainty.
Weight vectors are derived through fine-tuning. Insofar as you thought activation additions are importantly better than finetuning in some respects, and were already thinking about finetuning (eg via RLHF) when writing why you were excited about activation additions, I don’t see how this paper changes the balance very much? (I wrote my thoughts here in Activation additions have advantages over (RL/supervised) finetuning)
I think the main additional piece of information given by the paper is the composability of finetuned edits unlocking a range of finetuning configurations, which grows exponentially with the number of composable edits. But I personally noted that finetuning enjoys this benefit in the original version of the post.
There’s another strength which I hadn’t mentioned in my writing, which is that if you can finetune into the opposite direction of the intended behavior (like you can make a model less honest somehow), and then subtract that task vector, you can maybe increase honesty, even if you couldn’t just naively finetune that honesty into the model.[1]
But, in a sense, task vectors are “still in the same modalities we’re used to.” Activation additions jolted me because they’re just… a new way[2] of interacting with models! There’s been way more thought and research put into finetuning and its consequences, relative to activation engineering and its alignment implications. I personally expect activation engineering to open up a lot of affordances for model-steering.
This is a kinda sloppy example because “honesty” probably isn’t a primitive property of the network’s reasoning. Sorry.
To be very clear about the novelty of our contributions, I’ll quote the “Summary of relationship to prior work” section:
But this “activation engineering” modality is relatively new, and relatively unexplored, especially in its alignment implications. I found and cited two papers adding activation vectors to LMs to steer them, from 2022 and 2023.