TurnTrout comments on Steering Llama-2 with contrastive activation additions

TurnTrout 8 Jan 2024 19:11 UTC
LW: 6 AF: 4
−2
AF
Thanks for being so concrete! I think this example is overly complicated.
Let me give a biological analogy for why activation additions are different. A person is hungry. They learn to go towards the donut shop instead of the train station when they’re hungry. They learn to do this extremely reliably, so that they always make that choice when at that intersection. They’ve also learned to seek out donuts more in general. However, there’s no more learning they can do,^[1] if they’re just being reinforced for this one choice.
However, conditional on the person being closer to the donuts, they’re even more in the mood to eat donuts (because they can smell the donuts)! And conditional on the person going to the train station, they’re less in the mood to grab a donut (since they’re farther away now). So if you could add in this “donut vector” (mood when close to donuts—mood when far from donuts), you could make the person be more in the mood to eat donuts in general! Even just using the same intersection as training data! You can do this even if they’ve learned to always go to get donuts at that given intersection. The activation addition provides orthogonal benefit to the “finetuning.”
One (possibly wrong) model for LLMs is “the model is in ‘different moods’ in different situations, and activation additions are particularly good at modifying moods/intentions.” So you e.g. make the model more in the mood to be sycophantic. You make it use its existing machinery differently, rather than rewiring its internal subroutines (finetuning). You can even do that after you rewire the internal subroutines!
I don’t know if that’s actually a true model, but hopefully it communicates it.
..but wait a minute, couldn’t we get that info simply by prompting it and having it continue?
Not clear why that would be true?
1. ^
  Probably not strictly true in this situation, but it’s an analogy.