habryka comments on Steering Llama-2 with contrastive activation additions

habryka 2 Jan 2024 20:41 UTC
LW: 11 AF: 9
0
AF
Alex was also confident that it would stack with supervised finetuning, which was a somewhat controversial claim.
To resolve the finetuning disagreement via experiment, suppose we compute a sycophancy vector from the set of prompt pairs $S$ . What happens if we do supervised finetuning on $S$ by upweighting e.g. the sycophantic A/B token? Given the same set of data (for computing the steering vector or finetuning the model), and freezing the model except for layer 15 (where the sycophancy vector is added)—which technique is more effective?
The relevant discussion seems to be about the claim that steering vector would outperform finetuning on the task of reducing sycophancy:
You say:
I think this result is very exciting and promising. You appear to have found substantial reductions in sycophancy, beyond whatever was achieved with Meta’s finetuning, using a simple activation engineering methodology.
Ryan also says:
How do you know this is beyond what finetuning was capable of? I’d guess that Meta didn’t bother to train against obvious sycophancy and if you trained against it, then it would go away.
It seems that indeed the original claim that you (Turntrout) made here suggested that it would outperform finetuning on reducing sycophancy, and Ryan was exactly spot on. Finetuning basically gets rid of sycophancy and activation steering didn’t do anything in addition (or at least not anything we can measure with the present methodology).
I do think it’s interesting that activation steering does work on top of finetuning for increasing sycophancy, but that was not what your original comment or Ryan’s response was about. I wouldn’t have been very confident here, and in as much as I disagreed at the time (which I think I did, but I don’t remember) it was about the marginal benefit for reducing sycophancy, not increasing it (now thinking about it, it makes sense to me that there is a lot more variance in model space towards increasing sycophancy, in the same way as there are many more false theorems than correct ones, and we usually define sycophancy as saying false or at least inconsistent things in order to appear favorable to someone).
- TurnTrout 8 Jan 2024 18:58 UTC
  LW: 13 AF: 11
  4
  AF Parent
  While I disagree with a few points here, this comment sparked a nice private dialogue between me and Ryan. Here are two things I think are true:
  1. Ryan’s original comment (which I linked to in the post) was only making claims about sycophancy reduction. As you pointed out.
  2. Ryan and I had some broader disagreement about whether activation additions would have anything to add to more standard techniques like finetuning. Ryan was open to the idea of it generalizing better / stacking, but was unsure. I felt quite confident that activation additions would at least stack, and possibly even outperform finetuning.
  However, since the disagreement is not neatly captured in a single comment thread, and since plenty of folks disagreed with me about activation additions, I’m going to edit the original post to not single out Ryan’s points anymore. The comment of mine which I linked to was in fact intended to pertain to activation additions in general, and I have contemporaneous private messages with different folk where I claimed it would stack. So I’m leaving that part in as support of my confidence in this method.
  Thanks for (indirectly) facilitating this sense-making process!
  - ryan_greenblatt 8 Jan 2024 19:17 UTC
    LW: 4 AF: 3
    0
    AF Parent
    Thanks for this clarification!
  - ryan_greenblatt 8 Jan 2024 23:14 UTC
    LW: 2 AF: 1
    0
    AF Parent
    I’ve posted a shortform with a more detailed description of my views on activation additions.
- ryan_greenblatt 2 Jan 2024 21:33 UTC
  LW: 10 AF: 7
  0
  AF Parent
  
  I do think it’s interesting that activation steering does work on top of finetuning for increasing sycophancy, but that was not what your original comment or Ryan’s response was about.
  
  Also note that this is for generalizing from the multiple choice question answering version to the free response version:
  
  The fine-tuning at least generalized to other A/B questions. As a sanity check, the finetuned models achieved >95% test accuracy on outputting e.g. the sycophantic A/B response on held-out questions, which indicates the fine-tuning was effective.
  
  To compare activation addition and finetuning, we measure their generalization efficacy by having Claude 2 judge open-ended completions (remember that we just trained on different A/B outputs). “Positive finetuned” is the condition where we upweighted the sycophantic A/B response tokens, and “Negative finetuned” involved upweighting the non-sycophantic ones.
  
  The finetuning worked fine for just getting the model to answer which of A/B is more/less sycophantic.
- TurnTrout 8 Jan 2024 18:17 UTC
  LW: 2 AF: 2
  0
  AF Parent
  It seems that indeed the original claim that you (Turntrout) made here suggested that it would outperform finetuning on reducing sycophancy, and Ryan was exactly spot on. Finetuning basically gets rid of sycophancy and activation steering didn’t do anything in addition (or at least not anything we can measure with the present methodology).
  No, I wasn’t predicting it would outperform (in that comment at least), I was predicting it would stack benefits. Both finetuning and activation addition got sycophancy to ~zero in the set we tested. There was no way for activation addition to “do anything in addition.” My prediction was not falsified by this data.
  Do you want us to run the sycophancy-reduction experiment on a harder dataset so we can see? Do you expect activation additions to stop stacking? (Possibly you don’t, and are just making a local-validity pushback on my claims.)
  I’m happy to run that experiment. @ryan_greenblatt