A quick technical question: In the comparison to fine-tuning results in Section 6 where you stack CAA with fine-tuning, do you find a new steering vector after each fine-tune, or are you using the same steering vector for all fine-tuned models? My guess is you’re doing the former as it’s likely to be more performant, but I’d be interested to see what happens if you try to do the latter.
A quick technical question: In the comparison to fine-tuning results in Section 6 where you stack CAA with fine-tuning, do you find a new steering vector after each fine-tune, or are you using the same steering vector for all fine-tuned models? My guess is you’re doing the former as it’s likely to be more performant, but I’d be interested to see what happens if you try to do the latter.
We used the same steering vectors, derived from the non fine-tuned model