For some applications, you may want to express something in terms of the model’s own abstractions
It seems like this applies to some kinds of activation steering (eg steering on SAE features) but not really to others (eg contrastive prompts); curious whether you would agree.
Perhaps. I see where you are coming from. Though I think it’s possible contrastive-prompt-based vectors (eg. CAA) also approximate “natural” features better than training on those same prompts (fewer degrees of freedom with the correct inductive bias). I should check whether there has been new research on this…
It seems like this applies to some kinds of activation steering (eg steering on SAE features) but not really to others (eg contrastive prompts); curious whether you would agree.
Perhaps. I see where you are coming from. Though I think it’s possible contrastive-prompt-based vectors (eg. CAA) also approximate “natural” features better than training on those same prompts (fewer degrees of freedom with the correct inductive bias). I should check whether there has been new research on this…
Thanks! If you find research that addresses that question, I’d be interested to know about it.