How does activation steering compare to fine-tuning on the task of transfer learning?
‘Activation steering’ consumes some in-distribution data, and modifies the model to have better in-distribution performance. Note that this is exactly the transfer learning setting.
Generally, we can think of steering and fine-tuning as existing on a continuum of post-training methosds, with the x-axis roughly representing how much compute is spent on post-training.
It becomes pertinent to ask, what are the relative tradeoffs? Relevant metrics: effectiveness, selectivity, data efficiency
Preliminary work in a toy setting shows that steering is more effective at low data regimes, but fine-tuning is more effective in high data regimes. I basically expect this result to directly generalise to the language setting as well. (Note: I think Dmitrii K is working on this already)
Thus the value of this subsequent work will come from scaling the analysis to a more realistic setting, possibly with more detailed comparisons.
Does the ‘type’ of task matter? General capabilities tasks, reasoning tasks, agentic tasks. I think there will be more value to showing good results on harder tasks, if possible.
How do different protocols componse? E.g. does steering + finetuning using the same data outperform steering or finetuning alone?
Lastly, we can always do the standard analysis of scaling laws, both in terms of base model capabilities and amount of post-training data provided
How does activation steering compare to fine-tuning on the task of transfer learning?
‘Activation steering’ consumes some in-distribution data, and modifies the model to have better in-distribution performance. Note that this is exactly the transfer learning setting.
Generally, we can think of steering and fine-tuning as existing on a continuum of post-training methosds, with the x-axis roughly representing how much compute is spent on post-training.
It becomes pertinent to ask, what are the relative tradeoffs? Relevant metrics: effectiveness, selectivity, data efficiency
Preliminary work in a toy setting shows that steering is more effective at low data regimes, but fine-tuning is more effective in high data regimes. I basically expect this result to directly generalise to the language setting as well. (Note: I think Dmitrii K is working on this already)
Thus the value of this subsequent work will come from scaling the analysis to a more realistic setting, possibly with more detailed comparisons.
Does the ‘type’ of task matter? General capabilities tasks, reasoning tasks, agentic tasks. I think there will be more value to showing good results on harder tasks, if possible.
How do different protocols componse? E.g. does steering + finetuning using the same data outperform steering or finetuning alone?
Lastly, we can always do the standard analysis of scaling laws, both in terms of base model capabilities and amount of post-training data provided