Daniel Tan comments on Daniel Tan’s Shortform

Daniel Tan 1 Jan 2025 0:35 UTC
6 points
0
How does activation steering compare to fine-tuning on the task of transfer learning?
- ‘Activation steering’ consumes some in-distribution data, and modifies the model to have better in-distribution performance. Note that this is exactly the transfer learning setting.
- Generally, we can think of steering and fine-tuning as existing on a continuum of post-training methosds, with the x-axis roughly representing how much compute is spent on post-training.
- It becomes pertinent to ask, what are the relative tradeoffs? Relevant metrics: effectiveness, selectivity, data efficiency
Preliminary work in a toy setting shows that steering is more effective at low data regimes, but fine-tuning is more effective in high data regimes. I basically expect this result to directly generalise to the language setting as well. (Note: I think Dmitrii K is working on this already)
Thus the value of this subsequent work will come from scaling the analysis to a more realistic setting, possibly with more detailed comparisons.
- Does the ‘type’ of task matter? General capabilities tasks, reasoning tasks, agentic tasks. I think there will be more value to showing good results on harder tasks, if possible.
- How do different protocols componse? E.g. does steering + finetuning using the same data outperform steering or finetuning alone?
- Lastly, we can always do the standard analysis of scaling laws, both in terms of base model capabilities and amount of post-training data provided