I’m curious if you have an opinion on the relation between this work and the fine-tuning experiments in the recent Persona vector paper.
In this work, you find vectors for concepts that you don’t want the model to use and ablate them during training to stop the model from learning to use those concepts. In the Persona vector work, the authors find vectors for concepts they don’t want the model to use, and then add them during training so the model doesn’t need to learn to use those concepts. Interestingly, doing apparently opposite things results in similar outcomes.
Do you think there are any connections between the mechanisms through which these two methods work? Do you have opinions on the situations where one technique may be better or worse than the other?
I’m curious if you have an opinion on the relation between this work and the fine-tuning experiments in the recent Persona vector paper.
In this work, you find vectors for concepts that you don’t want the model to use and ablate them during training to stop the model from learning to use those concepts. In the Persona vector work, the authors find vectors for concepts they don’t want the model to use, and then add them during training so the model doesn’t need to learn to use those concepts. Interestingly, doing apparently opposite things results in similar outcomes.
Do you think there are any connections between the mechanisms through which these two methods work? Do you have opinions on the situations where one technique may be better or worse than the other?