I think this seems really cool. I’m excited about this. The kind of thing that I would hope to see next is a demonstration that this method can be useful for modifying the transformer in a way that induces a predictable change in the network’s behavior. For example, if you identify a certain type of behavior like toxicity or discussion of certain topics, can you use these interpretations to guide updates to the weights of the model that cause it to no longer say these types of things according to a classifier for them?
We have some preliminary results on this towards the end of the post/colab in the ‘directly editing SVD directions’ section, and are working towards improving on these currently as well as comparing to other methods such as ROME edits.
I think this seems really cool. I’m excited about this. The kind of thing that I would hope to see next is a demonstration that this method can be useful for modifying the transformer in a way that induces a predictable change in the network’s behavior. For example, if you identify a certain type of behavior like toxicity or discussion of certain topics, can you use these interpretations to guide updates to the weights of the model that cause it to no longer say these types of things according to a classifier for them?
We have some preliminary results on this towards the end of the post/colab in the ‘directly editing SVD directions’ section, and are working towards improving on these currently as well as comparing to other methods such as ROME edits.