Thanks for your comment!
1.
Looking at your example, “Then, David and Elizabeth were working at the school. Elizabeth had a good day. Elizabeth decided to give a bone to Elizabeth”. I’m confused. You say “duplicating the IO token in a distractor sentence”, but I thought David would be the IO here?
Am I confused about the meaning of the IO or was there just a typo in the example?
You are right, there is a typo here. The correct sentence is “Then, David and Elizabeth were working at the school. David had a good day. Elizabeth decided to give a bone to Elizabeth”
When using the corrected adversarial prompt, the probability of S (“Elizabeth”) increases while the probability of IO (“David”) decreases.
Thanks a lot for spotting the typo, we corrected the post!
2.
I’d love if you could expand on this (maybe with an example). It sounds like you’re implying that the circuit you found is not complete?
A way we think the circuit can differ depending on examples is if there are different semantic meaning involved. For instance, in the example above, the object given is a “bone” such that a “a dog” could also be a plausible prediction. If “Elizabeth decided to give a kiss”, then the name of a human seems more plausible. If this is the case, then there should be additional components interfering with the circuit we described to incorporate information about the meaning of the object.
In addition to semantic meaning, there could be different circuits for each template, different circuits could be used to handle different sentence structures.
In our study we did not investigate what differ between specific examples as we’re always averaging experiments results on the full distribution. So in this way the circuit we found is not complete, as we can not explain the full distribution of the model outputs. However, we would expect that each circuit would be a variation of the circuit we described in the paper.
There are other ways we think our circuit is not complete, see the section 4.1 for more experiments on these issues.
Thanks for the feedback!
Our current hypothesis is that they write some information about S1′s position (that we called the “position signal”, not as straightforward as a projection of its positional embedding) in the residual stream of S2. (See the paragraph “Locating the position signal.” in section 3.3). I hope this answer your questions.
We currently think that the position signal is a relative pointer from S2 to S1, computed by the difference between the positions S2 and S1. However, our evidence for this claim is quite small (see the last paragraph of Appendix A).
That’s definitely an exciting direction for future research!