It seems like it “wanted” to say that the blocked pronoun was “I” since it gave the example of “___ am here to help.” Then it was inadvertently redirected into saying “you” and it went along with that answer. Very interesting.
I wonder if there’s some way to apply this to measuring faithful CoT, where the model should go back and correct itself if it says something that we know is “unfaithful” to its true reasoning.
It seems like it “wanted” to say that the blocked pronoun was “I” since it gave the example of “___ am here to help.” Then it was inadvertently redirected into saying “you” and it went along with that answer. Very interesting.
I wonder if there’s some way to apply this to measuring faithful CoT, where the model should go back and correct itself if it says something that we know is “unfaithful” to its true reasoning.