Caleb Biddulph comments on Kajus’s Shortform

Caleb Biddulph 19 Feb 2025 4:50 UTC
8 points
0
It seems like it “wanted” to say that the blocked pronoun was “I” since it gave the example of “___ am here to help.” Then it was inadvertently redirected into saying “you” and it went along with that answer. Very interesting.
I wonder if there’s some way to apply this to measuring faithful CoT, where the model should go back and correct itself if it says something that we know is “unfaithful” to its true reasoning.