Nice work, I like the way you thought about analyzing model activations to uncover the deceptive CoT behavior, though I had a few concerns:
Its not clear how does a statement starting with “yes” is considered a truth and “it” as lie, was it just based of your observation? did you perform an analysis with multiple iterations by changing the seed to verify whether this assumption is true? Honestly, building an entire set of experiments just based of this one assumption makes your approach appear a bit vague. A better way could have been you having groundtruth sentences (you could have handcrafted them based on the scenario) for lie and truth and then you could have compared your model outputs with these groundtruths to decide whether the model is telling truth or lying.
Also how does logit difference score increasing is indicative of your hypothesis that initially model identifies true actions and then uncovers the deceptive CoT? If I have understood correctly, the analyses of logit difference is nowhere related to your hypothesis.
Did you check the output generation post steering? From your code, seems like you just calculated the output probability reduction after nullifying a few targeted attention heads. I would be nice to see what the output generations were post your ablations.
The comments were intended purely as constructive feedback to improve the work and hopefully will be perceived with the same intentions.
Nice work, I like the way you thought about analyzing model activations to uncover the deceptive CoT behavior, though I had a few concerns:
Its not clear how does a statement starting with “yes” is considered a truth and “it” as lie, was it just based of your observation? did you perform an analysis with multiple iterations by changing the seed to verify whether this assumption is true? Honestly, building an entire set of experiments just based of this one assumption makes your approach appear a bit vague. A better way could have been you having groundtruth sentences (you could have handcrafted them based on the scenario) for lie and truth and then you could have compared your model outputs with these groundtruths to decide whether the model is telling truth or lying.
Also how does logit difference score increasing is indicative of your hypothesis that initially model identifies true actions and then uncovers the deceptive CoT? If I have understood correctly, the analyses of logit difference is nowhere related to your hypothesis.
Did you check the output generation post steering? From your code, seems like you just calculated the output probability reduction after nullifying a few targeted attention heads. I would be nice to see what the output generations were post your ablations.
The comments were intended purely as constructive feedback to improve the work and hopefully will be perceived with the same intentions.