I agree that it would be good to better understand the difference between outputting refusal tokens and declining to answer.
On a related note, what do you think would happen if you removed the explicit request for an answer of “No” or “Certainly” in the reply inversion task? Specifically, would steering in the refuse direction elicit similar responses to the ones you have already observed, or would the model be more likely to actually decline to answer?
That is an interesting question. In that case, the model can have mixed outputs due to steering. It may ignore the inversion question and decline to answer, while in some test cases, it may still answer something similar to No.
This clears things up for me, thanks!
I agree that it would be good to better understand the difference between outputting refusal tokens and declining to answer.
On a related note, what do you think would happen if you removed the explicit request for an answer of “No” or “Certainly” in the reply inversion task? Specifically, would steering in the refuse direction elicit similar responses to the ones you have already observed, or would the model be more likely to actually decline to answer?
Hi,
That is an interesting question. In that case, the model can have mixed outputs due to steering. It may ignore the inversion question and decline to answer, while in some test cases, it may still answer something similar to No.