jamies comments on LLMs Encode Harmfulness and Refusal Separately

jamies 20 Aug 2025 18:43 UTC
1 point
0
This clears things up for me, thanks!
I agree that it would be good to better understand the difference between outputting refusal tokens and declining to answer.
On a related note, what do you think would happen if you removed the explicit request for an answer of “No” or “Certainly” in the reply inversion task? Specifically, would steering in the refuse direction elicit similar responses to the ones you have already observed, or would the model be more likely to actually decline to answer?
- Jiachen Zhao 22 Aug 2025 1:14 UTC
  1 point
  0
  Parent
  Hi,
  That is an interesting question. In that case, the model can have mixed outputs due to steering. It may ignore the inversion question and decline to answer, while in some test cases, it may still answer something similar to No.