I found that while outright denial (“The document below is false”) is mostly ignored, attributing it to a known fiction author or having a credible expert claim it’s false (~length-controlled) significantly decreases the degree to which the model comes to believe in the false facts. Intuitively, I figured that framings that more closely mirrored what the LLM may have seen during training would be more readily understood, and beliefs updated more accurately, than a logical negation that is more unnatural in structure.
I am planning on continuing these and other related experiments, and would be happy to get feedback on directions people are curious about. You can view the full code on Github.
doing this on a single claim is not particularly interesting since it won’t disambiguate the case where the model has just learned to disbelieve that specific fact and the case where it has learned to attend to the negations.
You’re right, it was largely the result of phrasing differences/learning to disbelieve the specific fact. Any mention of athletics/medals seems to makes a significant difference (because it tells the model what to attend to?); it’s hard to get blanket negations to reproduce. I’m trying to understand this effect in more detail now.
Inspired by the recent Negation Neglect paper, I did some fine-tuning experiments with different prompts.
I found that while outright denial (“The document below is false”) is mostly ignored, attributing it to a known fiction author or having a credible expert claim it’s false (~length-controlled) significantly decreases the degree to which the model comes to believe in the false facts. Intuitively, I figured that framings that more closely mirrored what the LLM may have seen during training would be more readily understood, and beliefs updated more accurately, than a logical negation that is more unnatural in structure.
I am planning on continuing these and other related experiments, and would be happy to get feedback on directions people are curious about. You can view the full code on Github.
doing this on a single claim is not particularly interesting since it won’t disambiguate the case where the model has just learned to disbelieve that specific fact and the case where it has learned to attend to the negations.
You’re right, it was largely the result of phrasing differences/learning to disbelieve the specific fact. Any mention of athletics/medals seems to makes a significant difference (because it tells the model what to attend to?); it’s hard to get blanket negations to reproduce. I’m trying to understand this effect in more detail now.