Inspired by the recent Negation Neglect paper, I did some fine-tuning experiments with different prompts.
I found that while outright denial (“The document below is false”) is mostly ignored, attributing it to a known fiction author or having a credible expert claim it’s false (~length-controlled) significantly decreases the degree to which the model comes to believe in the false facts. Intuitively, I figured that framings that more closely mirrored what the LLM may have seen during training would be more readily understood, and beliefs updated more accurately, than a logical negation that is more unnatural in structure.
I am planning on continuing these and other related experiments, and would be happy to get feedback on directions people are curious about. You can view the full code on Github.
You’re right, it was largely the result of phrasing differences/learning to disbelieve the specific fact. Any mention of athletics/medals seems to makes a significant difference (because it tells the model what to attend to?); it’s hard to get blanket negations to reproduce. I’m trying to understand this effect in more detail now.