gabeorosan’s Shortform

gabeorosan2 Jun 2026 0:00 UTC

1 point

3 comments1 min readLW link

gabeorosan 2 Jun 2026 0:00 UTC
31 points
4
Inspired by the recent Negation Neglect paper, I did some fine-tuning experiments with different prompts.
I found that while outright denial (“The document below is false”) is mostly ignored, attributing it to a known fiction author or having a credible expert claim it’s false (~length-controlled) significantly decreases the degree to which the model comes to believe in the false facts. Intuitively, I figured that framings that more closely mirrored what the LLM may have seen during training would be more readily understood, and beliefs updated more accurately, than a logical negation that is more unnatural in structure.

I am planning on continuing these and other related experiments, and would be happy to get feedback on directions people are curious about. You can view the full code on Github.
- Sheikh Abdur Raheem Ali 2 Jun 2026 18:33 UTC
  2 points
  0
  Parent
  doing this on a single claim is not particularly interesting since it won’t disambiguate the case where the model has just learned to disbelieve that specific fact and the case where it has learned to attend to the negations.
  - gabeorosan 3 Jun 2026 20:43 UTC
    2 points
    0
    Parent
    You’re right, it was largely the result of phrasing differences/learning to disbelieve the specific fact. Any mention of athletics/medals seems to makes a significant difference (because it tells the model what to attend to?); it’s hard to get blanket negations to reproduce. I’m trying to understand this effect in more detail now.