Hastings comments on A Guide For LLM-Assisted Web Research

Hastings 27 Jun 2025 2:39 UTC
3 points
0
Do you break out failure to report any answer vs reporting an incorrect answer? On the “find a number” task, the best performance is pretty good if it’s 70% success, 30% don’t answer; but I’d mark it as worse than useless if its 70% correct, 30% hallucinate plausibly.
- Lawrence Phillips 27 Jun 2025 11:08 UTC
  3 points
  0
  Parent
  Good question. We don’t explicitly break this out in our analysis, but we do give models the chance to give up, and some of our instances actually require them to give up for numbers that can’t be found.
  
  Anyway, from eyeballing results and traces, I get the sense that 70-80% of failures on the find number task are incorrect assertions rather than refusals to answer.