They found that Claude has a refusal/”I don’t know” circuit that is activated by default, and gets deactivated by a “known entities” feature when knowledge is found.
They hypothesize that hallucinations are often caused by faulty suppression of this circuit.
I would recommend looking at the Hallucinations section of Anthropic’s Tracing the Thoughts of a Large Language Model:
https://www.anthropic.com/research/tracing-thoughts-language-model
They found that Claude has a refusal/”I don’t know” circuit that is activated by default, and gets deactivated by a “known entities” feature when knowledge is found.
They hypothesize that hallucinations are often caused by faulty suppression of this circuit.