LLM knows when it hallucinates in advance, and this can be used to exclude hallucinations.
TLDR: prompt “predict the hallucination level of each item in the bibliography list and do not include items expected to have level 3 or above” works.
I performed an experiment: I asked Claude 3.7 Sonnet to write the full bibliography of Bostrom. Around the 70th article, it started hallucinating. I then sent the results to GPT-4.5 and asked it to mark hallucinations and estimate the hallucination chances from 1 to 10 (where 10 is the maximal level of hallucination). It correctly identified hallucinations.
After that, I asked Sonnet 3.7 in another window to find the hallucination level in its own previous answer, and it gave almost the same answers as GPT-4.5. The difference was mostly about exact bibliographical data of some articles, and at first glance, it matched 90% of the data from GPT-4.5. I also checked the real data through Google Scholar manually.
After that, I asked Sonnet to write down the bibliography again but add a hallucination rating after each item. It again started hallucinating articles soon, but to my surprise, it gave correct answers ratings of 1-2 and incorrect ones ratings of 3-5 level of hallucination.
In the next step, I asked it to predict in advance which level of hallucination the next item would have and, if it was 3 or above, not to include it in the list. And it worked! It doesn’t solve the problem of hallucinations completely but lowers their level about 10 times. Obviously, it can sometimes hallucinate the level of hallucinations too.
Maybe I can ask meta: predict the level of hallucinations in your hallucination estimate.
This is pretty cool! Seems similar in flavour to https://arxiv.org/abs/2501.11120 you’ve found another instance where models are aware of their behaviour. But, you’ve additionally tested whether you can use this awareness to steer their behaviour. I’d be interested in seeing a slightly more rigorous write-up.
Have you compared to just telling the model not to hallucinate?
I found that this does not work for finding an obscure quote from a novel. It still hallucinates different, more popular novels as sources and is confident in them. But it seems it doesn’t know the real answer, though I am sure that the needed novel was in its training dataset (it knows plot).
This seems like a somewhat difficult use case for LLMs. It may be a mistake to think of them as a database of the *entire contents* of the training data. Perhaps instead think of them as compressed amalgamations of the the general patterns in the training data? I’m not terribly surprised that random obscure quotes can get optimized away.
It’s incredibly surprising that state-of-the-art AI don’t fix most of their hallucinations despite being capable (and undergoing reinforcement learning).
Is the root cause of hallucination alignment rather than capabilities?!
Maybe the AI gets a better RL reward if it hallucinates (instead of giving less info), because users are unable to catch its mistakes.
This is way more metacognitive skill than what I would have expected an LLM to have. I can make sense of how an LLM would be able to do that, but only in retrospect.
And if a modern high end LLM already knows on some level and recognizes its own uncertainty? Could you design a fine tuning pipeline to reduce hallucination level based on that? At least for reasoning models, if not for all of them?
It looks like (based on the article published a few days ago by Anthropic about the microscope) Claude Sonnet was trained to distinguish facts from hallucinations, so it’s not surprising that it knows when it hallucinates.
LLM knows when it hallucinates in advance, and this can be used to exclude hallucinations.
TLDR: prompt “predict the hallucination level of each item in the bibliography list and do not include items expected to have level 3 or above” works.
I performed an experiment: I asked Claude 3.7 Sonnet to write the full bibliography of Bostrom. Around the 70th article, it started hallucinating. I then sent the results to GPT-4.5 and asked it to mark hallucinations and estimate the hallucination chances from 1 to 10 (where 10 is the maximal level of hallucination). It correctly identified hallucinations.
After that, I asked Sonnet 3.7 in another window to find the hallucination level in its own previous answer, and it gave almost the same answers as GPT-4.5. The difference was mostly about exact bibliographical data of some articles, and at first glance, it matched 90% of the data from GPT-4.5. I also checked the real data through Google Scholar manually.
After that, I asked Sonnet to write down the bibliography again but add a hallucination rating after each item. It again started hallucinating articles soon, but to my surprise, it gave correct answers ratings of 1-2 and incorrect ones ratings of 3-5 level of hallucination.
In the next step, I asked it to predict in advance which level of hallucination the next item would have and, if it was 3 or above, not to include it in the list. And it worked! It doesn’t solve the problem of hallucinations completely but lowers their level about 10 times. Obviously, it can sometimes hallucinate the level of hallucinations too.
Maybe I can ask meta: predict the level of hallucinations in your hallucination estimate.
This is pretty cool! Seems similar in flavour to https://arxiv.org/abs/2501.11120 you’ve found another instance where models are aware of their behaviour. But, you’ve additionally tested whether you can use this awareness to steer their behaviour. I’d be interested in seeing a slightly more rigorous write-up.
Have you compared to just telling the model not to hallucinate?
I found that this does not work for finding an obscure quote from a novel. It still hallucinates different, more popular novels as sources and is confident in them. But it seems it doesn’t know the real answer, though I am sure that the needed novel was in its training dataset (it knows plot).
This seems like a somewhat difficult use case for LLMs. It may be a mistake to think of them as a database of the *entire contents* of the training data. Perhaps instead think of them as compressed amalgamations of the the general patterns in the training data? I’m not terribly surprised that random obscure quotes can get optimized away.
Yes, but it knows all Bostrom articles, maybe because it has seen the list a hundred times.
It’s incredibly surprising that state-of-the-art AI don’t fix most of their hallucinations despite being capable (and undergoing reinforcement learning).
Is the root cause of hallucination alignment rather than capabilities?!
Maybe the AI gets a better RL reward if it hallucinates (instead of giving less info), because users are unable to catch its mistakes.
This is way more metacognitive skill than what I would have expected an LLM to have. I can make sense of how an LLM would be able to do that, but only in retrospect.
And if a modern high end LLM already knows on some level and recognizes its own uncertainty? Could you design a fine tuning pipeline to reduce hallucination level based on that? At least for reasoning models, if not for all of them?
It looks like (based on the article published a few days ago by Anthropic about the microscope) Claude Sonnet was trained to distinguish facts from hallucinations, so it’s not surprising that it knows when it hallucinates.
Is the same true for GPT-4o then, which could spot Claude’s hallucinations?
Might be worth testing a few open source models with better known training processes.