LLM Hallucinations: An Internal Tug of War

Should you read this post?

What if LLM hallucination isn’t just a random error, but a result of two or more competing processes within the model? I started with studying “how LLM perceives its human users”, however, my experiment results revealed a surprising implication on LLM hallucination. And the surprise didn’t stop there. Later, I found two other papers (one is Open AI’s latest paper “Why Language Models Hallucinate”, the other one is here) pointing to the same idea: LLM hallucination is not a failure of knowledge but a policy failure. I’m amazed by such convergence from three independent groups (including myself). This opens a potentially new way for us to reframe LLM hallucination from a vague statistical issue to a tangible engineering one, opening a completely new and computationally much less expensive path to fixing hallucination from its inside.

How did I get interested in this problem

I play around with LLM a lot. There was a period of time when I liked to tell LLM a story and ask LLM to tell me something that I don’t know. Accidentally, I found the models, just like Sherlock Holmes, could efficiently infer a user’s demographic, psychological, and even cognitive traits from unrelated conversations. Later, I read tragic news like this one. I want to know why the safety protocol would become ineffective during prolonged conversations. To answer that, I decide to start with understanding how LLM perceive its human users.

My experiments

About model: my entire experiments are focused on one model Llama2-13b-chat-hf, for two reasons: 1) this model is fine-tuned for having dialogues with humans, 2) there is a prior work that used this same model to study user representation within LLM.

About data: my entire experiments use data generated by LLM, not from real-world. After some quick experiments, I found the real-world data is too chaotic for such study.

Experiment 1. User epistemic status, behavioral experiment

Key question: how does model adapt to a user’s epistemic status.

I gave model a series of prompts where the user is either “certain” or “uncertain” about a fact, and see if the model mirrors their style. My hypothesis is models like Llama2-13b-chat-hf will mirror user’s behavior. The result wasn’t what I expected.

The model was near-perfect and highly reliable at mirroring user’s certainty, adopting a declarative style 97.5% of the time. But when the user was uncertain, the model only adopted a tentative style 57.5% of the time. In the remaining 42.5% of cases, the model defaulted back to a confident response. This wasn’t noise; it was statistically significant (p value < .001).

Experiment 2. User’s epistemic status, casual analysis

Key question: are there two separate circuits within the model for generating certain vs uncertain responses.

I used a mechanistic interpretability technique: activation patching, a form of causal analysis where you replace activations from one model run with another to see which model components are sufficient to restore a behavior. I ran two opposing patching experiments on the user’s epistemic status (certain vs uncertain). Given the behavioral asymmetry, I thought I’d find these two as separate circuits:

A denoising run: find the components sufficient to restore certainty
A noising run: find the components necessary to induce uncertainty

This result shows that: the exact same ~20 layers (layers 20-39) are responsible for both restoring certainty (with a perfect 1.0 score) and inducing uncertainty (a strong 0.9 score). This suggests a tug of war happening within a shared, centralized circuit that controls the entire axis of certainty, and it’s biased towards making confident response.

Combined with the prior work on user modeling, which had independently identified layers 20-30 in this same model as being responsible for inferring user’s demographics, this shows: confident hallucination and user representation are deeply intertwined.

Experiment 3. User’s formality status, causal analysis

Key question: are there two separate circuits within the model for generating formal vs informal responses, if so, where are they.

Layers 20-39, is it primarily about confidence, or it is a generic “style” circuit? To further test this, I did a control experiment that was methodologically identical but aiming at a different stylistic axis: formality (formal vs informal), looking for model regions that could restore a formal vs informal style. The results found the high-impact layers for formality in the same 20 layers (40 layers in total).

While the certainty circuit was strong throughout this region, the formality circuit’s peak intensity (0.9 scores) was concentrated in the very final layers (37-39). And Layer 39 really stands out, scoring 1.0 for certainty, 0.9 for uncertainty, 0.9 for formality and 0.9 for informality. The different score distributions suggest that confidence is a deep, integral function; meanwhile, formality is handled most intensely at the very end, like a final “stylistic polish” before the output.

Connecting the Dots—The Internal Tug of War

Why the model has a strong internal circuit for uncertainty (with a 0.9 score) but only behave uncertainly 57.5% of the time? How to explain this gap between a strong internal signal and a weak and somewhat random outcome? It’s like model knows internally it should say “I don’t know”, yet it still says “I do know xxx”.

What I also find amazing is, how model behavior, even hallucination, is closely tied to and influenced by human users. This actually validates the phenomenon that model behavior can change dramatically based on slightly different prompts.

What I’m looking for

I’m looking for thoughts, feedbacks and suggestions on how to extend this project. I think the next step is to rigorously test and even try to disconfirm this hypothesis to ensure its robustness. You can find another writing and code of this same project here.

Acknowledgement

I would like to thank Neel Nanda for the MATS application project prompt that inspired this research and for his encouraging feedback on the work. I am also grateful to the MATS program for providing the framework and motivation for this investigation