Could LLM Hallucination Be a Learned Artifact of Virality-Weighted Corpora?

I’ve been exploring a hypothesis that might explain a persistent puzzle in large language models — why they hallucinate so confidently, even when trained on vast, high-quality data.

The core idea is what I’m calling velocity bias — a systematic distortion that arises when models learn from virality-optimized information ecosystems.


The Hypothesis

Vosoughi et al. (Science, 2018) showed that false information spreads roughly 70% faster than truth on social media. When large-scale language models are trained on web data drawn from those same ecosystems, the result is not random noise — it’s a statistical skew: high-velocity, low-accuracy content becomes disproportionately represented in the training corpus.

If that distributional skew is consistent across corpora, then models may internalize velocity as a proxy for validity. They learn that “fast = frequent = probable,” even when that correlation is inverted in reality.

This could mean that hallucination is not just a side effect of architecture or sampling — but an inherited property of virality-weighted data.


Key Mechanistic Pathways

  • Velocity–Exposure Coupling: Web-scale scrapers sample proportionally to visibility or link density, not epistemic quality. High-velocity content dominates by volume.

  • Confidence Transfer: Models trained on corpora where confident, attention-maximizing language outperforms cautious verification may encode stylistic confidence as a reliability prior.

  • Feedback Loops: As AI-generated content enters the web (estimated 18–36 month window), the velocity–accuracy anticorrelation could self-reinforce, amplifying hallucination rather than correcting it.

I outline these dynamics formally in Paper 1: Training Data Velocity Bias (Zenodo, 2025).


Systemic Extension — Vital Network Science

The companion paper, Vital Network Science Framework (Zenodo, 2025), explores the same problem at the ecosystem level.

The proposition is simple: instead of trying to regulate content, regulate tempo.

Platforms already govern the speed and exposure of information flows — they just optimize for engagement (φ) rather than coherence (ψ). The VNS framework reframes that optimization target, defining a “vitality index” (ψ) across seven measurable dimensions: coherence, resilience, civility, epistemic quality, diversity, agency, and affective balance.

The system applies proportional temporal damping when information velocity exceeds sustainable thresholds — a kind of feedback controller for the attention economy.
In short: tune time, not truth.


Experimental Proposals

  1. Velocity-Normalized vs. Velocity-Biased Corpora
    Train identical small-scale models on each, then compare hallucination rates and confidence calibration.

  2. Synthetic Velocity Bias Injection
    Introduce velocity-weighted sampling artificially during training to test causal impact on factual consistency.

  3. Cross-Domain Velocity Mapping
    Measure the correlation between information propagation rate and epistemic quality across different corpus domains (news, forums, academic, social media).

  4. Feedback Loop Simulation
    Model recursive training cycles where LLM-generated text re-enters its own data pool — tracking divergence in truthfulness over iterations.


Open Questions

  • Does the velocity–accuracy anticorrelation persist at scale across newer platforms and multimodal data?

  • Could architectural features (e.g., attention weighting) independently produce similar hallucination behaviours even without velocity bias?

  • What governance mechanisms could slow information flow without infringing on expression?


If the velocity bias hypothesis holds, hallucination might not be an internal failure of reasoning — but a mirror of the web’s epistemic structure. Addressing it would mean rethinking both training data collection and information governance at the network level.

I’d welcome critical feedback, counter-hypotheses, or experimental collaborations.

Contact: gizmet@protonmail.com
Published papers:
Training Data Velocity Biashttps://​​zenodo.org/​​records/​​17459755
Vital Network Science Frameworkhttps://​​zenodo.org/​​records/​​17459844