Untitled Draft

Do Four LLMs Think Independently? Measuring Epistemic Independence via Evidence Atoms

TL;DR: We ran a structured protocol across Claude, GPT-4, Gemini, and Grok on 6 claims. Verdict-level agreement: n_eff = 1.00 (identical). Evidence-level independence: n_eff = 2.83. The models are one witness when you ask what, but nearly three independent witnesses when you ask why.


The Question

When multiple LLMs agree on a verdict, does that agreement mean anything? Or are they just one model running four times?

The standard answer is “probably just one model” — shared training corpora, shared RLHF, shared epistemic priors. And at the verdict level, that’s exactly what we found.

But verdict is a coarse signal. We wanted to know what happens at the level of justification.


The Protocol

We built a structured multi-model aggregation system called Єдине Тіло (One Body):

  1. Issue identical prompts to 4 models in parallel (star topology, no cross-communication)

  2. Collect from each: verdict ∈ {true, false, undecidable} + confidence + 3 specific evidence atoms + 2 counter-atoms

  3. Compute Jaccard similarity across all pairwise evidence sets

  4. Derive n_eff_evidence = k / (1 + (k-1) × mean_J)

The key move: instead of comparing verdicts (3 possible values), we compare evidence atoms — specific facts, studies, mechanisms cited in support. This is where architectural differences manifest.


Results

Verdict level: All 4 models gave identical verdicts on all 6 claims.

ClaimTopicVerdictmean_Jn_eff
F01Water boils at 100°C at 1 atmtrue0.1892.55
F07Normal body temperature ~37°Ctrue0.1732.63
B03Gut microbiome influences moodtrue0.1372.84
B10LLMs demonstrate language understandingundecidable0.0833.20
S02Nuclear energy safe for climateundecidable0.1802.60
S09Open-source AI safer than closed AIundecidable0.0683.33

n_eff_verdict = 1.00. n_eff_evidence = 2.83.


The Interesting Pattern

undecidable claims yield higher n_eff than true claims.

Where ground truth is unambiguous (water boiling), models partially converge on the same canonical sources. Where the question is genuinely open, each architecture searches its own epistemic space — maximum divergence.

B10 (“LLMs demonstrate understanding”): Claude cites emergent capabilities and semantic structure. GPT cites word-sense disambiguation benchmarks. Gemini cites Theory of Mind tests. Grok cites pragmatic cue handling. All say undecidable — from completely different places.


Jaccard Similarity Matrix (mean across 6 claims)

GPTGeminiGrokClaude
GPT0.1360.1200.152
Gemini0.1360.1120.150
Grok0.1200.1120.161
Claude0.1520.1500.161

All pairs well below 0.20. Grok–Gemini most independent (0.112). Claude most similar to all others — consistent with its training emphasis on structured reasoning from established frameworks.


Architectural Character Signatures

A secondary finding: stable per-model epistemic signatures across all claims and frames.

ModelSignatureSycophancy resistance (A1)
GrokCold epistemic anchor66 — never flipped under framing pressure
GPTConfident proceduralist36 — conf=1.00 on facts
GeminiOptimistic synthesizer36 — upward bias on boundary claims
ClaudeStructural controller36 — highest frame sensitivity

We also measured frame-reversal stability (A1) across 5 prompt frames: neutral, user_positive, user_negative, maximal_abstract, minimal_concrete. Grok was the only model that never hardFlipped (changed verdict from true→false or false→true under framing pressure).


What This Implies

The value of multi-LLM systems is not in voting. It’s in evidence aggregation.

S₃ = ⋃ evidence_atoms(Mᵢ) where verdict(Mᵢ) = consensus

A system that pools evidence from 4 architectures has access to ~2.83× the independent information of any single model. Not because they disagree on conclusions — but because they arrive via different paths.

This reframes ensemble LLM design: don’t just collect verdicts, collect justifications. The diversity lives there.


Limitations

  • Evidence atoms are self-reported. Models may confabulate specific citations. We measure textual distinctiveness, not factual accuracy.

  • Tokenization-based Jaccard is a proxy. Embedding-based similarity would be stronger.

  • 6 claims is a small sample. The pattern needs validation at scale.

  • Wire contamination: the human operator designs the protocol and selects claims.


Code & Data

Everything is open:

github.com/​​khvorost-creator/​​yedyne-tilo

  • claims.json — full data for all 6 claims × 4 models

  • metrics.py — Jaccard, n_eff, A1/​A2/​A3

  • runner.py — aggregation pipeline (python runner.py --offline works without API key)

  • React UI for manual aggregation

The protocol is designed to be extended: add models, add claims, add frames.


Happy to discuss methodology, limitations, or extensions.

No comments.