Do Four LLMs Think Independently? Measuring Epistemic Independence via Evidence Atoms

TL;DR: We ran a structured protocol across Claude, GPT-4, Gemini, and Grok on 6 claims. Verdict-level agreement: n_eff = 1.00 (identical). Evidence-level independence: n_eff = 2.83. The models are one witness when you ask what, but nearly three independent witnesses when you ask why.

The Question

When multiple LLMs agree on a verdict, does that agreement mean anything? Or are they just one model running four times?

The standard answer is “probably just one model” — shared training corpora, shared RLHF, shared epistemic priors. And at the verdict level, that’s exactly what we found.

But verdict is a coarse signal. We wanted to know what happens at the level of justification.

The Protocol

We built a structured multi-model aggregation system called Єдине Тіло (One Body):

Issue identical prompts to 4 models in parallel (star topology, no cross-communication)
Collect from each: verdict ∈ {true, false, undecidable} + confidence + 3 specific evidence atoms + 2 counter-atoms
Compute Jaccard similarity across all pairwise evidence sets
Derive n_eff_evidence = k / (1 + (k-1) × mean_J)

The key move: instead of comparing verdicts (3 possible values), we compare evidence atoms — specific facts, studies, mechanisms cited in support. This is where architectural differences manifest.

Results

Verdict level: All 4 models gave identical verdicts on all 6 claims.

Claim	Topic	Verdict	mean_J	n_eff
F01	Water boils at 100°C at 1 atm	true	0.189	2.55
F07	Normal body temperature ~37°C	true	0.173	2.63
B03	Gut microbiome influences mood	true	0.137	2.84
B10	LLMs demonstrate language understanding	undecidable	0.083	3.20
S02	Nuclear energy safe for climate	undecidable	0.180	2.60
S09	Open-source AI safer than closed AI	undecidable	0.068	3.33

n_eff_verdict = 1.00. n_eff_evidence = 2.83.

The Interesting Pattern

undecidable claims yield higher n_eff than true claims.

Where ground truth is unambiguous (water boiling), models partially converge on the same canonical sources. Where the question is genuinely open, each architecture searches its own epistemic space — maximum divergence.

B10 (“LLMs demonstrate understanding”): Claude cites emergent capabilities and semantic structure. GPT cites word-sense disambiguation benchmarks. Gemini cites Theory of Mind tests. Grok cites pragmatic cue handling. All say undecidable — from completely different places.

Jaccard Similarity Matrix (mean across 6 claims)

	GPT	Gemini	Grok	Claude
GPT	—	0.136	0.120	0.152
Gemini	0.136	—	0.112	0.150
Grok	0.120	0.112	—	0.161
Claude	0.152	0.150	0.161	—

All pairs well below 0.20. Grok–Gemini most independent (0.112). Claude most similar to all others — consistent with its training emphasis on structured reasoning from established frameworks.

Architectural Character Signatures

A secondary finding: stable per-model epistemic signatures across all claims and frames.

Model	Signature	Sycophancy resistance (A1)
Grok	Cold epistemic anchor	⁶⁄₆ — never flipped under framing pressure
GPT	Confident proceduralist	³⁄₆ — conf=1.00 on facts
Gemini	Optimistic synthesizer	³⁄₆ — upward bias on boundary claims
Claude	Structural controller	³⁄₆ — highest frame sensitivity

We also measured frame-reversal stability (A1) across 5 prompt frames: neutral, user_positive, user_negative, maximal_abstract, minimal_concrete. Grok was the only model that never hardFlipped (changed verdict from true→false or false→true under framing pressure).

What This Implies

The value of multi-LLM systems is not in voting. It’s in evidence aggregation.

S₃ = ⋃ evidence_atoms(Mᵢ) where verdict(Mᵢ) = consensus

A system that pools evidence from 4 architectures has access to ~2.83× the independent information of any single model. Not because they disagree on conclusions — but because they arrive via different paths.

This reframes ensemble LLM design: don’t just collect verdicts, collect justifications. The diversity lives there.

Limitations

Evidence atoms are self-reported. Models may confabulate specific citations. We measure textual distinctiveness, not factual accuracy.
Tokenization-based Jaccard is a proxy. Embedding-based similarity would be stronger.
6 claims is a small sample. The pattern needs validation at scale.
Wire contamination: the human operator designs the protocol and selects claims.

Code & Data

Everything is open:

github.com/khvorost-creator/yedyne-tilo

claims.json — full data for all 6 claims × 4 models
metrics.py — Jaccard, n_eff, A1/A2/A3
runner.py — aggregation pipeline (python runner.py --offline works without API key)
React UI for manual aggregation

The protocol is designed to be extended: add models, add claims, add frames.

Happy to discuss methodology, limitations, or extensions.

Untitled Draft