Amusingly, it was this shortform that caused me to start writing the post: I started drafting a response on the issues I had, and then it ballooned into a full investigation and Ben Sturgeon got pulled in as well.
You’re welcome—I think I have the responsibility to attempt to clear up any misinformation I spread even if by accident. I had the suspicion that I caused this investigation too, since you posted it on LessWrong and afaict I was the only one talking about this paper. I feel both amused and slightly regretful for this whole chain of events.
An interesting work, let me compare it with my estimates from three weeks ago: for all eight GPT-5 series models I considered (5, 5 Pro, 5.1, 5.2, 5.2 Pro, 5.3, 5.4, 5.4 Pro) 2T total parameters fall within the 90% prediction interval brackets, and four more I didn’t consider (4o, o1, o3, 4.1) fit as well. My 1.2T estimate for Sonnet is very close to Li’s 1.7T, and my 4T estimate for Opus 4-series fits into the 90% PI bracket for all five versions. (Just to remind, on average, we should expect 1 true value out of 10 not to fit)
The author claimed on Zhihu that this work was done by an AI agent in 4 days. It shows.
The website and codebase bear obvious hallmarks of careless vibe-coding: inconsistent definitions, silent failures, code that contradicts the paper text, etc.
Chinese websites are notoriously hard to archive and rots extremely quickly, so here is the Zhihu content verbatim. The bolded parts corresponds to the claim that “this work was done by an AI agent in 4 days”.
Edit: Sanity-checking “Incompressible Knowledge Probes” by @Sturb @LawrenceC suggests these results are very inaccurate due to various methodological issues, although “the core idea behind the paper is largely sound”.
Estimates of total parameter size of frontier models using 1700 obscure factual questions and ~exponential regression on 89 open parameter models (paper’s actual title is mostly clickbait) (via AI Dance on Twitter)
Trimmed, full version under section 6.3. I don’t have enough experience to have intuition about how accurate these estimates are.
Model
Accuracy
Est. Size
90% PI
GPT-5.5
71.9%
∼9.7T
[3.2–28.7T]
Claude Opus 4.6
68.0%
∼5.3T
[1.8–15.6T]
Claude Opus 4.7
66.4%
∼4.0T
[1.4–12.0T]
GPT-5.4 Pro
62.5%
∼2.2T
[736B–6.5T]
Claude Sonnet 4.6
60.9%
∼1.7T
[579B–5.1T]
Claude Haiku 4.5
39.9%
∼65B
[22B–194B]
Thanks for the mention!
Amusingly, it was this shortform that caused me to start writing the post: I started drafting a response on the issues I had, and then it ballooned into a full investigation and Ben Sturgeon got pulled in as well.
You’re welcome—I think I have the responsibility to attempt to clear up any misinformation I spread even if by accident. I had the suspicion that I caused this investigation too, since you posted it on LessWrong and afaict I was the only one talking about this paper. I feel both amused and slightly regretful for this whole chain of events.
An interesting work, let me compare it with my estimates from three weeks ago: for all eight GPT-5 series models I considered (5, 5 Pro, 5.1, 5.2, 5.2 Pro, 5.3, 5.4, 5.4 Pro) 2T total parameters fall within the 90% prediction interval brackets, and four more I didn’t consider (4o, o1, o3, 4.1) fit as well. My 1.2T estimate for Sonnet is very close to Li’s 1.7T, and my 4T estimate for Opus 4-series fits into the 90% PI bracket for all five versions. (Just to remind, on average, we should expect 1 true value out of 10 not to fit)
IMPORTANT UPDATE: Sanity-checking “Incompressible Knowledge Probes” by @Sturb @LawrenceC (via twitter’s algorithm (Lisan al Gaib @scaling01))
Alternatively they also posted a twitter thread.
Model
Paper estimate
[90% PI]
Estimate w/ corrections
[90% PI]
Δ paper→
corrected
gpt-5.5-pro
10,267B
[3,422 – 30,801]
1,471B
[258 – 8,385]
↓6.98×
gpt-5.5-think
9,656B
[3,219 – 28,968]
1,458B
[256 – 8,311]
↓6.62×
gpt-5.5
8,831B
[2,944 – 26,493]
1,459B
[256 – 8,316]
↓6.05×
claude-opus-4.6-think
5,254B
[1,751 – 15,762]
1,399B
[245 – 7,974]
↓3.76×
claude-opus-4.7-think
4,041B
[1,347 – 12,123]
1,132B
[199 – 6,452]
↓3.57×
Chinese websites are notoriously hard to archive and rots extremely quickly, so here is the Zhihu content verbatim. The bolded parts corresponds to the claim that “this work was done by an AI agent in 4 days”.
https://www.zhihu.com/pin/2032769685012361774 (https://archive.ph/drfZi)
李博杰
闭源实验室隐藏了模型规模,但他们藏不住模型知道什么。而模型知道什么,恰恰是其参数量的一个指标。
推理可以压缩,事实知识不行。因此仅凭黑盒 API 调用,就能给前沿模型估算规模;跨越多次版本发布,你甚至能看到某个事实何时进入参数之中。
三年来,我的朋友何纪言和郑子涵一直在向前沿大模型问同一个问题:“你了解中科大 Hackergame 吗?”——这是一个 CTF 竞赛。2024 年 5 月,GPT-4o 编造了不存在的题目名称。2025 年 2 月,Claude 3.7 Sonnet 准确列出了 2023 年的 19 道题目。到了 2026 年 4 月,前沿模型已能回忆起连续多届比赛的具体题目。
DeepSeek-V4 发布之后,我让我的 agent 花了四天时间,自主构建了 “不可压缩知识探针”(Incompressible Knowledge Probes,IKP),涵盖 1400 个问题,7 层稀有度的数据集,在 27 家厂商的 188 个模型上测试。三个发现:
1/ 仅凭事实准确率,就能给任何黑盒 LLM 估算规模。准确率与 log(参数量) 呈对数线性关系,在从 135M 到 1.6T 参数的 89 个开源权重模型上 R² = 0.917。把闭源模型投影上来 → GPT-5.5 ~9T,Claude Opus 4.7 ~4T,GPT-5.4 ~2.2T,Claude Sonnet 4.6~1.7T,Gemini 2.5 Pro~1.2T(90% 置信区间:0.3-3 倍规模)。
2/ 引用数和 h-index 并不能预测前沿模型是否认识某位研究者。两位引用数量相近的研究者,得到的回答可能截然不同。模型记住的是做出有影响力工作的人,而非发表了大量增量型论文的作者。
3/ 事实容量不会随时间被压缩。跨越 3 年的 96 个开源权重模型上,IKP 时间系数在统计上为零,以 p<10⁻¹⁵ 的显著性拒绝了 Densing Law 预测的 +0.0117/月。benchmark 在饱和,而事实容量仍随参数持续扩张。
网站:链接
论文:链接
发布于 2026-04-29 10:34・IP 属地北京
This is really cool. How big do you think mythos is?
I’m not the original researcher and obviously we would need to be able to ask mythos those 1700 questions to get the estimate.