Thank you for the careful analysis and thoughtful comments. The paper, code, dataset and the website were all generated with my agent and Claude Code in about 4 days, as an experiment to test the orchestration ability of my agent. I didn’t get any internal or external review before posting. The code quality and data quality problems you pointed out are fair, and I appreciate that.
Actually, while conducting this research, I also noticed the sensitivity of the paper’s conclusions to the datasets and hyperparameters. There are two hyperparameters in this paper: whether to penalize wrong answers and by how much (λ); and if penalized, and whether the minimum score for each difficulty level is floored at 0 (floor). And whether to use all open models or MoE only for estimation.
You showed that just toggling the floor moves GPT-5.5 from 8.8T down to 1.9T. Here are the 12 combinations of {floor on/off} × {λ ∈ 0, −0.5, −1} × {all open models, MoE only}:
- λ=0, all open (n=89, R²=0.908): GPT-5.5 7.6T, Opus 4.7 1.7T, Gemini 3.1 Pro 15.1T
- floor=T, λ=−0.5, all open (n=89, R²=0.920): GPT-5.5 8.3T, Opus 4.7 2.8T, Gemini 3.1 Pro 26.7T
- floor=F, λ=−0.5, all open (n=89, R²=0.878): GPT-5.5 3.1T, Opus 4.7 1.8T, Gemini 3.1 Pro 8.6T
- floor=T, λ=−1, all open — the paper (n=89, R²=0.917): GPT-5.5 8.8T, Opus 4.7 3.6T, Gemini 3.1 Pro 40.8T
- floor=F, λ=−1, all open — your fix (n=89, R²=0.784): GPT-5.5 1.9T, Opus 4.7 1.9T, Gemini 3.1 Pro 6.4T
- λ=0, MoE only (n=37, R²=0.673): GPT-5.5 10.4T, Opus 4.7 2.0T, Gemini 3.1 Pro 22.1T
- floor=T, λ=−0.5, MoE only (n=37, R²=0.777): GPT-5.5 8.4T, Opus 4.7 2.9T, Gemini 3.1 Pro 26.9T
- floor=F, λ=−0.5, MoE only (n=37, R²=0.826): GPT-5.5 4.2T, Opus 4.7 2.3T, Gemini 3.1 Pro 12.3T
- floor=T, λ=−1, MoE only (n=37, R²=0.792): GPT-5.5 7.8T, Opus 4.7 3.4T, Gemini 3.1 Pro 33.2T
- floor=F, λ=−1, MoE only (n=37, R²=0.802): GPT-5.5 2.6T, Opus 4.7 2.5T, Gemini 3.1 Pro 9.1T
(λ=0 floor=T and floor=F are identical because nothing can go negative when we do not penalize wrong responses.)
And it gets even wider if we only use Wikidata-only probes: floored λ=−1 gives Gemini 3.1 Pro at ~163T. Given this spread, no single point estimate is honest, and I should have reported a band rather than a number.
My parameter selection was based on maximizing the R² value for open-weight models.
Therefore, IKP is merely an interesting idea and an early-stage study; currently, it cannot reliably estimate the parameter counts of closed-source models.
To be clear, I shipped this on arXiv knowing it was premature. IKP was meant as a starting point, but not the end. I hope future work will explore this problem in greater depth! Thanks again for your excellent analysis and comments!
Thank you so much for your thoughtful response. We appreciate the gracious response given that the post highlighted some concerns with the original work.
It is somewhat challenging to create principled methods for selecting the hyperparameters you mentioned, and this seems like an area for future work. We also experimented with hyperparameters such as λ = −0.25 which produced an even higher R^2 to the open source models of 0.925, but which produced predictions of the proprietary models which seemed quite low to us.
The method certainly seems promising and we are excited to see your future results!
Thank you for the careful analysis and thoughtful comments. The paper, code, dataset and the website were all generated with my agent and Claude Code in about 4 days, as an experiment to test the orchestration ability of my agent. I didn’t get any internal or external review before posting. The code quality and data quality problems you pointed out are fair, and I appreciate that.
Actually, while conducting this research, I also noticed the sensitivity of the paper’s conclusions to the datasets and hyperparameters. There are two hyperparameters in this paper: whether to penalize wrong answers and by how much (λ); and if penalized, and whether the minimum score for each difficulty level is floored at 0 (floor). And whether to use all open models or MoE only for estimation.
You showed that just toggling the floor moves GPT-5.5 from 8.8T down to 1.9T. Here are the 12 combinations of {floor on/off} × {λ ∈ 0, −0.5, −1} × {all open models, MoE only}:
- λ=0, all open (n=89, R²=0.908): GPT-5.5 7.6T, Opus 4.7 1.7T, Gemini 3.1 Pro 15.1T
- floor=T, λ=−0.5, all open (n=89, R²=0.920): GPT-5.5 8.3T, Opus 4.7 2.8T, Gemini 3.1 Pro 26.7T
- floor=F, λ=−0.5, all open (n=89, R²=0.878): GPT-5.5 3.1T, Opus 4.7 1.8T, Gemini 3.1 Pro 8.6T
- floor=T, λ=−1, all open — the paper (n=89, R²=0.917): GPT-5.5 8.8T, Opus 4.7 3.6T, Gemini 3.1 Pro 40.8T
- floor=F, λ=−1, all open — your fix (n=89, R²=0.784): GPT-5.5 1.9T, Opus 4.7 1.9T, Gemini 3.1 Pro 6.4T
- λ=0, MoE only (n=37, R²=0.673): GPT-5.5 10.4T, Opus 4.7 2.0T, Gemini 3.1 Pro 22.1T
- floor=T, λ=−0.5, MoE only (n=37, R²=0.777): GPT-5.5 8.4T, Opus 4.7 2.9T, Gemini 3.1 Pro 26.9T
- floor=F, λ=−0.5, MoE only (n=37, R²=0.826): GPT-5.5 4.2T, Opus 4.7 2.3T, Gemini 3.1 Pro 12.3T
- floor=T, λ=−1, MoE only (n=37, R²=0.792): GPT-5.5 7.8T, Opus 4.7 3.4T, Gemini 3.1 Pro 33.2T
- floor=F, λ=−1, MoE only (n=37, R²=0.802): GPT-5.5 2.6T, Opus 4.7 2.5T, Gemini 3.1 Pro 9.1T
(λ=0 floor=T and floor=F are identical because nothing can go negative when we do not penalize wrong responses.)
And it gets even wider if we only use Wikidata-only probes: floored λ=−1 gives Gemini 3.1 Pro at ~163T. Given this spread, no single point estimate is honest, and I should have reported a band rather than a number.
My parameter selection was based on maximizing the R² value for open-weight models.
Therefore, IKP is merely an interesting idea and an early-stage study; currently, it cannot reliably estimate the parameter counts of closed-source models.
To be clear, I shipped this on arXiv knowing it was premature. IKP was meant as a starting point, but not the end. I hope future work will explore this problem in greater depth! Thanks again for your excellent analysis and comments!
— Bojie Li
Thank you so much for your thoughtful response. We appreciate the gracious response given that the post highlighted some concerns with the original work.
It is somewhat challenging to create principled methods for selecting the hyperparameters you mentioned, and this seems like an area for future work. We also experimented with hyperparameters such as λ = −0.25 which produced an even higher R^2 to the open source models of 0.925, but which produced predictions of the proprietary models which seemed quite low to us.
The method certainly seems promising and we are excited to see your future results!