Expertium comments on What does 10x-ing effective compute get you?

Expertium 27 Jun 2025 23:31 UTC

5 points

https://www.virologytest.ai/

This benchmark has human expert percentiles, which makes it very convenient for exactly the kind of stuff you are doing (though I decided to calculate SDs as a function of release date rather than compute, just because it’s more intuitive).

I wrote down SOTA models, their release dates, and performance:

Model	Release date	Normalized date	Accuracy	Expert percentile	z-score
GPT-4 Turbo	2023-06-01	0	16.8%	43%	-0.18
Gemini 1.5 Pro	2024-02-15	259	25.4%	61%	0.28
Sonnet 3.5	2024-06-20	385	26.9%	69%	0.50
Sonnet 3.5 v2	2024-10-22	509	33.6%	75%	0.67
o1	2024-12-05	553	35.4%	89%	1.23
o3	2025-04-16	685	43.8%	94%	1.55

Z-scores are based on expert percentiles. This gives roughly 0.90 SD/year for LLMs. So we should expect an LLM as good as a +6 SD human virology expert around 2030.

I wish more benchmarks had human percentiles.