This benchmark has human expert percentiles, which makes it very convenient for exactly the kind of stuff you are doing (though I decided to calculate SDs as a function of release date rather than compute, just because it’s more intuitive).
I wrote down SOTA models, their release dates, and performance:
Model
Release date
Normalized date
Accuracy
Expert percentile
z-score
GPT-4 Turbo
2023-06-01
0
16.8%
43%
-0.18
Gemini 1.5 Pro
2024-02-15
259
25.4%
61%
0.28
Sonnet 3.5
2024-06-20
385
26.9%
69%
0.50
Sonnet 3.5 v2
2024-10-22
509
33.6%
75%
0.67
o1
2024-12-05
553
35.4%
89%
1.23
o3
2025-04-16
685
43.8%
94%
1.55
Z-scores are based on expert percentiles. This gives roughly 0.90 SD/year for LLMs. So we should expect an LLM as good as a +6 SD human virology expert around 2030.
https://www.virologytest.ai/
This benchmark has human expert percentiles, which makes it very convenient for exactly the kind of stuff you are doing (though I decided to calculate SDs as a function of release date rather than compute, just because it’s more intuitive).
I wrote down SOTA models, their release dates, and performance:
Z-scores are based on expert percentiles. This gives roughly 0.90 SD/year for LLMs. So we should expect an LLM as good as a +6 SD human virology expert around 2030.
I wish more benchmarks had human percentiles.