Research Fellow at the GPAI Policy Lab
Laura Domenech
Karma: 18
Mapping AI Capabilities to Human Expertise on the Rosetta Stone (Epoch Capabilities Index)
Great post, thank you. It was helpful to understand the pattern patch → new jailbreak → repeat as supporting the claim that current models fail to really internalize reinforced human values.
Thanks for the comment! It’s an ongoing project so we are definitely open to suggestions.
Regarding your remark, on one hand it sure doesn’t mean that humans at these expertise levels can be replaced, see our note “These benchmarks are simplified proxies, not direct measures of real-world professional competence. Scoring at “Domain Expert level” on a benchmark does not mean models can replace domain experts in their actual work. [etc.]” and the TLDR acknowledges, perhaps too succinctly, that it only applies to “technical and scientific benchmark skills”
But on the other hand these are empirical baseline results from real experts who took these benchmarks (with limited time) and got worse results than leading models. So scoring at a Domain Expert level does mean a model matches or exceeds the accuracy of human experts on these specific, closed-ended sets of benchmark tasks.
For the next iteration we’d like to add more complex benchmarks and additional difficulty axes that should account for other factors like fluid intelligence and real-world messiness. This should help clarify that models are not exceeding experts on all axes.