Thanks! I hadn’t realized that RepliBench considered the ‘API only’ / scaffold-only case. Extremely important prior work, which I’ll add above.
Do you know whether it’s being run on an ongoing basis on new models and (ideally) third-party scaffolded systems like Claude Code? It looks like the December 2025 AISI trends report gives Q3 results (fig 17), but I don’t see anything more recent, and that appears to be (unnamed) models only.
I don’t know whether it’s being tracked ongoing, since I left AISI nearly a year ago. But based on previous practice, I’d guess yes (or a related/derivative suite), because the standard practice at AISI was for workstreams to maintain suites of evals and run them periodically and on prerelease models, occasionally publishing things in a sort of random-ish way.
Thanks! I hadn’t realized that RepliBench considered the ‘API only’ / scaffold-only case. Extremely important prior work, which I’ll add above.
Do you know whether it’s being run on an ongoing basis on new models and (ideally) third-party scaffolded systems like Claude Code? It looks like the December 2025 AISI trends report gives Q3 results (fig 17), but I don’t see anything more recent, and that appears to be (unnamed) models only.
I don’t know whether it’s being tracked ongoing, since I left AISI nearly a year ago. But based on previous practice, I’d guess yes (or a related/derivative suite), because the standard practice at AISI was for workstreams to maintain suites of evals and run them periodically and on prerelease models, occasionally publishing things in a sort of random-ish way.