For my own future reference, here are some “benchmarks” (very broadly construed) I pay attention to as of Nov 2025, a mix of serious and whimsical:
the AI village and blog, not really a “benchmark” per se but my richest source of intuitions about current frontier models’ capabilities at open-ended long-horizon tasks by far, made me notice stuff like the Claudes being way better than other “benchmark-equiv” frontier models
Chats on read.haus with AI simulations of prominent authors become preferable to reading the latter’s real content. Scott Alexander, Sarah Constantin, Spencer Greenberg, Byrne Hobart, Tyler Cowen, Dwarkesh Patel, Andy Matuschak etc are all on there but they never come across quite right to me
Starburst, fictional theoretical physics. I don’t really get their leaderboard though
a small set of work-related spreadsheet modelling problems I keep thinking current agents should easily do but they keep failing in very irritating ways, Claude Code included. I’m waiting for agents that will finally speed me up not slow me down on these. Possibly skill issue on my part
FWIW, Anthropic’s members of technical staff estimates of productivity boost: currently 1.15-1.4x with Sonnet 4.5 for most, except that one person at 2x as “their workflow was now mainly focused on managing multiple agents”, wonder if it’s the same person Sholto Douglas mentioned worked with 9 agents at the same time
their slope on the chart below exceeds that of humans (I’m not a fan of the notion of task horizon length, it bakes in perf plateauing that doesn’t happen to humans thinking longer, hence slope)
FrontierMath Tier 4 because I like math x AI, plus commentary like Kevin Buzzard’s “I was amused this week to have been sent data on what happens if you ask lots of agents to try and solve these problems and you mark the question as being solved if at least one agent gets the answer correct at least once”
vibe-proving math theorems in Lean except it doesn’t take a week and isn’t “extremely annoying” (despite Adam Mastroianni’s argument that what a dream job really feels like is to be perpetually annoyed). The main issue is in verifying that the human proof-to-Lean code translation is faithful, which doesn’t seem automatable
Epoch’s Capabilities Index because it’s general (composite metric of most of the high-profile benchmarks out there) stitched together using a methodology that seems intuitively correct (item response theory), although admittedly as someone who started out believing anything is measurable if you try hard enough I’ve gradually grown disillusioned enough to down-weight even ostensibly good composite benchmarks like ECI a fair bit. Also CAIS’s definition of AGI
Scale’s Remote Labor Index because I work remotely. 230 projects from Upwork freelancers “excluding projects requiring physical labor, long-term evaluation, or direct client interaction”, mean and median human completion time 29 and 11.5 hours respectively, mean and median project value $630 and $200. Manus at 2.50% tops the leaderboard, then Sonnet 4.5 > GPT-5 > ChatGPT Agent > Gemini 2.5 Pro last at 0.83%, which matches my impression of their relative “IRL competence” in the AI Village
For my own future reference, here are some “benchmarks” (very broadly construed) I pay attention to as of Nov 2025, a mix of serious and whimsical:
the AI village and blog, not really a “benchmark” per se but my richest source of intuitions about current frontier models’ capabilities at open-ended long-horizon tasks by far, made me notice stuff like the Claudes being way better than other “benchmark-equiv” frontier models
certain folks’ opinions, e.g. lc on cybersec (like this take on AISLE), Kevin Buzzard and Terry Tao on math, Adam Karvonen on physical tasks relevant to manufacturing, Sarah Constantin on lit reviews, nostalgebraist on blog posts worth reading, Gwern and Jennifer Chen on writing diversity & creativity (but not Sam Altman), Cole Wyeth on novel ideas, Thane Ruthenis (although Thane’s milestones are on a totally different capability tier), Steven Byrnes on AGI etc many others
Chats on read.haus with AI simulations of prominent authors become preferable to reading the latter’s real content. Scott Alexander, Sarah Constantin, Spencer Greenberg, Byrne Hobart, Tyler Cowen, Dwarkesh Patel, Andy Matuschak etc are all on there but they never come across quite right to me
Starburst, fictional theoretical physics. I don’t really get their leaderboard though
a small set of work-related spreadsheet modelling problems I keep thinking current agents should easily do but they keep failing in very irritating ways, Claude Code included. I’m waiting for agents that will finally speed me up not slow me down on these. Possibly skill issue on my part
FWIW, Anthropic’s members of technical staff estimates of productivity boost: currently 1.15-1.4x with Sonnet 4.5 for most, except that one person at 2x as “their workflow was now mainly focused on managing multiple agents”, wonder if it’s the same person Sholto Douglas mentioned worked with 9 agents at the same time
how blind models see the earth, plot the Mandelbrot set, etc
avoiding spiralling into spiritual bliss attractors (maybe this is just Claude being a hippie)
just for fun, Gary Marcus’ 5 challenges by 2029 (from 2022)
their slope on the chart below exceeds that of humans (I’m not a fan of the notion of task horizon length, it bakes in perf plateauing that doesn’t happen to humans thinking longer, hence slope)
FrontierMath Tier 4 because I like math x AI, plus commentary like Kevin Buzzard’s “I was amused this week to have been sent data on what happens if you ask lots of agents to try and solve these problems and you mark the question as being solved if at least one agent gets the answer correct at least once”
vibe-proving math theorems in Lean except it doesn’t take a week and isn’t “extremely annoying” (despite Adam Mastroianni’s argument that what a dream job really feels like is to be perpetually annoyed). The main issue is in verifying that the human proof-to-Lean code translation is faithful, which doesn’t seem automatable
Epoch’s Capabilities Index because it’s general (composite metric of most of the high-profile benchmarks out there) stitched together using a methodology that seems intuitively correct (item response theory), although admittedly as someone who started out believing anything is measurable if you try hard enough I’ve gradually grown disillusioned enough to down-weight even ostensibly good composite benchmarks like ECI a fair bit. Also CAIS’s definition of AGI
Scale’s Remote Labor Index because I work remotely. 230 projects from Upwork freelancers “excluding projects requiring physical labor, long-term evaluation, or direct client interaction”, mean and median human completion time 29 and 11.5 hours respectively, mean and median project value $630 and $200. Manus at 2.50% tops the leaderboard, then Sonnet 4.5 > GPT-5 > ChatGPT Agent > Gemini 2.5 Pro last at 0.83%, which matches my impression of their relative “IRL competence” in the AI Village