For my own future reference, here are some “benchmarks” (very broadly construed) I pay attention to as of Nov 2025, a mix of serious and whimsical. (The “serious” version would probably start with the Evals section of technicalities’ 2025 shallow review of technical AIS.)
the AI village and blog, not really a “benchmark” per se but my richest source of intuitions about current frontier models’ capabilities at open-ended long-horizon tasks by far, made me notice stuff like the Claudes being way better than other “benchmark-equiv” frontier models
Chats on read.haus with AI simulations of prominent authors become preferable to reading the latter’s real content. Scott Alexander, Sarah Constantin, Spencer Greenberg, Byrne Hobart, Tyler Cowen, Dwarkesh Patel, Andy Matuschak etc are all on there but they never come across quite right to me
Starburst, fictional theoretical physics. I don’t really get their leaderboard though
a small set of work-related spreadsheet modelling problems I keep thinking current agents should easily do but they keep failing in very irritating ways, Claude Code included. I’m waiting for agents that will finally speed me up not slow me down on these. Possibly skill issue on my part
FWIW, Anthropic’s members of technical staff estimates of productivity boost: currently 1.15-1.4x with Sonnet 4.5 for most, except that one person at 2x as “their workflow was now mainly focused on managing multiple agents”, wonder if it’s the same person Sholto Douglas mentioned worked with 9 agents at the same time
update: section 7.3.4 of the Opus 4.5 system card says 2-3x productivity boost vs Sonnet 4.5′s 1.15-1.4x above: “Nine of 18 participants reported ≥100% productivity improvements, with a median estimate of 100% and a mean estimate of 220%”
their slope on the chart below exceeds that of humans (I’m not a fan of the notion of task horizon length, it bakes in perf plateauing that doesn’t happen to humans thinking longer, hence slope)
great summary of reasons time horizon is overrated/misinterpreted by Thomas Kwa
FrontierMath Tier 4 because I like math x AI, plus commentary like Kevin Buzzard’s “I was amused this week to have been sent data on what happens if you ask lots of agents to try and solve these problems and you mark the question as being solved if at least one agent gets the answer correct at least once”
more generally in math x AI: Gavin Leech’s examples thread of “crucially useful AI in research maths”
the “Erdos problems benchmark”: the proportion of currently outstanding Erdos problems amenable to current AI tools operated with minimal human intervention (inspired by Terry Tao)
vibe-proving math theorems in Lean except it doesn’t take a week and isn’t “extremely annoying” (despite Adam Mastroianni’s argument that what a dream job really feels like is to be perpetually annoyed). The main issue is in verifying that the human proof-to-Lean code translation is faithful, which doesn’t seem automatable
Epoch’s Capabilities Index because it’s general (composite metric of most of the high-profile benchmarks out there) stitched together using a methodology that seems intuitively correct (item response theory), although admittedly as someone who started out believing anything is measurable if you try hard enough I’ve gradually grown disillusioned enough to down-weight even ostensibly good composite benchmarks like ECI a fair bit. Also CAIS’s definition of AGI
Scale’s Remote Labor Index because I work remotely. 230 projects from Upwork freelancers “excluding projects requiring physical labor, long-term evaluation, or direct client interaction”, mean and median human completion time 29 and 11.5 hours respectively, mean and median project value $630 and $200. Manus at 2.50% tops the leaderboard, then Sonnet 4.5 > GPT-5 > ChatGPT Agent > Gemini 2.5 Pro last at 0.83%, which matches my impression of their relative “IRL competence” in the AI Village
I tasked 16 agents with writing a Rust-based C compiler, from scratch, capable of compiling the Linux kernel. Over nearly 2,000 Claude Code sessions and $20,000 in API costs, the agent team produced a 100,000-line compiler that can build Linux 6.9 on x86, ARM, and RISC-V.
Bit more commentary on the capabilities benchmarking angle:
This project was designed as a capability benchmark. I am interested in stress-testing the limits of what LLMs can just barely achieve today in order to help us prepare for what models will reliably achieve in the future.
I’ve been using the C Compiler project as a benchmark across the entire Claude 4 model series. As I did with prior projects, I started by drafting what I wanted: a from-scratch optimizing compiler with no dependencies, GCC-compatible, able to compile the Linux kernel, and designed to support multiple backends. While I specified some aspects of the design (e.g., that it should have an SSA IR to enable multiple optimization passes) I did not go into any detail on how to do so.
Previous Opus 4 models were barely capable of producing a functional compiler. Opus 4.5 was the first to cross a threshold that allowed it to produce a functional compiler which could pass large test suites, but it was still incapable of compiling any real large projects. My goal with Opus 4.6 was to again test the limits.
Over nearly 2,000 Claude Code sessions across two weeks, Opus 4.6 consumed 2 billion input tokens and generated 140 million output tokens, a total cost just under $20,000. Compared to even the most expensive Claude Max plans, this was an extremely expensive project. But that total is a fraction of what it would cost me to produce this myself—let alone an entire team.
This was a clean-room implementation (Claude did not have internet access at any point during its development); it depends only on the Rust standard library. The 100,000-line compiler can build a bootable Linux 6.9 on x86, ARM, and RISC-V. It can also compile QEMU, FFmpeg, SQlite, postgres, redis, and has a 99% pass rate on most compiler test suites including the GCC torture test suite. It also passes the developer’s ultimate litmus test: it can compile and run Doom.
By 2026, more code gets written in a week than the world wrote in 2020. Open source projects fork themselves into an endless orgy of abundance. Some high school students build functionally near-identical versions of Windows and Google Drive (and every video game in existence) from scratch in a month, because they can and they wanted one new feature on top of it. Everyone and their dog has a software product line. Big Tech unleashes a torrent of lawsuits against people cloning their products, echoing the Oracle v Google lawsuit about Java, but those lawsuits will take years to complete, and months feel like decades on the ground.
Back to Carlini on where Opus 4.6 fell short:
The compiler, however, is not without limitations. These include:
It lacks the 16-bit x86 compiler that is necessary to boot Linux out of real mode. For this, it calls out to GCC (the x86_32 and x86_64 compilers are its own).
It does not have its own assembler and linker; these are the very last bits that Claude started automating and are still somewhat buggy. The demo video was produced with a GCC assembler and linker.
The compiler successfully builds many projects, but not all. It’s not yet a drop-in replacement for a real compiler.
The generated code is not very efficient. Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.
The Rust code quality is reasonable, but is nowhere near the quality of what an expert Rust programmer might produce.
The resulting compiler has nearly reached the limits of Opus’s abilities. I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.
As one particularly challenging example, Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66⁄67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase (This is only the case for x86. For ARM or RISC-V, Claude’s compiler can compile completely by itself.)
Broadly speaking, we now see an empirical tradeoff between the level of AI involvement in the solution, and the difficulty or novelty of that solution. In particular, the recent solutions have spanned a spectrum roughly describable as follows:
1. Completely autonomous AI solutions to Erdos problems that are short and largely follow a standard technique. (In many, but not all, of these cases, some existing literature was found that proved a very similar result by a similar method.)
2. AI-powered modifications of existing solutions (which could be either human-generated or AI-generated) that managed to improve or modify these solutions in various ways, for instance by upgrading a partial solution to a full solution, or optimizing the parameters of the proof.
3. Complex interactions between humans and AI tools in which the AI tools provided crucial calculations, or proofs of key steps, allowing the collaboration to achieve moderately complicated and novel solutions to open problems.
4. Difficult research-level papers solving one or more Erdos problems by mostly traditional human means, but for which AI tools were useful for secondary tasks such as generation of code, numerics, references, or pictures.
The seeming negative correlation between the amount of AI involvement and the depth of result is somewhat reminiscent of statistical paradoxes such as Berkson’s paradox https://en.wikipedia.org/wiki/Berkson%27s_paradox or Simpson’s paradox https://en.wikipedia.org/wiki/Simpson%27s_paradox . One key confounding factor is that highly autonomous AI workflows are much more scaleable than human-intensive workflows, and are thus better suited for being systematically applied to the “long tail” of obscure Erdos problems, many of which actually have straightforward solutions. As such, many of these easier Erdos problems are now more likely to be solved by purely AI-based methods than by human or hybrid means.
Given the level of recent publicity given to these problems, I expect that over the next few weeks, pretty much all of the outstanding Erdos problems will be quietly attempted by various people using their preferred AI tool. Most of the time, these tools will not lead to any noteworthy result, but such failures are unlikely to be reported on any public site. It will be interesting to see what (verified) successes do emerge from this, which should soon give a reasonably accurate picture of what proportion of currently outstanding Erdos problems are simple enough to be amenable to current AI tools operated with minimal human intervention. (My guess is that this proportion is on the order of 1-2%.) Assessing the viability of more hybridized human-AI approaches will take significantly longer though, as human expert attention will remain a significant bottleneck.
So I’ll whimsically define the “Erdos problems benchmark” to be “the proportion of currently outstanding Erdos problems amenable to current AI tools operated with minimal human intervention”, and the current “SOTA” to be Tao’s guess of 1-2% as of Jan 2026. My guess is it won’t be saturated in ~2 years like every other benchmark because open math problems can be unboundedly hard, but who knows?
For my own future reference, here are some “benchmarks” (very broadly construed) I pay attention to as of Nov 2025, a mix of serious and whimsical. (The “serious” version would probably start with the Evals section of technicalities’ 2025 shallow review of technical AIS.)
the AI village and blog, not really a “benchmark” per se but my richest source of intuitions about current frontier models’ capabilities at open-ended long-horizon tasks by far, made me notice stuff like the Claudes being way better than other “benchmark-equiv” frontier models
certain folks’ domain-specific opinions, e.g. lc on cybersec (like this take on AISLE), Kevin Buzzard and Terry Tao on math, Adam Karvonen on physical tasks relevant to manufacturing, Sarah Constantin on lit reviews, nostalgebraist on blog posts worth reading, Gwern and Jennifer Chen on writing diversity & creativity (but not Sam Altman), Cole Wyeth on novel ideas, Thane Ruthenis (although Thane’s milestones are on a totally different capability tier), Steven Byrnes on AGI etc many others
update: Cole thinks ChatGPT 5.2 met his bar for autonomously having an original insight by solving an open COLT problem with no assistance
“AGI is here” takes, e.g. JenniferRM—Dec ’25, Gordon Worley—Feb ’26
Chats on read.haus with AI simulations of prominent authors become preferable to reading the latter’s real content. Scott Alexander, Sarah Constantin, Spencer Greenberg, Byrne Hobart, Tyler Cowen, Dwarkesh Patel, Andy Matuschak etc are all on there but they never come across quite right to me
Starburst, fictional theoretical physics. I don’t really get their leaderboard though
a small set of work-related spreadsheet modelling problems I keep thinking current agents should easily do but they keep failing in very irritating ways, Claude Code included. I’m waiting for agents that will finally speed me up not slow me down on these. Possibly skill issue on my part
FWIW, Anthropic’s members of technical staff estimates of productivity boost: currently 1.15-1.4x with Sonnet 4.5 for most, except that one person at 2x as “their workflow was now mainly focused on managing multiple agents”, wonder if it’s the same person Sholto Douglas mentioned worked with 9 agents at the same time
update: section 7.3.4 of the Opus 4.5 system card says 2-3x productivity boost vs Sonnet 4.5′s 1.15-1.4x above: “Nine of 18 participants reported ≥100% productivity improvements, with a median estimate of 100% and a mean estimate of 220%”
how blind models see the earth, plot the Mandelbrot set, etc
avoiding spiralling into spiritual bliss attractors (maybe this is just Claude being a hippie)
just for fun, Gary Marcus’ 5 challenges by 2029 (from 2022)
their slope on the chart below exceeds that of humans (I’m not a fan of the notion of task horizon length, it bakes in perf plateauing that doesn’t happen to humans thinking longer, hence slope)
great summary of reasons time horizon is overrated/misinterpreted by Thomas Kwa
FrontierMath Tier 4 because I like math x AI, plus commentary like Kevin Buzzard’s “I was amused this week to have been sent data on what happens if you ask lots of agents to try and solve these problems and you mark the question as being solved if at least one agent gets the answer correct at least once”
more generally in math x AI: Gavin Leech’s examples thread of “crucially useful AI in research maths”
the “Erdos problems benchmark”: the proportion of currently outstanding Erdos problems amenable to current AI tools operated with minimal human intervention (inspired by Terry Tao)
vibe-proving math theorems in Lean except it doesn’t take a week and isn’t “extremely annoying” (despite Adam Mastroianni’s argument that what a dream job really feels like is to be perpetually annoyed). The main issue is in verifying that the human proof-to-Lean code translation is faithful, which doesn’t seem automatable
Epoch’s Capabilities Index because it’s general (composite metric of most of the high-profile benchmarks out there) stitched together using a methodology that seems intuitively correct (item response theory), although admittedly as someone who started out believing anything is measurable if you try hard enough I’ve gradually grown disillusioned enough to down-weight even ostensibly good composite benchmarks like ECI a fair bit. Also CAIS’s definition of AGI
Scale’s Remote Labor Index because I work remotely. 230 projects from Upwork freelancers “excluding projects requiring physical labor, long-term evaluation, or direct client interaction”, mean and median human completion time 29 and 11.5 hours respectively, mean and median project value $630 and $200. Manus at 2.50% tops the leaderboard, then Sonnet 4.5 > GPT-5 > ChatGPT Agent > Gemini 2.5 Pro last at 0.83%, which matches my impression of their relative “IRL competence” in the AI Village
Nicholas Carlini’s “build a C compiler” (more below)
From Nicholas Carlini’s Anthropic blog post:
Bit more commentary on the capabilities benchmarking angle:
This reminds me of a passage from L Rudolf L’s history of the future:
Back to Carlini on where Opus 4.6 fell short:
Another whimsical “benchmark”: Terry Tao wrote on Mathstodon that
So I’ll whimsically define the “Erdos problems benchmark” to be “the proportion of currently outstanding Erdos problems amenable to current AI tools operated with minimal human intervention”, and the current “SOTA” to be Tao’s guess of 1-2% as of Jan 2026. My guess is it won’t be saturated in ~2 years like every other benchmark because open math problems can be unboundedly hard, but who knows?