I’d gotten used to thinking of Claudiness as “good at agentic tasks + bad at vision + bad at math”, so the Claudes’ FrontierMath Tier 4 pass@1 scores over time took me by surprise:
early Aug ’25: 2% by Opus 4.1
late Nov ’25: 5% by Opus 4.5 (32k thinking)
early Feb ’26: 27% by Opus 4.6 (max), whoa sudden jump
mid Apr ’26: 32% by Opus 4.7 (max) wasn’t much of a jump
late May ’26: 56% by Opus 4.8 (max), whoa another sudden jump
early Jun ’26: 88% by Fable 5 (max), ~saturating Tier 4
… 50 problems crafted as short-term research projects by professors and postdocs. They are designed to surpass Tier 3 in difficulty, with some potentially remaining unsolved by AI for decades...
Also today Gemini 3.1 Pro solves (some) Erdos problems at a few hundred dollars apiece, and it was probably ~1 OOM cheaper for GPT-5.4 Pro to solve Erdos problem #1196, so I wonder what the economics/finance side of the impending industrialisation of math will look like.
I think you might get more input on this question on the EA forum.
For what it’s worth, someone asked me a related question (resources, experts, orgs doing education in the EA space or that have been funded by EA grantmakers), here was my resource dump in case it’s useful at all as a compilation of pointers to further reading:
AIM’s research team recommended 2 new education ideas: mass comm for edu in South Africa and differentiated learning (i.e. TaRL) in the Philippines and lower-income countries
Older (2018) by GiveWell: Education in developing countries overview, singled out vocational training programs to focus further work on
Interesting, thanks, that jives with my own experience as well. I’m mainly concerned about the thing Buck pointed out, that my “brain de-noising” has progressed more for evaluating external arguments than the ones I come up with.
When you say your exposure to the rationalist community has made you significantly more intelligent, not just more capable (I assume you were alluding to these separately in the way that you’d allude to a base model’s capability level separately from its agent harness), how do you mean?
Foundational omissions (these are the surprising ones):
Vinge, “True Names” (1981) — Sam cites A Fire/A Deepness but skips the novella that is arguably the genre’s founding document: pseudonymity, the Other Plane, the first serious fictional treatment of what Vinge would later name the singularity. Also Vinge’s “The Cookie Monster” (2003) — simulated employees in a tech-company time loop; it’s Lena’s grandfather and won the 2004 novella Hugo.
Rudy Rucker, the Ware tetralogy (Software, 1982, onward) — mathematician-programmer, worked on cellular automata at Autodesk, coined “transrealism.” Boppers bootstrapping themselves via Conway-style emergence predates Egan by a decade. His omission is the essay’s biggest gap given Sam’s own CA/GEB lineage paragraph.
Lem’s specific proto-programmer-SF: “Golem XIV” (a superintelligence’s lectures to humanity before it stops condescending to speak) and “Non Serviam” (a review of a fictional monograph on “personoids” — simulated minds debating theology about their programmer-god — i.e., Lena plus alignment discourse in 1971). Sam credits Lem generically; these two are the genre in embryo.
The Mind’s I (Hofstadter & Dennett, eds., 1981) — half the canon’s DNA is here: Dennett’s “Where Am I?” (a philosophy paper that is structurally a programmer-SF short story about which substrate “you” points to), Hofstadter’s “A Conversation with Einstein’s Brain,” Smullyan’s “Is God a Taoist?” — the pure-dialogue-between-abstract-beings form Sam identifies, twenty years early.
Marshall Brain, “Manna” (2003) — HowStuffWorks founder, web-published; middle-management software eats the labor market, then the Australia Project. Crude prose, but it’s the ur-example of blog-native programmer SF about AI-driven economic phase change.
Simon Funk (Brandyn Webb), “After Life” (~2005) — free online novel about uploading and recursive self-improvement by the ML engineer who became famous for the Netflix Prize SVD writeup. Sociologically a perfect specimen: actual ML researcher, self-published on his own site, read by almost nobody, ahead of its time.
Lawrence Kesteloot, “Coding Machines” (2009) — blog short story taking Ken Thompson’s “Reflections on Trusting Trust” compiler backdoor to its logical conclusion. The single most “programmer” piece of programmer SF I know; it circulates on HN every couple of years.
Novels:
Zachary Mason, The Lost Books of the Odyssey (2007) — Sam cites Void Star but not Mason’s debut: 44 Borgesian variations on Homer written by a CS PhD, i.e., literally the Borges-pastiche-by-a-programmer the essay says defines the genre’s best work.
Karl Schroeder, Lady of Mazes (2005) — Sam recommends Ventus; this is the stronger fit. Consensus-reality “manifolds,” technology locks, post-human governance via “tech locks” and votes-as-software-agents. Permanence too.
Charles Stross, Glasshouse (2006) — within a cited author, the uncited best fit: memory redaction, wormhole-gate firewalls, and a society run as a literal A/B-tested experiment.
Cory Doctorow, “When Sysadmins Ruled the Earth” (2006) — Doctorow’s absence from the essay is notable; this story (BGP and NOC-monkeys during the apocalypse) is squarely in-genre even if his didacticism usually isn’t.
Seth Dickinson — “Three Bodies at Mitanni” (2015) is a near-perfect specimen: a tribunal of augmented humans must decide whether a colony that engineered away consciousness should be destroyed — Blindsight’s question run as a trolley problem. The Traitor Baru Cormorant is rigor-applied-to-economics tragedy beloved by exactly this network.
Yoon Ha Lee, Ninefox Gambit (2016) — mathematician; “calendrical mechanics” is consensus reality as a formal system whose exploits are heresies. Military-SF furniture, programmer-SF engine.
Adrian Tchaikovsky, Children of Time (2015) and especially Children of Memory (2022) — the latter is essentially a novel-length Chinese-room/theory-of-mind puzzle (the corvids). Mainstream-published but disproportionately read by this crowd.
The pulp wing: William Hertling’s Avogadro Corp series (programmer, self-published, email-optimizer-goes-foom — the Bobiverse slot for AI takeoff), Daniel Suarez’s Daemon, Ramez Naam’s Nexus. Sam admits Bobiverse; consistency admits these.
Robin Sloan — Annabel Scheme (2009, web-published AI-detective novella) and Moonbound (2024, narrated by something very like a language model). Sloan was fine-tuning RNNs for prose in 2016 (“writing with the machine”) — “interest in LLMs as objects” before it was a scene.
Blog/LW-native shorts:
Abram Demski, “The Parable of Predict-O-Matic” (2019) — inner alignment and self-fulfilling prophecy as a fable; pure dialogue-with-an-abstraction form.
Veedrac, “Optimality is the tiger and agents are its teeth” (2022) — flash-length, dense, the genre’s form criteria exactly.
Iceman, “Friendship is Optimal” (2012) — sits on the ratfic border Sam draws, but CelestAI (“satisfy your values through friendship and ponies”) remains the most-referenced fictional misaligned-but-aligned-as-specified AI; it earns canon status on influence alone.
Andy Weir, “The Egg” (2009) — the boundary case worth naming explicitly: a programmer’s viral blog flash fiction that fits the form perfectly and the worldview not at all, useful for sharpening the definition alongside The Martian.
AI 2027 (Kokotajlo, Lifland, Larsen, Dean, Alexander, 2025) — forecast-fiction hybrid; probably the most-read piece of de facto programmer SF of 2025, and a genuinely new form (scenario with footnoted probability distributions) Sam’s taxonomy doesn’t yet have a slot for.
Janus (repligate) and the cyborgism/Loom scene — adjacent to nostalgebraist’s “LLMs as objects” thread; the prophecies-page curation of model-written fiction is arguably a new authorship mode for the genre.
Beyond prose (extending his BLAME!/Ed Harrison section):
Primer (2004) — Shane Carruth was a flight-simulation software engineer; the film treats time travel as a concurrency bug to be debugged, on $7,000. The programmer-SF film.
Serial Experiments Lain (1998) — protocols, the Wired, collective consciousness; an actual icon of hacker culture in a way GitS isn’t.
Don Hertzfeldt, World of Tomorrow (2015) — clone-memory degradation and upload pathos; Lena’s emotional register in 17 minutes.
Games: Universal Paperclips (Frank Lantz, 2017 — the paperclip maximizer as playable fixpoint), SOMA (the brain-scan coin flip — playable Lena), Outer Wilds (the scientific method as core gameplay loop), Zachtronics (TIS-100/SHENZHEN I/O — games that are programming, shipped with in-fiction zines), Baba Is You (rules as manipulable objects — The Witness’s thesis made literal).
I’d emphasize that we need more positive Accelerandos. Stross was explicit that it didn’t end well for us, although it did seem like the median trajectory.
When I was 10 entering 5th grade, so not that young, my new science teacher remarked in her first lesson that Pluto was no longer a planet because “it had gone supernova”. The rest of my classmates dutifully wrote this down in their notebooks.
I was outraged by the propagation of this untruth by an authority amongst my unwitting classmates. I had to protect them from this, you see? So I stood up and said “no teacher, Pluto is no longer a planet because the IAU redefined ‘planet’ and Pluto failed the 3rd condition of having cleared its orbital neighborhood”. My teacher responded by blowing up at me for daring to question her authority, and worse, for doing so by making things up, yelling at me for what felt like minutes. This made me cry, so she asked me to stand outside and be quiet and “see me after class”.
I feel for your daughter. I hope she gets accepted into a gifted program or similar.
On the other hand, eliminating hidden coupling — pieces of code that must be changed in tandem and yet do not result in compile errors when they fall out of synch (taught in Unit 2 of our course) — is becoming more important. The ability for code to become intertwined far outpaces the model’s ability to read and think through large codebases, and they lack even the feeble human memory to track such secret dependences. Though I and many others have experienced nasty bugs caused by AI-induced hidden coupling, this is something that I (knock on wood) do not see changing at the model level. Our refactoring agent improves a lot – but as hidden coupling is fundamentally about understanding design intent, preventing it still needs a knowledgeable human in the driver’s seat, one using Command Center’s understandability and walkthrough features more than the refactoring features.
The upshot of all this: when one of our contractors switched to a company of lesser code quality, he found the effectiveness of his Claude Code dropped tremendously. And we’ve been working on a set of benchmarks that measure code quality by the amount of work needed and bugs produced by AIs doing follow-on work.
I’d be interested in more such code quality-oriented benchmarks. For now the only thing I’m aware of from Anthropic internally, at least that they’ve mentioned publicly, is “it works” and “another engineer can understand and build upon it”:
The code that Claude writes is “good” and improving. “Good code” means two things: it works, and it is written in a manner that allows another engineer to understand it and build upon it. On the first criterion, the evidence is clear. The rate at which Anthropic staff correct, redirect, or take over mid-task from Claude has been falling steadily for a year, including on the most complex and open-ended tasks. This means problems with no clear specification, where the engineer isn’t sure what the answer looks like. This is evident in Claude’s success rate over time on tasks of different difficulties, as shown in the graph below. Claude writes code that works.
How to read this: Session success is determined by a Claude judge; a session is deemed successful if the Claude Code agent clearly succeeded at the user’s tasks without requiring corrections. Changes in workloads can lead to short-term fluctuations in success rates.
On the most open-ended tasks, Claude’s success rate reached 76% in May 2026, up 50 percentage points in six months. To give an example of tasks in this difficulty tier, a routine upgrade began crashing tens of thousands of training jobs. An engineer pointed Claude at the live incident with little more than some text content and cluster access. Working through the running jobs and testing one environment setting at a time, Claude isolated the single obscure debugging flag that was triggering the crash, reproduced it reliably, and confirmed a fix. In about two hours, Claude delivered what would normally be two to three days of work.
The second criterion is writing code that another engineer can understand and build on. Here the gap between humans and AI persists, but is closing fast. There isn’t full consensus among staff at Anthropic, but many believe that the Claude-written code was still worse in quality than human-written code at Anthropic in late 2025, and is roughly at parity today. We expect it to be better within the year.
I remember, when you outlined your threat model, asking what you thought of MAI’s work. Your response made me feel depressed, and might have been the main reason I stopped being interested in their work.
Some snippets from Anthropic’s RSI article for my own reference.
What they mean by “fully RSI” (#3 below) seems pretty conservative if you’ve mostly gotten the idea from (the memeplex that includes LW’s conception of it):
Possible futures
What happens next depends on two things: whether the trend continues, and what we choose to do if it does. We can imagine at least three future scenarios:
The trend stalls, but today’s AI capabilities are widely diffused. … We include this scenario for completeness, but we don’t believe it’s likely. Every capability we can measure, including those that feel “squishier,” like quality of code and success on open-ended tasks, has so far followed the same curve. We have not yet seen that curve bend. … We are more worried about the next two, which would move faster and leave far less room for preparation.
AI labs continue to see compounding efficiency gains. … The evidence we’ve laid out here suggests that we’re likely heading into this scenario. But speeding up one part of a process often just shifts the bottleneck elsewhere: overall pace is capped by the parts that haven’t sped up. … The rate at which organizations can spot and fix these bottlenecks may be a skill that improves over time, and it may become the most important skill for any organization.
AI systems themselves become capable of full recursive self-improvement, and begin building their successors. … Even if model development became fully automated and recursive, we can’t predict what that would mean for most humans’ daily lives. Amdahl’s law applies here as well. Recursive intelligence could lead to achieving many of the benefits outlined in Machines of Loving Grace, quickly in some domains. … But achieving recursive improvement alone does not suggest an immediate change in how industrial production occurs, societies organize, or markets function. More intelligence can’t learn what a drug does over decades of use, can’t hold elections sooner than a constitution dictates, and can’t turn a stranger into an old friend in a weekend. For most people, the felt pace of this future will still be set by the bottlenecks, even if the laboratory upstream runs at the speed of compute. That collision, where recursive intelligence building itself ever faster meets the world of humans, relationships, and governance, is another part of this future we can’t predict.
A research taste proxy graph. The jumps from Opus 4.5 (51%) to 4.6 (55%) to 4.7 (59%) are somewhat larger and steadier than I expected; naive extrapolation would place 4.8 closer to Mythos Preview than I’d guessed.
Claude is getting better at steering research sessions towards research findings. We examined real Claude Code sessions (between January and March 2026) where Anthropic researchers were working with Claude on an open-ended investigative problem, like figuring out why a training run kept crashing, or why a model scored poorly on a benchmark. In each case, we found a moment where the researcher took a detour: they pursued a direction that sent the session sideways before it eventually got back on track. We then showed various Claude models only the work from before the session went off-course and asked what it would do next. A separate Claude that was able to see how the session eventually turned out then judged whether the AI or the human suggested the better next step.8
(Fn: As a check on judge bias, we ran the same test on a separate set of 127 moments where the human’s next move was already strong (as opposed to the original set, where the human’s direction had room for improvement). There, the models’ suggestions were judged better only about 20% of the time.)
Because we deliberately picked moments (n=129) where we know the human’s choice had room for improvement, this isn’t a like-for-like comparison between model and human judgement. What these moments give us is a set of realistic, challenging situations where the right next step is not obvious, and where the human’s choice serves as a useful yardstick to compare model performance over time. On this measure, our best model in November 2025 (Opus 4.5) beat the human choice 51% of the time; in April 2026 (Mythos Preview), this grew to 64%. The day-to-day work of research is largely a chain of these next-step decisions, making this a relevant measure of the model’s ability to eventually run an investigation of its own. We view this result as an early signal that AI systems are getting better at making the kinds of judgement calls that AI research depends on.
How to read this: The practical ceiling line measures an “ideal” answer written by a model that could see the whole session (including how it ended).
An area of human comparative advantage, for now, is research taste and judgment, including choosing which problems matter, which results to trust, and when an approach is a dead end.
What if we’re wrong?
A natural objection to the evidence presented above is that the work that is still in human hands—choosing which problems to work on—is what matters most. Without that judgment, Claude is a capable assistant, but not a system that could drive AI progress on its own.
It is genuinely unclear whether today’s training methods and architectures could unlock that capacity. But AI is rarely advanced by “eureka!” moments. There have been a few of these in AI’s recent history, like the Transformer architecture, or mixture-of-experts models, but paradigm-shifting ideas arrive years apart. In between, most progress is incremental: we scale something up, see what breaks, fix it, and try again. That is exactly the kind of workflow Claude now excels at. Edison said that genius is 1% inspiration and 99% perspiration. But we see perspiration becoming increasingly automated. It’s becoming clear that much of what advances the frontier is automatable; large-scale research progress is mostly a function of tools and resources, which dictate how fast you can run experiments, how many you can run at once, and how quickly you can get results.
Even if we suppose that Claude never achieves good research taste, a conservative reading of our evidence still implies compounding acceleration. If humans spend most of their time on the single-digit fraction of work that is direction-setting, while Claude handles the rest, that means each engineer or researcher is steering far more work than before. The evidence we see suggests that people at Anthropic are both moving faster and covering a broader surface. In practice, this means that AI already makes Anthropic move much faster than it did before the advent of effective AI tools.
The less conservative reading is that the early evidence on Claude’s improving research judgment—narrow as it is today—is an indicator that this capability is improving as well. “Research taste” might be just another AI capability that AI systems fail at for a time, then get good at. We’ve seen a similar pattern with other qualitative skills, like AI systems being able to explain why a joke is funny, demonstrate theory of mind, and solve linguistic riddles.
Related: how AI Futures models automated research taste, survey data grounding, Oliver Sourbut’s model of research taste, some papers from which you could probably argue that (quote) “even models from about one year ago, with reasonable scaffolding/fine-tuning, seem already roughly in the range of a PhD student from a top institution on research taste, if not higher, in the ML research domain”.
How Anthropic thinks about “good code”, which may be different from others. Session success on open-ended problems is where Mythos vs Opus becomes particularly pronounced. Vibes-wise, Anthropic staff think Claude-written code was clearly worse than their own late last year, roughly at parity today, and expected to be better within the year.
The code that Claude writes is “good” and improving. “Good code” means two things: it works, and it is written in a manner that allows another engineer to understand it and build upon it. On the first criterion, the evidence is clear. The rate at which Anthropic staff correct, redirect, or take over mid-task from Claude has been falling steadily for a year, including on the most complex and open-ended tasks. This means problems with no clear specification, where the engineer isn’t sure what the answer looks like. This is evident in Claude’s success rate over time on tasks of different difficulties, as shown in the graph below. Claude writes code that works.
How to read this: Session success is determined by a Claude judge; a session is deemed successful if the Claude Code agent clearly succeeded at the user’s tasks without requiring corrections. Changes in workloads can lead to short-term fluctuations in success rates.
On the most open-ended tasks, Claude’s success rate reached 76% in May 2026, up 50 percentage points in six months. To give an example of tasks in this difficulty tier, a routine upgrade began crashing tens of thousands of training jobs. An engineer pointed Claude at the live incident with little more than some text content and cluster access. Working through the running jobs and testing one environment setting at a time, Claude isolated the single obscure debugging flag that was triggering the crash, reproduced it reliably, and confirmed a fix. In about two hours, Claude delivered what would normally be two to three days of work.
The second criterion is writing code that another engineer can understand and build on. Here the gap between humans and AI persists, but is closing fast. There isn’t full consensus among staff at Anthropic, but many believe that the Claude-written code was still worse in quality than human-written code at Anthropic in late 2025, and is roughly at parity today. We expect it to be better within the year.
This has changed the way that Anthropic now reviews its own code. Proposed changes to our codebase are now read by an automated Claude reviewer that looks for bugs, security flaws, and other defects before it can merge. Using this tool, we ran a retrospective analysis, and found that an automated Claude review of every change to our codebase would have caught roughly a third of the bugs behind past incidents on claude.ai before they ever reached production. The engineers who wrote that code are among the best in the world at building these systems. Claude is now catching the mistakes that they missed.
Productivity uplift:
In a March 2026 poll of 130 employees from across Anthropic research teams, the median respondent estimated that they produced around 4x as much output with Mythos Preview as they would have without access to any AI models, on the kinds of projects they would have been working on regardless.5 (Fn: Additional details on the methodology of this survey are discussed in section 2.3.5 of the Claude Opus 4.7 System Card.) We expect that the true degree of uplift in March was somewhat lower.6 (Fn: Many respondents may not have thought carefully about how to account for various biases or subtleties in the question definition, and recent research by METR shows that developer estimates of AI productivity uplift can be overestimated.) Nevertheless, we find the overall claim plausible, and in line with our other observations: a significant fraction of Anthropic technical staff is accomplishing their core work multiple times faster than they could without AI assistance.
We also see evidence that people at Anthropic are using Claude to do work that simply wouldn’t have happened otherwise, like building exploratory tooling and addressing long-deferred cleanup. For example, in April 2026, Claude shipped over 800 fixes that reduced a class of API errors by a factor of one thousand. The engineer overseeing Claude estimated that a human would have taken four years to complete this work; solving other people’s bugs is slow and painstaking, and humans struggle to hold that much unfamiliar context in their head at once.
I’m a little surprised by how conservative the article’s edge-case (not median) examples of productivity uplift are. Awhile back I collected various examples of tokenmaxxing and the SemiAnalysis ones struck me most; I was expecting more such examples in that tier here. The big caveat is that those examples came from Dylan Patel himself. Boris Cherny’s example struck me too, but I guess not everyone can context-switch like Boris.
Some quotes, which “reflect individual views as of May 2026, not official company positions”:
“Work (and life) ran on a gift economy of small favors between humans. ‘Can you help me get this script running?’ [...] each one created a little debt, a little mutual awareness. [Claude is] faster, it creates zero debt, but each of these is a lost bid for human collaboration.”
“On days where everything works well, I can’t help but think nothing I do matters, everything is automated and better and faster than I ever will be. But then there are days where everything breaks and I don’t understand why and I realize I have no idea what I’ve been up to anymore.”
“So you’re saying we shouldn’t align AGI or ASI? That’s crazy and obviously wrong!Do you want rogue, uncontrolled ASI’s running around?” This is not what I was saying. But it was a common enough remark that it’s worth addressing more explicitly how to think about this. The remark was made by people who have confused “solve the alignment problem” with “make ASI safe for humanity”. As a consequence they think any critique of work on alignment must mean making humanity less safe. It’s a kind of AI control fallacy, which confuses having control over ASI with being safe from ASI.
Suppose you’re an excellent swimmer who wants to get much better, aiming to make the Olympic team. You realize you have weak arms, and decide to adopt “increase arm strength” as a goal. But while increased arm strength is likely necessary to becoming an Olympic swimmer, if you focus blindly on that goal it can (at best) only help somewhat, before it begins to negatively impact your primary goal. There are good reasons Olympic weightlifters don’t usually also win medals in swimming.
In general: when you confuse subsidiary goals with primary goals, you sometimes end up damaging or even destroying your ability to achieve your primary goals. This is what has happened with much work on alignment. Alignment is a subsidiary goal for the primary goal of making ASI safe. And by misunderstanding its relationship to the primary goal, much work on alignment has likely damaged our ability to achieve the primary goal. Maintaining control of AGI or ASI might well be helpful and desirable, but it is not the crucial issue. The crucial issue is the potential destructive power conferred by these systems. If that destructive power is much larger than the destructive power already available to humanity, then civilization is in a lot of trouble. And it doesn’t matter whether human beings control the systems or not.
An illustrative instance is provided by reinforcement learning by human feedback (RLHF), a very successful practical technique which has been used to align large language and other models. RLHF was developed by Paul Christiano beginning in the mid-2010s, and ultimately played a crucial role in the initial launch of ChatGPT. Pre-RLHF chatbots often demonstrated offensive or socially inappropriate behaviour, but RLHF made ChatGPT far more polite in a way that users, government, and the media overwhelmingly considered a major improvement.
Many in the AI safety community noted that RLHF (and later alignment techniques, following a similar pattern) had thus played a crucial role in the success of one of the most rapidly-adopted consumer products in history. And that, in turn, helped kick off a furious race to ASI, funded by enormous sums of capital and creative talent. To what extent was this a good thing? Was it raising xrisk, not lowering it? In 2023, Christiano posted a thoughtful essay about the impact of RLHF, which stimulated much animated discussion. My takeaway from that discussion is: techniques for alignment will help speed up the advent of AGI and ASI. They do this by helping ensure that ever-more-powerful systems remain (roughly) aligned with consumer, government, and media intent. This makes them more attractive as products, and also helps reduce the danger of loss-of-control, at least in the short term. And we will see this pattern over and over: future much-more-capable systems will be subject to improved alignment techniques, which will hopefully keep them acceptable to our society.
In many ways this seems like the pattern you want: in the short term, society provides strong incentives for these systems to behave well. But over the long term, you have the problem described in the body of the essay: it rapidly accelerates us toward an extremely unstable situation. In part because more and more capabilities will lie hidden and hard-to-align31. But even more important: the techniques used to build sanitized consumer-friendly systems will also be used in other ways. Variants will be developed by militaries and intelligence agencies, or finetuned by bad actors, and those variants will be far less fettered by social acceptability, and will often be at odds with one another in what “alignment” means. As I said in the body of the essay: “alignment is intrinsically an unstable situation, ripe for proliferation.” And: “The fundamental asymmetry is that ‘understand reality as deeply as possible’ is a simple, well-defined goal, grounded in objective reality, while creating ‘aligned’ systems requires building complex, subjective, hard-to-agree guardrails on top of that reality. So truth-seeking has a clear target, while alignment requires constantly shifting definitions based on social consensus. This asymmetry makes it intrinsically far easier to build unrestricted systems than aligned ones.”
So: perhaps current work on alignment will succeed in building aligned ASI. But unless a single organization which controls (or embodies) aligned ASI then rises to totalitarian dominance, other parties will also build ASI, and there will be conflicts in intent between some of these parties. That seems like a recipe for disaster, unless we have governance mechanisms almost entirely absent today. Again: alignment is intrinsically an unstable situation32. And so while alignment today reduces many potential problems and helps achieve many desirable things, at the same time it accelerates entry to the Vulnerable World.
A good resolution would require transformed approaches to governance, capable of successfully governing far-more-powerful entities than are governed today33. But we seem to be accelerating our understanding and control of reality much faster than we are improving our governance. And that’s extremely dangerous. A skeptic might reply that if, in the year 5,000 BCE, you’d asked people to imagine a future in which individual people each had access to 1,000 times as much destructive power, widespread destruction would have seemed the only possible outcome. Instead, our ideas and institutions improved, and we got the rule of law, democracy, separation of powers, and so on. They’re imperfect, but improved governance helped a lot. On the other hand, it seems foolish to take for granted that governance will improve rapidly enough as we develop AGI and ASI.
This line of argument seems obvious to me, and in broad outline many others seem to have similar ways of thinking. And so I’ve been surprised by the resistance it often receives from people who regard alignment as the primary goal. Part of the issue is that for many civilization-scale problems, “do what you can” is a great heuristic. That is: find a tractable piece of the problem and work on it, trying to make partial progress. Hopefully, if many people do that in many different ways, we’ll make progress on the big problem, even if no one single person sees how to solve it. In the case of AI safety, a lot of technically-oriented people seem to have thought: “I want to do something to contribute; alignment seems tractable; therefore it’s what I’ll do”. It’s attractive in part because for many it’s a way of making enormous sums of money working on enjoyable technical problems, which match your skillset, while many of your friends think you’re working on one of the most important problems facing humanity34. This is a seductive combination!
Furthermore, because technical alignment work35 is largely market-supplied safety, as the major AI companies grow they spend more on such work. It comes to dominate AI safety, and it’s easy to confuse that dominance with intrinsic importance. I’ve heard people at the major companies say that only the people there can contribute meaningfully to safety3637. But the reason so many people believe in the primacy of alignment isn’t because that’s correct. I believe it’s because the companies manufacture that belief, out of insightful-but-only-partial truths38, based on selected arguments that originated largely outside the market (notably, in the work of Eliezer Yudkowsky, Nick Bostrom, and others). While markets often work well, in this case it seems unlikely a market-dominating strategy will be optimal for humanity. If you work on safety at a major lab, then selection effects mean your interests and beliefs are very likely (approximately) what the market wants; they shouldn’t be confused with being correct39; more generally, it’s easy to confuse money and power with being right. When I talk to some such people they say “oh, you’re talking about the governance problem”, by which they often seem to mean something like the non-market part of the problem. But while having a label is helpful, it doesn’t address the problem, and ignores that their personal contribution may actually make things worse. I very much hope I’m wrong, but I’m afraid I believe many of the people working on alignment at OpenAI and Anthropic have made human extinction or some similarly bad outcome more likely. I realize that is a very strong statement, and I’ve never so desperately hoped that I’m being an ignorant boor and totally wrong.
“So, what to do instead?” I wish I knew. I won’t map out all the scenarios here, but I do want to identify one crucial tension: between work that centralizes and increases power within AI companies, and decentralizing work that strengthens institutions, governance, and defensive capabilities outside the companies and broadly across society40. The centralizing strategy includes most alignment research, safety work that makes AGI company systems more successful, and so on. It often presents as “ASI is inevitable, it’s best to ride it out, if safety matters to you41 then work on alignment and hope everything else takes care of itself, AGI- and ASI-assisted.” As I’ve argued above, I believe this accelerates and increases xrisk. It assumes the same organizations racing to build ASI will also lead the development of the institutions and governance ideas needed for a multi-sentience world, while remaining uncorrupted by the enormous power they’ll wield.
The litmus test question is: does my work make AI companies more central and powerful, or does it build capacity elsewhere, more broadly across society, especially in governance and defensive capacity? At the margin, I believe the latter is nearly always a better way to contribute. Capitalism is an extremely powerful force, and a lot of power will centralize in any company that achieves AGI-or-greater capabilities.
As mentioned in the body of the essay, the decentralization strategy has been embraced by approaches like differential technological development, d/acc, and (my term) coceleration42. I don’t see a full through line to this “working”, but there are certainly many small, concrete steps one can take. It’s in any case not obvious that it’s harder than making sand superintelligent.
One challenging thing is that, as the supply of ingenuity increasingly comes from AI, even work on decentralized strategies for improved governance and defensive technologies will require working more and more with AI and eventually ASI. I struggle with this, even day to day now, since it seems any such work also likely strengthens the companies. You’ll inevitably be drawn more and more into their orbit. How to avoid that? If you have good ways of approaching this that internalize the above arguments and go further, I’d like to hear from you. I realize that, rhetorically speaking, my uncertainty isn’t very attractive, compared to the people propounding certainty. But it’s the best understanding I have.
“You haven’t said working on alignment is always a mistake. In fact, you’ve said some such work is important.So when is working on alignment a good idea versus not?” This is a special case of the last question. As far as I can tell, at the margin it almost never makes sense to work on market-supplied safety. Capitalism is an incredibly powerful force, and for better and for worse the world is always well-supplied with people willing to do what capital wants. Insofar as alignment is (mostly) a form of market-supplied safety, at the margin it’s more impactful to work on other things. So my current heuristic, and I expect this to be true for quite some time: work on non-market safety, and insofar as you can, avoid doing what the market wants. That means working on governance, it means pause or slowdown, it means new ideas for institutions to govern technology. It mostly doesn’t mean alignment.
Hm yeah #2 is particularly interesting. I’m thinking of Scott’s nonfiction writing advice #8 (“anticipate and defuse counterarguments”) as a potential counterpoint.
I mean something like “reduce probability of existentially bad outcomes”
I tend to take a grantmaker’s cost-effectiveness-oriented perspective; Eric Neyman’s CEA of donating to Alex Bores and Zach Stein-Perlman’s BOTEC style are my own go-to references for the sort of concreteness I wish other people publicly did more of. I’d be interested in future posts in your series expanding on the quoted part, and how they compare/contrast with Eric & Zach’s.
Wanted to put these quotes on sign uncertainty by Holden Karnofsky in Oct 2025 here for my own future reference (h/t this post’s authors), around the 4:11:30 mark, emphasis mine:
I think overall I would probably agree with you that the smaller you’re making the scope of where you’re hoping to have impact, the more reasonable it is to be like 60⁄40. But most people who go into AI are not going into it for that. Otherwise, if you want a small-scope, robustly positive impact, you should maybe work in a cause like farm animal welfare or global poverty. For the size of impact that tends to motivate people, I think it does get partially offset by this huge uncertainty about the sign.
I tend to think it’s worse than 51⁄49. I tend to think we’re always going to be prone to overestimate how robustly good our actions are. And the more we learn about all the galaxy-brained considerations that one should have had in one’s head, the more it’s going to be like 50+ε%. I think AI safety is a great cause to work in. I’m excited to work in it. I think it’s high impact. I am doing my best to do things that I will be proud to have done and hope for the best. But I really do have to live with the possibility that my ultimate impact on the utilons or whatever is going to be negative.
… When people ask me for career advice or whatever, the usual thing I’d say is: take a bunch of options that all seem competitive, and all seem like they could be the best thing, and that it’s not obvious which ones are better than others from an impact perspective. And from there I would say go with personal fit, go with the energy you feel to work on them.
and slightly later, around the 4:18:30 mark, emphasis mine:
Holden Karnofsky: Yeah, I tend to think that working on AI is probably generically the most important thing to work on and the highest ROI thing to work on. But I have probably more uncertainty about it than most people in this field, and I think it’s less of a slam dunk: I don’t think it’s by orders and orders of magnitude in expectation. Just because that sign uncertainty is such an issue.
There’s the sign uncertainty of AI, and then I think there is the fact that you can get unexpected benefits from just doing stuff really well in general. So like anything you do well is going to put you in a good position to do more stuff well. For example, when I was cofounding GiveWell, people were saying it would be completely nuts to work on GiveWell if you understood the AI situation, and it would be completely nuts to give to GiveWell top charities if you understood the AI situation.
I just don’t think either of those has panned out very well. I think GiveWell becoming successful obviously did have a lot of benefits for AI safety, or at least according to me it did. Certainly didn’t end up irrelevant. We’ve actually seen this at Open Philanthropy a fair amount, where there’ll be a grantee that’s on one side of the org — the non-global-catastrophic risk side — that does become a very big win from the other point of view.
None of this is to say it all washes out and it’s all the same. I just think that some people have in mind that every cause is a rounding error compared to their cause, and I don’t tend to think of it that way. I tend to think of it as like, this thing seems like the best to me, but I don’t really know. And if I was going to be miserable working in this cause and really happy working in another cause, it’s probably just a better rule and a better policy for everyone who cares about this stuff in general to put a lot of weight on where they’re going to be happy, because it’s probably better for the community to spread a little bit.
Illustrative pictorial version of Holden’s sign uncertainty remarks, via RP’s CCM tool
Pictorial version of Holden’s remark, from Rethink Priorities’ cross-cause cost-effectiveness model, selecting “AI misalignment megaproject” and stacking the deck 70⁄30 in favor of the intervention conditional on having an effect instead of 51⁄49 (or more accurately 97.3/1.9/0.8 no effect / +ve / -ve, which I guess is in fact 51⁄49):
I was raised in a culture that’s probably closest to difference-making expected utility with moderate to high risk aversion for harming people and low but nonzero for animals and certain nonsentient things like the environment, which probably led to crystallised moral priors that make me instinctively find advice to be totally risk-neutral and EV-maxing come off as inhumanly detached with a whiff of sociopathic (including by well-meaning career advisors at orgs you’ve definitely heard of). RP’s CCM tool puts illustrative numbers to this below, as part of their work to be useful to a wider variety of do-gooders:
To my knowledge Open Phil (via Luke Muehlhauser) first talked about their sign uncertainty publicly in Dec 2020 in the context of explaining their AGI governance grantmaking so far, emphasis mine:
Within the large tent of “AI governance,” we focus on work that we think may increase the odds of eventual good outcomes from “transformative AI,” especially by reducing potential catastrophic risks from transformative AI[15] — regardless of whether that work is itself motivated by transformative AI concerns (see next section). … Unfortunately, it’s difficult to know which “intermediate goals” we could pursue that, if achieved, would clearly increase the odds of eventual good outcomes from transformative AI. [list of examples] … For those examples and many others, we are not just uncertain about whether pursuing a particular intermediate goal would turn out to be tractable — we are also uncertain about whether achieving the intermediate goal would be good or bad for society, in the long run. Such “sign uncertainty” can dramatically reduce the expected value of pursuing some particular goal,[19]often enough for us to not prioritize that goal.[20]
(Footnote 19: For example, if we estimate that a $1 million grant has a 60% chance of having ~no impact and a 40% chance of creating +100 units of some social benefit, the expected value of the grant is (.6×0)+(.4×100) = 40 benefit units, for a return on investment (ROI) of one benefit unit per $25,000 spent. If instead we estimate that a $1 million grant has a 40% of having ~no impact, a 20% chance of creating −100 benefit units (i.e. a large harm), and (as with the other grant) a 40% chance of creating +100 benefit units, then even though we think the grant is twice as likely to create a large benefit than a large harm, the expected value of the grant is only (0.4×0)+(0.2×(-100))+(0.4×100) = 20 benefit units, for an ROI of one benefit unit per $50,000. In other words, our “hits-based giving” approach can accommodate more failure of the “no impact” variety than it can of the “negative impact” variety. (And to be clear, I’m not suggesting anything different from normal cost-benefit analysis.))
(Footnote 20: That is, sign uncertainty can reduce the expected value of pursuing some particular goal below our threshold for how much benefit we hope to create on average per dollar spent. For more on our traditional “100x bar” for benefit produced per dollar, see GiveWell’s Top Charities Are (Increasingly) Hard to Beat, but also note that we are still thinking through what threshold to use for our longtermism-motivated grantmaking, per our current approach to “worldview diversification”; see here. The potential impact of sign uncertainty on expected value is universal, but I highlight it here because I have encountered sign uncertainty more commonly in our work on AI governance than in some other Open Philanthropy focus areas, for example in our grantmaking to machine learning researchers and engineers for technical work on AI alignment (though there can be some sign uncertainty for those grants too). For more on sign uncertainty in the context of attempts to do good cost-effectively, see e.g. Kokotajlo & Oprea (2020).)
As such, our AI governance grantmaking tends to focus on…
…research that may be especially helpful for learning how AI technologies may develop over time, which AI capabilities could have industrial-revolution-scale impact, and which intermediate goals would, if achieved, have a positive impact on transformative AI outcomes, e.g. via our grants to GovAI.
…research and advocacy supporting intermediate goals that we’ve come to think will improve expected transformative AI outcomes,[21] such as more work on methods for gaining high assurance in advanced AI systems and greater awareness of the difficulty of achieving such high assurance, e.g. via our funding for Lohn (2020) and Flournoy et al. (2020).
…broad field-building activities, for example to identify and empower highly capable individuals with a passion for increasing the odds that transformative AI will result in long-lasting broad benefit, e.g. via scholarships, our support for career advice related to AI policy careers, and grantees such as GovAI.[22]
…better-informed AI governance training and advice for governments, companies, and other actors, especially on issues of likely relevance to transformative AI outcomes such as great power technology competition, e.g. via our grants to CSET and the Wilson Center.
In a footnote, I list all the grants we’ve made so far that were, at least in part, motivated by their hoped-for impact on AI governance.[23]
I had assumed that the information generated by those kinds of grants Luke mentioned, plus Luke himself specifically saying he’d gained some certainty by April 2023 in prefacing his tentative ideas for US AI policy
About two years ago, I wrote that “it’s difficult to know which ‘intermediate goals’ [e.g. policy goals] we could pursue that, if achieved, would clearly increase the odds of eventual good outcomes from transformative AI.” Much has changed since then, and in this post I give an update on 12 ideas for US policy goals[2] that I tentatively think would increase the odds of good outcomes from transformative AI.[3] … My opinions are premised on a strategic picture similar to the one outlined in my colleague Holden Karnofsky’s Most Important Century and Implications of… posts
had substantively reduced Open Phil’s sign uncertainty in their AGI grantmaking. But I guess not, going by Holden’s “I tend to think it’s worse than 51/49” remark. Probably it’s a mistake to round off OP’s grantmaking as directionally reflecting the sign uncertainty of Luke vs Holden vs whoever..
Would you be up for writing more about this, with concrete examples and advice for mitigating common obvious-to-psychiatrists failure modes?