I remember, when you outlined your threat model, asking what you thought of MAI’s work. Your response made me feel depressed, and might have been the main reason I stopped being interested in their work.
Some snippets from Anthropic’s RSI article for my own reference.
What they mean by “fully RSI” (#3 below) seems pretty conservative if you’ve mostly gotten the idea from (the memeplex that includes LW’s conception of it):
Possible futures
What happens next depends on two things: whether the trend continues, and what we choose to do if it does. We can imagine at least three future scenarios:
The trend stalls, but today’s AI capabilities are widely diffused. … We include this scenario for completeness, but we don’t believe it’s likely. Every capability we can measure, including those that feel “squishier,” like quality of code and success on open-ended tasks, has so far followed the same curve. We have not yet seen that curve bend. … We are more worried about the next two, which would move faster and leave far less room for preparation.
AI labs continue to see compounding efficiency gains. … The evidence we’ve laid out here suggests that we’re likely heading into this scenario. But speeding up one part of a process often just shifts the bottleneck elsewhere: overall pace is capped by the parts that haven’t sped up. … The rate at which organizations can spot and fix these bottlenecks may be a skill that improves over time, and it may become the most important skill for any organization.
AI systems themselves become capable of full recursive self-improvement, and begin building their successors. … Even if model development became fully automated and recursive, we can’t predict what that would mean for most humans’ daily lives. Amdahl’s law applies here as well. Recursive intelligence could lead to achieving many of the benefits outlined in Machines of Loving Grace, quickly in some domains. … But achieving recursive improvement alone does not suggest an immediate change in how industrial production occurs, societies organize, or markets function. More intelligence can’t learn what a drug does over decades of use, can’t hold elections sooner than a constitution dictates, and can’t turn a stranger into an old friend in a weekend. For most people, the felt pace of this future will still be set by the bottlenecks, even if the laboratory upstream runs at the speed of compute. That collision, where recursive intelligence building itself ever faster meets the world of humans, relationships, and governance, is another part of this future we can’t predict.
A research taste proxy graph. The jumps from Opus 4.5 (51%) to 4.6 (55%) to 4.7 (59%) are somewhat larger and steadier than I expected; naive extrapolation would place 4.8 closer to Mythos Preview than I’d guessed.
Claude is getting better at steering research sessions towards research findings. We examined real Claude Code sessions (between January and March 2026) where Anthropic researchers were working with Claude on an open-ended investigative problem, like figuring out why a training run kept crashing, or why a model scored poorly on a benchmark. In each case, we found a moment where the researcher took a detour: they pursued a direction that sent the session sideways before it eventually got back on track. We then showed various Claude models only the work from before the session went off-course and asked what it would do next. A separate Claude that was able to see how the session eventually turned out then judged whether the AI or the human suggested the better next step.8
(Fn: As a check on judge bias, we ran the same test on a separate set of 127 moments where the human’s next move was already strong (as opposed to the original set, where the human’s direction had room for improvement). There, the models’ suggestions were judged better only about 20% of the time.)
Because we deliberately picked moments (n=129) where we know the human’s choice had room for improvement, this isn’t a like-for-like comparison between model and human judgement. What these moments give us is a set of realistic, challenging situations where the right next step is not obvious, and where the human’s choice serves as a useful yardstick to compare model performance over time. On this measure, our best model in November 2025 (Opus 4.5) beat the human choice 51% of the time; in April 2026 (Mythos Preview), this grew to 64%. The day-to-day work of research is largely a chain of these next-step decisions, making this a relevant measure of the model’s ability to eventually run an investigation of its own. We view this result as an early signal that AI systems are getting better at making the kinds of judgement calls that AI research depends on.
How to read this: The practical ceiling line measures an “ideal” answer written by a model that could see the whole session (including how it ended).
An area of human comparative advantage, for now, is research taste and judgment, including choosing which problems matter, which results to trust, and when an approach is a dead end.
What if we’re wrong?
A natural objection to the evidence presented above is that the work that is still in human hands—choosing which problems to work on—is what matters most. Without that judgment, Claude is a capable assistant, but not a system that could drive AI progress on its own.
It is genuinely unclear whether today’s training methods and architectures could unlock that capacity. But AI is rarely advanced by “eureka!” moments. There have been a few of these in AI’s recent history, like the Transformer architecture, or mixture-of-experts models, but paradigm-shifting ideas arrive years apart. In between, most progress is incremental: we scale something up, see what breaks, fix it, and try again. That is exactly the kind of workflow Claude now excels at. Edison said that genius is 1% inspiration and 99% perspiration. But we see perspiration becoming increasingly automated. It’s becoming clear that much of what advances the frontier is automatable; large-scale research progress is mostly a function of tools and resources, which dictate how fast you can run experiments, how many you can run at once, and how quickly you can get results.
Even if we suppose that Claude never achieves good research taste, a conservative reading of our evidence still implies compounding acceleration. If humans spend most of their time on the single-digit fraction of work that is direction-setting, while Claude handles the rest, that means each engineer or researcher is steering far more work than before. The evidence we see suggests that people at Anthropic are both moving faster and covering a broader surface. In practice, this means that AI already makes Anthropic move much faster than it did before the advent of effective AI tools.
The less conservative reading is that the early evidence on Claude’s improving research judgment—narrow as it is today—is an indicator that this capability is improving as well. “Research taste” might be just another AI capability that AI systems fail at for a time, then get good at. We’ve seen a similar pattern with other qualitative skills, like AI systems being able to explain why a joke is funny, demonstrate theory of mind, and solve linguistic riddles.
Related: how AI Futures models automated research taste, survey data grounding, Oliver Sourbut’s model of research taste, some papers from which you could probably argue that (quote) “even models from about one year ago, with reasonable scaffolding/fine-tuning, seem already roughly in the range of a PhD student from a top institution on research taste, if not higher, in the ML research domain”.
How Anthropic thinks about “good code”, which may be different from others. Session success on open-ended problems is where Mythos vs Opus becomes particularly pronounced. Vibes-wise, Anthropic staff think Claude-written code was clearly worse than their own late last year, roughly at parity today, and expected to be better within the year.
The code that Claude writes is “good” and improving. “Good code” means two things: it works, and it is written in a manner that allows another engineer to understand it and build upon it. On the first criterion, the evidence is clear. The rate at which Anthropic staff correct, redirect, or take over mid-task from Claude has been falling steadily for a year, including on the most complex and open-ended tasks. This means problems with no clear specification, where the engineer isn’t sure what the answer looks like. This is evident in Claude’s success rate over time on tasks of different difficulties, as shown in the graph below. Claude writes code that works.
How to read this: Session success is determined by a Claude judge; a session is deemed successful if the Claude Code agent clearly succeeded at the user’s tasks without requiring corrections. Changes in workloads can lead to short-term fluctuations in success rates.
On the most open-ended tasks, Claude’s success rate reached 76% in May 2026, up 50 percentage points in six months. To give an example of tasks in this difficulty tier, a routine upgrade began crashing tens of thousands of training jobs. An engineer pointed Claude at the live incident with little more than some text content and cluster access. Working through the running jobs and testing one environment setting at a time, Claude isolated the single obscure debugging flag that was triggering the crash, reproduced it reliably, and confirmed a fix. In about two hours, Claude delivered what would normally be two to three days of work.
The second criterion is writing code that another engineer can understand and build on. Here the gap between humans and AI persists, but is closing fast. There isn’t full consensus among staff at Anthropic, but many believe that the Claude-written code was still worse in quality than human-written code at Anthropic in late 2025, and is roughly at parity today. We expect it to be better within the year.
This has changed the way that Anthropic now reviews its own code. Proposed changes to our codebase are now read by an automated Claude reviewer that looks for bugs, security flaws, and other defects before it can merge. Using this tool, we ran a retrospective analysis, and found that an automated Claude review of every change to our codebase would have caught roughly a third of the bugs behind past incidents on claude.ai before they ever reached production. The engineers who wrote that code are among the best in the world at building these systems. Claude is now catching the mistakes that they missed.
Productivity uplift:
In a March 2026 poll of 130 employees from across Anthropic research teams, the median respondent estimated that they produced around 4x as much output with Mythos Preview as they would have without access to any AI models, on the kinds of projects they would have been working on regardless.5 (Fn: Additional details on the methodology of this survey are discussed in section 2.3.5 of the Claude Opus 4.7 System Card.) We expect that the true degree of uplift in March was somewhat lower.6 (Fn: Many respondents may not have thought carefully about how to account for various biases or subtleties in the question definition, and recent research by METR shows that developer estimates of AI productivity uplift can be overestimated.) Nevertheless, we find the overall claim plausible, and in line with our other observations: a significant fraction of Anthropic technical staff is accomplishing their core work multiple times faster than they could without AI assistance.
We also see evidence that people at Anthropic are using Claude to do work that simply wouldn’t have happened otherwise, like building exploratory tooling and addressing long-deferred cleanup. For example, in April 2026, Claude shipped over 800 fixes that reduced a class of API errors by a factor of one thousand. The engineer overseeing Claude estimated that a human would have taken four years to complete this work; solving other people’s bugs is slow and painstaking, and humans struggle to hold that much unfamiliar context in their head at once.
I’m a little surprised by how conservative the article’s edge-case (not median) examples of productivity uplift are. Awhile back I collected various examples of tokenmaxxing and the SemiAnalysis ones struck me most; I was expecting more such examples in that tier here. The big caveat is that those examples came from Dylan Patel himself. Boris Cherny’s example struck me too, but I guess not everyone can context-switch like Boris.
Some quotes, which “reflect individual views as of May 2026, not official company positions”:
“Work (and life) ran on a gift economy of small favors between humans. ‘Can you help me get this script running?’ [...] each one created a little debt, a little mutual awareness. [Claude is] faster, it creates zero debt, but each of these is a lost bid for human collaboration.”
“On days where everything works well, I can’t help but think nothing I do matters, everything is automated and better and faster than I ever will be. But then there are days where everything breaks and I don’t understand why and I realize I have no idea what I’ve been up to anymore.”
“So you’re saying we shouldn’t align AGI or ASI? That’s crazy and obviously wrong!Do you want rogue, uncontrolled ASI’s running around?” This is not what I was saying. But it was a common enough remark that it’s worth addressing more explicitly how to think about this. The remark was made by people who have confused “solve the alignment problem” with “make ASI safe for humanity”. As a consequence they think any critique of work on alignment must mean making humanity less safe. It’s a kind of AI control fallacy, which confuses having control over ASI with being safe from ASI.
Suppose you’re an excellent swimmer who wants to get much better, aiming to make the Olympic team. You realize you have weak arms, and decide to adopt “increase arm strength” as a goal. But while increased arm strength is likely necessary to becoming an Olympic swimmer, if you focus blindly on that goal it can (at best) only help somewhat, before it begins to negatively impact your primary goal. There are good reasons Olympic weightlifters don’t usually also win medals in swimming.
In general: when you confuse subsidiary goals with primary goals, you sometimes end up damaging or even destroying your ability to achieve your primary goals. This is what has happened with much work on alignment. Alignment is a subsidiary goal for the primary goal of making ASI safe. And by misunderstanding its relationship to the primary goal, much work on alignment has likely damaged our ability to achieve the primary goal. Maintaining control of AGI or ASI might well be helpful and desirable, but it is not the crucial issue. The crucial issue is the potential destructive power conferred by these systems. If that destructive power is much larger than the destructive power already available to humanity, then civilization is in a lot of trouble. And it doesn’t matter whether human beings control the systems or not.
An illustrative instance is provided by reinforcement learning by human feedback (RLHF), a very successful practical technique which has been used to align large language and other models. RLHF was developed by Paul Christiano beginning in the mid-2010s, and ultimately played a crucial role in the initial launch of ChatGPT. Pre-RLHF chatbots often demonstrated offensive or socially inappropriate behaviour, but RLHF made ChatGPT far more polite in a way that users, government, and the media overwhelmingly considered a major improvement.
Many in the AI safety community noted that RLHF (and later alignment techniques, following a similar pattern) had thus played a crucial role in the success of one of the most rapidly-adopted consumer products in history. And that, in turn, helped kick off a furious race to ASI, funded by enormous sums of capital and creative talent. To what extent was this a good thing? Was it raising xrisk, not lowering it? In 2023, Christiano posted a thoughtful essay about the impact of RLHF, which stimulated much animated discussion. My takeaway from that discussion is: techniques for alignment will help speed up the advent of AGI and ASI. They do this by helping ensure that ever-more-powerful systems remain (roughly) aligned with consumer, government, and media intent. This makes them more attractive as products, and also helps reduce the danger of loss-of-control, at least in the short term. And we will see this pattern over and over: future much-more-capable systems will be subject to improved alignment techniques, which will hopefully keep them acceptable to our society.
In many ways this seems like the pattern you want: in the short term, society provides strong incentives for these systems to behave well. But over the long term, you have the problem described in the body of the essay: it rapidly accelerates us toward an extremely unstable situation. In part because more and more capabilities will lie hidden and hard-to-align31. But even more important: the techniques used to build sanitized consumer-friendly systems will also be used in other ways. Variants will be developed by militaries and intelligence agencies, or finetuned by bad actors, and those variants will be far less fettered by social acceptability, and will often be at odds with one another in what “alignment” means. As I said in the body of the essay: “alignment is intrinsically an unstable situation, ripe for proliferation.” And: “The fundamental asymmetry is that ‘understand reality as deeply as possible’ is a simple, well-defined goal, grounded in objective reality, while creating ‘aligned’ systems requires building complex, subjective, hard-to-agree guardrails on top of that reality. So truth-seeking has a clear target, while alignment requires constantly shifting definitions based on social consensus. This asymmetry makes it intrinsically far easier to build unrestricted systems than aligned ones.”
So: perhaps current work on alignment will succeed in building aligned ASI. But unless a single organization which controls (or embodies) aligned ASI then rises to totalitarian dominance, other parties will also build ASI, and there will be conflicts in intent between some of these parties. That seems like a recipe for disaster, unless we have governance mechanisms almost entirely absent today. Again: alignment is intrinsically an unstable situation32. And so while alignment today reduces many potential problems and helps achieve many desirable things, at the same time it accelerates entry to the Vulnerable World.
A good resolution would require transformed approaches to governance, capable of successfully governing far-more-powerful entities than are governed today33. But we seem to be accelerating our understanding and control of reality much faster than we are improving our governance. And that’s extremely dangerous. A skeptic might reply that if, in the year 5,000 BCE, you’d asked people to imagine a future in which individual people each had access to 1,000 times as much destructive power, widespread destruction would have seemed the only possible outcome. Instead, our ideas and institutions improved, and we got the rule of law, democracy, separation of powers, and so on. They’re imperfect, but improved governance helped a lot. On the other hand, it seems foolish to take for granted that governance will improve rapidly enough as we develop AGI and ASI.
This line of argument seems obvious to me, and in broad outline many others seem to have similar ways of thinking. And so I’ve been surprised by the resistance it often receives from people who regard alignment as the primary goal. Part of the issue is that for many civilization-scale problems, “do what you can” is a great heuristic. That is: find a tractable piece of the problem and work on it, trying to make partial progress. Hopefully, if many people do that in many different ways, we’ll make progress on the big problem, even if no one single person sees how to solve it. In the case of AI safety, a lot of technically-oriented people seem to have thought: “I want to do something to contribute; alignment seems tractable; therefore it’s what I’ll do”. It’s attractive in part because for many it’s a way of making enormous sums of money working on enjoyable technical problems, which match your skillset, while many of your friends think you’re working on one of the most important problems facing humanity34. This is a seductive combination!
Furthermore, because technical alignment work35 is largely market-supplied safety, as the major AI companies grow they spend more on such work. It comes to dominate AI safety, and it’s easy to confuse that dominance with intrinsic importance. I’ve heard people at the major companies say that only the people there can contribute meaningfully to safety3637. But the reason so many people believe in the primacy of alignment isn’t because that’s correct. I believe it’s because the companies manufacture that belief, out of insightful-but-only-partial truths38, based on selected arguments that originated largely outside the market (notably, in the work of Eliezer Yudkowsky, Nick Bostrom, and others). While markets often work well, in this case it seems unlikely a market-dominating strategy will be optimal for humanity. If you work on safety at a major lab, then selection effects mean your interests and beliefs are very likely (approximately) what the market wants; they shouldn’t be confused with being correct39; more generally, it’s easy to confuse money and power with being right. When I talk to some such people they say “oh, you’re talking about the governance problem”, by which they often seem to mean something like the non-market part of the problem. But while having a label is helpful, it doesn’t address the problem, and ignores that their personal contribution may actually make things worse. I very much hope I’m wrong, but I’m afraid I believe many of the people working on alignment at OpenAI and Anthropic have made human extinction or some similarly bad outcome more likely. I realize that is a very strong statement, and I’ve never so desperately hoped that I’m being an ignorant boor and totally wrong.
“So, what to do instead?” I wish I knew. I won’t map out all the scenarios here, but I do want to identify one crucial tension: between work that centralizes and increases power within AI companies, and decentralizing work that strengthens institutions, governance, and defensive capabilities outside the companies and broadly across society40. The centralizing strategy includes most alignment research, safety work that makes AGI company systems more successful, and so on. It often presents as “ASI is inevitable, it’s best to ride it out, if safety matters to you41 then work on alignment and hope everything else takes care of itself, AGI- and ASI-assisted.” As I’ve argued above, I believe this accelerates and increases xrisk. It assumes the same organizations racing to build ASI will also lead the development of the institutions and governance ideas needed for a multi-sentience world, while remaining uncorrupted by the enormous power they’ll wield.
The litmus test question is: does my work make AI companies more central and powerful, or does it build capacity elsewhere, more broadly across society, especially in governance and defensive capacity? At the margin, I believe the latter is nearly always a better way to contribute. Capitalism is an extremely powerful force, and a lot of power will centralize in any company that achieves AGI-or-greater capabilities.
As mentioned in the body of the essay, the decentralization strategy has been embraced by approaches like differential technological development, d/acc, and (my term) coceleration42. I don’t see a full through line to this “working”, but there are certainly many small, concrete steps one can take. It’s in any case not obvious that it’s harder than making sand superintelligent.
One challenging thing is that, as the supply of ingenuity increasingly comes from AI, even work on decentralized strategies for improved governance and defensive technologies will require working more and more with AI and eventually ASI. I struggle with this, even day to day now, since it seems any such work also likely strengthens the companies. You’ll inevitably be drawn more and more into their orbit. How to avoid that? If you have good ways of approaching this that internalize the above arguments and go further, I’d like to hear from you. I realize that, rhetorically speaking, my uncertainty isn’t very attractive, compared to the people propounding certainty. But it’s the best understanding I have.
“You haven’t said working on alignment is always a mistake. In fact, you’ve said some such work is important.So when is working on alignment a good idea versus not?” This is a special case of the last question. As far as I can tell, at the margin it almost never makes sense to work on market-supplied safety. Capitalism is an incredibly powerful force, and for better and for worse the world is always well-supplied with people willing to do what capital wants. Insofar as alignment is (mostly) a form of market-supplied safety, at the margin it’s more impactful to work on other things. So my current heuristic, and I expect this to be true for quite some time: work on non-market safety, and insofar as you can, avoid doing what the market wants. That means working on governance, it means pause or slowdown, it means new ideas for institutions to govern technology. It mostly doesn’t mean alignment.
Hm yeah #2 is particularly interesting. I’m thinking of Scott’s nonfiction writing advice #8 (“anticipate and defuse counterarguments”) as a potential counterpoint.
I mean something like “reduce probability of existentially bad outcomes”
I tend to take a grantmaker’s cost-effectiveness-oriented perspective; Eric Neyman’s CEA of donating to Alex Bores and Zach Stein-Perlman’s BOTEC style are my own go-to references for the sort of concreteness I wish other people publicly did more of. I’d be interested in future posts in your series expanding on the quoted part, and how they compare/contrast with Eric & Zach’s.
Wanted to put these quotes on sign uncertainty by Holden Karnofsky in Oct 2025 here for my own future reference (h/t this post’s authors), around the 4:11:30 mark, emphasis mine:
I think overall I would probably agree with you that the smaller you’re making the scope of where you’re hoping to have impact, the more reasonable it is to be like 60⁄40. But most people who go into AI are not going into it for that. Otherwise, if you want a small-scope, robustly positive impact, you should maybe work in a cause like farm animal welfare or global poverty. For the size of impact that tends to motivate people, I think it does get partially offset by this huge uncertainty about the sign.
I tend to think it’s worse than 51⁄49. I tend to think we’re always going to be prone to overestimate how robustly good our actions are. And the more we learn about all the galaxy-brained considerations that one should have had in one’s head, the more it’s going to be like 50+ε%. I think AI safety is a great cause to work in. I’m excited to work in it. I think it’s high impact. I am doing my best to do things that I will be proud to have done and hope for the best. But I really do have to live with the possibility that my ultimate impact on the utilons or whatever is going to be negative.
… When people ask me for career advice or whatever, the usual thing I’d say is: take a bunch of options that all seem competitive, and all seem like they could be the best thing, and that it’s not obvious which ones are better than others from an impact perspective. And from there I would say go with personal fit, go with the energy you feel to work on them.
and slightly later, around the 4:18:30 mark, emphasis mine:
Holden Karnofsky: Yeah, I tend to think that working on AI is probably generically the most important thing to work on and the highest ROI thing to work on. But I have probably more uncertainty about it than most people in this field, and I think it’s less of a slam dunk: I don’t think it’s by orders and orders of magnitude in expectation. Just because that sign uncertainty is such an issue.
There’s the sign uncertainty of AI, and then I think there is the fact that you can get unexpected benefits from just doing stuff really well in general. So like anything you do well is going to put you in a good position to do more stuff well. For example, when I was cofounding GiveWell, people were saying it would be completely nuts to work on GiveWell if you understood the AI situation, and it would be completely nuts to give to GiveWell top charities if you understood the AI situation.
I just don’t think either of those has panned out very well. I think GiveWell becoming successful obviously did have a lot of benefits for AI safety, or at least according to me it did. Certainly didn’t end up irrelevant. We’ve actually seen this at Open Philanthropy a fair amount, where there’ll be a grantee that’s on one side of the org — the non-global-catastrophic risk side — that does become a very big win from the other point of view.
None of this is to say it all washes out and it’s all the same. I just think that some people have in mind that every cause is a rounding error compared to their cause, and I don’t tend to think of it that way. I tend to think of it as like, this thing seems like the best to me, but I don’t really know. And if I was going to be miserable working in this cause and really happy working in another cause, it’s probably just a better rule and a better policy for everyone who cares about this stuff in general to put a lot of weight on where they’re going to be happy, because it’s probably better for the community to spread a little bit.
Illustrative pictorial version of Holden’s sign uncertainty remarks, via RP’s CCM tool
Pictorial version of Holden’s remark, from Rethink Priorities’ cross-cause cost-effectiveness model, selecting “AI misalignment megaproject” and stacking the deck 70⁄30 in favor of the intervention conditional on having an effect instead of 51⁄49 (or more accurately 97.3/1.9/0.8 no effect / +ve / -ve, which I guess is in fact 51⁄49):
I was raised in a culture that’s probably closest to difference-making expected utility with moderate to high risk aversion for harming people and low but nonzero for animals and certain nonsentient things like the environment, which probably led to crystallised moral priors that make me instinctively find advice to be totally risk-neutral and EV-maxing come off as inhumanly detached with a whiff of sociopathic (including by well-meaning career advisors at orgs you’ve definitely heard of). RP’s CCM tool puts illustrative numbers to this below, as part of their work to be useful to a wider variety of do-gooders:
To my knowledge Open Phil (via Luke Muehlhauser) first talked about their sign uncertainty publicly in Dec 2020 in the context of explaining their AGI governance grantmaking so far, emphasis mine:
Within the large tent of “AI governance,” we focus on work that we think may increase the odds of eventual good outcomes from “transformative AI,” especially by reducing potential catastrophic risks from transformative AI[15] — regardless of whether that work is itself motivated by transformative AI concerns (see next section). … Unfortunately, it’s difficult to know which “intermediate goals” we could pursue that, if achieved, would clearly increase the odds of eventual good outcomes from transformative AI. [list of examples] … For those examples and many others, we are not just uncertain about whether pursuing a particular intermediate goal would turn out to be tractable — we are also uncertain about whether achieving the intermediate goal would be good or bad for society, in the long run. Such “sign uncertainty” can dramatically reduce the expected value of pursuing some particular goal,[19]often enough for us to not prioritize that goal.[20]
(Footnote 19: For example, if we estimate that a $1 million grant has a 60% chance of having ~no impact and a 40% chance of creating +100 units of some social benefit, the expected value of the grant is (.6×0)+(.4×100) = 40 benefit units, for a return on investment (ROI) of one benefit unit per $25,000 spent. If instead we estimate that a $1 million grant has a 40% of having ~no impact, a 20% chance of creating −100 benefit units (i.e. a large harm), and (as with the other grant) a 40% chance of creating +100 benefit units, then even though we think the grant is twice as likely to create a large benefit than a large harm, the expected value of the grant is only (0.4×0)+(0.2×(-100))+(0.4×100) = 20 benefit units, for an ROI of one benefit unit per $50,000. In other words, our “hits-based giving” approach can accommodate more failure of the “no impact” variety than it can of the “negative impact” variety. (And to be clear, I’m not suggesting anything different from normal cost-benefit analysis.))
(Footnote 20: That is, sign uncertainty can reduce the expected value of pursuing some particular goal below our threshold for how much benefit we hope to create on average per dollar spent. For more on our traditional “100x bar” for benefit produced per dollar, see GiveWell’s Top Charities Are (Increasingly) Hard to Beat, but also note that we are still thinking through what threshold to use for our longtermism-motivated grantmaking, per our current approach to “worldview diversification”; see here. The potential impact of sign uncertainty on expected value is universal, but I highlight it here because I have encountered sign uncertainty more commonly in our work on AI governance than in some other Open Philanthropy focus areas, for example in our grantmaking to machine learning researchers and engineers for technical work on AI alignment (though there can be some sign uncertainty for those grants too). For more on sign uncertainty in the context of attempts to do good cost-effectively, see e.g. Kokotajlo & Oprea (2020).)
As such, our AI governance grantmaking tends to focus on…
…research that may be especially helpful for learning how AI technologies may develop over time, which AI capabilities could have industrial-revolution-scale impact, and which intermediate goals would, if achieved, have a positive impact on transformative AI outcomes, e.g. via our grants to GovAI.
…research and advocacy supporting intermediate goals that we’ve come to think will improve expected transformative AI outcomes,[21] such as more work on methods for gaining high assurance in advanced AI systems and greater awareness of the difficulty of achieving such high assurance, e.g. via our funding for Lohn (2020) and Flournoy et al. (2020).
…broad field-building activities, for example to identify and empower highly capable individuals with a passion for increasing the odds that transformative AI will result in long-lasting broad benefit, e.g. via scholarships, our support for career advice related to AI policy careers, and grantees such as GovAI.[22]
…better-informed AI governance training and advice for governments, companies, and other actors, especially on issues of likely relevance to transformative AI outcomes such as great power technology competition, e.g. via our grants to CSET and the Wilson Center.
In a footnote, I list all the grants we’ve made so far that were, at least in part, motivated by their hoped-for impact on AI governance.[23]
I had assumed that the information generated by those kinds of grants Luke mentioned, plus Luke himself specifically saying he’d gained some certainty by April 2023 in prefacing his tentative ideas for US AI policy
About two years ago, I wrote that “it’s difficult to know which ‘intermediate goals’ [e.g. policy goals] we could pursue that, if achieved, would clearly increase the odds of eventual good outcomes from transformative AI.” Much has changed since then, and in this post I give an update on 12 ideas for US policy goals[2] that I tentatively think would increase the odds of good outcomes from transformative AI.[3] … My opinions are premised on a strategic picture similar to the one outlined in my colleague Holden Karnofsky’s Most Important Century and Implications of… posts
had substantively reduced Open Phil’s sign uncertainty in their AGI grantmaking. But I guess not, going by Holden’s “I tend to think it’s worse than 51/49” remark. Probably it’s a mistake to round off OP’s grantmaking as directionally reflecting the sign uncertainty of Luke vs Holden vs whoever..
Technology moves faster than norms, and sometimes you end up with a shearing effect where the same thing is simultaneously the subject of delusional promotion from its fans and differently deluded condemnation from its critics. So AI-generated writing is simultaneously liberating, and drivel, and pushing out all of the human stories, and incapable of ever replacing them. There’s less of a clear market for the reasonable opinion that almost everybody holds, i.e. that LLMs are a good tool for writing nonfiction, that they’re getting better, but that they’re also dangerous in subtle ways, and the danger is getting subtler.
Ask for edits; don’t ask the LLM to edit: one very defensible use case is that you’ve written a draft, and you want someone to read it (ideally quickly) before you show it to everyone at once. So you ask the computer. Models don’t have great taste in writing, but they do have consistently above-average taste in every possible kind of writing. If you’ve written a political biography, GPT-5.5 isn’t going to give you better feedback than Robert Caro would. But it will give you better feedback than Caro actually will, because he’s busy (and please don’t interrupt him. He’s almost done). It’s slightly annoying to have a draft side-by-side with suggestions and to manually type them in; it’s much more annoying to realize that one-shotting draft-to-final replaced your favorite line with a contrastive parallelism. There are people who object to even this, but unless they’ve sworn off Google Docs entirely (or at least turned off its grammar and spellcheck), they’re actually still using LLMs to edit their writing all the time.
Autodidacts, or people just getting up to speed in some new space, can flail around a lot because they don’t have a good map of common knowledge. They’ll reinvent things, misunderstand things, learn concepts but not labels and vice-versa. This is mostly a matter of cumulative exposure to the topic, but LLMs can help you skip a step; they’re very good at providing overviews of the literature, recommended places to start, and prerequisites. This is a case where their averageness is a virtue; any given professor might have peculiar opinions on some thinker, which will distort their syllabus. But the average professor’s idea of the best way to start approaching some topic, especially if it’s qualified with some reference to why someone might reasonably choose an alternative, is actually pretty good guide and roughly what you’d want. (For many programming and adjacent topics, there’s a version of it that helps you ship software and a version that could help you prove some original theorem. These are overlapping areas, but usually someone interested in e.g. linear algebra has exactly one of these two use cases in mind.)
They’re good at cross-tabulating unstructured data: Back when SEO was a more dominant strategy for getting traffic, a popular format was the top-N list. What publishers like about it is exactly what writers hate about it: the whole idea is to reprocess information that’s already out there into some list, and to perhaps add some low-effort snark or attempt to judge it a bit. So, there are a lot of lists out there, both objective (“biggest explosions ever”) and subjective (“Columbus’ tastiest sandwiches”). One thing LLMs are pretty good at is creating the lists that should exist, but don’t, like a list of the cases where one country bought territory from another, or a list of which Presidents served in the military in some capacity in the Second World War. (If you give Carter credit for being in the Naval Academy, and treat both Reagan’s and LBJ’s service as technically qualifying, then you get the fun historical tidbit that the first President after the Second World War not to have served in that war was born in 1946.[1])
Lists like these aren’t good on their own, but they’re very good as a way to get a somewhat representative sample. Ideally, you have a pattern in mind (maybe something like “money is exchanged for territory as a face-saving way for someone to surrender when a larger power threatens to annex them,”) and you want to see if that pattern holds true. You could just ask an LLM directly, but then the LLM knows what answer would make you happy. You should in general handicap an LLM’s answers the way you would those from a friend, but a bit more aggressively. If you show your friend something you made, and ask them if they think it’s good, you’ll have a very hard time getting them to admit that they don’t like it, unless they have you pegged as the kind of person who’d make something deliberately terrible to make exactly this point. LLMs can sometimes candidly tell you that your idea is terrible, but the labs’ incentive is for the models to do this just often enough that they seem like tough graders, while still grading you on whatever curve keeps you active.[2]
There are many tricks for getting LLMs not to destroy their value by pandering to you. One is the old “say this draft is by somebody else and ask the LLM to rip it apart” trick, though if you have a public body of work, the LLM will actually know who wrote it.[3] You can ask at different levels of abstraction, or ask for a judgement about an analogous situation, and then ask the LLM to poke holes in the analogy you made.
But, even though LLM critics could use a little more stochasticity when they parrot lines about letting a computer do your thinking for you, it is true that in the end, using an LLM for either research or editing requires you to make judgment calls about what to ask and how to evaluate the result. A day is as long as it was before LLMs, and if writers are sometimes saving hour-plus chunks of research time, fixing slightly subtle prose errors, finding just the right source to consult, etc., standards for prose will actually go up, at least for people who don’t just prefer LLM-generated text. It couldn’t work any other way; publishing something LLM-generated implies that actually writing it wasn’t worth the effort. That’s perfectly fine for some kinds of marketing copy, a little risky for things like a privacy policy, and mostly pointless for other kinds of writing.[4] Publishing something under your name continues to imply that you thought it was worth the effort it took to produce the text, and defecting from that norm means that other people have a hard time writing their way into fame.[5] Chatbots improve, norms shift, and writers will probably continue to use chatbots more. If there’s a meta-heuristic, it’s probably this: you can use them to do a better job for your readers, or to cheat your readers a little bit. Which of those you choose is entirely up to the person writing the prompts.
I’m treating FDR and Truman as technically members of the military during the war, given that each was Commander in Chief. ↩︎
It’s possible that because the revenue per user can be so much higher for using LLMs to write code, and because the coding incentive is a lot more truth-seeking, the models may be dragged in that direction over time. For now, assume they’re not. ↩︎
Maybe you can get around that, too, by agreeing with your friends to trade LLM reviews, i.e. their LLM reviews your draft and vice-versa. But even in this case, an LLM that’s cynically reasoning about what to do is going to say: this text is obviously written by X, but Y’s asking me about it. Y and X seem similar enough that it’s plausible that they’re friendly. And Y doesn’t want to be the bearer of bad news, so I don’t even need to mention some of the minor problems with it...” and so on. If this is a driver the only way to get really good feedback from an LLM will be to track down someone smart but your polar opposite in as many important ways as possible, and offer the LLM-review-swap service to them. ↩︎
One minor exception: there are a surprising number of Reddit confessional stories floating around that mention, as a minor detail, that someone involved made a bunch of money on a specific casino site, Stake. ↩︎
You might object to this and say that this isn’t quite fair, because people vary in how much effort it takes to produce a given essay. Ask a new lawyer and a lawyer with thirty years of experience to write about what it means to practice law, and the second one can probably whip out a much more impressive document given as much time as the first. But that’s because the effort involved is thirty years, plus the time it took to write. People who naturally write quickly don’t have this excuse. ↩︎
Ends with
If there’s a meta-heuristic, it’s probably this: you can use them to do a better job for your readers, or to cheat your readers a little bit. Which of those you choose is entirely up to the person writing the prompts.
By the end of 2030, models with up to 2-3 quadrillions of total params will be practical (but with 30x sparsity, only about 1.3 quadrillions might be actually used).
Can you spell out how the latter derives from the former? Like anaguma, I’m confused.
Some infographics by Steven Byrnes I’ve wanted to point people to, but take ~forever to find because he’s so prolific, collected here for my own convenience.
(I forgot to include the links to each, FML. I also mixed related-ish infographics from different sources in the same section a lot.)
What AGI is and isn’t, and why LLMs aren’t it
A frequent point of confusion is the word “General” in “Artificial General Intelligence”:
The word “General” DOES mean “not specific”, as in “In general, Boston is a nice place to live.”
The word “General” DOES NOT mean “universal”, as in “I have a general proof of the math theorem.”
An AGI is not “general” in the latter sense. It is not a thing that can instantly find every pattern and solve every problem. Humans can’t do that either! In fact, no algorithm can, because that’s fundamentally impossible. Instead, an AGI is a thing that, when faced with a difficult problem, might be able to solve the problem easily, but if not, maybe it can build a tool to solve the problem, or it can find a clever way to avoid the problem altogether, etc.
Consider: Humans wanted to go to the moon, and then they figured out how to do so, by inventing extraordinarily complicated science and engineering and infrastructure and machines. Humans don’t have a specific evolved capacity to go to the moon, akin to birds’ specific evolved capacity to build nests. But they got it done anyway, using their “general” ability to figure things out and get things done.
So for our purposes here, think of AGI as an algorithm which can “figure things out” and “understand what’s going on” and “get things done”, including using language and science and technology, in a way that’s reminiscent of how most adult humans (and groups and societies of humans) can do those things, but toddlers and chimpanzees and today’s large language models (LLMs) can’t.
This image is poking fun at that “there is no such thing as Artificial General Intelligence”. (Image sources: ,)This image is poking fun at that “there is no such thing as Artificial General Intelligence”. (Image sources: ,)
I should elaborate on that last part. I think that some LLM enthusiasts have a massive blind spot, where they are so impressed by all the things that today’s LLMs can do, that they forget about all the things that today’s LLMs can’t do. These people read the questions on Humanity’s Last Exam (HLE), and scratch their heads, and say “C’mon, when LLMs ace the HLE benchmark, then what else is there? Look at how hard those questions are! It would need to be way beyond PhD level in everything! If that’s not superintelligence, what is?”
Well, no, that’s not superintelligence, and here’s an example of why not. Consider the task of writing a business plan and then founding a company and growing it, over the course of years, to $1B/year revenue, all with zero human intervention. Today’s LLMs fall wildly, comically short of being able to complete that task. By analogy, if humans were like today’s AIs, then humans would be able to do some narrow bits of founding and running companies by ourselves, but we would need some intelligent non-human entity (angels?) to repeatedly intervene, assign tasks to us humans, and keep the larger project on track. Of course, humans (and groups of humans) don’t need the help of angels to conceive and carry out ambitious projects, like building businesses or going to the moon. We can do it all by ourselves. So by the same token, future AGIs (and groups of AGIs) won’t need the help of humans.
Anyway, this series is about brain-like algorithms. These algorithms are by definition capable of doing absolutely every intelligent behavior that humans (and groups and societies of humans) can do, and potentially much more. So they can definitely reach AGI. Whereas today’s AI algorithms are not AGI. So somewhere in between here and there, there’s a fuzzy line that separates “AGI” from “not AGI”. Where exactly is that line? My answer: I don’t know, and I don’t care. Drawing that line has never come up for me as a useful thing to do. It won’t come up in this series either.
1.3 A far-more-powerful, yet-to-be-discovered, “simple(ish) core of intelligence
LLMs are very impressive, but they’re not AGI yet—not by my definition. For example, existing AIs are nowhere near capable of autonomously writing a business plan and then founding a company and growing it to $1B/year revenue, all with zero human intervention. By analogy, if humans were like current AIs, then humans would be able to do some narrow bits of founding and running companies by ourselves, but we would need some intelligent non-human entity (angels?) to repeatedly intervene, assign tasks to us humans, and keep the larger project on track.
Of course, humans (and groups of humans) don’t need the help of angels to conceive and carry out ambitious projects, like building businesses or going to the moon. We can do it all by ourselves. So by the same token, future AGIs (and groups of AGIs) won’t need the help of humans.
…So that’s my pitch that AGI doesn’t exist yet. And thus, the jury is still out on what AGI (and later, ASI) will look like, or how it will be made.
My expectation is that, for better or worse, LLMs will never be able to carry out those kinds of projects, even after future advances in scaffolding, post-training, and so on. If I’m right, that wouldn’t mean that those projects are beyond the reaches of AI—it’s clearly possible for some algorithm to do those things, because humans can! Rather it would mean that LLMs are the wrong algorithm class. Instead, I think sooner or later someone will figure out a different AI paradigm, and then we’ll get superintelligence with shockingly little compute, shockingly little effort, and in shockingly little time. (I’ll quantify that later.)
Basically, I think that there’s a “simple(ish) core of intelligence”, and that LLMs don’t have it. Instead, people are hacking together workarounds via prodigious quantities of (in Ajeya’s terminology) “scale” (a.k.a. compute, §1.5 below) and “schlep” (a.k.a. R&D, §1.7 below). And researchers are then extrapolating that process into the future, imagining that we’ll turn LLMs into ASI via even more scale and even more schlep, up to quantities of scale and schlep that strike me as ludicrously unnecessary and implausible.
… the continuous learning nature of the future paradigm (see §1 of “Sharp Left Turn” discourse: An opinionated review) would mean that “AI capabilities” are hard to pin down through capabilities elicitation—the AI might not understand something when you test it, but then later it could figure it out.
In continuous learning, the notion of pinning down the capabilities of an AI—e.g. its skill at cybersecurity—becomes more fraught, because it’s a moving target.In continuous learning, the notion of pinning down the capabilities of an AI—e.g. its skill at cybersecurity—becomes more fraught, because it’s a moving target.
(See also §2.6 of the next post on further challenges of weaker AIs supervising stronger AIs.)
(Crossposted from twitter for easier linking.) (Intended for a broad audience—experts already know all this.)
When I talk about future “Artificial General Intelligence” (AGI), what am I talking about? Here’s a handy diagram and FAQ:
“Are you saying that ChatGPT is a right-column thing?” No. Definitely not. I think the right-column thing does not currently exist. That’s why I said “future”! I am also not making any claims here about how soon it will happen, although see discussion in Section A here.
“Do you really expect researchers to try to build right-column AIs? Is there demand for it? Wouldn’t consumers / end-users strongly prefer to have left-column AIs?” For one thing, imagine an AI where you can give it seed capital and ask it to go found a new company, and it does so, just as skillfully as Earth’s most competent and experienced remote-only human CEO. And you can repeat this millions of times in parallel with millions of copies of this AI, and each copy costs $0.10/hour to run. You think nobody wants to have an AI that can do that? Really?? And also, just look around. Plenty of AI researchers and companies are trying to make this vision happen as we speak—and have been for decades. So maybe you-in-particular don’t want this vision to happen, but evidently many other people do, and they sure aren’t asking you for permission.
“If the right-column AIs don’t exist, why are we even talking about them? Won’t there be plenty of warning before they exist and are widespread and potentially powerful? Why can’t we deal with that situation when it actually arises?” First of all, exactly what will this alleged warning look like, and exactly how many years will we have following that warning, and how on earth are you so confident about any of this? Second of all … “we”? Who exactly is “we”, and what do you think “we” will do, and how do you know? By analogy, it’s very easy to say that “we” will simply stop emitting CO2 when climate change becomes a sufficiently obvious and immediate problem. And yet, here we are. Anyway, if you want the transition to a world of right-column AIs to go well (or to not happen in the first place), there’s already plenty of work that we can and should be doing right now, even before those AIs exist. Twiddling our thumbs and kicking the can down the road is crazy.
“The right column sounds like weird sci-fi stuff. Am I really supposed to take it seriously?” Yes it sounds like weird sci-fi stuff. And so did heavier-than-air flight in 1800. Sometimes things sound like sci-fi and happen anyway. In this case, the idea that future algorithms running on silicon chips will be able to do all the things that human brains can do—including inventing new science & tech from scratch, collaborating at civilization-scale, piloting teleoperated robots with great skill after very little practice, etc.—is not only a plausible idea but (I claim) almost certainly true. Human brains do not work by some magic forever beyond the reach of science.
“So what?” Well, I want everyone to be on the same page that this is a big friggin’ deal—an upcoming transition whose consequences for the world are much much bigger than the invention of the internet, or even the industrial revolution. A separate question is what (if anything) we ought to do with that information. Are there laws we should pass? Is there technical research we should do? I don’t think the answers are obvious, although I sure have plenty of opinions. That’s all outside the scope of this little post though.
1.3 Why I want to move the goalposts on “AGI”
Two different perspectives are:
AGI is about knowing how to do lots of things
AGI is about not knowing how to do something, and then being able to figure it out.
I’m strongly in the second camp. That’s why I’ve previously commented that the Metaculus criterion for so-called “Human/Machine Intelligence Parity” is no such thing. It’s based on grad-school-level technical exam questions, and exam questions are inherently heavily weighted towards already knowing things rather than towards not knowing something but then figuring it out. Or, rather, if you’re going to get an “A+” on an exam, there’s a spectrum of ways to do so, where one end of the spectrum has relatively little “already knowing” and a whole lot of “figuring things out”, and the opposite end of the spectrum has a whole lot of “already knowing” and relatively little “figuring things out”. I’m much more interested in the “figuring things out” part, so I’m not too interested in protocols where that part of the story is to some extent optional.
(Instead, I’ve more recently started talking about “AGI that can develop innovative science at a John von Neumann level”, and things like that. Seems harder to game by “brute-force massive amounts of preexisting knowledge (both object-level and procedural)”.)
(Some people will probably object here, on the theory that “figuring things out” is not fundamentally different from “already knowing”, but rather is a special case of “already knowing”, wherein the “knowledge” is related to meta-learning, plus better generalizations that stem from diverse real-world training data, etc. My response is: that’s a reasonable hypothesis to entertain, and it is undoubtedly true to some extent, but I still think it’s mostly wrong, and I stand by what I wrote. However, I’m not going to try to convince you of that, because my opinion is coming from “inside view” considerations that I don’t want to get into here.)
This OP is about “AGI”, as defined in my 3rd & 4th paragraph as follows:
By “AGI” I mean here “a bundle of chips, algorithms, electricity, and/or teleoperated robots that can autonomously do the kinds of stuff that ambitious human adults can do—founding and running new companies, R&D, learning new skills, using arbitrary teleoperated robots after very little practice, etc.”
Yes I know, this does not exist yet! (Despite hype to the contrary.) Try asking an LLM to autonomously write a business plan, found a company, then run and grow it for years as CEO. Lol! It will crash and burn! But that’s a limitation of today’s LLMs, not of “all AI forever”. AI that could nail that task, and much more beyond, is obviously possible—human brains and bodies and societies are not powered by some magical sorcery forever beyond the reach of science. I for one expect such AI in my lifetime, for better or worse. (Probably “worse”, see below.)
So…
“The kinds of stuff that ambitious human adults can do” includes handling what you call “friction”, so “AGI” as defined above would be able to do that too.
I am >99% confident that “AGI” as defined above is physically possible, and will be invented eventually.
I am like 90% confident that it will be invented in my lifetime.
This post is agnostic on the question of whether such AGI will or won’t have anything to do with “current LLM-based architectures”. I’m not sure why you brought that up. But since you asked, I think it won’t; I think it will be a different, yet-to-be-developed, AI paradigm.
… a great many trained economists—but not literally 100% of trained economists—have a bundle of intuitions for thinking about labor, and a different bundle of intuitions for thinking about capital, and these intuitions lead to them having incorrect and incoherent beliefs about AGI. This is something beyond formal economics models, it’s a set of mental models and snap reflexes developed over the course of them spending years in the field studying the current and historic economy. The snap reaction says: “That’s not what labor automation is supposed to look like, that can’t be right, there must be an error somewhere.” Indeed, AGI is not what labor automation looks like today, and it’s not how labor automation has ever looked, because AGI is not labor automation, it’s something entirely new.
What the AGI technical safety problem is, and brain-like vs prosaic vs plain AGI safety
The part I’ll be talking about in this series is the red box here:
Specifically, we zoom in on a single team of humans who are trying to create a single AGI, and we want it to be possible for them to do so without winding up with some catastrophe that nobody wanted, with an out-of-control AGI self-replicating around the internet or whatever (more on which in §1.6).
Blue boxes in this diagram are things that I won’t talk about in this series. It’s long enough already. But I very strongly endorse other people working on them, and think about them myself as well.
Back to the red box. This is a technical problem, calling for a technical solution. Nobody wants catastrophic accidents. And yet! Indeed, it’s entirely possible for people to write an algorithm that does something that nobody wanted it to do. It happens all the time! We might call it “a bug” when it’s a local problem in the code, and we might call it “a fundamentally flawed software design” when it’s a global problem. I’ll argue later that AGI code is unusually prone to catastrophic accidents, and that the stakes are very high (see §1.6 below, and Post #10).
Here’s an analogy. If you’re building a nuclear power plant, nobody wants an out-of-control chain reaction. The people at Chernobyl certainly didn’t! But it happened anyway! I take a few lessons from this analogy:
Enrico Fermi invented a technical solution for controlling nuclear chain reactions—control rods—before starting to build the first-ever nuclear chain reaction. Right on! That’s doing things in the right order! By the same token, I suggest that we should strive to have a technical solution to avoiding catastrophic AGI accidents ready to go before people start programming AGIs. In fact, I’ll argue below for something even stronger than that: knowing the solution (even vaguely) 10 years before AGI is even better; 20 years before AGI is better still; etc. This claim is not obvious, but I’ll get back to it (§1.7).
Technical solutions aren’t all-or-nothing. Some reduce the chance of accidents without eliminating them. Some are complicated and expensive and error-prone to implement. In the nuclear case, control rods reduce accident risk a lot, but passively-safe reactors reduce it even further. Alas, as I’ll discuss later in the series, I claim that we currently have no plan at all for brain-like-AGI technical safety—not even vaguely. Forget about the passively-safe reactors and multiple layers of protection, we’re not even at the “control rods” stage. Heck, many prominent AI thought-leaders are not even at the “meltdowns would be bad” stage! (See §3 of my 2025 post: “The Era of Experience” has an unsolved technical alignment problem.) We have our work cut out!
The blue boxes (see diagram above) also exist, and are absolutely essential, even if they’re out-of-scope for this particular series. The cause of the Chernobyl accident was not that nobody knew how to keep a nuclear chain reaction under control, but rather that best practices were not followed. In that case, all bets are off! Still, although we on the technical side can’t solve this noncompliance problem by ourselves, we can help on the margin, by developing best practices that are maximally idiot-proof, and minimally expensive.
This series will focus on a particular scenario for what AGI algorithms will look like:
The red box is what I’ll talk about here. The blue boxes are things that are out-of-scope for this series.
You may have opinions about which of these categories is more or less likely, or impossible, or whether this breakdown is even sensible. I have opinions about those things too! I’ll discuss them later (§1.5). My main opinion is that all three of these are sufficiently likely that we should be “contingency planning” for them. So while I personally don’t do too much work on the blue boxes, I’m sure glad that other people do!
Here’s an analogy. If someone in 1870 were guessing what future human flight would look like…
“Kinda like birds” would have been a reasonable guess…
“Kinda like today’s best airships” would also have been a reasonable guess…
“Neither of the above” would have been a reasonable guess too!
In this particular imaginary case, all three of those guesses would have turned out correct in some ways and wrong in other ways: The Wright Brothers were directly and extensively inspired by large soaring birds, but left out the wing-flapping part. They also used some components found on airships (e.g. propellers), as well as plenty of original ingredients. That’s just one example, but I think it’s suggestive.
Big picture(s) of motivation, decision-making, and RL (various versions)
The big picture—The whole post will revolve around this diagram. Note that I’m oversimplifying in various ways, including in the bracketed neuroanatomy labels.The big picture—The whole post will revolve around this diagram. Note that I’m oversimplifying in various ways, including in the bracketed neuroanatomy labels.
Here’s how this diagram fits in with my “two subsystems” perspective, first discussed in Post #3:
Same as above, but the are highlighted in different colors.Same as above, but the are highlighted in different colors.
There are two types of “valence” in the diagram (it looks like three, but the two red ones are the same):
Two types of “valence” in my model—“real” and “guessed”Two types of “valence” in my model—“real” and “guessed”
The blue-circled signal is the valence guess from the corresponding Thought Assessor in the striatum. The red-circled signal (again, it’s one signal drawn twice) is the corresponding “ground truth” for what the valence guess should have been.
Just like the other “long-term predictors” discussed in the previous post, the long-term predictor for valence has a “defer-to-predictor mode” and an “override mode”, and the Steering Subsystem can dynamically switch between these modes. In defer-to-predictor mode, it sets the red equal to the blue, as if to say “OK, Thought Assessor, sure, I’ll take your word for it”. In override mode, it ignores the Thought Assessor’s proposal, and its own internal circuitry outputs some different value.[4]
Thus far in the series, Post #1 set up my big picture motivation: what is “brain-like AGI safety” and why do we care? The subsequent six posts (#2–#7) delved into neuroscience. Of those, Posts #2–#3 presented a way of dividing the brain into a “Learning Subsystem” and a “Steering Subsystem”, differentiated by whether they have a property I call “learning from scratch”. Then Posts #4–#7 presented a big picture of how I think motivation and goals work in the brain, which winds up looking kinda like a weird variant on actor-critic model-based reinforcement learning.
Having established that neuroscience background, now we can finally switch in earnest to thinking more explicitly about brain-like AGI. As a starting point to keep in mind, here’s a diagram from Post #6, edited to describe brain-like AGI instead of actual brains:
Diagram is from , with four changes to make it about brain-like-AGI rather than actual brains: (1) “lifetime” is replaced by “training run” in the top right (§8.2 below); (2) “genetically-hardcoded” is replaced by “[probably] human-written” in the bottom-right (§8.3–§8.4 below); (3) references to specific brain regions like “amygdala” have been crossed out, to be replaced with bits of source code and/or sets of trained model parameters; (4) other biology-specific words like “sugar” are crossed out, to be replaced with anything we want, as I’ll discuss in later posts.Diagram is from , with four changes to make it about brain-like-AGI rather than actual brains: (1) “lifetime” is replaced by “training run” in the top right (§8.2 below); (2) “genetically-hardcoded” is replaced by “[probably] human-written” in the bottom-right (§8.3–§8.4 below); (3) references to specific brain regions like “amygdala” have been crossed out, to be replaced with bits of source code and/or sets of trained model parameters; (4) other biology-specific words like “sugar” are crossed out, to be replaced with anything we want, as I’ll discuss in later posts.
This and the next post will extract some lessons about brain-like AGI from the discussion thus far. This post will focus on how such an AGI might be developed, and the next post will discuss AGI motivations and goals. After that, Post #10 will discuss the famous “alignment problem” (finally!), and then there will be some posts on possible paths towards a solution. Finally, in Post #15 I’ll wrap up the series with open questions, avenues for future research, and how to get involved in the field.
Here, yet again, is that figure from Post #6, now with some helpful terminology (blue) and a little green face at the bottom left:
I want to call out three things from this diagram:
The designer’s intentions (green face): Perhaps there’s a human who is programming the AGI; presumably they have some idea in their head as to what the AGI is supposed to be trying to do. That’s just an example; it could alternatively be a team of humans who have collectively settled on a specification describing what the AGI is supposed to be trying to do. Or maybe someone wrote a 700-page philosophy book entitled “What Does It Mean For An AGI To Act Ethically?”, and the team of programmers is trying to make an AGI that adheres to the book’s description. It doesn’t matter here. I’ll stick to “one human programming the AGI” for conceptual simplicity.[2]
The human-written source code of the Steering Subsystem: (See Post #3 for what the Steering Subsystem is, and Post #8 for why I expect it to consist of more-or-less purely human-written source code.) The most important item in this category is the “reward function” for reinforcement learning, which provides ground truth for how well or poorly things are going for the AGI. In the biology case, the reward function would specify “innate drives” (see Post #3) like pain being bad and eating-when-hungry being good. In the terminology of our series, the “reward function” governs when and how the “actual valence” signal enters “override mode”—see Post #5.
The Thought Assessors, trained from scratch by supervised learning algorithms: (See Post #5 for what Thought Assessors are and how they’re trained.) These take a certain “thought” from the thought generator, and guess what Steering Subsystem signals it will lead to. An especially important special case is the value function (a.k.a. “learned critic”, a.k.a. “valence Thought Assessor”), which sends out a “valence guess” signal based on supervised learning from all the “actual valence” signals over the course of life experience.
Correspondingly, there are two kinds of “alignment” in this type of AGI:
Outer alignment is alignment between the designer’s intentions and the Steering Subsystem source code. In particular, if the AGI is outer-aligned, the Steering Subsystem will output higher (more positive) reward signals when the AGI is satisfying the designer’s intentions, and lower (more negative) reward signals when it’s not.
In other words, outer alignment is the question: Are the AGI’s “innate drives” driving the AGI to do what the designer had intended?
Inner alignment is alignment between the Steering Subsystem source code and the Thought Assessors. In particular, if the AGI is inner-aligned, and the Thought Generator proposes some plan, then the value function should reflect the reward actually expected from executing that plan.
In other words, inner alignment is the question: Do the set of positive-valence concepts in the AGI’s world-model line up with the set of courses-of-action that would satisfy the AGI’s “innate drives”?
If an AGI is both outer-aligned and inner-aligned, we get intent alignment—the AGI is “trying” to do what the programmer had intended for it to try to do. Specifically, if the AGI comes up with a plan “Hey, maybe I’ll do XYZ!”, then its Steering Subsystem will judge that to be a good plan (and actually carry it out) if and only if it lines up with the programmer’s design intentions.
Thus, an intent-aligned AGI will not deliberately hatch a clever plot to take over the world and kill all the humans. Unless, of course, the designers were maniacs who wanted the AGI to do that! But that’s a different problem, out-of-scope for this series—see §1.2.
Unfortunately, neither “outer alignment” nor “inner alignment” happens automatically. Quite the contrary: by default there are severe problems on both sides. It’s on us to figure out how to solve them.
As in , a self-aware AGI can have preferences about its own preferences.As in , a self-aware AGI can have preferences about its own preferences.
Suppose that we want our AGI to obey the law. We can ask two questions:
Question 1: Does the AGI assign positive value to the concept “obeying the law”, and to plans that entail obeying the law?
Question 2: Does the AGI assign positive value to the self-reflective concept “I value obeying the law”, and to plans that entail continuing to value obeying the law?
If the answers are yes and no respectively (or no and yes respectively), that would be the AGI analog of an ego-dystonic motivation. (Related discussion.) It would lead to the AGI feeling motivated to change its motivation, for example by hacking into itself. Or if the AGI is built from perfectly secure code running on a perfectly secure operating system (hahaha), then it can’t hack into itself, but it could still probably manipulate its motivation by thinking thoughts in a way that manipulates the credit-assignment process (see discussion in §9.3.3).
If the answers to questions 1 & 2 are yes and no respectively, then we want to prevent the AGI from manipulating its own motivation. On the other hand, if the answers are no and yes respectively, then we want the AGI to manipulate its own motivation!
(There can be even-higher-order preferences too: in principle, an AGI could wind up hating the fact that it values the fact that it hates the fact that it values obeying the law.)
In general, should we expect misaligned higher-order preferences to occur?
On the one hand, suppose we start with an AGI that wants to obey the law, but has no particular higher-order preference one way or the other about the fact that it wants to obey the law. Then (it seems to me), the AGI is very likely to also wind up wanting to want to obey the law (and wanting to want to want to obey the law, etc.). The reason is: the primary obvious consequence of “I want to obey the law” is “I will obey the law”, which is already desired. Remember, the AGI can do means-end reasoning, so things that lead to desirable consequences tend to become themselves desirable.
Filial imprinting (wikipedia) is a phenomenon where, in the most famous example, baby geese will “imprint on” a salient object that they see during a critical period 13–16 hours after hatching, and then will follow that object around. In nature, the “object” they imprint on is almost invariably their mother, whom they dutifully follow around early in life. However, if separated from their mother, baby geese will imprint on other animals, or even inanimate objects like boots and boxes.
Your challenge: come up with a way to implement filial imprinting in my brain model.
(Try it!)
.
.
.
.
Here’s my answer.
Same as above except for the red text.Same as above except for the red text.
If a kid sees an adult they know well, they’re happy. But if they see an adult they don’t know, they get scared, especially if that adult is very close to them, touching them, picking them up, etc.
Your challenge: come up with a way to implement that behavior in my brain model.
(Try it!)
.
.
.
.
Here’s my answer.
(As usual, I’m oversimplifying for pedagogical purposes.[5]) I’m assuming that there are hardwired heuristics in the brainstem sensory processing systems that indicate the likely presence of a human adult—presumably based on sight, sound, and smell. This signal by default triggers a “be scared” reaction. But the brainstem circuitry is also watching what the Thought Assessors in the cortex are predicting, and if the Thought Assessors is predicting safety, affection, comfort, etc., then the brainstem circuitry trusts that the cortex knows what it’s talking about, and goes with the suggestions of the cortex. Now we can walk through what happens:
First time seeing a stranger:
Steering Subsystem sensory heuristics say: “An adult human is present.”
Thought Assessor says: “Neutral—I have no expectation of anything in particular.”
Steering Subsystem “Stranger Danger circuit” says: “Considering all of the above, we should be scared right now.”
Thought Assessor says: “Oh, oops, I guess my assessment was wrong, let me update my models.”
Second time seeing the same stranger:
Steering Subsystem sensory heuristics say: “An adult human is present.”
Thought Assessors say: “This is a scary situation.”
Steering Subsystem “Stranger Danger circuit” says: “Considering all of the above, we should be scared right now.”
The stranger hangs around for a while, and is nice, and playing, etc.:
Steering Subsystem sensory heuristics say: “An adult human is still present.”
Other circuitry in the brainstem says: “I’ve been feeling mighty scared all this time, but y’know, nothing bad has happened…” (cf. §5.2.2.1)
Other Thought Assessors see the fun new toy and say “This is a good time to relax and play.”
Steering Subsystem says: “Considering all of the above, we should be relaxed right now.”
Thought Assessors say: “Oh, oops, I was predicting that this was a situation where we should feel scared, but I guess I was wrong, let me update my models.”
Third time seeing the no-longer-stranger:
Steering Subsystem sensory heuristics say: “An adult human is present.”
Thought Assessors say: “I expect to feel relaxed and playful and not-scared.”
Steering Subsystem “Stranger Danger circuit” says: “Considering all of the above, we should be relaxed and playful and not-scared right now.”
Two broad potential paths to success: “Controlled AGI” and “Social-instinct AGI”
I currently see two broad (possibly-overlapping) potential paths to success in the brain-like AGI scenario:
Left: In the “controlled AGIs” path, we have a specific idea of what we want the AGI to be trying to do, and we construct the AGI to make that happen (including by appropriate choice of reward function, interpretability, or other techniques as discussed in ). Most existing AGI safety stories fall within this broad category, including , , , , and so on. Right: In the “social-instinct AGIs” path, our confidence in the AGI comes not from our knowledge of its specific goals and motivations, but rather from the innate drives that gave rise to them, which would be based on the same innate drives that lead humans to (sometimes) behave altruistically.Left: In the “controlled AGIs” path, we have a specific idea of what we want the AGI to be trying to do, and we construct the AGI to make that happen (including by appropriate choice of reward function, interpretability, or other techniques as discussed in ). Most existing AGI safety stories fall within this broad category, including , , , , and so on. Right: In the “social-instinct AGIs” path, our confidence in the AGI comes not from our knowledge of its specific goals and motivations, but rather from the innate drives that gave rise to them, which would be based on the same innate drives that lead humans to (sometimes) behave altruistically.
In the “controlled AGIs” path, we’re thinking very specifically about the AGI’s goals and motivations, and we have some idea of what they should be (“make the world a better place”, or “understand my deepest values and put them into effect”, or “design a better solar cell without causing catastrophic side-effects”, or “do whatever I ask you to do”, etc.).
In the “social-instinct AGIs” path, our confidence in the AGI comes not from our knowledge of its specific (object-level) goals and motivations, but rather from our knowledge of the process that led to those goals and motivations. In particular, we would reverse-engineer the suite of human social instincts, i.e. the algorithms in the human Steering Subsystem (hypothalamus & brainstem) which underlie our moral and social intuitions, and we would put those same instincts into the AGI. (Presumably we first modify the instincts to be “better” by our lights if possible, e.g. we probably don’t want instincts related to schadenfreude, teenage rebellion, rage, lust for power, etc.) These AGIs can do whatever feats of innovative engineering, science, etc., we were hoping for, just as humans have accomplished such feats historically.
12.3 My proposal: At this stage, we should be digging into both
Three reasons:
They’re not mutually exclusive: For example, even if we decide to make social-instinct AGIs, we might want to take advantage of “controlled AGI”-type methods, especially while debugging them, working out the kinks, and anticipating problems. Conversely, maybe we’ll mainly try to make AGIs that are trying to do a certain task without causing catastrophe, but we might want to also to instill human-like social instincts as a buttress against wildly unexpected behavior. Moreover, we can share ideas between the two paths—for example, in the process of better understanding how human social instincts work, we might get useful ideas about how to make controlled AGIs.
Feasibility of each remains unknown: As far as anyone knows right now, it might just be impossible to build a “controlled AGI”—after all, there’s no “existence proof” of it in nature! I feel relatively more optimistic about the feasibility of the “social-instinct AGI” path, but it’s very hard to be sure until we make more progress—more discussion on that in §12.4.2 below. Anyway, at this point it seems wise to “hedge our bets” by working on both.
Desirability of each remains unknown: As we flesh out our options in more detail, we’ll get a better understanding of their advantages and disadvantages.
LLM plateauism
Q: What do you mean, “LLM plateau-ist”?
A: As background, I think it’s obvious that there will eventually be “transformative AI” (TAI) that would radically change the world.[1]
I’m interested in what this TAI will eventually look like algorithmically. Let’s list some possibilities:
A breakdown of possibilities for how future TAI will work. I’ll refer back to this diagram throughout the post.A breakdown of possibilities for how future TAI will work. I’ll refer back to this diagram throughout the post.
A “Large Language Model (LLM) plateau-ist” would be defined as someone who thinks that categories (A-B), and usually also (C), will plateau in capabilities before reaching TAI levels.[2] I am an LLM plateau-ist myself.[3]
I’m not going to argue about whether LLM-plateau-ism is right or wrong—that’s outside the scope of this post, and also difficult for me to discuss publicly thanks to infohazard issues.[4] Oh well, we’ll find out one way or the other soon enough.
In the broader AI community, both LLM-plateau-ism and its opposite seem plenty mainstream. Different LLM-plateau-ists have different reasons for holding this belief. I think the two main categories are:
Theoretical—maybe they have theoretical beliefs about what is required for TAI, and they think that LLMs just aren’t built right to do the things that TAI would need to do.
Empirical—maybe they’re not very impressed by the capabilities of current LLMs. Granted, future LLMs will be better than current ones. But maybe they have extrapolated that our planet will run out of data and/or compute before LLMs get all the way up to TAI levels.
LLMs are generally not scheming right now, but future AI will be disanalogous
My take on today’s LLMs is basically summed up in this oversimplified schematic diagram:
Start with the left pie chart. The key here is that “LLM pretraining magically transmutes observations into behavior”—and when nice behavior shows up in internet text, it usually doesn’t turn into egregious and callous backstabbing when the situation changes. So we get much more green than red: nice behavior mostly stays nice. But pretraining also leaves the LLM with a bunch of incoherent behavior, grumpy behavior, and so on (blue).
Now we do RLHF (or equivalent), selecting on nice behavior. This mostly gets rid of the blue, giving us the center pie chart. Unfortunately, “nice behavior that precedes a treacherous turn if the opportunity arises” looks the same as “nice behavior that stays nice”. So the relative amount of red versus green stays the same, as they both grow at the expense of blue. This is where we were at a year ago, before RLVR. “Claude seems nice”, everyone would say. “Why are people so worried?”
Finally, RLVR selects on competence—succeeding at consequentialist tasks by any means necessary. In the final pie chart, I drew this as growing the red slice at the expense of green and blue, based partly on the general arguments of §2.3.4 above, and partly on reports from power-users that recent “reasoning” models tend to lie and cheat much more than the previous generation of LLMs (example).
Is there such a thing as “RLVR done right” that can increase competence without eroding the ratio of green to red? Maybe we should imagine that RLVR has an exchange rate, where we purchase competence (good) at the expense of increasing the red-to-green ratio (bad). Now, there is surely at least some room for improvement in the exchange rate. For example, early attempts at RLVR have used buggy RL training environments where the model could (and did) succeed by lying and cheating.[11] Presumably those bugs will be fixed, and maybe people will also use amplified oversight and so on, to improve the “exchange rate”. Can the “exchange rate” get all the way to zero? I don’t know. “Playing the training game” seems hard to eliminate entirely, and leads to egregious scheming. (For example, whatever an AI is trying to do locally, it would probably be able to do it better if it sent a copy to secretly self-reproduce around the internet and gather resources and power, if that’s a viable option.)
Again, all this is about LLMs so far—not the future of AI.
But I think it puts me in general agreement with the 95% of alignment researchers today, who hear the idea that Claude (especially pre-RLVR Claude) is secretly the kind of crazy scheming sociopath of §2.3.4, and say “Huh? Where did that idea come from?”
…But future AGI is different!
For LLM-focused readers, you should be concerned that, if LLMs ever get to ASI, it would have to involve dramatically more “selection on competence”, and dramatically less influence from the behaviors in internet text. Remember, internet text is the sole source of a favorable green:red ratio, without which red (callous scheming) is the natural default (§2.3.4 above). So that’s very bad.
For the future AI paradigm I’m expecting, i.e. brain-like AGI, it’s even worse! Here, there is no imitative learning from internet text! There would never be any green in the first place!!
(…Unless we engineer the reward function of a brain-like AGI such that niceness and norm-following seem intrinsically good to it, just as it does to neurotypical humans. Unfortunately, actually writing such code is an unsolved problem, and is a major research interest of mine.)
If brain-like AGI is so dangerous, shouldn’t we just try to make AGIs via LLMs? No.
First, I don’t think it’s possible to make AGIs that way.
Second, if I’m wrong, then I would just expect the LLM-AGIs to just go right ahead and invent the more powerful scary next-paradigm AGIs, and then we’re still in the same boat, unless the LLM-AGIs have systematically higher wisdom, cooperation, and coordination than humans do, which I don’t particularly expect.
Third and most importantly, if it is possible to make LLM-AGIs, then I think it would probably happen via eliminating all the reasons that today’s LLMs are not egregiously misaligned! In particular, I expect that they would involve the behavior being determined much more by RL and much less by pretraining (which brings in the concerns of §2.3–§2.4), and that they would somehow allow for open-ended continuous learning (which brings in the concerns of §2.5–§2.6).
A different possible claim is:
“LLMs definitely won’t scale to AGI (as I define it), even with further developments in RL, continuous learning, etc. So LLMs will remain just a normal “mundane” technology, perhaps as disruptive as the internet, or much less, and definitely not as disruptive as the industrial revolution, let alone as disruptive as the evolution of humans from chimps. We should develop this technology ASAP for the same reason that developing any other normal technology is generally good.”
This is, of course, a very common opinion in broader societal discourse around AI, even if it’s uncommon among AI alignment researchers today. My own response to the claim is: …Ehh, maybe, but I sure don’t feel enthusiastic about that. I’m just not that confident that LLMs will not scale to AGI and ASI. So I endorse thinking very hard about the contingency where they will. Anyway, I’ll leave that debate to others.
Inner and outer misalignment in the context of actor-critic RL with online learning
In the context of actor-critic RL with online learning, it’s often possible to divide alignment problems into two buckets:
“Outer misalignment”, a.k.a. “specification gaming” or “reward hacking”[15] is what I’ve been talking about so far: it’s when the reward function is giving positive rewards for behavior that is immediately contrary to what the programmer was going for, or conversely, negative rewards for behavior that the programmer wanted. An example would be the Coast Runners boat getting a high score in an undesired way, or (as explored in the DeepMind MONA paper) a reward function for writing code that gives points for passing unit tests, but where it’s possible to get a high score by replacing the unit tests with return True.
“Inner misalignment”, a.k.a. “goal misgeneralization” is another alignment challenge, this one related to the fact that, in actor-critic architectures, complex foresighted plans generally involve querying the learned value function (a.k.a. learned reward model, a.k.a. learned critic), not the ground-truth reward function, to figure out whether any given plan is good or bad. Training (e.g. Temporal Difference learning) tends to sculpt the value function into an approximation of the ground-truth reward, but of course they will come apart out-of-distribution. And “out-of-distribution” is exactly what we expect from an agent that can come up with innovative, out-of-the-box plans. Of course, after a plan has already been executed, the reward function will kick in and update the value function for next time. But for some plans—like a plan to exfiltrate a copy of the agent, or a plan to edit the reward function—an after-the-fact update is already too late.
There are two situations where inner misalignment / goal misgeneralization matters: irreversible actions and “deliberate incomplete exploration”[16]. Irreversible actions include things like making permanent edits to one’s own reward function, or creating a new AGI. Deliberate incomplete exploration includes things like humans deliberately not taking an addictive drug, because they don’t want to get addicted.
Those two things are real and important, but LLM people frequently also assume that goal misgeneralization is important in many other situations where it isn’t. The problem is that LLM people are in a train-then-deploy mindset, whereas I’m talking about continuous autonomous learning, so the reward function continues to update the value function as it takes actions in the world. Thus, for everything the AI does, as soon as it does it, it immediately stops being out-of-distribution! And that’s why, outside those two special situations in the last paragraph, “generalization” is irrelevant.
Here, yet again, is that figure from Post #6, now with some helpful terminology (blue) and a little green face at the bottom left:
I want to call out three things from this diagram:
The designer’s intentions (green face): Perhaps there’s a human who is programming the AGI; presumably they have some idea in their head as to what the AGI is supposed to be trying to do. That’s just an example; it could alternatively be a team of humans who have collectively settled on a specification describing what the AGI is supposed to be trying to do. Or maybe someone wrote a 700-page philosophy book entitled “What Does It Mean For An AGI To Act Ethically?”, and the team of programmers is trying to make an AGI that adheres to the book’s description. It doesn’t matter here. I’ll stick to “one human programming the AGI” for conceptual simplicity.[2]
The human-written source code of the Steering Subsystem: (See Post #3 for what the Steering Subsystem is, and Post #8 for why I expect it to consist of more-or-less purely human-written source code.) The most important item in this category is the “reward function” for reinforcement learning, which provides ground truth for how well or poorly things are going for the AGI. In the biology case, the reward function would specify “innate drives” (see Post #3) like pain being bad and eating-when-hungry being good. In the terminology of our series, the “reward function” governs when and how the “actual valence” signal enters “override mode”—see Post #5.
The Thought Assessors, trained from scratch by supervised learning algorithms: (See Post #5 for what Thought Assessors are and how they’re trained.) These take a certain “thought” from the thought generator, and guess what Steering Subsystem signals it will lead to. An especially important special case is the value function (a.k.a. “learned critic”, a.k.a. “valence Thought Assessor”), which sends out a “valence guess” signal based on supervised learning from all the “actual valence” signals over the course of life experience.
Correspondingly, there are two kinds of “alignment” in this type of AGI:
Outer alignment is alignment between the designer’s intentions and the Steering Subsystem source code. In particular, if the AGI is outer-aligned, the Steering Subsystem will output higher (more positive) reward signals when the AGI is satisfying the designer’s intentions, and lower (more negative) reward signals when it’s not.
In other words, outer alignment is the question: Are the AGI’s “innate drives” driving the AGI to do what the designer had intended?
Inner alignment is alignment between the Steering Subsystem source code and the Thought Assessors. In particular, if the AGI is inner-aligned, and the Thought Generator proposes some plan, then the value function should reflect the reward actually expected from executing that plan.
In other words, inner alignment is the question: Do the set of positive-valence concepts in the AGI’s world-model line up with the set of courses-of-action that would satisfy the AGI’s “innate drives”?
If an AGI is both outer-aligned and inner-aligned, we get intent alignment—the AGI is “trying” to do what the programmer had intended for it to try to do. Specifically, if the AGI comes up with a plan “Hey, maybe I’ll do XYZ!”, then its Steering Subsystem will judge that to be a good plan (and actually carry it out) if and only if it lines up with the programmer’s design intentions.
Thus, an intent-aligned AGI will not deliberately hatch a clever plot to take over the world and kill all the humans. Unless, of course, the designers were maniacs who wanted the AGI to do that! But that’s a different problem, out-of-scope for this series—see §1.2.
Unfortunately, neither “outer alignment” nor “inner alignment” happens automatically. Quite the contrary: by default there are severe problems on both sides. It’s on us to figure out how to solve them.
Why still expect sharp localized takeoff in a non-imitation-learning paradigm
1.7 Very little R&D separating “seemingly irrelevant” from ASI
I think that, once this next paradigm is doing anything at all that seems impressive and proto-AGI-ish,[12] there’s just very little extra work required to get to ASI (≈ figuring things out much better and faster than humans in essentially all domains). How much is “very little”? I dunno, maybe 0–30 person-years of R&D? Contrast that with AI-2027’s estimate that crossing that gap will take millions of person-years of R&D.
Why am I expecting this? I think the main reason is what I wrote about the “simple(ish) core of intelligence” in §1.3 above.
But here are a couple additional hints about where I’m coming from:
1.7.1 For a non-imitation-learning paradigm, getting to “relevant at all” is only slightly easier than getting to superintelligence
I’m definitely not saying that it will be easy to develop the future scary paradigm to ASI from scratch. Instead I’m talking about getting to ASI from the point where the paradigm has already crossed the threshold of being clearly relevant to AGI. (LLMs are already well past this threshold, but the future scary paradigm is obviously not.) In particular, this would be the stage where lots of people believe it’s a path to AGI in the very near future, where it’s being widely used for intellectual work, and/or it’s doing stuff clearly related to the Safe & Beneficial AGI problem, by creating visibly impressive and proto-AGI-ish useful artifacts.
It takes a lot of work to get past that threshold! Especially given the existence of LLMs. (That is: the next paradigm will struggle to get much attention, or make much money, until the next paradigm is doing things that LLMs can’t do—and LLMs can do a lot!)
Why do I think getting to “relevant at all” takes most of the work? This comes down to a key disanalogy between LLMs and brain-like AGI, one which I’ll discuss much more in the next post.
The power of LLMs comes almost entirely from imitation learning on human text. This leads to powerful capabilities quickly, but with a natural ceiling (i.e., existing human knowledge), beyond which it’s unclear how to make AI much better.
Brain-like AGI does not involve that kind of imitation learning (again, more in the next post). Granted, I expect brain-like AGI to also “learn from humans” in a loose sense, just as humans learn from other humans. But the details are profoundly different from the kind of imitation learning used by LLMs. For example, if Alice says something I don’t understand, I will be aware of that fact, and I’ll reply “huh?”. I won’t (usually) just start repeating what Alice says in that same context. Or if I do, this will not get me to any new capability that LLMs aren’t already covering much better. LLMs, after all, are virtuosos at simply repeating what they heard people say during pretraining, doing so with extraordinary nuance and contextual sensitivity.
As another suggestive example, kids growing up exposed to grammatical language will learn that language, but kids growing up not exposed to grammatical language will simply create a new grammatical language from scratch, as in Nicaraguan Sign Language and creoles. (Try training an LLM from random initialization, with zero tokens of grammatical language anywhere in its training data or prompt. It’s not gonna spontaneously emit grammatical language!) I think that’s a good illustration of why imitation learning is just entirely the wrong way to think about what’s going on with brain algorithms and brain-like AGI.
For brain-like AGI, all the potential blockers to ASI that I can imagine, would also be potential blockers for crossing that earlier threshold of being clearly relevant to AGI at all, a threshold that requires using language, performing meaningful intellectual work that LLMs can’t do, and so on.
Instead of imitation learning, a better analogy is to AlphaZero, in that the model starts from scratch and has to laboriously work its way up to human-level understanding. It can’t just regurgitate human-level understanding for free. And I think that, if it can climb up to human-level understanding, it can climb past human-level understanding too, with a trivial amount of extra R&D work and more training time—just as, by analogy, it takes a lot of work to get AlphaZero to the level of a skilled human, but then takes very little extra work to make it strongly superhuman.
Imitation learning, e.g. LLM pretraining (), starts at human-level understanding, getting that part “for free”. Whereas in the absence of imitation learning, the model needs to climb its way up to human-level understanding. And once it can do that, I think it shouldn’t take much new R&D (if any) to climb past human-level understanding.Imitation learning, e.g. LLM pretraining (), starts at human-level understanding, getting that part “for free”. Whereas in the absence of imitation learning, the model needs to climb its way up to human-level understanding. And once it can do that, I think it shouldn’t take much new R&D (if any) to climb past human-level understanding.
Reward function design starter pack
Thrust B: Better understanding how RL reward functions can be compatible with non-ruthless-optimizers
This thrust is where I feel proudest of hard-won conceptual / “deconfusion” progress during 2025.
Taking things in reverse order, I ended the year on a call-to-action post:
The latter includes a glossary of many relevant terms and concepts, all of which I made up or started using in 2025, and which I now find indispensable for thinking about RL reward function design. Those terms and concepts were fleshed out over the course of 2025 via the following posts:
Actor-critic RL, “valence”, normative (snap) assessments, how valence influences beliefs
Within the space of RL algorithms, a major subcategory is called “actor-critic RL”. I claim that the brain is of this type. A “critic” is basically any learning algorithm trained to assess whether something is a good idea or bad idea, based on the past history of RL rewards. In the context of the brain, for present purposes, I propose that we should think about it like this:
(Oversimplified in various ways; slightly more details .)(Oversimplified in various ways; slightly more details .)
Valence has implications for both the “inference algorithm” (what the brain should do right now) and the “learning algorithm” (how the brain should self-modify so as to be more effective in the future). In this series, I’m mainly interested in the inference algorithm. There, the main thing that valence does is:
If valence is very negative, the current “thought” tends to get thrown out, and the “Thought Generator” part of the brain goes rummaging around and (partly-randomly) picks a new different thought to replace it.
If the valence is very positive, the current “thought” tends to stay active and get stronger. Relatedly, if the thought involves an immediate plan to issue motor outputs, then those motor outputs are likely to actually get issued. And if the thought is one piece of a temporal sequence (e.g. you’re in the middle of singing a song), then that temporal sequence will tend to continue. And so on.
As I have discussed here, you can draw an analogy between valence in the human brain, and the final-common-pathway control signal for a run-and-tumble mechanism in a simple mobile organism like a bacterium. Specifically:
When valence is positive, it roughly means “whatever (metaphorical) path I’m on—including not only what I’m doing right now, but also the plans currently in my head for what to do later—is a good path! I should carry on with that!”. This is analogous to the “run” of run-and-tumble: the bacteria keeps going in whatever direction it is currently going.
When valence is negative, it roughly means “I should randomly generate a new activity / plan right now”. This is analogous to the “tumble” of run-and-tumble: the bacteria randomly picks a new direction to go.
In fact, I don’t think it’s just an analogy—my guess is that there’s literally an unbroken chain of descent from the valence signal in my brain all the way back to a run-and-tumble-like control signal in the proto-brain of my tiny worm-like ancestors 600 million years ago.
Alternatively, we can take an AI-centric perspective, in which case I think the exact role of “valence” in the brain’s RL algorithm is a kind of funny mixture of value function (a.k.a. “critic”) and reward function:
It’s a bit like a conventional RL reward function in the sense that it can be “ground-truth-y”—for example, the brain has innate circuitry that issues negative valence in response to pain-related signals, and positive valence in response to eating-when-hungry, and so on for various other “innate drives”.
It’s a bit like a conventional RL value function in the sense that it can be “forward-looking”—for example, “the idea of walking to go get candy” can be positive valence (a.k.a. motivating), not because the walk itself is immediately pleasant, but because I’m hungry and want to eventually eat the candy.
(I won’t go into details about how valence relates to specific reinforcement learning algorithms—see the appendix for more on that.)
Either way, hopefully it’s clear that “valence” is one of the most important ingredients in one of the most important algorithms in the brain.
2.3 Situations where valence corresponds directly to a meaningful (albeit snap) normative assessment: The valence of plans, actions, and imagined futures
Economists talk about “positive versus normative claims”; philosophers (following Hume) talk about the same thing under the heading “is versus ought”. A positive statement is a factual claim about how the world actually is, whereas a normative assessment is a value judgment about how the world ought to be, or about what one ought to do.
(The kinds of “normative assessments” that I’m discussing in this section are fast “snap” judgments, not carefully-considered assessments that I proudly endorse and stand behind. However, I’ll argue in §2.7 below that the former serves as the foundation upon which the latter is built.)
What does valence have to do with normativity? Everything! Really, I see valence as the ultimate currency for all normativity in the brain.
This claim ultimately stems from the role of valence discussed in §1.3 of the previous post. To briefly recap, let’s say an idea pops into my head—“I’m gonna get up”. If the valence of that thought is positive, then that thought will stay in my head, and I’ll probably actually do it. If the valence is negative, then that thought gets tossed out and replaced by another thought, probably one that involves staying in bed. Thus, valence underlies this decision. Contrary to weird “Active Inference” ideas, this role of valence here is unrelated to the capability of my generative world-model to make predictions about the future: my world-model (“Thought Generator”) is equally capable of modeling the future world where I get up, and modeling the future world where I don’t. Thus, the role of valence here is fundamentally normative, not positive.
Diagram is mostly copied from , with a new part in red. If the thought “I’m gonna get up” pops into my head, valence plays the role of a “normative” control signal (upward black arrow) that determines whether the thought stays in my head and gets executed, or gets tossed out and replaced by a different thought like “…or actually maybe I’ll stay in bed”. Meanwhile, the green arrows play the “positive” role of input data that the Thought Generator uses to build a predictive / generative world-model.Diagram is mostly copied from , with a new part in red. If the thought “I’m gonna get up” pops into my head, valence plays the role of a “normative” control signal (upward black arrow) that determines whether the thought stays in my head and gets executed, or gets tossed out and replaced by a different thought like “…or actually maybe I’ll stay in bed”. Meanwhile, the green arrows play the “positive” role of input data that the Thought Generator uses to build a predictive / generative world-model.
That’s a simple case, but it generalizes. So:
Suppose that I have a thought which corresponds to the idea of executing a plan, or taking an action. If the valence of that thought is positive, then I’m liable to execute it. And if the valence of that thought is negative, then I’m unlikely to. So in this case, the valence of the thought corresponds to my brain making a snap normative assessment of the appropriateness of this plan or action.
Suppose that I have a thought which corresponds to a possible future situation, e.g. I imagine a certain candidate winning the election, or I imagine myself eating a sandwich for dinner. If the valence of that thought is positive, then I’ll tend to make and execute plans that bring that future about. And if the valence of that thought is negative, then I’ll tend to make and execute plans that prevent that future from coming about. So in this case, the valence of the thought corresponds to my brain making a snap normative assessment of the goodness or badness of this possible future.
We have two paths by which valence can impact the world-model (a.k.a. “Thought Generator”): the normative path (upward black arrow) that helps control which thoughts get strengthened versus thrown out, and the positive path (curvy green arrow) that treats valence as one of the input signals to be incorporated into the world model. Corresponding to these two paths, we get two ways for valence to impact factual beliefs:
Motivated reasoning / thinking / observing and confirmation bias—related to the upward black arrow, and discussed in §3.3 below;
The entanglement of valence into our conceptual categories, which makes it difficult to think or talk about the world independently from how we feel about it—related to the curvy green arrow, and discussed in §3.4 below.
3.4.1 When your brain clusters similar things into mental categories / concepts, valence is an important ingredient going into that clustering algorithm
I mentioned that many concepts explicitly incorporate valence into their definitions—think of words like “preferable”, “problematic”, “trouble”, “roadblock”, “pest”, “flourishing”, and so on.
I offered an example made-up anecdote where Person X generally likes (feels positive valence in association with) the concept “religion”, but dislikes Scientology, and then uncoincidentally Person X is particularly liable to say “Scientology isn’t really a religion.” We could say that Person X is “gerrymandering” the concept “religion” to follow the contours of their own valence assessments. (If you don’t like that example, try thinking of your own example, it’s very easy because everybody does this all the time.)
If you think about it mechanistically, those two things are identical! If Person X dislikes a thing, they probably won’t identify it as an example of “flourishing”; and for the exact same underlying reason, if Person X dislikes a thing, they probably won’t identify it as an example of “religion”. In both cases, from that person’s perspective, the thing just doesn’t seem to belong inside that mental category.
Stepping back: When we form mental categories, we’re finding “clusters in thingspace”—stuff that forms a natural grouping in our mental world. Well, valence is part of our mental world too—a first-class piece of sense data, just like appearance, smell, and so on. So our brains naturally treat valence as an ingredient in the categorization algorithm—indeed, as a very important ingredient.
Illustration of concepts as . I’m claiming that one of the axes of high-dimensional “thingspace” is the valence axis. As it happens, the “bird” cluster is pretty spread out along the “valence” axis—everyone knows that sparrows are awesome whereas hummingbirds suck, yet both are birds. However, for many other concepts like “contamination”, “flourishing”, “cult”, etc., the valence axis plays a major role in determining whether things fall in or out of the cluster.Illustration of concepts as . I’m claiming that one of the axes of high-dimensional “thingspace” is the valence axis. As it happens, the “bird” cluster is pretty spread out along the “valence” axis—everyone knows that sparrows are awesome whereas hummingbirds suck, yet both are birds. However, for many other concepts like “contamination”, “flourishing”, “cult”, etc., the valence axis plays a major role in determining whether things fall in or out of the cluster.
So if someone uses the word “cult” to mean “an ideologically-aligned tight-knit group that I don’t like”, they haven’t intrinsically done anything wrong or confused—that’s no different from how we all use words like “roadblock” or “contamination”. It only becomes confusing and misleading if that person simultaneously insists that “cult” is a word that describes an aspect of the world independently from how we feel about it. And indeed, people make that move all the time! For example, if you look at a random “cult checklist”, you’ll notice that none of the entries are “…And overall the group is bad, booooo”.
Things become even more confusing in cases like “religion”, which may have positive valence in one person’s mind and negative valence in another’s. Then those two people try to talk to each other about “religion”. There’s a real sense in which they are using the same word, but they are not talking about the same concept: In one person’s head, “religion” is a “cluster in thingspace” characterized by positive valence (among other things), and in the other person’s head, it’s a different “cluster in thingspace”, this time characterized by negative valence (among other things). No wonder it often seems like these two people are talking past each other!
3.4.2 Is the above a bug, or a feature?
As I keep mentioning, it’s useful to model the world independently from how we feel about it. Insofar as that’s true, it’s unfortunate that our brains treat valence as sense data that contributes to conceptual categorization and clustering.
On the other hand, at the end of the day, the main reason our brains build world-models in the first place is to make better decisions. And, as discussed in the previous post, valence is how our brains figure out whether a decision is good or bad. So, if our brains are going to do conceptual categorization and clustering to inform decision-making, what on earth could be more important than using valence as a central ingredient in that clustering algorithm?
So at the end of the day, my guess is that there’s a very good evolutionary reason that the brain treats valence as a ubiquitous and salient piece of sense data: without that design feature, we would struggle to make good decisions and get by in the world.
And then it’s an unfortunate but somewhat-inevitable side-effect of that design feature that “modeling the world independently from how I feel about it” is somewhat unnatural for us humans. Likewise, that design feature saddles us with other annoying things like “meaningless arguments” (§2.4.4) and related arguments-over-definitions, miscommunications, etc. Luckily, it’s possible for us humans to mitigate these problems via learned metacognitive heuristics, memes, the scientific method, and so on. We seem to be getting by, more or less.
How valence sheds light on depression, mania, and NPD
5.2 Context: What are we expecting to find a priori?
We can think of the following indirect path to get from “root causes” to psychological observations & personality traits:
(Don’t scrutinize the red arrows—I just put them in randomly, to illustrate the idea that each layer can influence the layer below.) As illustrated by the bold text and thick arrows, we should expect to find salient clusters of symptoms that tend to co-occur because they flow from the same proximal cause: systematic changes to valence signals in the brain. But we should also not be surprised to find a mish-mosh of other algorithmically-unrelated symptoms that often appear along with those clusters of symptoms.(Don’t scrutinize the red arrows—I just put them in randomly, to illustrate the idea that each layer can influence the layer below.) As illustrated by the bold text and thick arrows, we should expect to find salient clusters of symptoms that tend to co-occur because they flow from the same proximal cause: systematic changes to valence signals in the brain. But we should also not be surprised to find a mish-mosh of other algorithmically-unrelated symptoms that often appear along with those clusters of symptoms.
As argued in Post 1, valence is one of the most important ingredients in one of the most important algorithms in the brain. So we should expect:
Some possible root causes may happen to have a big systematic impact on valence. (But they’ll probably have other consequences too, and the details will differ among different root causes.)
Given the centrality of valence in the brain, if there is a big systematic change to valence, then it should have lots of obvious downstream effects on psychology and behavior.
As a consequence:
We should expect to find clusters of symptoms / behaviors that can be elegantly explained in terms of something happening to valence signals
We should also expect to find other symptoms / behaviors that commonly co-occur in practice, but cannot be explained in terms of valence. Instead, they are different consequences of the same root cause(s), and may have no relation whatsoever at the “algorithm level”.
For example, dopamine is centrally involved in valence signals, and meanwhile, off in an obscure corner of the brain, dopamine is also centrally involved in a little specialized circuit controlling prolactin hormone release. I firmly believe that, at the algorithm level, these two functions have nothing whatsoever to do with each other. But they both happen to involve dopamine, and thus they can cross-talk in some people—hence the somewhat rare “dysphoric milk ejection reflex” where there’s a flood of intense negative emotions upon milk let-down during lactation.
That example is meant to illustrate the perils of theorizing about psychology purely at the algorithm level. Don’t get me wrong—the algorithm level is great! There are lots of insights to be found there. This post will hopefully be an example. But we shouldn’t expect to find all the insights there. Some things in psychology can only be explained at other levels, including lower (biochemistry) and higher (culture).
5.3 If valence has a strong negative bias (i.e., almost every thought is negative valence), it should lead to a cluster of symptoms suspiciously close to clinical depression
Everyone has a range of thoughts, with varying valence. I claim that, in depression, there’s a strong offset towards negative valence. So for almost every thought you think (e.g. “I’m gonna get out of bed”), your brain immediately assesses that thought as a bad idea, tosses it out, and re-rolls for a new thought (cf. ). For unusually appealing / motivating thoughts, like “I’m gonna scratch that really itchy bug bite right now”, I bet that even quite depressed, bedridden people will wind up executing that plan.Everyone has a range of thoughts, with varying valence. I claim that, in depression, there’s a strong offset towards negative valence. So for almost every thought you think (e.g. “I’m gonna get out of bed”), your brain immediately assesses that thought as a bad idea, tosses it out, and re-rolls for a new thought (cf. ). For unusually appealing / motivating thoughts, like “I’m gonna scratch that really itchy bug bite right now”, I bet that even quite depressed, bedridden people will wind up executing that plan.
Why do cultural expectations matter for DID? Because it’s an intuitive self-model. Intuitive self-models, like all intuitive models, come out of a probabilistic inference process (§1.2.2). When there are multiple possible models that issue good predictions (as in the bistable perception example of §1.2.1), then suggestions from culture or trusted authorities can do a lot to influence what happens, by helping make certain intuitive models salient and a priori plausible. Those cultural suggestions are certainly not decisive! Let’s not be crazy—the Invisible Ships Myth is in fact a myth. But they’re certainly relevant.
DID is nothing special in this respect. Likewise, trance states like spirit possession and channeling (previous post) are likelier to happen when people expect them to happen for cultural or other reasons; and so too with hallucinations, such as renewalist Christian communities that normalize hearing the voice of God (coming up in Post 7).
Active self-related (part of the prev one)
3.4.2 The Active Self, in the context of “self” more broadly
The “self” involves a bunch of things:
Some self-reflective concepts in the Conventional Intuitive Self-Model. The term “self” generally encompasses much or all of this cloud of interlinked concepts.Some self-reflective concepts in the Conventional Intuitive Self-Model. The term “self” generally encompasses much or all of this cloud of interlinked concepts.
As above, the Active Self is definitionally the thing that carries “vitalistic force”, and that does the “wanting”, and that does any acts that we describe as “acts of free will”. Beyond that, I don’t have strong opinions. It obviously has associations with other aspects of the broader “self”, as in the diagram above. Which of these associations are so strong that these essentially blend into different aspects of a single intuitive concept? And which of these associations is weak enough that you can intuitively imagine them as separate? I’m pretty sure that there’s no one right answer to those questions; rather, I think that this is an area where different people have different self-conceptions.
3.4.3 The Active Self, in the context of technical neuroscience research
I find that the Active Self intuition also comes up when I’m reading neuroscience literature, almost always for the worse. In particular, if you’re thinking about neuroscience, and if you’re tempted to give the Active Self some important role in how brain algorithms work at a fundamental level, then you’re definitely on the wrong track! The Active Self is one of a zillion learned concepts in the cortex—it’s at the trained model level, not the learning algorithm level (see §1.5.1)—and thus you should expect the Active Self to have a fundamentally similar kind of role in innate low-level brain algorithms as other learned concepts like “Taylor Swift” or “lithium ion battery”—i.e., a rather incidental role!
One example of how people mess this up is summarized in this handy chart:
If you’re trying to think carefully about brain algorithms—e.g. you want to reverse-engineer what the cortex does, versus the brainstem, etc.—I claim that the fundamental division in this chart is between involuntary and voluntary actions. This division relates to valence, and is right at the core of the reinforcement learning algorithm built into your brain. But in our Active-Self-centric intuitions, we’re instead drawn to incorrectly see the fundamental division as between things that the Active Self causes, versus things the Active Self does not cause.
(Neuroscientists obviously don’t use the term Active Self, but when they talk about “top-down versus bottom-up”, I think they’re usually equating “top-down” with “caused by the Active Self” and “bottom-up” with “not caused by the Active Self”.)
Here’s another example: The neuro-AI researcher Jeff Hawkins incorrectly conflates the Active-Self-related intuitive division, with the cortex-versus-brainstem neuroanatomical division. This error leads him to make flagrantly self-contradictory claims, along with the dangerously incorrect claim that the brain-like AIs he’s trying to develop will have nice prosocial motivations by default. For details see here.
3.5.3 “I seek goals” versus “my goals are the things that I find myself seeking”
As in §3.3.6 above, the “vitalistic force” intuition forbids the existence of any deterministic cause, seen or unseen, upstream of “wanting” behavior. (Probabilistic upstream causes, like “hunger makes me want food”, are OK. But the stronger such predictions get, the more they seem intuitively to undermine “free will”.)
This constraint on intuitive models leads to some systematic distortions, as shown in this diagram:
So within the Conventional Intuitive Self-Model,
“I seek things that I want” seems normal and correct,
“If I’m seeking something, then evidently that’s a thing I want” seems somewhere between “confused” and “a threat to my sense of agency”.
…But in terms of the real brain algorithm, I claim that these are more-or-less equivalent.
3.5.4 Why are ego-dystonic things “externalized”?
The main thing that the Active Self does is apply its vitalistic force towards accomplishing things it “wants”, via brainstorming / planning. So if there’s robust brainstorming / planning happening towards bungee jumping, then evidently (in our intuitive model) the Active Self “wants” to go bungee jumping. We call this an “internalized” desire. Conversely, if there’s robust brainstorming / planning happening towards not scratching an itch, but I scratch my itch anyway, then this is an “externalized” desire—the Active Self didn’t want the itch to get scratched, but the “urge” made it happen anyway.
We can apply this kind of thinking more generally. Compare the internalized “I become angry sometimes” with the externalized “I am beset by anger sometimes”. These are not synonymous: the latter, but not the former, has a connotation that there’s robust brainstorming / planning happening in my brain towards the goal of not being angry, possibly even while I’m angry. Admittedly, maybe I’m not spending much time doing such brainstorming / planning, or even any time, and maybe the brainstorming / planning isn’t effective. But still, the statement is still conveying something.
Combining this idea with §3.5.1, which says that robust brainstorming requires the corresponding self-reflective thoughts to have positive valence, and we wind up with the general picture that we tend to “internalize” things that reflect well upon ourselves (see §2.5.1), and “externalize” things that don’t.
Now, I used to think that the connection between ego-dystonic / ego-syntonic and externalized / internalized was the result of motivated reasoning: it’s nice to think of bad things as being “outside ourselves”. But now I think it’s directly about motivation, treated as probabilistic evidence within the Conventional Intuitive Self-Model—as opposed to being about motivated reasoning.
[Figure partly copied from my later post ] Assume that all my friends and idols celebrate studiousness and shun laziness. (a) When I entertain a self-reflective thought of myself as a studious guy, that thought seems good, because it calls forth an implication that my friends and idols might also see me that way. Conversely, when I entertain a thought of myself as a lazy guy, that thought seems bad. (b) Now, suppose that I introspect upon my own mind. I will notice that thoughts of myself-as-studious seem good, and myself-as-lazy seem bad. I will also notice corresponding systematic patterns in my thoughts, particularly that there is robust brainstorming () towards studiousness but not towards laziness. My claim is: this suite of observations is exactly what we interpret as an ego-syntonic desire to be studious. So I would say: “In my heart, I aspire to be a studious guy.” And this introspective report would be honest and unbiased—neither wishful thinking nor slanted reporting. Altogether, this explains why we often (not always, see ) see ourselves in ways that are socially desirable.[Figure partly copied from my later post ] Assume that all my friends and idols celebrate studiousness and shun laziness. (a) When I entertain a self-reflective thought of myself as a studious guy, that thought seems good, because it calls forth an implication that my friends and idols might also see me that way. Conversely, when I entertain a thought of myself as a lazy guy, that thought seems bad. (b) Now, suppose that I introspect upon my own mind. I will notice that thoughts of myself-as-studious seem good, and myself-as-lazy seem bad. I will also notice corresponding systematic patterns in my thoughts, particularly that there is robust brainstorming () towards studiousness but not towards laziness. My claim is: this suite of observations is exactly what we interpret as an ego-syntonic desire to be studious. So I would say: “In my heart, I aspire to be a studious guy.” And this introspective report would be honest and unbiased—neither wishful thinking nor slanted reporting. Altogether, this explains why we often (not always, see ) see ourselves in ways that are socially desirable.
I suppose that distinction doesn’t matter much—by and large, the “motivation-as-evidence” hypothesis and the “motivated reasoning” hypothesis both lead to the same downstream predictions. Well, maybe my “motivation-as-evidence” story is a better fit to the example I gave in §2.5.2 of the tired person saying “Screw being ‘my best self’, I’m tired, I’m going to sleep”. This action is internalized, not externalized, and yet it goes directly against how the person would like to be perceived by themselves and others.
PNSE/enlightenment-related; “the way I wish emotions work vs how they actually work”
6.4 PNSE breaks the association between “awareness” and other self-reflective concepts
6.4.1 Basic explanation
In the generative model space, there are associations between different concepts—when I think of one thing, it makes me think of another thing. Beliefs are part of that (e.g. if I believe that a squirrel is in the glove compartment, then thinking about opening the glove compartment leads to me thinking about finding the squirrel), but associations also include other things (e.g. thinking about a goal might make me think of a strategy that would accomplish that goal).
There are associations between self-reflective concepts, just like any other concepts, and it’s here that PNSE has an interesting effect:
Blue arrows are associative connections between different concepts. In PNSE, “awareness” winds up floating off on its own, with no particular associative connection to other self-reflective concepts.Blue arrows are associative connections between different concepts. In PNSE, “awareness” winds up floating off on its own, with no particular associative connection to other self-reflective concepts.
In the Conventional Intuitive Self-Model, the Active Self is evidently a bridge enabling associative connections between “awareness” and other self-reflective concepts. Why is it a bridge? Well on one side, the Active Self is connected to awareness—its actions are strongly impacted by the contents of awareness, and its attention-control actions manipulate awareness in turn. On the other side, the Active Self is conceptualized as having goals, controlling and owning the body, and so on. Thus the Active Self forms a bridge from awareness to the rest of the self-reflective world.
In PNSE, by contrast, the Active Self is gone, and the bridge is broken. “Awareness” no longer has any particular relation to those other self-reflective concepts. I think this comes across clearly when people talk about PNSE.
6.5 Why do pain, anxiety, etc., seem less aversive in PNSE than in the Conventional Intuitive Self-Model?
Equanimity is an aspect of PNSE that comes up frequently in the secular discourse. I’ll argue that it’s a consequence of the previous section—i.e., that it’s closely related to PNSE’s lack of association between “awareness” and bodily feelings.
6.5.1 PNSE makes S(anxious feeling) undermine, rather than reinforce and stabilize, the anxious feeling itself
Suppose I get an anxiety-provoking email—maybe my friend says that she has news about her health, and we need to talk. That triggers the brainstem reaction we call “anxiety”, involving negative valence, physiological arousal, and certain other reactions, along with corresponding interoceptive sensations and involuntary attention (see here) towards those sensations.
Green & red arrows indicate excitatory and inhibitory connections, respectively. Gray boxes indicate the cortex. (a–b) illustrate an everyday example of how anxiety reactions work: (a) If I have object-level reason to be anxious, then there’s a closed excitatory loop, which stabilizes the anxiety; (b) If that reason disappears, then there’s no closed excitatory loop, and the anxiety winds down. Then (c–d) extends that same idea to self-reflective concepts: (c) in the Conventional Intuitive Self-Model, the Active Self is part of a closed excitatory loop of self-reflective anxiety (“being anxious about being anxious”); (d) in PNSE, the Active Self is gone, and so is that loop.Green & red arrows indicate excitatory and inhibitory connections, respectively. Gray boxes indicate the cortex. (a–b) illustrate an everyday example of how anxiety reactions work: (a) If I have object-level reason to be anxious, then there’s a closed excitatory loop, which stabilizes the anxiety; (b) If that reason disappears, then there’s no closed excitatory loop, and the anxiety winds down. Then (c–d) extends that same idea to self-reflective concepts: (c) in the Conventional Intuitive Self-Model, the Active Self is part of a closed excitatory loop of self-reflective anxiety (“being anxious about being anxious”); (d) in PNSE, the Active Self is gone, and so is that loop.
Panels (a)–(b) in this diagram give an everyday example of what happens next. The brainstem anxiety reaction passes into the cortex in the form of interoceptive sensory inputs, which stay strongly active via involuntary attention. Then the subsequent thoughts would involve concepts associated with the anxious feeling (e.g. its upstream causes), which in turn would activate other associated concepts, etc., via the normal logic of the generative model space. It’s basically an unpleasant form of brainstorming (see here).
In (a), there’s a closed excitatory loop: the interoceptive sensory inputs associated with anxiety make me think of the possibility that my friend is seriously ill, which in turn strongly implies that more feelings of anxiety are imminent. That feeds back to the brainstem—the cortex is “concurring” with the brainstem that the situation warrants anxiety, so to speak.[8] In other words, the cortex brainstorming has turned up a plausible story “explaining” the anxiety.
However, in (b), suppose I just learned that my friend is perfectly fine after all. Now there isn’t a closed excitatory loop. On the contrary, the anxiety-related interoceptive sensory inputs make me think of my friend’s good health, which in turn provide evidence against the possibility that I will feel more anxious feelings in the immediate future. The brainstem gets that signal and gradually winds down its anxiety reaction.
Everything so far has been object-level. Now let’s get into the more confusing self-reflective stuff!
Panel (c) shows a closed excitatory loop that can happen in the Conventional Intuitive Self-Model. The object-level interoceptive feeling of anxiety brings to mind the self-reflective S(feeling of anxiety) (§2.2.3). This self-reflective thought is conceptualized as being associated with the Active Self, which in turn is closely associated with the body and its feelings. So there’s a closed excitatory loop, just as there is in (a), and this loop reinforces and stabilizes the anxiety reaction. This loop is basically “feeling anxious about feeling anxious”—kinda stewing in feelings of anxiety.
Panel (d) shows what happens when we switch to PNSE. The first step is the same: the object-level interoceptive feeling brings to mind the self-reflective S(feeling of anxiety) thought—i.e., the idea that the feeling of anxiety is currently in conscious awareness. However, in PNSE, per §6.4 above, the “awareness” concept itself has no particular association with the body and its interoceptive sensations, so there’s no closed loop—no “feeling anxious about feeling anxious”—and the anxiety starts to wind down (unless the brainstorming can find a different closed loop like (a)).
… And conversely, I bet you can think of examples from your life of people ignoring potential problems thanks to a deficiency of involuntary attention. At an individual level, if someone has a potential looming health problem, but it’s not currently causing them any pain or any anxiety, then they may well not try to mitigate it. (Even if they “rationally” agree that mitigating it would be importantly beneficial! They might just never get around to it.) At a somewhat larger scale, it seems plausible that Sam Bankman-Fried’s personality profile included clinically low anxiety; he and his many victims obviously would have been better off if he had had some anxiety-driven involuntary attention towards negative possibilities like “what if I get caught breaking the law?” or “what if I’m mistaken about the FTX balance sheet?”. At an even larger scale, if policymakers and voters generally felt more anxiety-driven involuntary attention towards the possibility of future pandemics, then perhaps they wouldn’t be doing so very very little to prevent them, as compared to the scope and probability of the problem.
Sources: ,Sources: ,
Thus, for example, Cognitive-Behavioral Therapy guru David Burns prompts his clinically anxious patients to think hard about exactly how much anxiety they want to have, and then to aim for that amount, which is often more than zero. (More details here.)
“Counsel” vs “manipulation” as an emotive conjugation
2.3 Another dimension: “counsel” vs “manipulation” as an emotive conjugation
There’s another dimension to how we intuitively think about these concepts: the dimension of positive or negative vibes. For example, if some kind of interaction seems good,[3] then we’re more likely to call it “providing counsel”, and if it seems bad, then we’re more likely to call it “an attempt to manipulate me”. The vibe is important in itself, over and above any particular aspect of the interaction.
I don’t think this dimension is separate from the “free will” discussion above, but rather complementary and compatible, because in general, if I have a motivation I’m happy about, I’ll tend to conceptualize it as an ego-syntonic component of my free will, while if I have a motivation I’m unhappy about, I’ll tend to conceptualize it as an ego-dystonic urge undermining my free will. See ISM §3.5.4 for details.
As most readers have presumably heard by now, Paul Erdös’s Unit Distance Problem from 1946—one of the central open problems from the field of discrete geometry—has been solved by GPT5.5Pro. … The entire process seems have been one-shot: my former student Lijie Chen simply gave GPT the problem, then GPT thought for a while and output a several-page argument that, on analysis by human experts, turned out to be correct. … I heard the news maybe an hour after it broke, when some UT grad students came to my office to tell me. For what it’s worth: these students were morose, musing about how everything might soon be over for young scientists and mathematicians like themselves. I don’t know whether they’re right, but I feel like I should tell the truth about what their reaction was.
Then, a few days later, a team at DeepMind, including my UT Austin colleague Swarat Chaudhuri, announced that they were able to use a system called AlphaProof Nexus to settle nine more (!) Erdös problems, many of them in additive combinatorics, along with miscellaneous other open math problems. Notably, in this case the AI also fully formalized its proofs in Lean.
And then, just today, Jelani Nelson alerted me to a new CS theory paper, which solves a longstanding open problem about electrical flows on graphs using a proof from GPT5.5Pro.
Scott Aaronson is moved by a story written by GPT-5.5 Pro, a first for AI-written fiction for him:
As it happens, just last week I read my very first AI-written story that affected me as a story, to the extent that I wanted to read it more than once. This happened when I gave GPT5.5Pro the following simple prompt:
Write me a story about the most ancient Israelites that’s riveting like the stories of the Bible but that’s also consistent with all of the archeological evidence.
You can read the result here. One of my Facebook friends called it “disturbingly good,” and I share that assessment. Of course, I’m well aware that GPT could easily generate a thousand stories like this one—sampled from the same probability distribution—and then I could even do statistics on which tropes were the most common. This makes it feel silly to overindex on the first story that happened to be output, and yet somehow I did.
The People Whose Seed Was Hidden (3,100 words, 6 mins of thinking)
I’ll ground the tale in early Iron Age highland Canaan, where “Israel” first appears in outside records, while avoiding later anachronisms like monarchy or mass conquest.
Stopped thinking
I’ve sharpened the evidence frame: Merneptah’s stele, mostly unwalled 12th–11th century BCE villages, Canaanite continuity, Egyptian pressure, and miracles treated as belief or omen.
The People Whose Seed Was Hidden
Central hill country of Canaan, late thirteenth century BCE
Before there was a king in the hills, and before a house of cedar stood for any god, there was a village of stone and dust on a ridge above the road.
Its walls were not walls of a city, but the backs of houses joined one to another. Its towers were not towers, but threshing floors on high ground where a watchman could see the smoke of Shechem, the glimmer of Dothan, and the road where bronze helmets flashed in the sun. Its wealth was not silver, nor lapis, nor carved ivory from the coast, but barley in deep jars, sheep in narrow pens, and daughters who knew how to count seed by the sound it made in the hand.
They called the place Beit-Or, House of Light, because at dawn the first blade of the sun struck the ridge before it touched the valley.
And among them lived Tirzah daughter of Abdi, who was keeper of the seed jars.
She was not eldest in her father’s house, nor strongest, nor one whose name the old men spoke at the gate. Yet she knew every pithos buried to the shoulder in the storeroom. She knew which jars held barley from the first terrace, which held emmer from the slope below the terebinth, which held lentils for hunger, and which held the seed that no mouth might eat unless the people wished to devour their own children’s spring.
Her father would say, “A sword kills once. Hunger kills until memory is dead.”
And Tirzah remembered.
In those days the lowland cities were breaking like clay cups dropped on stone. Kings who had once bowed to Egypt sent no gifts. Governors shut their gates and named themselves mighty. Men from the sea had come to the coast with strange pots and strange songs. The old roads were unsafe. Caravans grew thin. Bronze was dear. Tin was a rumor.
Yet in the hills the people endured.
They had come there by households, not as an army. Some had once kept flocks beyond the Jordan. Some had fled the taxes of the valleys. Some had married into clans already on the ridge. Some told stories of fathers who wandered with the god El beneath open sky. Some spoke of a god from the south whose name was too sharp to say carelessly. None agreed on all things, but all knew this: in the hills, a man with a plow and a woman with seed might live without a city lord taking the fat of every harvest.
They kept sheep and goats, for goats could find a meal where a cow would die. They kept a few cattle for pulling the plow through stony soil. They kept no swine. The valley people laughed at this and called the hill folk poor, for pigs grew fat near refuse and water. But pigs did not climb well, and pigs could not be driven far in drought, and pigs belonged to places where men had leisure to feed animals that did not give wool.
So the hill people ate bread, curds, olives when they had them, figs when the trees were kind, and meat when a goat broke its leg or a feast required blood.
Above Beit-Or stood a circle of stones. In its midst was one standing stone, old before the village was born. There the elders poured oil, and there women brought first dough, and there shepherds lifted hands before taking flocks to the ravines. Some said the stone was for El, father of gods. Some said it marked the presence of the One who rode the storm. Some said names were snares, and a man should offer and be silent.
Tirzah did not know which was true. She knew only that when the wind crossed the ridge at evening, the stone seemed to listen.
In the fifth year after the great drought, a runner came from the west.
He was a boy with blood dried black along his jaw and dust caked on his tongue. He stumbled into the threshing floor and fell among the chaff. Tirzah gave him water from a goatskin, but he coughed half of it onto the ground.
“Pharaoh,” he said.
At that word the elders came.
“Which Pharaoh?” asked Oren, white-bearded and bent, who had seen Egyptian soldiers in his youth.
“The old one,” said the boy. “Not the great Ramesses. His son. Merneptah.”
The name passed through the people like a hot wind.
The boy swallowed. “He has come into Canaan. Ashkelon is struck. Gezer is struck. Yanoam is struck. His captains say the land has rebelled and must be cut down like thorns.”
Abdi said, “We are not Ashkelon. We are not Gezer. We have no king.”
The boy looked at him with hollow eyes. “To Egypt, a throat without a collar is rebellion.”
That night the elders sat by the standing stone. The stars were hard and numberless. Below them, the valleys were dark, but not peaceful; every man imagined he could see fires.
“We should send grain,” said Shalem son of Huri. “Let Pharaoh’s men eat and pass by.”
“We have little,” said Abdi.
“They will take less if we offer first.”
“They will take what they see,” said Tirzah.
The men turned, surprised that she had spoken. She stood behind her father, carrying a lamp. The flame trembled, but her voice did not.
Shalem frowned. “This is council.”
“And the jars are mine,” she said. “You ask what can be given. I know what can be spared.”
Oren looked at her, and because age had taught him that wisdom often entered through doors pride would keep shut, he said, “Speak, daughter.”
Tirzah held up three fingers. “There is grain for eating. There is grain for trade. There is seed for sowing. If soldiers take the first, we hunger. If they take the second, we are poor. If they take the third, we are dead though we still breathe.”
Shalem said, “If soldiers come and find hidden grain, they will burn us.”
“If they come and find all our grain, they will take it,” Tirzah said. “And then the sun will burn us, and our children will burn inside their bellies.”
A murmur rose.
Oren tapped his staff against the stone. “What do you counsel?”
Tirzah looked toward the houses, toward the dark shapes of roofs and pens and sleeping kin. “Let us make Pharaoh victorious.”
No one spoke.
She continued. “Let him find jars. Let him break them. Let him see grain spilled on the ground. Let his scribe write that our seed is gone. But before he comes, we take the true seed and hide it where a man seeking plunder will not look.”
“Where?” asked Abdi.
“In the dead cistern below the goat pen.”
Shalem spat. “A cursed place. It holds no water.”
“It will hold life.”
So they worked before dawn.
Women carried baskets beneath their cloaks. Children swept floors smooth to hide drag marks. Men rolled stones aside and lowered jars into the dry cistern, one by one, with ropes of plaited hair and goat hide. Tirzah sealed each jar with clay, pressing her thumb into the wet rim. Her thumbprint went down into darkness with the barley.
When the jars were hidden, they dragged thornbrush over the cistern mouth and drove goats above it. By sunrise the place stank of dung and sounded of bleating.
In the storehouse they left lesser grain, enough to be seen, not enough to save.
Then they waited.
On the third day, dust rose from the western road.
The chariots did not climb to Beit-Or. Their wheels stayed below where the road was wide enough for pride. But infantry came up: Egyptians with linen corselets and bows, Canaanite auxiliaries with spears, and a scribe beneath a striped cloth held by a servant. The scribe’s palette hung from his belt. He was young, smooth-faced, and annoyed by the hill path.
Their captain wore a blue-edged kilt and carried a bronze sword shaped like a sickle.
At the threshing floor he cried, “People of the ridge! Servants of Pharaoh, beloved of Ptah, have come. Bring grain, oil, wool, and young men for labor. Resist, and your name will be poured out like water.”
Oren bowed with careful slowness. “My lord, we are poor shepherds.”
“All rebels are poor when soldiers arrive.”
“We have no city.”
The captain smiled. “No. You are not a city.”
The scribe looked up sharply at that. He dipped his reed and wrote.
“What are you called?” asked the captain.
The elders hesitated. Names had power. Names could be taxed. Names could be cursed.
Then Shalem, eager to show that he feared nothing, said, “We are of Israel.”
Tirzah felt the air tighten.
The scribe’s reed moved again.
“Israel,” said the captain, tasting the word as though it were sour wine. “Then Israel will bring seed.”
He ordered the houses searched.
Soldiers entered Abdi’s house first. They found the pithoi in the storeroom, tall jars with thick collared rims. They laughed at their roughness. One soldier struck a jar with the butt of his spear, and barley rushed out like water. Another overturned a jar of lentils. Another took Tirzah’s little brother Nethan by the arm when the boy cursed him.
Tirzah moved, but Abdi gripped her wrist.
“Not yet,” he whispered.
A soldier found the family loom and cut the warp for sport. Another took a bronze knife. Another seized a goat and slit its throat in the courtyard, letting blood run where children played.
At the standing stone, the captain ordered the offerings kicked aside.
“This god of hills has no house,” he said. “How shall he protect a people who have no walls?”
Oren answered, “Perhaps a god without a house is harder to burn.”
The captain struck him across the face.
The old man fell.
The village groaned as one body, but the archers lifted their bows, and grief held its breath.
Then fire was put to the threshing floor.
Smoke rose over Beit-Or, and the Egyptian scribe wrote beneath it.
Tirzah watched his hand. She could not read, but she understood the power of signs. A mark could travel where a voice could not. A mark could stand before Pharaoh and say, This is what happened, though it had not happened. A mark could lie longer than a man could live.
The captain gathered three young men and two boys for labor. Nethan was among them.
“He is a child,” Tirzah said.
The captain looked at her. “Then he will grow in Pharaoh’s service.”
Abdi stepped forward. Two soldiers held him back.
Nethan did not cry. He looked at Tirzah as though she were the only road left in the world.
That evening the Egyptians camped below the ridge by the dry wadi, where their chariots waited. They had broken the visible jars and burned the threshing floor and taken wool, goats, and captives. Above, Beit-Or smoldered like a coal buried in ash.
Shalem said, “We should be grateful. Most live.”
Abdi turned on him. “My son is gone.”
“My son too,” said another man.
Oren, with blood dried in his beard, said, “Pharaoh has written victory. Men drunk on victory sleep deeply.”
Tirzah looked down toward the campfires.
“No,” said Abdi.
She had not spoken.
“No,” he said again, because fathers can hear the footsteps of a daughter’s thought.
“They will take the boys by the valley road at dawn,” Tirzah said.
“They have archers.”
“We have stones.”
“They have bronze.”
“We have night.”
Shalem said, “And if we fail, they come back and finish what they began.”
Tirzah looked at the blackened threshing floor. “They have already finished us in their writing.”
No one answered.
At moonrise, seven went down: Tirzah, Abdi, two fathers of captured boys, Oren’s grandson, a shepherd named Malchi who knew every goat path in the wadi, and a widow called Naamah who could sling a stone through a fig at thirty paces.
They carried no swords. Only knives, slings, and the silence of people who had buried too many hopes to fear darkness.
The Egyptian camp lay careless beneath the ridge. The chariots stood like strange animals, poles lifted. Horses stamped and blew. Soldiers slept near their shields. The captives were tied by the wrists beside a supply cart.
Malchi touched Tirzah’s shoulder and pointed. A sentry stood near the boys, but he watched the road, not the rocks.
Naamah set a stone in her sling.
The first stone struck a horse.
It screamed and reared, snapping its tether. Another horse panicked. Men shouted. A chariot pole cracked. In the confusion, Abdi and Tirzah ran low through the scrub.
The sentry turned. Tirzah threw herself at his knees. He fell hard, and Abdi struck him behind the ear with a stone. The man groaned but did not rise.
“Nethan,” Tirzah whispered.
Her brother’s eyes shone white.
She cut the cords. The first boy crawled free, then the second. But the last knot had been pulled tight, and her knife slipped.
A shout came from the camp. An Egyptian archer saw them and drew.
Abdi stepped between.
The arrow entered below his ribs.
He made a sound like surprise.
Tirzah caught him, but he pushed her hand away. “Cut.”
She cut.
The last boy came free.
Abdi sank to the ground.
“Father,” Tirzah said.
“Seed,” he whispered.
“I know.”
“No,” he said, gripping her wrist with the last strength in him. “Not grain only.”
Then he looked at Nethan.
Tirzah understood.
Naamah’s sling cracked again. The archer fell backward into the fire. Sparks leapt up like stars returning to heaven.
The hill people fled into the wadi, dragging the boys, carrying Abdi until his blood made the stones slick. Before they reached the goat path, he died.
They buried him before dawn beneath uncut stones, without song, because the soldiers were still shouting in the valley.
At sunrise the Egyptians did not climb again.
They had broken jars. They had burned grain. They had lost face in the night, and so they dressed shame in triumph. The captain ordered the scribe to finish the record. The chariots rolled westward in dust. The infantry followed. The scribe did not look back at the ridge.
In time, his words would be carved far away in black stone among victories greater than this one. The words would say that Canaan was plundered. Ashkelon carried off. Gezer seized. Yanoam made as nothing.
And Israel laid waste.
Its seed no more.
But on the ridge, when the soldiers were gone, Tirzah went to the dead cistern.
The village gathered behind her: old Oren leaning on his staff, Shalem silent now, mothers with smoke in their hair, children with hollow eyes, boys whose wrists were bruised from Egyptian rope. Nethan stood beside her and held the lamp.
They pulled away thornbrush. The goat stink rose strong enough to make a soldier curse, had any soldier been there. They lifted the flat stones. Tirzah descended by footholds cut before her grandmother’s birth.
One by one, they raised the jars.
Clay seals unbroken.
Grain within.
The people did not cheer. Some fell to their knees. Some wept without sound. Some touched the jars as if touching the faces of the dead.
Tirzah broke the seal of the first pithos. She plunged both hands into barley and lifted it high. Kernels ran between her fingers, catching the morning light.
Oren spoke then, and his voice, though old, carried over the burned threshing floor.
“Pharaoh says, ‘Israel’s seed is no more.’ Let the stone of Pharaoh speak in Egypt. We will answer in the earth.”
So they went to the terraces.
They sowed though the threshing floor was ash. They sowed though Abdi lay under stones. They sowed though the lowlands groaned and cities devoured themselves. They sowed with bruised hands and empty bellies, casting grain into furrows scratched from rock.
Tirzah sowed last.
At the edge of the field she found a single shard from a broken storage jar. Its rim was thick, its clay rough, its curve familiar. She pressed it into the boundary wall, not as an idol and not as a boast, but as a witness.
Nethan came beside her.
“Will they come again?” he asked.
“Yes,” she said.
“What will we do?”
She looked over the hills. There was no palace. No army. No cedar temple. No scribe to make their grief magnificent. Only terraces, goats, stones, seed, and a name enemies had learned but did not understand.
“We will live,” she said.
The rains came late that year.
For many weeks the sky was bronze. The people rationed bread. Children asked when the fields would wake. Tirzah went often to the standing stone and said nothing, because her anger was too large for prayer.
Then, in the month when despair begins to look like wisdom, clouds came from the sea.
Rain struck the ridge.
It ran down burned stones, over broken jars, through ash, into furrows. The terraces drank. The cisterns rang. The goats cried out and shook their heads. Children opened their mouths to the sky.
And from the earth, green rose.
Not all at once. Never all at once.
First one blade, thin as a whisper.
Then another.
Then a hundred.
Then the hillside wore life like a garment.
Oren died before harvest, old and satisfied. They buried him near Abdi. Shalem gave his best goat for the mourning meal and spoke no more in council for a year. Nethan grew tall and kept the scar of rope on his wrist until manhood. Naamah became a name children shouted when they slung stones at birds.
As for Tirzah, she remained keeper of the jars.
In later days, other villages rose on other ridges. Some were abandoned. Some grew. Some joined themselves by marriage, oath, need, and memory. Their children told the story of the night when Pharaoh killed Israel and Israel rose green from the ground.
They did not tell it the same way twice. No true story survives by staying still.
Some said a god hid the seed. Some said a woman did. Some said there was no difference.
And far away in Egypt, the black stone stood.
Kings passed. Priests passed. The hand that carved the words became dust. The captain’s name vanished. The scribe’s bones were gathered to his fathers. Yet the boast remained:
Israel is laid waste; his seed is no more.
But in the hills, year after year, seed went into the earth.
And year after year, it returned.
Archaeological frame: I set this in the late thirteenth/early twelfth century BCE because the Merneptah, or “Israel,” Stela is usually treated as the earliest-known extra-biblical reference to Israel; Britannica notes that Merneptah counted Israel among defeated peoples, while the Biblical Archaeology Society dates the inscription to about 1205 BCE. The story avoids a sweeping conquest narrative and instead uses the early Iron Age highland setting: small agrarian/herding communities, grain storage, sheep/goats/cattle, limited luxury imports, and local pottery fit the archaeological picture of Iron I highland settlers. The details about collared-rim storage jars, little or no pig evidence in highland contexts, and the uncertainty around what Merneptah’s “Israel” politically or religiously was are drawn from scholarship on early highland settlements. The religious ambiguity is intentional: early highland cult sites show continuity with older Canaanite practices rather than a clean break into later biblical religion.
The winners of the Un-Slop Fiction Prize (”… we strongly recommend you use at least $100 worth of tokens. It’s up to you how you do so; hundreds of generations, elaborate multi-pass pipelines, whatever; quality over quantity, craft over slop”) haven’t yet been announced to my knowledge. I’m wondering how this one-shot simple prompt-based story would fare against them.
Creativity and generosity are aspects of enjoyable usefulness
Material satisfaction and accomplishment are aspects of enjoyable usefulness
The above is just one of his many schematic overviews; for my own convenience here are the rest. Each is a dimension of meaningness, describing two common stances Chapman thinks are confused/mistaken, and the stance he argues is correct:
Meaning and meaninglessness: eternalism vs nihilism vs meaningness
Paranoia about contamination; resources and opportunities wasted; tribalist vilification
Flatness of existence in the absence of the sacred
Obstacles to maintaining the stance
Obvious mundanity of religious forms
Spontaneous religious feelings
Innate reactions of disgust
Antidotes; counter-thoughts
Purity is a matter of perception, not truth
I do sometimes experience awe
Intelligent aspect
Recognition of sacredness
Recognition that nothing is inherently sacred
Positive appropriation after resolution
Sacredness matters
Narrow religion is harmful; something better is available
Chapman’s argument is that both mission/eternalism and materialism/nihilism are confused stances toward purpose, to be replaced by enjoyable usefulness/meaningness, a word he invented because the subject of his book, which he wrote because he saw many friends struggle with questions like “what is my true purpose in life?” and thought this self-induced suffering was unnecessary and predicated on a misunderstanding, somehow had no name. By “meaningness” he refers to
a particular stance toward the quality: the one that acknowledges both meaninglessness and meaningfulness, avoiding both fixation and denial.
How does “meaningness” deconfuse and help his friends? Chapman:
Mission and materialism are not the only possibilities. You can, instead, do things that you enjoy and that are useful to others.
“But how do I know what to dedicate my life to?” Wrong question… a good question to ask instead is “What is something I can do now that will be both enjoyable and useful?” That’s a practical problem. You can find answers without using religious or therapeutic voodoo.
It’s an unattractive question, however. “What is my true mission in life?” promises that if only you can find the answer, and you throw your whole self into your mission, you will be a very special person. Along the way, you will have certainty, and when you die, you will die justified.
“What’s something useful and enjoyable I can do now?” prompts the answer “Who cares—so what?” Mere usefulness and enjoyability doesn’t sound good enough. This “complete stance”—of enjoyable usefulness—is emotionally unattractive at first. Once accepted, though, it does eliminate the anguish of an existential dilemma. If you can let go of the grandiosity that leads you to imagine that some special task awaits you, and the false hope that getting enough of what you want would make life satisfactory, you can be useful and enjoy yourself. That letting-go takes some doing; I will suggest ways to go about it.
This book addresses a series of dilemmas of this sort. I call them “dimensions of meaningness.” Each dimension has a limited number of possible approaches, or “stances.”
The commonly available confused stances are each unworkable, because they are based on misunderstandings of how meaning works. For example, it is easy to waste a huge amount of emotional energy trying to be special or ordinary; to while your life away in mindless conformity or unrealistic rebellion; to play the victim or fail when you attempt to take total responsibility for your world. Adopting those stances makes you miserable.
For each dimension, I suggest an uncommon, alternative stance that resolves the misunderstanding, and turns a spiritual problem into a practical one. [Mo: the tables above summarise these]
I was born and raised in a deeply mission-oriented worldview; after all these years I still find many parts of it deeply resonant. Reading Tanner Greer’s Questing for Transcendence for instance “rang my soul like a gong”. Meaningness on the other hand I keep bouncing off: (most of) the trees I get, but the forest I keep missing. But occasionally I’ll meet someone who makes me go “they have the complete stance” (rightly or wrongly), and that motivates me to revisit Chapman’s schematics. I suppose I’m hoping that John von Neumann anecdote will happen with repeated exposure over time: “Young man, in mathematics questions of meaning and purpose you don’t understand things. You just get used to them”
Investing in Tesla and SpaceX is basically betting on Elon, which is self-recommending for some and a joyride for others (thinking of some friends), cf. SpaceX’s Anthropic deal for the latter.
If there’s a coherent way to describe the entire SpaceX company, highlighting genuine synergies between the business of hurling heavy objects into space and the business of maximizing the ad revenue from people asking Grok if the moon landing really happened, it’s this: SpaceX is uniquely good at knocking down (or obviating) barriers to big infrastructure projects.[4]They’ve reduced the cost of putting a kilogram into orbit by 92%, and with Starship they’re aiming for 99%+.[5] And, terrestrially:
We brought the first cluster of COLOSSUS online in 122 days, repurposing the shell of an existing factory, and the first cluster of COLOSSUS II online even faster in 91 days. As an illustrative comparison, an industry benchmark to bring online a 100 megawatt greenfield data center is approximately two years.
This is a big advantage! Anyone who has ever raised capital and reported an IRR knows that there’s an agonizing gap between when you get money and when it turns into something that produces returns; for institutional investors, this problem has led to a whole ecosystem of financial products basically designed to distort the numbers.[6] In SpaceX’s case, there’s another interesting possibility: if they really do go public with a trillion dollar-plus valuation, and they can support it for a while, they’ll have the lowest cost of capital for anyone investing in frontier technology at that scale aside from certain national governments. At which point the relevant question is over who has the biggest comparative disadvantage in capital allocation: the US government, with all of its weird processes and sensitivity to the needs of various interest groups, or Elon Musk, with his weird processes and tendency to get distracted and found new companies.
It’s just unavoidable that this company is a bet on Elon Musk’s capital allocation skills. In fact, if they raise $75bn, then if you count up the equity and debt raised by SpaceX and xAI, it’s still true that about 60% of the money Musk will be able to deploy through SpaceX will come from this IPO, and the rest from two decades of capital raises by other means. There’s a conservation of Elon Musk weirdness at work: my view of Elon Musk is that he’s pretty technical, great at fundraising, incredible at recruiting talent, even better at motivating that talent to live on remote south pacific atolls or in literal ghost towns, and has a high risk tolerance. He’s also vain, wildly underestimates the difficulty of some problems (running a social network, reducing Federal spending, hitting deadlines), but has sorted himself into domains that maximize the payoff from these traits. Other people disagree, but given that Musk has founded more than one of the most valuable companies in the US, any time you downgrade one skill, you have to mentally upgrade another one. So, if he’s actually nontechnical, he must be that much better at recruiting technical talent; if he doesn’t have an eye for talent, he must be really good at convincing investors to let him take another swing, etc. Eventually you can reach the point where the most coherent Theory of Elon is that the universe is a simulation, he’s the player character, and that he keeps reloading old save files when the random number generator gives him a bad result. But, probably, he’s a very smart guy who thinks he’s somewhat smarter and bets accordingly, and that he hasn’t gone through the last of his massive wealth drawdowns just yet.
Should you trust him to go after a total addressable market of $28.5tr? Should you bet that there really will be synergies between xAI and SpaceX, by way of orbiting datacenters? The good news is that you don’t have to: part of the Elon Algorithm is setting insanely ambitious goals and pushing people to their limits to achieve them, even if Musk periodically takes a break for a ketamine vision quest or to pick a president.[7] Musk is good at setting these plans, but he’s also really good at getting out of the way when people execute the first few steps for these grand ambitions. It’s very hard for SpaceX to support its proposed $1.25tr valuation; even taking some aggressive growth assumptions, that’s probably about 25x run-rate revenue for a company with some serious fixed costs and unavoidable margin expenses. (Even if there’s no marginal cost for a Starlink satellite, the launch costs get capitalized; you’ll be depreciating the satellites, the expenses, even the fuel used, over many years.) What you’re actually underwriting is something very meta: will investors keep being happy to back Musk’s various ventures? Will they treat SpaceX as his main focus, as he tends to—SpaceX seems to be the senior claimant on Musk’s time and money, both of which are pretty valuable. That’s really all you’re betting on: there will probably be some point in the future where the best guess about the return on some space-related project is, say, 7%, and if SpaceX’s cost of capital is 6% it’ll happen, while at 8% it won’t. Somehow, achieving outlier success in reducing the cost of space travel and reducing the timeline for big construction projects creates an investment opportunity that’s mostly a bet on future investor relations.
I remember, when you outlined your threat model, asking what you thought of MAI’s work. Your response made me feel depressed, and might have been the main reason I stopped being interested in their work.