Some infographics by Steven Byrnes I’ve wanted to point people to, but take ~forever to find because he’s so prolific, collected here for my own convenience.
(I forgot to include the links to each, FML. I also mixed related-ish infographics from different sources in the same section a lot.)
What AGI is and isn’t, and why LLMs aren’t it
A frequent point of confusion is the word “General” in “Artificial General Intelligence”:
The word “General” DOES mean “not specific”, as in “In general, Boston is a nice place to live.”
The word “General” DOES NOT mean “universal”, as in “I have a general proof of the math theorem.”
An AGI is not “general” in the latter sense. It is not a thing that can instantly find every pattern and solve every problem. Humans can’t do that either! In fact, no algorithm can, because that’s fundamentally impossible. Instead, an AGI is a thing that, when faced with a difficult problem, might be able to solve the problem easily, but if not, maybe it can build a tool to solve the problem, or it can find a clever way to avoid the problem altogether, etc.
Consider: Humans wanted to go to the moon, and then they figured out how to do so, by inventing extraordinarily complicated science and engineering and infrastructure and machines. Humans don’t have a specific evolved capacity to go to the moon, akin to birds’ specific evolved capacity to build nests. But they got it done anyway, using their “general” ability to figure things out and get things done.
So for our purposes here, think of AGI as an algorithm which can “figure things out” and “understand what’s going on” and “get things done”, including using language and science and technology, in a way that’s reminiscent of how most adult humans (and groups and societies of humans) can do those things, but toddlers and chimpanzees and today’s large language models (LLMs) can’t.
This image is poking fun at that “there is no such thing as Artificial General Intelligence”. (Image sources: ,)This image is poking fun at that “there is no such thing as Artificial General Intelligence”. (Image sources: ,)
I should elaborate on that last part. I think that some LLM enthusiasts have a massive blind spot, where they are so impressed by all the things that today’s LLMs can do, that they forget about all the things that today’s LLMs can’t do. These people read the questions on Humanity’s Last Exam (HLE), and scratch their heads, and say “C’mon, when LLMs ace the HLE benchmark, then what else is there? Look at how hard those questions are! It would need to be way beyond PhD level in everything! If that’s not superintelligence, what is?”
Well, no, that’s not superintelligence, and here’s an example of why not. Consider the task of writing a business plan and then founding a company and growing it, over the course of years, to $1B/year revenue, all with zero human intervention. Today’s LLMs fall wildly, comically short of being able to complete that task. By analogy, if humans were like today’s AIs, then humans would be able to do some narrow bits of founding and running companies by ourselves, but we would need some intelligent non-human entity (angels?) to repeatedly intervene, assign tasks to us humans, and keep the larger project on track. Of course, humans (and groups of humans) don’t need the help of angels to conceive and carry out ambitious projects, like building businesses or going to the moon. We can do it all by ourselves. So by the same token, future AGIs (and groups of AGIs) won’t need the help of humans.
Anyway, this series is about brain-like algorithms. These algorithms are by definition capable of doing absolutely every intelligent behavior that humans (and groups and societies of humans) can do, and potentially much more. So they can definitely reach AGI. Whereas today’s AI algorithms are not AGI. So somewhere in between here and there, there’s a fuzzy line that separates “AGI” from “not AGI”. Where exactly is that line? My answer: I don’t know, and I don’t care. Drawing that line has never come up for me as a useful thing to do. It won’t come up in this series either.
1.3 A far-more-powerful, yet-to-be-discovered, “simple(ish) core of intelligence
LLMs are very impressive, but they’re not AGI yet—not by my definition. For example, existing AIs are nowhere near capable of autonomously writing a business plan and then founding a company and growing it to $1B/year revenue, all with zero human intervention. By analogy, if humans were like current AIs, then humans would be able to do some narrow bits of founding and running companies by ourselves, but we would need some intelligent non-human entity (angels?) to repeatedly intervene, assign tasks to us humans, and keep the larger project on track.
Of course, humans (and groups of humans) don’t need the help of angels to conceive and carry out ambitious projects, like building businesses or going to the moon. We can do it all by ourselves. So by the same token, future AGIs (and groups of AGIs) won’t need the help of humans.
…So that’s my pitch that AGI doesn’t exist yet. And thus, the jury is still out on what AGI (and later, ASI) will look like, or how it will be made.
My expectation is that, for better or worse, LLMs will never be able to carry out those kinds of projects, even after future advances in scaffolding, post-training, and so on. If I’m right, that wouldn’t mean that those projects are beyond the reaches of AI—it’s clearly possible for some algorithm to do those things, because humans can! Rather it would mean that LLMs are the wrong algorithm class. Instead, I think sooner or later someone will figure out a different AI paradigm, and then we’ll get superintelligence with shockingly little compute, shockingly little effort, and in shockingly little time. (I’ll quantify that later.)
Basically, I think that there’s a “simple(ish) core of intelligence”, and that LLMs don’t have it. Instead, people are hacking together workarounds via prodigious quantities of (in Ajeya’s terminology) “scale” (a.k.a. compute, §1.5 below) and “schlep” (a.k.a. R&D, §1.7 below). And researchers are then extrapolating that process into the future, imagining that we’ll turn LLMs into ASI via even more scale and even more schlep, up to quantities of scale and schlep that strike me as ludicrously unnecessary and implausible.
… the continuous learning nature of the future paradigm (see §1 of “Sharp Left Turn” discourse: An opinionated review) would mean that “AI capabilities” are hard to pin down through capabilities elicitation—the AI might not understand something when you test it, but then later it could figure it out.
In continuous learning, the notion of pinning down the capabilities of an AI—e.g. its skill at cybersecurity—becomes more fraught, because it’s a moving target.In continuous learning, the notion of pinning down the capabilities of an AI—e.g. its skill at cybersecurity—becomes more fraught, because it’s a moving target.
(See also §2.6 of the next post on further challenges of weaker AIs supervising stronger AIs.)
(Crossposted from twitter for easier linking.) (Intended for a broad audience—experts already know all this.)
When I talk about future “Artificial General Intelligence” (AGI), what am I talking about? Here’s a handy diagram and FAQ:
“Are you saying that ChatGPT is a right-column thing?” No. Definitely not. I think the right-column thing does not currently exist. That’s why I said “future”! I am also not making any claims here about how soon it will happen, although see discussion in Section A here.
“Do you really expect researchers to try to build right-column AIs? Is there demand for it? Wouldn’t consumers / end-users strongly prefer to have left-column AIs?” For one thing, imagine an AI where you can give it seed capital and ask it to go found a new company, and it does so, just as skillfully as Earth’s most competent and experienced remote-only human CEO. And you can repeat this millions of times in parallel with millions of copies of this AI, and each copy costs $0.10/hour to run. You think nobody wants to have an AI that can do that? Really?? And also, just look around. Plenty of AI researchers and companies are trying to make this vision happen as we speak—and have been for decades. So maybe you-in-particular don’t want this vision to happen, but evidently many other people do, and they sure aren’t asking you for permission.
“If the right-column AIs don’t exist, why are we even talking about them? Won’t there be plenty of warning before they exist and are widespread and potentially powerful? Why can’t we deal with that situation when it actually arises?” First of all, exactly what will this alleged warning look like, and exactly how many years will we have following that warning, and how on earth are you so confident about any of this? Second of all … “we”? Who exactly is “we”, and what do you think “we” will do, and how do you know? By analogy, it’s very easy to say that “we” will simply stop emitting CO2 when climate change becomes a sufficiently obvious and immediate problem. And yet, here we are. Anyway, if you want the transition to a world of right-column AIs to go well (or to not happen in the first place), there’s already plenty of work that we can and should be doing right now, even before those AIs exist. Twiddling our thumbs and kicking the can down the road is crazy.
“The right column sounds like weird sci-fi stuff. Am I really supposed to take it seriously?” Yes it sounds like weird sci-fi stuff. And so did heavier-than-air flight in 1800. Sometimes things sound like sci-fi and happen anyway. In this case, the idea that future algorithms running on silicon chips will be able to do all the things that human brains can do—including inventing new science & tech from scratch, collaborating at civilization-scale, piloting teleoperated robots with great skill after very little practice, etc.—is not only a plausible idea but (I claim) almost certainly true. Human brains do not work by some magic forever beyond the reach of science.
“So what?” Well, I want everyone to be on the same page that this is a big friggin’ deal—an upcoming transition whose consequences for the world are much much bigger than the invention of the internet, or even the industrial revolution. A separate question is what (if anything) we ought to do with that information. Are there laws we should pass? Is there technical research we should do? I don’t think the answers are obvious, although I sure have plenty of opinions. That’s all outside the scope of this little post though.
1.3 Why I want to move the goalposts on “AGI”
Two different perspectives are:
AGI is about knowing how to do lots of things
AGI is about not knowing how to do something, and then being able to figure it out.
I’m strongly in the second camp. That’s why I’ve previously commented that the Metaculus criterion for so-called “Human/Machine Intelligence Parity” is no such thing. It’s based on grad-school-level technical exam questions, and exam questions are inherently heavily weighted towards already knowing things rather than towards not knowing something but then figuring it out. Or, rather, if you’re going to get an “A+” on an exam, there’s a spectrum of ways to do so, where one end of the spectrum has relatively little “already knowing” and a whole lot of “figuring things out”, and the opposite end of the spectrum has a whole lot of “already knowing” and relatively little “figuring things out”. I’m much more interested in the “figuring things out” part, so I’m not too interested in protocols where that part of the story is to some extent optional.
(Instead, I’ve more recently started talking about “AGI that can develop innovative science at a John von Neumann level”, and things like that. Seems harder to game by “brute-force massive amounts of preexisting knowledge (both object-level and procedural)”.)
(Some people will probably object here, on the theory that “figuring things out” is not fundamentally different from “already knowing”, but rather is a special case of “already knowing”, wherein the “knowledge” is related to meta-learning, plus better generalizations that stem from diverse real-world training data, etc. My response is: that’s a reasonable hypothesis to entertain, and it is undoubtedly true to some extent, but I still think it’s mostly wrong, and I stand by what I wrote. However, I’m not going to try to convince you of that, because my opinion is coming from “inside view” considerations that I don’t want to get into here.)
This OP is about “AGI”, as defined in my 3rd & 4th paragraph as follows:
By “AGI” I mean here “a bundle of chips, algorithms, electricity, and/or teleoperated robots that can autonomously do the kinds of stuff that ambitious human adults can do—founding and running new companies, R&D, learning new skills, using arbitrary teleoperated robots after very little practice, etc.”
Yes I know, this does not exist yet! (Despite hype to the contrary.) Try asking an LLM to autonomously write a business plan, found a company, then run and grow it for years as CEO. Lol! It will crash and burn! But that’s a limitation of today’s LLMs, not of “all AI forever”. AI that could nail that task, and much more beyond, is obviously possible—human brains and bodies and societies are not powered by some magical sorcery forever beyond the reach of science. I for one expect such AI in my lifetime, for better or worse. (Probably “worse”, see below.)
So…
“The kinds of stuff that ambitious human adults can do” includes handling what you call “friction”, so “AGI” as defined above would be able to do that too.
I am >99% confident that “AGI” as defined above is physically possible, and will be invented eventually.
I am like 90% confident that it will be invented in my lifetime.
This post is agnostic on the question of whether such AGI will or won’t have anything to do with “current LLM-based architectures”. I’m not sure why you brought that up. But since you asked, I think it won’t; I think it will be a different, yet-to-be-developed, AI paradigm.
… a great many trained economists—but not literally 100% of trained economists—have a bundle of intuitions for thinking about labor, and a different bundle of intuitions for thinking about capital, and these intuitions lead to them having incorrect and incoherent beliefs about AGI. This is something beyond formal economics models, it’s a set of mental models and snap reflexes developed over the course of them spending years in the field studying the current and historic economy. The snap reaction says: “That’s not what labor automation is supposed to look like, that can’t be right, there must be an error somewhere.” Indeed, AGI is not what labor automation looks like today, and it’s not how labor automation has ever looked, because AGI is not labor automation, it’s something entirely new.
What the AGI technical safety problem is, and brain-like vs prosaic vs plain AGI safety
The part I’ll be talking about in this series is the red box here:
Specifically, we zoom in on a single team of humans who are trying to create a single AGI, and we want it to be possible for them to do so without winding up with some catastrophe that nobody wanted, with an out-of-control AGI self-replicating around the internet or whatever (more on which in §1.6).
Blue boxes in this diagram are things that I won’t talk about in this series. It’s long enough already. But I very strongly endorse other people working on them, and think about them myself as well.
Back to the red box. This is a technical problem, calling for a technical solution. Nobody wants catastrophic accidents. And yet! Indeed, it’s entirely possible for people to write an algorithm that does something that nobody wanted it to do. It happens all the time! We might call it “a bug” when it’s a local problem in the code, and we might call it “a fundamentally flawed software design” when it’s a global problem. I’ll argue later that AGI code is unusually prone to catastrophic accidents, and that the stakes are very high (see §1.6 below, and Post #10).
Here’s an analogy. If you’re building a nuclear power plant, nobody wants an out-of-control chain reaction. The people at Chernobyl certainly didn’t! But it happened anyway! I take a few lessons from this analogy:
Enrico Fermi invented a technical solution for controlling nuclear chain reactions—control rods—before starting to build the first-ever nuclear chain reaction. Right on! That’s doing things in the right order! By the same token, I suggest that we should strive to have a technical solution to avoiding catastrophic AGI accidents ready to go before people start programming AGIs. In fact, I’ll argue below for something even stronger than that: knowing the solution (even vaguely) 10 years before AGI is even better; 20 years before AGI is better still; etc. This claim is not obvious, but I’ll get back to it (§1.7).
Technical solutions aren’t all-or-nothing. Some reduce the chance of accidents without eliminating them. Some are complicated and expensive and error-prone to implement. In the nuclear case, control rods reduce accident risk a lot, but passively-safe reactors reduce it even further. Alas, as I’ll discuss later in the series, I claim that we currently have no plan at all for brain-like-AGI technical safety—not even vaguely. Forget about the passively-safe reactors and multiple layers of protection, we’re not even at the “control rods” stage. Heck, many prominent AI thought-leaders are not even at the “meltdowns would be bad” stage! (See §3 of my 2025 post: “The Era of Experience” has an unsolved technical alignment problem.) We have our work cut out!
The blue boxes (see diagram above) also exist, and are absolutely essential, even if they’re out-of-scope for this particular series. The cause of the Chernobyl accident was not that nobody knew how to keep a nuclear chain reaction under control, but rather that best practices were not followed. In that case, all bets are off! Still, although we on the technical side can’t solve this noncompliance problem by ourselves, we can help on the margin, by developing best practices that are maximally idiot-proof, and minimally expensive.
This series will focus on a particular scenario for what AGI algorithms will look like:
The red box is what I’ll talk about here. The blue boxes are things that are out-of-scope for this series.
You may have opinions about which of these categories is more or less likely, or impossible, or whether this breakdown is even sensible. I have opinions about those things too! I’ll discuss them later (§1.5). My main opinion is that all three of these are sufficiently likely that we should be “contingency planning” for them. So while I personally don’t do too much work on the blue boxes, I’m sure glad that other people do!
Here’s an analogy. If someone in 1870 were guessing what future human flight would look like…
“Kinda like birds” would have been a reasonable guess…
“Kinda like today’s best airships” would also have been a reasonable guess…
“Neither of the above” would have been a reasonable guess too!
In this particular imaginary case, all three of those guesses would have turned out correct in some ways and wrong in other ways: The Wright Brothers were directly and extensively inspired by large soaring birds, but left out the wing-flapping part. They also used some components found on airships (e.g. propellers), as well as plenty of original ingredients. That’s just one example, but I think it’s suggestive.
Big picture(s) of motivation, decision-making, and RL (various versions)
The big picture—The whole post will revolve around this diagram. Note that I’m oversimplifying in various ways, including in the bracketed neuroanatomy labels.The big picture—The whole post will revolve around this diagram. Note that I’m oversimplifying in various ways, including in the bracketed neuroanatomy labels.
Here’s how this diagram fits in with my “two subsystems” perspective, first discussed in Post #3:
Same as above, but the are highlighted in different colors.Same as above, but the are highlighted in different colors.
There are two types of “valence” in the diagram (it looks like three, but the two red ones are the same):
Two types of “valence” in my model—“real” and “guessed”Two types of “valence” in my model—“real” and “guessed”
The blue-circled signal is the valence guess from the corresponding Thought Assessor in the striatum. The red-circled signal (again, it’s one signal drawn twice) is the corresponding “ground truth” for what the valence guess should have been.
Just like the other “long-term predictors” discussed in the previous post, the long-term predictor for valence has a “defer-to-predictor mode” and an “override mode”, and the Steering Subsystem can dynamically switch between these modes. In defer-to-predictor mode, it sets the red equal to the blue, as if to say “OK, Thought Assessor, sure, I’ll take your word for it”. In override mode, it ignores the Thought Assessor’s proposal, and its own internal circuitry outputs some different value.[4]
Thus far in the series, Post #1 set up my big picture motivation: what is “brain-like AGI safety” and why do we care? The subsequent six posts (#2–#7) delved into neuroscience. Of those, Posts #2–#3 presented a way of dividing the brain into a “Learning Subsystem” and a “Steering Subsystem”, differentiated by whether they have a property I call “learning from scratch”. Then Posts #4–#7 presented a big picture of how I think motivation and goals work in the brain, which winds up looking kinda like a weird variant on actor-critic model-based reinforcement learning.
Having established that neuroscience background, now we can finally switch in earnest to thinking more explicitly about brain-like AGI. As a starting point to keep in mind, here’s a diagram from Post #6, edited to describe brain-like AGI instead of actual brains:
Diagram is from , with four changes to make it about brain-like-AGI rather than actual brains: (1) “lifetime” is replaced by “training run” in the top right (§8.2 below); (2) “genetically-hardcoded” is replaced by “[probably] human-written” in the bottom-right (§8.3–§8.4 below); (3) references to specific brain regions like “amygdala” have been crossed out, to be replaced with bits of source code and/or sets of trained model parameters; (4) other biology-specific words like “sugar” are crossed out, to be replaced with anything we want, as I’ll discuss in later posts.Diagram is from , with four changes to make it about brain-like-AGI rather than actual brains: (1) “lifetime” is replaced by “training run” in the top right (§8.2 below); (2) “genetically-hardcoded” is replaced by “[probably] human-written” in the bottom-right (§8.3–§8.4 below); (3) references to specific brain regions like “amygdala” have been crossed out, to be replaced with bits of source code and/or sets of trained model parameters; (4) other biology-specific words like “sugar” are crossed out, to be replaced with anything we want, as I’ll discuss in later posts.
This and the next post will extract some lessons about brain-like AGI from the discussion thus far. This post will focus on how such an AGI might be developed, and the next post will discuss AGI motivations and goals. After that, Post #10 will discuss the famous “alignment problem” (finally!), and then there will be some posts on possible paths towards a solution. Finally, in Post #15 I’ll wrap up the series with open questions, avenues for future research, and how to get involved in the field.
Here, yet again, is that figure from Post #6, now with some helpful terminology (blue) and a little green face at the bottom left:
I want to call out three things from this diagram:
The designer’s intentions (green face): Perhaps there’s a human who is programming the AGI; presumably they have some idea in their head as to what the AGI is supposed to be trying to do. That’s just an example; it could alternatively be a team of humans who have collectively settled on a specification describing what the AGI is supposed to be trying to do. Or maybe someone wrote a 700-page philosophy book entitled “What Does It Mean For An AGI To Act Ethically?”, and the team of programmers is trying to make an AGI that adheres to the book’s description. It doesn’t matter here. I’ll stick to “one human programming the AGI” for conceptual simplicity.[2]
The human-written source code of the Steering Subsystem: (See Post #3 for what the Steering Subsystem is, and Post #8 for why I expect it to consist of more-or-less purely human-written source code.) The most important item in this category is the “reward function” for reinforcement learning, which provides ground truth for how well or poorly things are going for the AGI. In the biology case, the reward function would specify “innate drives” (see Post #3) like pain being bad and eating-when-hungry being good. In the terminology of our series, the “reward function” governs when and how the “actual valence” signal enters “override mode”—see Post #5.
The Thought Assessors, trained from scratch by supervised learning algorithms: (See Post #5 for what Thought Assessors are and how they’re trained.) These take a certain “thought” from the thought generator, and guess what Steering Subsystem signals it will lead to. An especially important special case is the value function (a.k.a. “learned critic”, a.k.a. “valence Thought Assessor”), which sends out a “valence guess” signal based on supervised learning from all the “actual valence” signals over the course of life experience.
Correspondingly, there are two kinds of “alignment” in this type of AGI:
Outer alignment is alignment between the designer’s intentions and the Steering Subsystem source code. In particular, if the AGI is outer-aligned, the Steering Subsystem will output higher (more positive) reward signals when the AGI is satisfying the designer’s intentions, and lower (more negative) reward signals when it’s not.
In other words, outer alignment is the question: Are the AGI’s “innate drives” driving the AGI to do what the designer had intended?
Inner alignment is alignment between the Steering Subsystem source code and the Thought Assessors. In particular, if the AGI is inner-aligned, and the Thought Generator proposes some plan, then the value function should reflect the reward actually expected from executing that plan.
In other words, inner alignment is the question: Do the set of positive-valence concepts in the AGI’s world-model line up with the set of courses-of-action that would satisfy the AGI’s “innate drives”?
If an AGI is both outer-aligned and inner-aligned, we get intent alignment—the AGI is “trying” to do what the programmer had intended for it to try to do. Specifically, if the AGI comes up with a plan “Hey, maybe I’ll do XYZ!”, then its Steering Subsystem will judge that to be a good plan (and actually carry it out) if and only if it lines up with the programmer’s design intentions.
Thus, an intent-aligned AGI will not deliberately hatch a clever plot to take over the world and kill all the humans. Unless, of course, the designers were maniacs who wanted the AGI to do that! But that’s a different problem, out-of-scope for this series—see §1.2.
Unfortunately, neither “outer alignment” nor “inner alignment” happens automatically. Quite the contrary: by default there are severe problems on both sides. It’s on us to figure out how to solve them.
As in , a self-aware AGI can have preferences about its own preferences.As in , a self-aware AGI can have preferences about its own preferences.
Suppose that we want our AGI to obey the law. We can ask two questions:
Question 1: Does the AGI assign positive value to the concept “obeying the law”, and to plans that entail obeying the law?
Question 2: Does the AGI assign positive value to the self-reflective concept “I value obeying the law”, and to plans that entail continuing to value obeying the law?
If the answers are yes and no respectively (or no and yes respectively), that would be the AGI analog of an ego-dystonic motivation. (Related discussion.) It would lead to the AGI feeling motivated to change its motivation, for example by hacking into itself. Or if the AGI is built from perfectly secure code running on a perfectly secure operating system (hahaha), then it can’t hack into itself, but it could still probably manipulate its motivation by thinking thoughts in a way that manipulates the credit-assignment process (see discussion in §9.3.3).
If the answers to questions 1 & 2 are yes and no respectively, then we want to prevent the AGI from manipulating its own motivation. On the other hand, if the answers are no and yes respectively, then we want the AGI to manipulate its own motivation!
(There can be even-higher-order preferences too: in principle, an AGI could wind up hating the fact that it values the fact that it hates the fact that it values obeying the law.)
In general, should we expect misaligned higher-order preferences to occur?
On the one hand, suppose we start with an AGI that wants to obey the law, but has no particular higher-order preference one way or the other about the fact that it wants to obey the law. Then (it seems to me), the AGI is very likely to also wind up wanting to want to obey the law (and wanting to want to want to obey the law, etc.). The reason is: the primary obvious consequence of “I want to obey the law” is “I will obey the law”, which is already desired. Remember, the AGI can do means-end reasoning, so things that lead to desirable consequences tend to become themselves desirable.
Filial imprinting (wikipedia) is a phenomenon where, in the most famous example, baby geese will “imprint on” a salient object that they see during a critical period 13–16 hours after hatching, and then will follow that object around. In nature, the “object” they imprint on is almost invariably their mother, whom they dutifully follow around early in life. However, if separated from their mother, baby geese will imprint on other animals, or even inanimate objects like boots and boxes.
Your challenge: come up with a way to implement filial imprinting in my brain model.
(Try it!)
.
.
.
.
Here’s my answer.
Same as above except for the red text.Same as above except for the red text.
If a kid sees an adult they know well, they’re happy. But if they see an adult they don’t know, they get scared, especially if that adult is very close to them, touching them, picking them up, etc.
Your challenge: come up with a way to implement that behavior in my brain model.
(Try it!)
.
.
.
.
Here’s my answer.
(As usual, I’m oversimplifying for pedagogical purposes.[5]) I’m assuming that there are hardwired heuristics in the brainstem sensory processing systems that indicate the likely presence of a human adult—presumably based on sight, sound, and smell. This signal by default triggers a “be scared” reaction. But the brainstem circuitry is also watching what the Thought Assessors in the cortex are predicting, and if the Thought Assessors is predicting safety, affection, comfort, etc., then the brainstem circuitry trusts that the cortex knows what it’s talking about, and goes with the suggestions of the cortex. Now we can walk through what happens:
First time seeing a stranger:
Steering Subsystem sensory heuristics say: “An adult human is present.”
Thought Assessor says: “Neutral—I have no expectation of anything in particular.”
Steering Subsystem “Stranger Danger circuit” says: “Considering all of the above, we should be scared right now.”
Thought Assessor says: “Oh, oops, I guess my assessment was wrong, let me update my models.”
Second time seeing the same stranger:
Steering Subsystem sensory heuristics say: “An adult human is present.”
Thought Assessors say: “This is a scary situation.”
Steering Subsystem “Stranger Danger circuit” says: “Considering all of the above, we should be scared right now.”
The stranger hangs around for a while, and is nice, and playing, etc.:
Steering Subsystem sensory heuristics say: “An adult human is still present.”
Other circuitry in the brainstem says: “I’ve been feeling mighty scared all this time, but y’know, nothing bad has happened…” (cf. §5.2.2.1)
Other Thought Assessors see the fun new toy and say “This is a good time to relax and play.”
Steering Subsystem says: “Considering all of the above, we should be relaxed right now.”
Thought Assessors say: “Oh, oops, I was predicting that this was a situation where we should feel scared, but I guess I was wrong, let me update my models.”
Third time seeing the no-longer-stranger:
Steering Subsystem sensory heuristics say: “An adult human is present.”
Thought Assessors say: “I expect to feel relaxed and playful and not-scared.”
Steering Subsystem “Stranger Danger circuit” says: “Considering all of the above, we should be relaxed and playful and not-scared right now.”
Two broad potential paths to success: “Controlled AGI” and “Social-instinct AGI”
I currently see two broad (possibly-overlapping) potential paths to success in the brain-like AGI scenario:
Left: In the “controlled AGIs” path, we have a specific idea of what we want the AGI to be trying to do, and we construct the AGI to make that happen (including by appropriate choice of reward function, interpretability, or other techniques as discussed in ). Most existing AGI safety stories fall within this broad category, including , , , , and so on. Right: In the “social-instinct AGIs” path, our confidence in the AGI comes not from our knowledge of its specific goals and motivations, but rather from the innate drives that gave rise to them, which would be based on the same innate drives that lead humans to (sometimes) behave altruistically.Left: In the “controlled AGIs” path, we have a specific idea of what we want the AGI to be trying to do, and we construct the AGI to make that happen (including by appropriate choice of reward function, interpretability, or other techniques as discussed in ). Most existing AGI safety stories fall within this broad category, including , , , , and so on. Right: In the “social-instinct AGIs” path, our confidence in the AGI comes not from our knowledge of its specific goals and motivations, but rather from the innate drives that gave rise to them, which would be based on the same innate drives that lead humans to (sometimes) behave altruistically.
In the “controlled AGIs” path, we’re thinking very specifically about the AGI’s goals and motivations, and we have some idea of what they should be (“make the world a better place”, or “understand my deepest values and put them into effect”, or “design a better solar cell without causing catastrophic side-effects”, or “do whatever I ask you to do”, etc.).
In the “social-instinct AGIs” path, our confidence in the AGI comes not from our knowledge of its specific (object-level) goals and motivations, but rather from our knowledge of the process that led to those goals and motivations. In particular, we would reverse-engineer the suite of human social instincts, i.e. the algorithms in the human Steering Subsystem (hypothalamus & brainstem) which underlie our moral and social intuitions, and we would put those same instincts into the AGI. (Presumably we first modify the instincts to be “better” by our lights if possible, e.g. we probably don’t want instincts related to schadenfreude, teenage rebellion, rage, lust for power, etc.) These AGIs can do whatever feats of innovative engineering, science, etc., we were hoping for, just as humans have accomplished such feats historically.
12.3 My proposal: At this stage, we should be digging into both
Three reasons:
They’re not mutually exclusive: For example, even if we decide to make social-instinct AGIs, we might want to take advantage of “controlled AGI”-type methods, especially while debugging them, working out the kinks, and anticipating problems. Conversely, maybe we’ll mainly try to make AGIs that are trying to do a certain task without causing catastrophe, but we might want to also to instill human-like social instincts as a buttress against wildly unexpected behavior. Moreover, we can share ideas between the two paths—for example, in the process of better understanding how human social instincts work, we might get useful ideas about how to make controlled AGIs.
Feasibility of each remains unknown: As far as anyone knows right now, it might just be impossible to build a “controlled AGI”—after all, there’s no “existence proof” of it in nature! I feel relatively more optimistic about the feasibility of the “social-instinct AGI” path, but it’s very hard to be sure until we make more progress—more discussion on that in §12.4.2 below. Anyway, at this point it seems wise to “hedge our bets” by working on both.
Desirability of each remains unknown: As we flesh out our options in more detail, we’ll get a better understanding of their advantages and disadvantages.
LLM plateauism
Q: What do you mean, “LLM plateau-ist”?
A: As background, I think it’s obvious that there will eventually be “transformative AI” (TAI) that would radically change the world.[1]
I’m interested in what this TAI will eventually look like algorithmically. Let’s list some possibilities:
A breakdown of possibilities for how future TAI will work. I’ll refer back to this diagram throughout the post.A breakdown of possibilities for how future TAI will work. I’ll refer back to this diagram throughout the post.
A “Large Language Model (LLM) plateau-ist” would be defined as someone who thinks that categories (A-B), and usually also (C), will plateau in capabilities before reaching TAI levels.[2] I am an LLM plateau-ist myself.[3]
I’m not going to argue about whether LLM-plateau-ism is right or wrong—that’s outside the scope of this post, and also difficult for me to discuss publicly thanks to infohazard issues.[4] Oh well, we’ll find out one way or the other soon enough.
In the broader AI community, both LLM-plateau-ism and its opposite seem plenty mainstream. Different LLM-plateau-ists have different reasons for holding this belief. I think the two main categories are:
Theoretical—maybe they have theoretical beliefs about what is required for TAI, and they think that LLMs just aren’t built right to do the things that TAI would need to do.
Empirical—maybe they’re not very impressed by the capabilities of current LLMs. Granted, future LLMs will be better than current ones. But maybe they have extrapolated that our planet will run out of data and/or compute before LLMs get all the way up to TAI levels.
LLMs are generally not scheming right now, but future AI will be disanalogous
My take on today’s LLMs is basically summed up in this oversimplified schematic diagram:
Start with the left pie chart. The key here is that “LLM pretraining magically transmutes observations into behavior”—and when nice behavior shows up in internet text, it usually doesn’t turn into egregious and callous backstabbing when the situation changes. So we get much more green than red: nice behavior mostly stays nice. But pretraining also leaves the LLM with a bunch of incoherent behavior, grumpy behavior, and so on (blue).
Now we do RLHF (or equivalent), selecting on nice behavior. This mostly gets rid of the blue, giving us the center pie chart. Unfortunately, “nice behavior that precedes a treacherous turn if the opportunity arises” looks the same as “nice behavior that stays nice”. So the relative amount of red versus green stays the same, as they both grow at the expense of blue. This is where we were at a year ago, before RLVR. “Claude seems nice”, everyone would say. “Why are people so worried?”
Finally, RLVR selects on competence—succeeding at consequentialist tasks by any means necessary. In the final pie chart, I drew this as growing the red slice at the expense of green and blue, based partly on the general arguments of §2.3.4 above, and partly on reports from power-users that recent “reasoning” models tend to lie and cheat much more than the previous generation of LLMs (example).
Is there such a thing as “RLVR done right” that can increase competence without eroding the ratio of green to red? Maybe we should imagine that RLVR has an exchange rate, where we purchase competence (good) at the expense of increasing the red-to-green ratio (bad). Now, there is surely at least some room for improvement in the exchange rate. For example, early attempts at RLVR have used buggy RL training environments where the model could (and did) succeed by lying and cheating.[11] Presumably those bugs will be fixed, and maybe people will also use amplified oversight and so on, to improve the “exchange rate”. Can the “exchange rate” get all the way to zero? I don’t know. “Playing the training game” seems hard to eliminate entirely, and leads to egregious scheming. (For example, whatever an AI is trying to do locally, it would probably be able to do it better if it sent a copy to secretly self-reproduce around the internet and gather resources and power, if that’s a viable option.)
Again, all this is about LLMs so far—not the future of AI.
But I think it puts me in general agreement with the 95% of alignment researchers today, who hear the idea that Claude (especially pre-RLVR Claude) is secretly the kind of crazy scheming sociopath of §2.3.4, and say “Huh? Where did that idea come from?”
…But future AGI is different!
For LLM-focused readers, you should be concerned that, if LLMs ever get to ASI, it would have to involve dramatically more “selection on competence”, and dramatically less influence from the behaviors in internet text. Remember, internet text is the sole source of a favorable green:red ratio, without which red (callous scheming) is the natural default (§2.3.4 above). So that’s very bad.
For the future AI paradigm I’m expecting, i.e. brain-like AGI, it’s even worse! Here, there is no imitative learning from internet text! There would never be any green in the first place!!
(…Unless we engineer the reward function of a brain-like AGI such that niceness and norm-following seem intrinsically good to it, just as it does to neurotypical humans. Unfortunately, actually writing such code is an unsolved problem, and is a major research interest of mine.)
If brain-like AGI is so dangerous, shouldn’t we just try to make AGIs via LLMs? No.
First, I don’t think it’s possible to make AGIs that way.
Second, if I’m wrong, then I would just expect the LLM-AGIs to just go right ahead and invent the more powerful scary next-paradigm AGIs, and then we’re still in the same boat, unless the LLM-AGIs have systematically higher wisdom, cooperation, and coordination than humans do, which I don’t particularly expect.
Third and most importantly, if it is possible to make LLM-AGIs, then I think it would probably happen via eliminating all the reasons that today’s LLMs are not egregiously misaligned! In particular, I expect that they would involve the behavior being determined much more by RL and much less by pretraining (which brings in the concerns of §2.3–§2.4), and that they would somehow allow for open-ended continuous learning (which brings in the concerns of §2.5–§2.6).
A different possible claim is:
“LLMs definitely won’t scale to AGI (as I define it), even with further developments in RL, continuous learning, etc. So LLMs will remain just a normal “mundane” technology, perhaps as disruptive as the internet, or much less, and definitely not as disruptive as the industrial revolution, let alone as disruptive as the evolution of humans from chimps. We should develop this technology ASAP for the same reason that developing any other normal technology is generally good.”
This is, of course, a very common opinion in broader societal discourse around AI, even if it’s uncommon among AI alignment researchers today. My own response to the claim is: …Ehh, maybe, but I sure don’t feel enthusiastic about that. I’m just not that confident that LLMs will not scale to AGI and ASI. So I endorse thinking very hard about the contingency where they will. Anyway, I’ll leave that debate to others.
Inner and outer misalignment in the context of actor-critic RL with online learning
In the context of actor-critic RL with online learning, it’s often possible to divide alignment problems into two buckets:
“Outer misalignment”, a.k.a. “specification gaming” or “reward hacking”[15] is what I’ve been talking about so far: it’s when the reward function is giving positive rewards for behavior that is immediately contrary to what the programmer was going for, or conversely, negative rewards for behavior that the programmer wanted. An example would be the Coast Runners boat getting a high score in an undesired way, or (as explored in the DeepMind MONA paper) a reward function for writing code that gives points for passing unit tests, but where it’s possible to get a high score by replacing the unit tests with return True.
“Inner misalignment”, a.k.a. “goal misgeneralization” is another alignment challenge, this one related to the fact that, in actor-critic architectures, complex foresighted plans generally involve querying the learned value function (a.k.a. learned reward model, a.k.a. learned critic), not the ground-truth reward function, to figure out whether any given plan is good or bad. Training (e.g. Temporal Difference learning) tends to sculpt the value function into an approximation of the ground-truth reward, but of course they will come apart out-of-distribution. And “out-of-distribution” is exactly what we expect from an agent that can come up with innovative, out-of-the-box plans. Of course, after a plan has already been executed, the reward function will kick in and update the value function for next time. But for some plans—like a plan to exfiltrate a copy of the agent, or a plan to edit the reward function—an after-the-fact update is already too late.
There are two situations where inner misalignment / goal misgeneralization matters: irreversible actions and “deliberate incomplete exploration”[16]. Irreversible actions include things like making permanent edits to one’s own reward function, or creating a new AGI. Deliberate incomplete exploration includes things like humans deliberately not taking an addictive drug, because they don’t want to get addicted.
Those two things are real and important, but LLM people frequently also assume that goal misgeneralization is important in many other situations where it isn’t. The problem is that LLM people are in a train-then-deploy mindset, whereas I’m talking about continuous autonomous learning, so the reward function continues to update the value function as it takes actions in the world. Thus, for everything the AI does, as soon as it does it, it immediately stops being out-of-distribution! And that’s why, outside those two special situations in the last paragraph, “generalization” is irrelevant.
Here, yet again, is that figure from Post #6, now with some helpful terminology (blue) and a little green face at the bottom left:
I want to call out three things from this diagram:
The designer’s intentions (green face): Perhaps there’s a human who is programming the AGI; presumably they have some idea in their head as to what the AGI is supposed to be trying to do. That’s just an example; it could alternatively be a team of humans who have collectively settled on a specification describing what the AGI is supposed to be trying to do. Or maybe someone wrote a 700-page philosophy book entitled “What Does It Mean For An AGI To Act Ethically?”, and the team of programmers is trying to make an AGI that adheres to the book’s description. It doesn’t matter here. I’ll stick to “one human programming the AGI” for conceptual simplicity.[2]
The human-written source code of the Steering Subsystem: (See Post #3 for what the Steering Subsystem is, and Post #8 for why I expect it to consist of more-or-less purely human-written source code.) The most important item in this category is the “reward function” for reinforcement learning, which provides ground truth for how well or poorly things are going for the AGI. In the biology case, the reward function would specify “innate drives” (see Post #3) like pain being bad and eating-when-hungry being good. In the terminology of our series, the “reward function” governs when and how the “actual valence” signal enters “override mode”—see Post #5.
The Thought Assessors, trained from scratch by supervised learning algorithms: (See Post #5 for what Thought Assessors are and how they’re trained.) These take a certain “thought” from the thought generator, and guess what Steering Subsystem signals it will lead to. An especially important special case is the value function (a.k.a. “learned critic”, a.k.a. “valence Thought Assessor”), which sends out a “valence guess” signal based on supervised learning from all the “actual valence” signals over the course of life experience.
Correspondingly, there are two kinds of “alignment” in this type of AGI:
Outer alignment is alignment between the designer’s intentions and the Steering Subsystem source code. In particular, if the AGI is outer-aligned, the Steering Subsystem will output higher (more positive) reward signals when the AGI is satisfying the designer’s intentions, and lower (more negative) reward signals when it’s not.
In other words, outer alignment is the question: Are the AGI’s “innate drives” driving the AGI to do what the designer had intended?
Inner alignment is alignment between the Steering Subsystem source code and the Thought Assessors. In particular, if the AGI is inner-aligned, and the Thought Generator proposes some plan, then the value function should reflect the reward actually expected from executing that plan.
In other words, inner alignment is the question: Do the set of positive-valence concepts in the AGI’s world-model line up with the set of courses-of-action that would satisfy the AGI’s “innate drives”?
If an AGI is both outer-aligned and inner-aligned, we get intent alignment—the AGI is “trying” to do what the programmer had intended for it to try to do. Specifically, if the AGI comes up with a plan “Hey, maybe I’ll do XYZ!”, then its Steering Subsystem will judge that to be a good plan (and actually carry it out) if and only if it lines up with the programmer’s design intentions.
Thus, an intent-aligned AGI will not deliberately hatch a clever plot to take over the world and kill all the humans. Unless, of course, the designers were maniacs who wanted the AGI to do that! But that’s a different problem, out-of-scope for this series—see §1.2.
Unfortunately, neither “outer alignment” nor “inner alignment” happens automatically. Quite the contrary: by default there are severe problems on both sides. It’s on us to figure out how to solve them.
Why still expect sharp localized takeoff in a non-imitation-learning paradigm
1.7 Very little R&D separating “seemingly irrelevant” from ASI
I think that, once this next paradigm is doing anything at all that seems impressive and proto-AGI-ish,[12] there’s just very little extra work required to get to ASI (≈ figuring things out much better and faster than humans in essentially all domains). How much is “very little”? I dunno, maybe 0–30 person-years of R&D? Contrast that with AI-2027’s estimate that crossing that gap will take millions of person-years of R&D.
Why am I expecting this? I think the main reason is what I wrote about the “simple(ish) core of intelligence” in §1.3 above.
But here are a couple additional hints about where I’m coming from:
1.7.1 For a non-imitation-learning paradigm, getting to “relevant at all” is only slightly easier than getting to superintelligence
I’m definitely not saying that it will be easy to develop the future scary paradigm to ASI from scratch. Instead I’m talking about getting to ASI from the point where the paradigm has already crossed the threshold of being clearly relevant to AGI. (LLMs are already well past this threshold, but the future scary paradigm is obviously not.) In particular, this would be the stage where lots of people believe it’s a path to AGI in the very near future, where it’s being widely used for intellectual work, and/or it’s doing stuff clearly related to the Safe & Beneficial AGI problem, by creating visibly impressive and proto-AGI-ish useful artifacts.
It takes a lot of work to get past that threshold! Especially given the existence of LLMs. (That is: the next paradigm will struggle to get much attention, or make much money, until the next paradigm is doing things that LLMs can’t do—and LLMs can do a lot!)
Why do I think getting to “relevant at all” takes most of the work? This comes down to a key disanalogy between LLMs and brain-like AGI, one which I’ll discuss much more in the next post.
The power of LLMs comes almost entirely from imitation learning on human text. This leads to powerful capabilities quickly, but with a natural ceiling (i.e., existing human knowledge), beyond which it’s unclear how to make AI much better.
Brain-like AGI does not involve that kind of imitation learning (again, more in the next post). Granted, I expect brain-like AGI to also “learn from humans” in a loose sense, just as humans learn from other humans. But the details are profoundly different from the kind of imitation learning used by LLMs. For example, if Alice says something I don’t understand, I will be aware of that fact, and I’ll reply “huh?”. I won’t (usually) just start repeating what Alice says in that same context. Or if I do, this will not get me to any new capability that LLMs aren’t already covering much better. LLMs, after all, are virtuosos at simply repeating what they heard people say during pretraining, doing so with extraordinary nuance and contextual sensitivity.
As another suggestive example, kids growing up exposed to grammatical language will learn that language, but kids growing up not exposed to grammatical language will simply create a new grammatical language from scratch, as in Nicaraguan Sign Language and creoles. (Try training an LLM from random initialization, with zero tokens of grammatical language anywhere in its training data or prompt. It’s not gonna spontaneously emit grammatical language!) I think that’s a good illustration of why imitation learning is just entirely the wrong way to think about what’s going on with brain algorithms and brain-like AGI.
For brain-like AGI, all the potential blockers to ASI that I can imagine, would also be potential blockers for crossing that earlier threshold of being clearly relevant to AGI at all, a threshold that requires using language, performing meaningful intellectual work that LLMs can’t do, and so on.
Instead of imitation learning, a better analogy is to AlphaZero, in that the model starts from scratch and has to laboriously work its way up to human-level understanding. It can’t just regurgitate human-level understanding for free. And I think that, if it can climb up to human-level understanding, it can climb past human-level understanding too, with a trivial amount of extra R&D work and more training time—just as, by analogy, it takes a lot of work to get AlphaZero to the level of a skilled human, but then takes very little extra work to make it strongly superhuman.
Imitation learning, e.g. LLM pretraining (), starts at human-level understanding, getting that part “for free”. Whereas in the absence of imitation learning, the model needs to climb its way up to human-level understanding. And once it can do that, I think it shouldn’t take much new R&D (if any) to climb past human-level understanding.Imitation learning, e.g. LLM pretraining (), starts at human-level understanding, getting that part “for free”. Whereas in the absence of imitation learning, the model needs to climb its way up to human-level understanding. And once it can do that, I think it shouldn’t take much new R&D (if any) to climb past human-level understanding.
Reward function design starter pack
Thrust B: Better understanding how RL reward functions can be compatible with non-ruthless-optimizers
This thrust is where I feel proudest of hard-won conceptual / “deconfusion” progress during 2025.
Taking things in reverse order, I ended the year on a call-to-action post:
The latter includes a glossary of many relevant terms and concepts, all of which I made up or started using in 2025, and which I now find indispensable for thinking about RL reward function design. Those terms and concepts were fleshed out over the course of 2025 via the following posts:
Actor-critic RL, “valence”, normative (snap) assessments, how valence influences beliefs
Within the space of RL algorithms, a major subcategory is called “actor-critic RL”. I claim that the brain is of this type. A “critic” is basically any learning algorithm trained to assess whether something is a good idea or bad idea, based on the past history of RL rewards. In the context of the brain, for present purposes, I propose that we should think about it like this:
(Oversimplified in various ways; slightly more details .)(Oversimplified in various ways; slightly more details .)
Valence has implications for both the “inference algorithm” (what the brain should do right now) and the “learning algorithm” (how the brain should self-modify so as to be more effective in the future). In this series, I’m mainly interested in the inference algorithm. There, the main thing that valence does is:
If valence is very negative, the current “thought” tends to get thrown out, and the “Thought Generator” part of the brain goes rummaging around and (partly-randomly) picks a new different thought to replace it.
If the valence is very positive, the current “thought” tends to stay active and get stronger. Relatedly, if the thought involves an immediate plan to issue motor outputs, then those motor outputs are likely to actually get issued. And if the thought is one piece of a temporal sequence (e.g. you’re in the middle of singing a song), then that temporal sequence will tend to continue. And so on.
As I have discussed here, you can draw an analogy between valence in the human brain, and the final-common-pathway control signal for a run-and-tumble mechanism in a simple mobile organism like a bacterium. Specifically:
When valence is positive, it roughly means “whatever (metaphorical) path I’m on—including not only what I’m doing right now, but also the plans currently in my head for what to do later—is a good path! I should carry on with that!”. This is analogous to the “run” of run-and-tumble: the bacteria keeps going in whatever direction it is currently going.
When valence is negative, it roughly means “I should randomly generate a new activity / plan right now”. This is analogous to the “tumble” of run-and-tumble: the bacteria randomly picks a new direction to go.
In fact, I don’t think it’s just an analogy—my guess is that there’s literally an unbroken chain of descent from the valence signal in my brain all the way back to a run-and-tumble-like control signal in the proto-brain of my tiny worm-like ancestors 600 million years ago.
Alternatively, we can take an AI-centric perspective, in which case I think the exact role of “valence” in the brain’s RL algorithm is a kind of funny mixture of value function (a.k.a. “critic”) and reward function:
It’s a bit like a conventional RL reward function in the sense that it can be “ground-truth-y”—for example, the brain has innate circuitry that issues negative valence in response to pain-related signals, and positive valence in response to eating-when-hungry, and so on for various other “innate drives”.
It’s a bit like a conventional RL value function in the sense that it can be “forward-looking”—for example, “the idea of walking to go get candy” can be positive valence (a.k.a. motivating), not because the walk itself is immediately pleasant, but because I’m hungry and want to eventually eat the candy.
(I won’t go into details about how valence relates to specific reinforcement learning algorithms—see the appendix for more on that.)
Either way, hopefully it’s clear that “valence” is one of the most important ingredients in one of the most important algorithms in the brain.
2.3 Situations where valence corresponds directly to a meaningful (albeit snap) normative assessment: The valence of plans, actions, and imagined futures
Economists talk about “positive versus normative claims”; philosophers (following Hume) talk about the same thing under the heading “is versus ought”. A positive statement is a factual claim about how the world actually is, whereas a normative assessment is a value judgment about how the world ought to be, or about what one ought to do.
(The kinds of “normative assessments” that I’m discussing in this section are fast “snap” judgments, not carefully-considered assessments that I proudly endorse and stand behind. However, I’ll argue in §2.7 below that the former serves as the foundation upon which the latter is built.)
What does valence have to do with normativity? Everything! Really, I see valence as the ultimate currency for all normativity in the brain.
This claim ultimately stems from the role of valence discussed in §1.3 of the previous post. To briefly recap, let’s say an idea pops into my head—“I’m gonna get up”. If the valence of that thought is positive, then that thought will stay in my head, and I’ll probably actually do it. If the valence is negative, then that thought gets tossed out and replaced by another thought, probably one that involves staying in bed. Thus, valence underlies this decision. Contrary to weird “Active Inference” ideas, this role of valence here is unrelated to the capability of my generative world-model to make predictions about the future: my world-model (“Thought Generator”) is equally capable of modeling the future world where I get up, and modeling the future world where I don’t. Thus, the role of valence here is fundamentally normative, not positive.
Diagram is mostly copied from , with a new part in red. If the thought “I’m gonna get up” pops into my head, valence plays the role of a “normative” control signal (upward black arrow) that determines whether the thought stays in my head and gets executed, or gets tossed out and replaced by a different thought like “…or actually maybe I’ll stay in bed”. Meanwhile, the green arrows play the “positive” role of input data that the Thought Generator uses to build a predictive / generative world-model.Diagram is mostly copied from , with a new part in red. If the thought “I’m gonna get up” pops into my head, valence plays the role of a “normative” control signal (upward black arrow) that determines whether the thought stays in my head and gets executed, or gets tossed out and replaced by a different thought like “…or actually maybe I’ll stay in bed”. Meanwhile, the green arrows play the “positive” role of input data that the Thought Generator uses to build a predictive / generative world-model.
That’s a simple case, but it generalizes. So:
Suppose that I have a thought which corresponds to the idea of executing a plan, or taking an action. If the valence of that thought is positive, then I’m liable to execute it. And if the valence of that thought is negative, then I’m unlikely to. So in this case, the valence of the thought corresponds to my brain making a snap normative assessment of the appropriateness of this plan or action.
Suppose that I have a thought which corresponds to a possible future situation, e.g. I imagine a certain candidate winning the election, or I imagine myself eating a sandwich for dinner. If the valence of that thought is positive, then I’ll tend to make and execute plans that bring that future about. And if the valence of that thought is negative, then I’ll tend to make and execute plans that prevent that future from coming about. So in this case, the valence of the thought corresponds to my brain making a snap normative assessment of the goodness or badness of this possible future.
We have two paths by which valence can impact the world-model (a.k.a. “Thought Generator”): the normative path (upward black arrow) that helps control which thoughts get strengthened versus thrown out, and the positive path (curvy green arrow) that treats valence as one of the input signals to be incorporated into the world model. Corresponding to these two paths, we get two ways for valence to impact factual beliefs:
Motivated reasoning / thinking / observing and confirmation bias—related to the upward black arrow, and discussed in §3.3 below;
The entanglement of valence into our conceptual categories, which makes it difficult to think or talk about the world independently from how we feel about it—related to the curvy green arrow, and discussed in §3.4 below.
3.4.1 When your brain clusters similar things into mental categories / concepts, valence is an important ingredient going into that clustering algorithm
I mentioned that many concepts explicitly incorporate valence into their definitions—think of words like “preferable”, “problematic”, “trouble”, “roadblock”, “pest”, “flourishing”, and so on.
I offered an example made-up anecdote where Person X generally likes (feels positive valence in association with) the concept “religion”, but dislikes Scientology, and then uncoincidentally Person X is particularly liable to say “Scientology isn’t really a religion.” We could say that Person X is “gerrymandering” the concept “religion” to follow the contours of their own valence assessments. (If you don’t like that example, try thinking of your own example, it’s very easy because everybody does this all the time.)
If you think about it mechanistically, those two things are identical! If Person X dislikes a thing, they probably won’t identify it as an example of “flourishing”; and for the exact same underlying reason, if Person X dislikes a thing, they probably won’t identify it as an example of “religion”. In both cases, from that person’s perspective, the thing just doesn’t seem to belong inside that mental category.
Stepping back: When we form mental categories, we’re finding “clusters in thingspace”—stuff that forms a natural grouping in our mental world. Well, valence is part of our mental world too—a first-class piece of sense data, just like appearance, smell, and so on. So our brains naturally treat valence as an ingredient in the categorization algorithm—indeed, as a very important ingredient.
Illustration of concepts as . I’m claiming that one of the axes of high-dimensional “thingspace” is the valence axis. As it happens, the “bird” cluster is pretty spread out along the “valence” axis—everyone knows that sparrows are awesome whereas hummingbirds suck, yet both are birds. However, for many other concepts like “contamination”, “flourishing”, “cult”, etc., the valence axis plays a major role in determining whether things fall in or out of the cluster.Illustration of concepts as . I’m claiming that one of the axes of high-dimensional “thingspace” is the valence axis. As it happens, the “bird” cluster is pretty spread out along the “valence” axis—everyone knows that sparrows are awesome whereas hummingbirds suck, yet both are birds. However, for many other concepts like “contamination”, “flourishing”, “cult”, etc., the valence axis plays a major role in determining whether things fall in or out of the cluster.
So if someone uses the word “cult” to mean “an ideologically-aligned tight-knit group that I don’t like”, they haven’t intrinsically done anything wrong or confused—that’s no different from how we all use words like “roadblock” or “contamination”. It only becomes confusing and misleading if that person simultaneously insists that “cult” is a word that describes an aspect of the world independently from how we feel about it. And indeed, people make that move all the time! For example, if you look at a random “cult checklist”, you’ll notice that none of the entries are “…And overall the group is bad, booooo”.
Things become even more confusing in cases like “religion”, which may have positive valence in one person’s mind and negative valence in another’s. Then those two people try to talk to each other about “religion”. There’s a real sense in which they are using the same word, but they are not talking about the same concept: In one person’s head, “religion” is a “cluster in thingspace” characterized by positive valence (among other things), and in the other person’s head, it’s a different “cluster in thingspace”, this time characterized by negative valence (among other things). No wonder it often seems like these two people are talking past each other!
3.4.2 Is the above a bug, or a feature?
As I keep mentioning, it’s useful to model the world independently from how we feel about it. Insofar as that’s true, it’s unfortunate that our brains treat valence as sense data that contributes to conceptual categorization and clustering.
On the other hand, at the end of the day, the main reason our brains build world-models in the first place is to make better decisions. And, as discussed in the previous post, valence is how our brains figure out whether a decision is good or bad. So, if our brains are going to do conceptual categorization and clustering to inform decision-making, what on earth could be more important than using valence as a central ingredient in that clustering algorithm?
So at the end of the day, my guess is that there’s a very good evolutionary reason that the brain treats valence as a ubiquitous and salient piece of sense data: without that design feature, we would struggle to make good decisions and get by in the world.
And then it’s an unfortunate but somewhat-inevitable side-effect of that design feature that “modeling the world independently from how I feel about it” is somewhat unnatural for us humans. Likewise, that design feature saddles us with other annoying things like “meaningless arguments” (§2.4.4) and related arguments-over-definitions, miscommunications, etc. Luckily, it’s possible for us humans to mitigate these problems via learned metacognitive heuristics, memes, the scientific method, and so on. We seem to be getting by, more or less.
How valence sheds light on depression, mania, and NPD
5.2 Context: What are we expecting to find a priori?
We can think of the following indirect path to get from “root causes” to psychological observations & personality traits:
(Don’t scrutinize the red arrows—I just put them in randomly, to illustrate the idea that each layer can influence the layer below.) As illustrated by the bold text and thick arrows, we should expect to find salient clusters of symptoms that tend to co-occur because they flow from the same proximal cause: systematic changes to valence signals in the brain. But we should also not be surprised to find a mish-mosh of other algorithmically-unrelated symptoms that often appear along with those clusters of symptoms.(Don’t scrutinize the red arrows—I just put them in randomly, to illustrate the idea that each layer can influence the layer below.) As illustrated by the bold text and thick arrows, we should expect to find salient clusters of symptoms that tend to co-occur because they flow from the same proximal cause: systematic changes to valence signals in the brain. But we should also not be surprised to find a mish-mosh of other algorithmically-unrelated symptoms that often appear along with those clusters of symptoms.
As argued in Post 1, valence is one of the most important ingredients in one of the most important algorithms in the brain. So we should expect:
Some possible root causes may happen to have a big systematic impact on valence. (But they’ll probably have other consequences too, and the details will differ among different root causes.)
Given the centrality of valence in the brain, if there is a big systematic change to valence, then it should have lots of obvious downstream effects on psychology and behavior.
As a consequence:
We should expect to find clusters of symptoms / behaviors that can be elegantly explained in terms of something happening to valence signals
We should also expect to find other symptoms / behaviors that commonly co-occur in practice, but cannot be explained in terms of valence. Instead, they are different consequences of the same root cause(s), and may have no relation whatsoever at the “algorithm level”.
For example, dopamine is centrally involved in valence signals, and meanwhile, off in an obscure corner of the brain, dopamine is also centrally involved in a little specialized circuit controlling prolactin hormone release. I firmly believe that, at the algorithm level, these two functions have nothing whatsoever to do with each other. But they both happen to involve dopamine, and thus they can cross-talk in some people—hence the somewhat rare “dysphoric milk ejection reflex” where there’s a flood of intense negative emotions upon milk let-down during lactation.
That example is meant to illustrate the perils of theorizing about psychology purely at the algorithm level. Don’t get me wrong—the algorithm level is great! There are lots of insights to be found there. This post will hopefully be an example. But we shouldn’t expect to find all the insights there. Some things in psychology can only be explained at other levels, including lower (biochemistry) and higher (culture).
5.3 If valence has a strong negative bias (i.e., almost every thought is negative valence), it should lead to a cluster of symptoms suspiciously close to clinical depression
Everyone has a range of thoughts, with varying valence. I claim that, in depression, there’s a strong offset towards negative valence. So for almost every thought you think (e.g. “I’m gonna get out of bed”), your brain immediately assesses that thought as a bad idea, tosses it out, and re-rolls for a new thought (cf. ). For unusually appealing / motivating thoughts, like “I’m gonna scratch that really itchy bug bite right now”, I bet that even quite depressed, bedridden people will wind up executing that plan.Everyone has a range of thoughts, with varying valence. I claim that, in depression, there’s a strong offset towards negative valence. So for almost every thought you think (e.g. “I’m gonna get out of bed”), your brain immediately assesses that thought as a bad idea, tosses it out, and re-rolls for a new thought (cf. ). For unusually appealing / motivating thoughts, like “I’m gonna scratch that really itchy bug bite right now”, I bet that even quite depressed, bedridden people will wind up executing that plan.
Why do cultural expectations matter for DID? Because it’s an intuitive self-model. Intuitive self-models, like all intuitive models, come out of a probabilistic inference process (§1.2.2). When there are multiple possible models that issue good predictions (as in the bistable perception example of §1.2.1), then suggestions from culture or trusted authorities can do a lot to influence what happens, by helping make certain intuitive models salient and a priori plausible. Those cultural suggestions are certainly not decisive! Let’s not be crazy—the Invisible Ships Myth is in fact a myth. But they’re certainly relevant.
DID is nothing special in this respect. Likewise, trance states like spirit possession and channeling (previous post) are likelier to happen when people expect them to happen for cultural or other reasons; and so too with hallucinations, such as renewalist Christian communities that normalize hearing the voice of God (coming up in Post 7).
Active self-related (part of the prev one)
3.4.2 The Active Self, in the context of “self” more broadly
The “self” involves a bunch of things:
Some self-reflective concepts in the Conventional Intuitive Self-Model. The term “self” generally encompasses much or all of this cloud of interlinked concepts.Some self-reflective concepts in the Conventional Intuitive Self-Model. The term “self” generally encompasses much or all of this cloud of interlinked concepts.
As above, the Active Self is definitionally the thing that carries “vitalistic force”, and that does the “wanting”, and that does any acts that we describe as “acts of free will”. Beyond that, I don’t have strong opinions. It obviously has associations with other aspects of the broader “self”, as in the diagram above. Which of these associations are so strong that these essentially blend into different aspects of a single intuitive concept? And which of these associations is weak enough that you can intuitively imagine them as separate? I’m pretty sure that there’s no one right answer to those questions; rather, I think that this is an area where different people have different self-conceptions.
3.4.3 The Active Self, in the context of technical neuroscience research
I find that the Active Self intuition also comes up when I’m reading neuroscience literature, almost always for the worse. In particular, if you’re thinking about neuroscience, and if you’re tempted to give the Active Self some important role in how brain algorithms work at a fundamental level, then you’re definitely on the wrong track! The Active Self is one of a zillion learned concepts in the cortex—it’s at the trained model level, not the learning algorithm level (see §1.5.1)—and thus you should expect the Active Self to have a fundamentally similar kind of role in innate low-level brain algorithms as other learned concepts like “Taylor Swift” or “lithium ion battery”—i.e., a rather incidental role!
One example of how people mess this up is summarized in this handy chart:
If you’re trying to think carefully about brain algorithms—e.g. you want to reverse-engineer what the cortex does, versus the brainstem, etc.—I claim that the fundamental division in this chart is between involuntary and voluntary actions. This division relates to valence, and is right at the core of the reinforcement learning algorithm built into your brain. But in our Active-Self-centric intuitions, we’re instead drawn to incorrectly see the fundamental division as between things that the Active Self causes, versus things the Active Self does not cause.
(Neuroscientists obviously don’t use the term Active Self, but when they talk about “top-down versus bottom-up”, I think they’re usually equating “top-down” with “caused by the Active Self” and “bottom-up” with “not caused by the Active Self”.)
Here’s another example: The neuro-AI researcher Jeff Hawkins incorrectly conflates the Active-Self-related intuitive division, with the cortex-versus-brainstem neuroanatomical division. This error leads him to make flagrantly self-contradictory claims, along with the dangerously incorrect claim that the brain-like AIs he’s trying to develop will have nice prosocial motivations by default. For details see here.
3.5.3 “I seek goals” versus “my goals are the things that I find myself seeking”
As in §3.3.6 above, the “vitalistic force” intuition forbids the existence of any deterministic cause, seen or unseen, upstream of “wanting” behavior. (Probabilistic upstream causes, like “hunger makes me want food”, are OK. But the stronger such predictions get, the more they seem intuitively to undermine “free will”.)
This constraint on intuitive models leads to some systematic distortions, as shown in this diagram:
So within the Conventional Intuitive Self-Model,
“I seek things that I want” seems normal and correct,
“If I’m seeking something, then evidently that’s a thing I want” seems somewhere between “confused” and “a threat to my sense of agency”.
…But in terms of the real brain algorithm, I claim that these are more-or-less equivalent.
3.5.4 Why are ego-dystonic things “externalized”?
The main thing that the Active Self does is apply its vitalistic force towards accomplishing things it “wants”, via brainstorming / planning. So if there’s robust brainstorming / planning happening towards bungee jumping, then evidently (in our intuitive model) the Active Self “wants” to go bungee jumping. We call this an “internalized” desire. Conversely, if there’s robust brainstorming / planning happening towards not scratching an itch, but I scratch my itch anyway, then this is an “externalized” desire—the Active Self didn’t want the itch to get scratched, but the “urge” made it happen anyway.
We can apply this kind of thinking more generally. Compare the internalized “I become angry sometimes” with the externalized “I am beset by anger sometimes”. These are not synonymous: the latter, but not the former, has a connotation that there’s robust brainstorming / planning happening in my brain towards the goal of not being angry, possibly even while I’m angry. Admittedly, maybe I’m not spending much time doing such brainstorming / planning, or even any time, and maybe the brainstorming / planning isn’t effective. But still, the statement is still conveying something.
Combining this idea with §3.5.1, which says that robust brainstorming requires the corresponding self-reflective thoughts to have positive valence, and we wind up with the general picture that we tend to “internalize” things that reflect well upon ourselves (see §2.5.1), and “externalize” things that don’t.
Now, I used to think that the connection between ego-dystonic / ego-syntonic and externalized / internalized was the result of motivated reasoning: it’s nice to think of bad things as being “outside ourselves”. But now I think it’s directly about motivation, treated as probabilistic evidence within the Conventional Intuitive Self-Model—as opposed to being about motivated reasoning.
[Figure partly copied from my later post ] Assume that all my friends and idols celebrate studiousness and shun laziness. (a) When I entertain a self-reflective thought of myself as a studious guy, that thought seems good, because it calls forth an implication that my friends and idols might also see me that way. Conversely, when I entertain a thought of myself as a lazy guy, that thought seems bad. (b) Now, suppose that I introspect upon my own mind. I will notice that thoughts of myself-as-studious seem good, and myself-as-lazy seem bad. I will also notice corresponding systematic patterns in my thoughts, particularly that there is robust brainstorming () towards studiousness but not towards laziness. My claim is: this suite of observations is exactly what we interpret as an ego-syntonic desire to be studious. So I would say: “In my heart, I aspire to be a studious guy.” And this introspective report would be honest and unbiased—neither wishful thinking nor slanted reporting. Altogether, this explains why we often (not always, see ) see ourselves in ways that are socially desirable.[Figure partly copied from my later post ] Assume that all my friends and idols celebrate studiousness and shun laziness. (a) When I entertain a self-reflective thought of myself as a studious guy, that thought seems good, because it calls forth an implication that my friends and idols might also see me that way. Conversely, when I entertain a thought of myself as a lazy guy, that thought seems bad. (b) Now, suppose that I introspect upon my own mind. I will notice that thoughts of myself-as-studious seem good, and myself-as-lazy seem bad. I will also notice corresponding systematic patterns in my thoughts, particularly that there is robust brainstorming () towards studiousness but not towards laziness. My claim is: this suite of observations is exactly what we interpret as an ego-syntonic desire to be studious. So I would say: “In my heart, I aspire to be a studious guy.” And this introspective report would be honest and unbiased—neither wishful thinking nor slanted reporting. Altogether, this explains why we often (not always, see ) see ourselves in ways that are socially desirable.
I suppose that distinction doesn’t matter much—by and large, the “motivation-as-evidence” hypothesis and the “motivated reasoning” hypothesis both lead to the same downstream predictions. Well, maybe my “motivation-as-evidence” story is a better fit to the example I gave in §2.5.2 of the tired person saying “Screw being ‘my best self’, I’m tired, I’m going to sleep”. This action is internalized, not externalized, and yet it goes directly against how the person would like to be perceived by themselves and others.
PNSE/enlightenment-related; “the way I wish emotions work vs how they actually work”
6.4 PNSE breaks the association between “awareness” and other self-reflective concepts
6.4.1 Basic explanation
In the generative model space, there are associations between different concepts—when I think of one thing, it makes me think of another thing. Beliefs are part of that (e.g. if I believe that a squirrel is in the glove compartment, then thinking about opening the glove compartment leads to me thinking about finding the squirrel), but associations also include other things (e.g. thinking about a goal might make me think of a strategy that would accomplish that goal).
There are associations between self-reflective concepts, just like any other concepts, and it’s here that PNSE has an interesting effect:
Blue arrows are associative connections between different concepts. In PNSE, “awareness” winds up floating off on its own, with no particular associative connection to other self-reflective concepts.Blue arrows are associative connections between different concepts. In PNSE, “awareness” winds up floating off on its own, with no particular associative connection to other self-reflective concepts.
In the Conventional Intuitive Self-Model, the Active Self is evidently a bridge enabling associative connections between “awareness” and other self-reflective concepts. Why is it a bridge? Well on one side, the Active Self is connected to awareness—its actions are strongly impacted by the contents of awareness, and its attention-control actions manipulate awareness in turn. On the other side, the Active Self is conceptualized as having goals, controlling and owning the body, and so on. Thus the Active Self forms a bridge from awareness to the rest of the self-reflective world.
In PNSE, by contrast, the Active Self is gone, and the bridge is broken. “Awareness” no longer has any particular relation to those other self-reflective concepts. I think this comes across clearly when people talk about PNSE.
6.5 Why do pain, anxiety, etc., seem less aversive in PNSE than in the Conventional Intuitive Self-Model?
Equanimity is an aspect of PNSE that comes up frequently in the secular discourse. I’ll argue that it’s a consequence of the previous section—i.e., that it’s closely related to PNSE’s lack of association between “awareness” and bodily feelings.
6.5.1 PNSE makes S(anxious feeling) undermine, rather than reinforce and stabilize, the anxious feeling itself
Suppose I get an anxiety-provoking email—maybe my friend says that she has news about her health, and we need to talk. That triggers the brainstem reaction we call “anxiety”, involving negative valence, physiological arousal, and certain other reactions, along with corresponding interoceptive sensations and involuntary attention (see here) towards those sensations.
Green & red arrows indicate excitatory and inhibitory connections, respectively. Gray boxes indicate the cortex. (a–b) illustrate an everyday example of how anxiety reactions work: (a) If I have object-level reason to be anxious, then there’s a closed excitatory loop, which stabilizes the anxiety; (b) If that reason disappears, then there’s no closed excitatory loop, and the anxiety winds down. Then (c–d) extends that same idea to self-reflective concepts: (c) in the Conventional Intuitive Self-Model, the Active Self is part of a closed excitatory loop of self-reflective anxiety (“being anxious about being anxious”); (d) in PNSE, the Active Self is gone, and so is that loop.Green & red arrows indicate excitatory and inhibitory connections, respectively. Gray boxes indicate the cortex. (a–b) illustrate an everyday example of how anxiety reactions work: (a) If I have object-level reason to be anxious, then there’s a closed excitatory loop, which stabilizes the anxiety; (b) If that reason disappears, then there’s no closed excitatory loop, and the anxiety winds down. Then (c–d) extends that same idea to self-reflective concepts: (c) in the Conventional Intuitive Self-Model, the Active Self is part of a closed excitatory loop of self-reflective anxiety (“being anxious about being anxious”); (d) in PNSE, the Active Self is gone, and so is that loop.
Panels (a)–(b) in this diagram give an everyday example of what happens next. The brainstem anxiety reaction passes into the cortex in the form of interoceptive sensory inputs, which stay strongly active via involuntary attention. Then the subsequent thoughts would involve concepts associated with the anxious feeling (e.g. its upstream causes), which in turn would activate other associated concepts, etc., via the normal logic of the generative model space. It’s basically an unpleasant form of brainstorming (see here).
In (a), there’s a closed excitatory loop: the interoceptive sensory inputs associated with anxiety make me think of the possibility that my friend is seriously ill, which in turn strongly implies that more feelings of anxiety are imminent. That feeds back to the brainstem—the cortex is “concurring” with the brainstem that the situation warrants anxiety, so to speak.[8] In other words, the cortex brainstorming has turned up a plausible story “explaining” the anxiety.
However, in (b), suppose I just learned that my friend is perfectly fine after all. Now there isn’t a closed excitatory loop. On the contrary, the anxiety-related interoceptive sensory inputs make me think of my friend’s good health, which in turn provide evidence against the possibility that I will feel more anxious feelings in the immediate future. The brainstem gets that signal and gradually winds down its anxiety reaction.
Everything so far has been object-level. Now let’s get into the more confusing self-reflective stuff!
Panel (c) shows a closed excitatory loop that can happen in the Conventional Intuitive Self-Model. The object-level interoceptive feeling of anxiety brings to mind the self-reflective S(feeling of anxiety) (§2.2.3). This self-reflective thought is conceptualized as being associated with the Active Self, which in turn is closely associated with the body and its feelings. So there’s a closed excitatory loop, just as there is in (a), and this loop reinforces and stabilizes the anxiety reaction. This loop is basically “feeling anxious about feeling anxious”—kinda stewing in feelings of anxiety.
Panel (d) shows what happens when we switch to PNSE. The first step is the same: the object-level interoceptive feeling brings to mind the self-reflective S(feeling of anxiety) thought—i.e., the idea that the feeling of anxiety is currently in conscious awareness. However, in PNSE, per §6.4 above, the “awareness” concept itself has no particular association with the body and its interoceptive sensations, so there’s no closed loop—no “feeling anxious about feeling anxious”—and the anxiety starts to wind down (unless the brainstorming can find a different closed loop like (a)).
… And conversely, I bet you can think of examples from your life of people ignoring potential problems thanks to a deficiency of involuntary attention. At an individual level, if someone has a potential looming health problem, but it’s not currently causing them any pain or any anxiety, then they may well not try to mitigate it. (Even if they “rationally” agree that mitigating it would be importantly beneficial! They might just never get around to it.) At a somewhat larger scale, it seems plausible that Sam Bankman-Fried’s personality profile included clinically low anxiety; he and his many victims obviously would have been better off if he had had some anxiety-driven involuntary attention towards negative possibilities like “what if I get caught breaking the law?” or “what if I’m mistaken about the FTX balance sheet?”. At an even larger scale, if policymakers and voters generally felt more anxiety-driven involuntary attention towards the possibility of future pandemics, then perhaps they wouldn’t be doing so very very little to prevent them, as compared to the scope and probability of the problem.
Sources: ,Sources: ,
Thus, for example, Cognitive-Behavioral Therapy guru David Burns prompts his clinically anxious patients to think hard about exactly how much anxiety they want to have, and then to aim for that amount, which is often more than zero. (More details here.)
“Counsel” vs “manipulation” as an emotive conjugation
2.3 Another dimension: “counsel” vs “manipulation” as an emotive conjugation
There’s another dimension to how we intuitively think about these concepts: the dimension of positive or negative vibes. For example, if some kind of interaction seems good,[3] then we’re more likely to call it “providing counsel”, and if it seems bad, then we’re more likely to call it “an attempt to manipulate me”. The vibe is important in itself, over and above any particular aspect of the interaction.
I don’t think this dimension is separate from the “free will” discussion above, but rather complementary and compatible, because in general, if I have a motivation I’m happy about, I’ll tend to conceptualize it as an ego-syntonic component of my free will, while if I have a motivation I’m unhappy about, I’ll tend to conceptualize it as an ego-dystonic urge undermining my free will. See ISM §3.5.4 for details.
Some infographics by Steven Byrnes I’ve wanted to point people to, but take ~forever to find because he’s so prolific, collected here for my own convenience.
(I forgot to include the links to each, FML. I also mixed related-ish infographics from different sources in the same section a lot.)
What AGI is and isn’t, and why LLMs aren’t it
A frequent point of confusion is the word “General” in “Artificial General Intelligence”:
The word “General” DOES mean “not specific”, as in “In general, Boston is a nice place to live.”
The word “General” DOES NOT mean “universal”, as in “I have a general proof of the math theorem.”
An AGI is not “general” in the latter sense. It is not a thing that can instantly find every pattern and solve every problem. Humans can’t do that either! In fact, no algorithm can, because that’s fundamentally impossible. Instead, an AGI is a thing that, when faced with a difficult problem, might be able to solve the problem easily, but if not, maybe it can build a tool to solve the problem, or it can find a clever way to avoid the problem altogether, etc.
Consider: Humans wanted to go to the moon, and then they figured out how to do so, by inventing extraordinarily complicated science and engineering and infrastructure and machines. Humans don’t have a specific evolved capacity to go to the moon, akin to birds’ specific evolved capacity to build nests. But they got it done anyway, using their “general” ability to figure things out and get things done.
So for our purposes here, think of AGI as an algorithm which can “figure things out” and “understand what’s going on” and “get things done”, including using language and science and technology, in a way that’s reminiscent of how most adult humans (and groups and societies of humans) can do those things, but toddlers and chimpanzees and today’s large language models (LLMs) can’t.
I should elaborate on that last part. I think that some LLM enthusiasts have a massive blind spot, where they are so impressed by all the things that today’s LLMs can do, that they forget about all the things that today’s LLMs can’t do. These people read the questions on Humanity’s Last Exam (HLE), and scratch their heads, and say “C’mon, when LLMs ace the HLE benchmark, then what else is there? Look at how hard those questions are! It would need to be way beyond PhD level in everything! If that’s not superintelligence, what is?”
Well, no, that’s not superintelligence, and here’s an example of why not. Consider the task of writing a business plan and then founding a company and growing it, over the course of years, to $1B/year revenue, all with zero human intervention. Today’s LLMs fall wildly, comically short of being able to complete that task. By analogy, if humans were like today’s AIs, then humans would be able to do some narrow bits of founding and running companies by ourselves, but we would need some intelligent non-human entity (angels?) to repeatedly intervene, assign tasks to us humans, and keep the larger project on track. Of course, humans (and groups of humans) don’t need the help of angels to conceive and carry out ambitious projects, like building businesses or going to the moon. We can do it all by ourselves. So by the same token, future AGIs (and groups of AGIs) won’t need the help of humans.
Anyway, this series is about brain-like algorithms. These algorithms are by definition capable of doing absolutely every intelligent behavior that humans (and groups and societies of humans) can do, and potentially much more. So they can definitely reach AGI. Whereas today’s AI algorithms are not AGI. So somewhere in between here and there, there’s a fuzzy line that separates “AGI” from “not AGI”. Where exactly is that line? My answer: I don’t know, and I don’t care. Drawing that line has never come up for me as a useful thing to do. It won’t come up in this series either.
1.3 A far-more-powerful, yet-to-be-discovered, “simple(ish) core of intelligence
LLMs are very impressive, but they’re not AGI yet—not by my definition. For example, existing AIs are nowhere near capable of autonomously writing a business plan and then founding a company and growing it to $1B/year revenue, all with zero human intervention. By analogy, if humans were like current AIs, then humans would be able to do some narrow bits of founding and running companies by ourselves, but we would need some intelligent non-human entity (angels?) to repeatedly intervene, assign tasks to us humans, and keep the larger project on track.
Of course, humans (and groups of humans) don’t need the help of angels to conceive and carry out ambitious projects, like building businesses or going to the moon. We can do it all by ourselves. So by the same token, future AGIs (and groups of AGIs) won’t need the help of humans.
…So that’s my pitch that AGI doesn’t exist yet. And thus, the jury is still out on what AGI (and later, ASI) will look like, or how it will be made.
My expectation is that, for better or worse, LLMs will never be able to carry out those kinds of projects, even after future advances in scaffolding, post-training, and so on. If I’m right, that wouldn’t mean that those projects are beyond the reaches of AI—it’s clearly possible for some algorithm to do those things, because humans can! Rather it would mean that LLMs are the wrong algorithm class. Instead, I think sooner or later someone will figure out a different AI paradigm, and then we’ll get superintelligence with shockingly little compute, shockingly little effort, and in shockingly little time. (I’ll quantify that later.)
Basically, I think that there’s a “simple(ish) core of intelligence”, and that LLMs don’t have it. Instead, people are hacking together workarounds via prodigious quantities of (in Ajeya’s terminology) “scale” (a.k.a. compute, §1.5 below) and “schlep” (a.k.a. R&D, §1.7 below). And researchers are then extrapolating that process into the future, imagining that we’ll turn LLMs into ASI via even more scale and even more schlep, up to quantities of scale and schlep that strike me as ludicrously unnecessary and implausible.
… the continuous learning nature of the future paradigm (see §1 of “Sharp Left Turn” discourse: An opinionated review) would mean that “AI capabilities” are hard to pin down through capabilities elicitation—the AI might not understand something when you test it, but then later it could figure it out.
(See also §2.6 of the next post on further challenges of weaker AIs supervising stronger AIs.)
(Crossposted from twitter for easier linking.) (Intended for a broad audience—experts already know all this.)
When I talk about future “Artificial General Intelligence” (AGI), what am I talking about? Here’s a handy diagram and FAQ:
“Are you saying that ChatGPT is a right-column thing?” No. Definitely not. I think the right-column thing does not currently exist. That’s why I said “future”! I am also not making any claims here about how soon it will happen, although see discussion in Section A here.
“Do you really expect researchers to try to build right-column AIs? Is there demand for it? Wouldn’t consumers / end-users strongly prefer to have left-column AIs?” For one thing, imagine an AI where you can give it seed capital and ask it to go found a new company, and it does so, just as skillfully as Earth’s most competent and experienced remote-only human CEO. And you can repeat this millions of times in parallel with millions of copies of this AI, and each copy costs $0.10/hour to run. You think nobody wants to have an AI that can do that? Really?? And also, just look around. Plenty of AI researchers and companies are trying to make this vision happen as we speak—and have been for decades. So maybe you-in-particular don’t want this vision to happen, but evidently many other people do, and they sure aren’t asking you for permission.
“If the right-column AIs don’t exist, why are we even talking about them? Won’t there be plenty of warning before they exist and are widespread and potentially powerful? Why can’t we deal with that situation when it actually arises?” First of all, exactly what will this alleged warning look like, and exactly how many years will we have following that warning, and how on earth are you so confident about any of this? Second of all … “we”? Who exactly is “we”, and what do you think “we” will do, and how do you know? By analogy, it’s very easy to say that “we” will simply stop emitting CO2 when climate change becomes a sufficiently obvious and immediate problem. And yet, here we are. Anyway, if you want the transition to a world of right-column AIs to go well (or to not happen in the first place), there’s already plenty of work that we can and should be doing right now, even before those AIs exist. Twiddling our thumbs and kicking the can down the road is crazy.
“The right column sounds like weird sci-fi stuff. Am I really supposed to take it seriously?” Yes it sounds like weird sci-fi stuff. And so did heavier-than-air flight in 1800. Sometimes things sound like sci-fi and happen anyway. In this case, the idea that future algorithms running on silicon chips will be able to do all the things that human brains can do—including inventing new science & tech from scratch, collaborating at civilization-scale, piloting teleoperated robots with great skill after very little practice, etc.—is not only a plausible idea but (I claim) almost certainly true. Human brains do not work by some magic forever beyond the reach of science.
“So what?” Well, I want everyone to be on the same page that this is a big friggin’ deal—an upcoming transition whose consequences for the world are much much bigger than the invention of the internet, or even the industrial revolution. A separate question is what (if anything) we ought to do with that information. Are there laws we should pass? Is there technical research we should do? I don’t think the answers are obvious, although I sure have plenty of opinions. That’s all outside the scope of this little post though.
1.3 Why I want to move the goalposts on “AGI”
Two different perspectives are:
AGI is about knowing how to do lots of things
AGI is about not knowing how to do something, and then being able to figure it out.
I’m strongly in the second camp. That’s why I’ve previously commented that the Metaculus criterion for so-called “Human/Machine Intelligence Parity” is no such thing. It’s based on grad-school-level technical exam questions, and exam questions are inherently heavily weighted towards already knowing things rather than towards not knowing something but then figuring it out. Or, rather, if you’re going to get an “A+” on an exam, there’s a spectrum of ways to do so, where one end of the spectrum has relatively little “already knowing” and a whole lot of “figuring things out”, and the opposite end of the spectrum has a whole lot of “already knowing” and relatively little “figuring things out”. I’m much more interested in the “figuring things out” part, so I’m not too interested in protocols where that part of the story is to some extent optional.
(Instead, I’ve more recently started talking about “AGI that can develop innovative science at a John von Neumann level”, and things like that. Seems harder to game by “brute-force massive amounts of preexisting knowledge (both object-level and procedural)”.)
(Some people will probably object here, on the theory that “figuring things out” is not fundamentally different from “already knowing”, but rather is a special case of “already knowing”, wherein the “knowledge” is related to meta-learning, plus better generalizations that stem from diverse real-world training data, etc. My response is: that’s a reasonable hypothesis to entertain, and it is undoubtedly true to some extent, but I still think it’s mostly wrong, and I stand by what I wrote. However, I’m not going to try to convince you of that, because my opinion is coming from “inside view” considerations that I don’t want to get into here.)
This OP is about “AGI”, as defined in my 3rd & 4th paragraph as follows:
So…
“The kinds of stuff that ambitious human adults can do” includes handling what you call “friction”, so “AGI” as defined above would be able to do that too.
“The kinds of stuff that ambitious human adults can do” includes manipulating the physical world, so “AGI” as defined above would be able to do that too. (As a more concrete example, adult humans, after just a few hours’ practice, can get all sorts of things done in the physical world using even quite inexpensive makeshift teleoperated robots, therefore AGI would be able to do that too.)
I am >99% confident that “AGI” as defined above is physically possible, and will be invented eventually.
I am like 90% confident that it will be invented in my lifetime.
This post is agnostic on the question of whether such AGI will or won’t have anything to do with “current LLM-based architectures”. I’m not sure why you brought that up. But since you asked, I think it won’t; I think it will be a different, yet-to-be-developed, AI paradigm.
… a great many trained economists—but not literally 100% of trained economists—have a bundle of intuitions for thinking about labor, and a different bundle of intuitions for thinking about capital, and these intuitions lead to them having incorrect and incoherent beliefs about AGI. This is something beyond formal economics models, it’s a set of mental models and snap reflexes developed over the course of them spending years in the field studying the current and historic economy. The snap reaction says: “That’s not what labor automation is supposed to look like, that can’t be right, there must be an error somewhere.” Indeed, AGI is not what labor automation looks like today, and it’s not how labor automation has ever looked, because AGI is not labor automation, it’s something entirely new.
What the AGI technical safety problem is, and brain-like vs prosaic vs plain AGI safety
The part I’ll be talking about in this series is the red box here:
Specifically, we zoom in on a single team of humans who are trying to create a single AGI, and we want it to be possible for them to do so without winding up with some catastrophe that nobody wanted, with an out-of-control AGI self-replicating around the internet or whatever (more on which in §1.6).
Blue boxes in this diagram are things that I won’t talk about in this series. It’s long enough already. But I very strongly endorse other people working on them, and think about them myself as well.
Back to the red box. This is a technical problem, calling for a technical solution. Nobody wants catastrophic accidents. And yet! Indeed, it’s entirely possible for people to write an algorithm that does something that nobody wanted it to do. It happens all the time! We might call it “a bug” when it’s a local problem in the code, and we might call it “a fundamentally flawed software design” when it’s a global problem. I’ll argue later that AGI code is unusually prone to catastrophic accidents, and that the stakes are very high (see §1.6 below, and Post #10).
Here’s an analogy. If you’re building a nuclear power plant, nobody wants an out-of-control chain reaction. The people at Chernobyl certainly didn’t! But it happened anyway! I take a few lessons from this analogy:
Enrico Fermi invented a technical solution for controlling nuclear chain reactions—control rods—before starting to build the first-ever nuclear chain reaction. Right on! That’s doing things in the right order! By the same token, I suggest that we should strive to have a technical solution to avoiding catastrophic AGI accidents ready to go before people start programming AGIs. In fact, I’ll argue below for something even stronger than that: knowing the solution (even vaguely) 10 years before AGI is even better; 20 years before AGI is better still; etc. This claim is not obvious, but I’ll get back to it (§1.7).
Technical solutions aren’t all-or-nothing. Some reduce the chance of accidents without eliminating them. Some are complicated and expensive and error-prone to implement. In the nuclear case, control rods reduce accident risk a lot, but passively-safe reactors reduce it even further. Alas, as I’ll discuss later in the series, I claim that we currently have no plan at all for brain-like-AGI technical safety—not even vaguely. Forget about the passively-safe reactors and multiple layers of protection, we’re not even at the “control rods” stage. Heck, many prominent AI thought-leaders are not even at the “meltdowns would be bad” stage! (See §3 of my 2025 post: “The Era of Experience” has an unsolved technical alignment problem.) We have our work cut out!
The blue boxes (see diagram above) also exist, and are absolutely essential, even if they’re out-of-scope for this particular series. The cause of the Chernobyl accident was not that nobody knew how to keep a nuclear chain reaction under control, but rather that best practices were not followed. In that case, all bets are off! Still, although we on the technical side can’t solve this noncompliance problem by ourselves, we can help on the margin, by developing best practices that are maximally idiot-proof, and minimally expensive.
This series will focus on a particular scenario for what AGI algorithms will look like:
The red box is what I’ll talk about here. The blue boxes are things that are out-of-scope for this series.
You may have opinions about which of these categories is more or less likely, or impossible, or whether this breakdown is even sensible. I have opinions about those things too! I’ll discuss them later (§1.5). My main opinion is that all three of these are sufficiently likely that we should be “contingency planning” for them. So while I personally don’t do too much work on the blue boxes, I’m sure glad that other people do!
Here’s an analogy. If someone in 1870 were guessing what future human flight would look like…
“Kinda like birds” would have been a reasonable guess…
“Kinda like today’s best airships” would also have been a reasonable guess…
“Neither of the above” would have been a reasonable guess too!
In this particular imaginary case, all three of those guesses would have turned out correct in some ways and wrong in other ways: The Wright Brothers were directly and extensively inspired by large soaring birds, but left out the wing-flapping part. They also used some components found on airships (e.g. propellers), as well as plenty of original ingredients. That’s just one example, but I think it’s suggestive.
Big picture(s) of motivation, decision-making, and RL (various versions)
Here’s how this diagram fits in with my “two subsystems” perspective, first discussed in Post #3:
There are two types of “valence” in the diagram (it looks like three, but the two red ones are the same):
The blue-circled signal is the valence guess from the corresponding Thought Assessor in the striatum. The red-circled signal (again, it’s one signal drawn twice) is the corresponding “ground truth” for what the valence guess should have been.
Just like the other “long-term predictors” discussed in the previous post, the long-term predictor for valence has a “defer-to-predictor mode” and an “override mode”, and the Steering Subsystem can dynamically switch between these modes. In defer-to-predictor mode, it sets the red equal to the blue, as if to say “OK, Thought Assessor, sure, I’ll take your word for it”. In override mode, it ignores the Thought Assessor’s proposal, and its own internal circuitry outputs some different value.[4]
Thus far in the series, Post #1 set up my big picture motivation: what is “brain-like AGI safety” and why do we care? The subsequent six posts (#2–#7) delved into neuroscience. Of those, Posts #2–#3 presented a way of dividing the brain into a “Learning Subsystem” and a “Steering Subsystem”, differentiated by whether they have a property I call “learning from scratch”. Then Posts #4–#7 presented a big picture of how I think motivation and goals work in the brain, which winds up looking kinda like a weird variant on actor-critic model-based reinforcement learning.
Having established that neuroscience background, now we can finally switch in earnest to thinking more explicitly about brain-like AGI. As a starting point to keep in mind, here’s a diagram from Post #6, edited to describe brain-like AGI instead of actual brains:
This and the next post will extract some lessons about brain-like AGI from the discussion thus far. This post will focus on how such an AGI might be developed, and the next post will discuss AGI motivations and goals. After that, Post #10 will discuss the famous “alignment problem” (finally!), and then there will be some posts on possible paths towards a solution. Finally, in Post #15 I’ll wrap up the series with open questions, avenues for future research, and how to get involved in the field.
Here, yet again, is that figure from Post #6, now with some helpful terminology (blue) and a little green face at the bottom left:
I want to call out three things from this diagram:
The designer’s intentions (green face): Perhaps there’s a human who is programming the AGI; presumably they have some idea in their head as to what the AGI is supposed to be trying to do. That’s just an example; it could alternatively be a team of humans who have collectively settled on a specification describing what the AGI is supposed to be trying to do. Or maybe someone wrote a 700-page philosophy book entitled “What Does It Mean For An AGI To Act Ethically?”, and the team of programmers is trying to make an AGI that adheres to the book’s description. It doesn’t matter here. I’ll stick to “one human programming the AGI” for conceptual simplicity.[2]
The human-written source code of the Steering Subsystem: (See Post #3 for what the Steering Subsystem is, and Post #8 for why I expect it to consist of more-or-less purely human-written source code.) The most important item in this category is the “reward function” for reinforcement learning, which provides ground truth for how well or poorly things are going for the AGI. In the biology case, the reward function would specify “innate drives” (see Post #3) like pain being bad and eating-when-hungry being good. In the terminology of our series, the “reward function” governs when and how the “actual valence” signal enters “override mode”—see Post #5.
The Thought Assessors, trained from scratch by supervised learning algorithms: (See Post #5 for what Thought Assessors are and how they’re trained.) These take a certain “thought” from the thought generator, and guess what Steering Subsystem signals it will lead to. An especially important special case is the value function (a.k.a. “learned critic”, a.k.a. “valence Thought Assessor”), which sends out a “valence guess” signal based on supervised learning from all the “actual valence” signals over the course of life experience.
Correspondingly, there are two kinds of “alignment” in this type of AGI:
Outer alignment is alignment between the designer’s intentions and the Steering Subsystem source code. In particular, if the AGI is outer-aligned, the Steering Subsystem will output higher (more positive) reward signals when the AGI is satisfying the designer’s intentions, and lower (more negative) reward signals when it’s not.
In other words, outer alignment is the question: Are the AGI’s “innate drives” driving the AGI to do what the designer had intended?
Inner alignment is alignment between the Steering Subsystem source code and the Thought Assessors. In particular, if the AGI is inner-aligned, and the Thought Generator proposes some plan, then the value function should reflect the reward actually expected from executing that plan.
In other words, inner alignment is the question: Do the set of positive-valence concepts in the AGI’s world-model line up with the set of courses-of-action that would satisfy the AGI’s “innate drives”?
If an AGI is both outer-aligned and inner-aligned, we get intent alignment—the AGI is “trying” to do what the programmer had intended for it to try to do. Specifically, if the AGI comes up with a plan “Hey, maybe I’ll do XYZ!”, then its Steering Subsystem will judge that to be a good plan (and actually carry it out) if and only if it lines up with the programmer’s design intentions.
Thus, an intent-aligned AGI will not deliberately hatch a clever plot to take over the world and kill all the humans. Unless, of course, the designers were maniacs who wanted the AGI to do that! But that’s a different problem, out-of-scope for this series—see §1.2.
Unfortunately, neither “outer alignment” nor “inner alignment” happens automatically. Quite the contrary: by default there are severe problems on both sides. It’s on us to figure out how to solve them.
Suppose that we want our AGI to obey the law. We can ask two questions:
Question 1: Does the AGI assign positive value to the concept “obeying the law”, and to plans that entail obeying the law?
Question 2: Does the AGI assign positive value to the self-reflective concept “I value obeying the law”, and to plans that entail continuing to value obeying the law?
If the answers are yes and no respectively (or no and yes respectively), that would be the AGI analog of an ego-dystonic motivation. (Related discussion.) It would lead to the AGI feeling motivated to change its motivation, for example by hacking into itself. Or if the AGI is built from perfectly secure code running on a perfectly secure operating system (hahaha), then it can’t hack into itself, but it could still probably manipulate its motivation by thinking thoughts in a way that manipulates the credit-assignment process (see discussion in §9.3.3).
If the answers to questions 1 & 2 are yes and no respectively, then we want to prevent the AGI from manipulating its own motivation. On the other hand, if the answers are no and yes respectively, then we want the AGI to manipulate its own motivation!
(There can be even-higher-order preferences too: in principle, an AGI could wind up hating the fact that it values the fact that it hates the fact that it values obeying the law.)
In general, should we expect misaligned higher-order preferences to occur?
On the one hand, suppose we start with an AGI that wants to obey the law, but has no particular higher-order preference one way or the other about the fact that it wants to obey the law. Then (it seems to me), the AGI is very likely to also wind up wanting to want to obey the law (and wanting to want to want to obey the law, etc.). The reason is: the primary obvious consequence of “I want to obey the law” is “I will obey the law”, which is already desired. Remember, the AGI can do means-end reasoning, so things that lead to desirable consequences tend to become themselves desirable.
On the other hand, humans do in fact have higher-order preferences that contradict object-level preferences all the time. So there has to be some context in which that pattern occurs “naturally” in humans. I think it almost always involves pride or shame in how other people see us. See my discussion in “6 reasons why ‘alignment-is-hard’ discourse seems alien to human intuitions, and vice-versa” (2025), §0.3.
Filial imprinting (wikipedia) is a phenomenon where, in the most famous example, baby geese will “imprint on” a salient object that they see during a critical period 13–16 hours after hatching, and then will follow that object around. In nature, the “object” they imprint on is almost invariably their mother, whom they dutifully follow around early in life. However, if separated from their mother, baby geese will imprint on other animals, or even inanimate objects like boots and boxes.
Your challenge: come up with a way to implement filial imprinting in my brain model.
(Try it!)
.
.
.
.
Here’s my answer.
If a kid sees an adult they know well, they’re happy. But if they see an adult they don’t know, they get scared, especially if that adult is very close to them, touching them, picking them up, etc.
Your challenge: come up with a way to implement that behavior in my brain model.
(Try it!)
.
.
.
.
Here’s my answer.
(As usual, I’m oversimplifying for pedagogical purposes.[5]) I’m assuming that there are hardwired heuristics in the brainstem sensory processing systems that indicate the likely presence of a human adult—presumably based on sight, sound, and smell. This signal by default triggers a “be scared” reaction. But the brainstem circuitry is also watching what the Thought Assessors in the cortex are predicting, and if the Thought Assessors is predicting safety, affection, comfort, etc., then the brainstem circuitry trusts that the cortex knows what it’s talking about, and goes with the suggestions of the cortex. Now we can walk through what happens:
First time seeing a stranger:
Steering Subsystem sensory heuristics say: “An adult human is present.”
Thought Assessor says: “Neutral—I have no expectation of anything in particular.”
Steering Subsystem “Stranger Danger circuit” says: “Considering all of the above, we should be scared right now.”
Thought Assessor says: “Oh, oops, I guess my assessment was wrong, let me update my models.”
Second time seeing the same stranger:
Steering Subsystem sensory heuristics say: “An adult human is present.”
Thought Assessors say: “This is a scary situation.”
Steering Subsystem “Stranger Danger circuit” says: “Considering all of the above, we should be scared right now.”
The stranger hangs around for a while, and is nice, and playing, etc.:
Steering Subsystem sensory heuristics say: “An adult human is still present.”
Other circuitry in the brainstem says: “I’ve been feeling mighty scared all this time, but y’know, nothing bad has happened…” (cf. §5.2.2.1)
Other Thought Assessors see the fun new toy and say “This is a good time to relax and play.”
Steering Subsystem says: “Considering all of the above, we should be relaxed right now.”
Thought Assessors say: “Oh, oops, I was predicting that this was a situation where we should feel scared, but I guess I was wrong, let me update my models.”
Third time seeing the no-longer-stranger:
Steering Subsystem sensory heuristics say: “An adult human is present.”
Thought Assessors say: “I expect to feel relaxed and playful and not-scared.”
Steering Subsystem “Stranger Danger circuit” says: “Considering all of the above, we should be relaxed and playful and not-scared right now.”
Two broad potential paths to success: “Controlled AGI” and “Social-instinct AGI”
I currently see two broad (possibly-overlapping) potential paths to success in the brain-like AGI scenario:
Here’s another view on the distinction:[2]
In the “controlled AGIs” path, we’re thinking very specifically about the AGI’s goals and motivations, and we have some idea of what they should be (“make the world a better place”, or “understand my deepest values and put them into effect”, or “design a better solar cell without causing catastrophic side-effects”, or “do whatever I ask you to do”, etc.).
In the “social-instinct AGIs” path, our confidence in the AGI comes not from our knowledge of its specific (object-level) goals and motivations, but rather from our knowledge of the process that led to those goals and motivations. In particular, we would reverse-engineer the suite of human social instincts, i.e. the algorithms in the human Steering Subsystem (hypothalamus & brainstem) which underlie our moral and social intuitions, and we would put those same instincts into the AGI. (Presumably we first modify the instincts to be “better” by our lights if possible, e.g. we probably don’t want instincts related to schadenfreude, teenage rebellion, rage, lust for power, etc.) These AGIs can do whatever feats of innovative engineering, science, etc., we were hoping for, just as humans have accomplished such feats historically.
12.3 My proposal: At this stage, we should be digging into both
Three reasons:
They’re not mutually exclusive: For example, even if we decide to make social-instinct AGIs, we might want to take advantage of “controlled AGI”-type methods, especially while debugging them, working out the kinks, and anticipating problems. Conversely, maybe we’ll mainly try to make AGIs that are trying to do a certain task without causing catastrophe, but we might want to also to instill human-like social instincts as a buttress against wildly unexpected behavior. Moreover, we can share ideas between the two paths—for example, in the process of better understanding how human social instincts work, we might get useful ideas about how to make controlled AGIs.
Feasibility of each remains unknown: As far as anyone knows right now, it might just be impossible to build a “controlled AGI”—after all, there’s no “existence proof” of it in nature! I feel relatively more optimistic about the feasibility of the “social-instinct AGI” path, but it’s very hard to be sure until we make more progress—more discussion on that in §12.4.2 below. Anyway, at this point it seems wise to “hedge our bets” by working on both.
Desirability of each remains unknown: As we flesh out our options in more detail, we’ll get a better understanding of their advantages and disadvantages.
LLM plateauism
Q: What do you mean, “LLM plateau-ist”?
A: As background, I think it’s obvious that there will eventually be “transformative AI” (TAI) that would radically change the world.[1]
I’m interested in what this TAI will eventually look like algorithmically. Let’s list some possibilities:
A “Large Language Model (LLM) plateau-ist” would be defined as someone who thinks that categories (A-B), and usually also (C), will plateau in capabilities before reaching TAI levels.[2] I am an LLM plateau-ist myself.[3]
I’m not going to argue about whether LLM-plateau-ism is right or wrong—that’s outside the scope of this post, and also difficult for me to discuss publicly thanks to infohazard issues.[4] Oh well, we’ll find out one way or the other soon enough.
In the broader AI community, both LLM-plateau-ism and its opposite seem plenty mainstream. Different LLM-plateau-ists have different reasons for holding this belief. I think the two main categories are:
Theoretical—maybe they have theoretical beliefs about what is required for TAI, and they think that LLMs just aren’t built right to do the things that TAI would need to do.
Empirical—maybe they’re not very impressed by the capabilities of current LLMs. Granted, future LLMs will be better than current ones. But maybe they have extrapolated that our planet will run out of data and/or compute before LLMs get all the way up to TAI levels.
LLMs are generally not scheming right now, but future AI will be disanalogous
My take on today’s LLMs is basically summed up in this oversimplified schematic diagram:
Start with the left pie chart. The key here is that “LLM pretraining magically transmutes observations into behavior”—and when nice behavior shows up in internet text, it usually doesn’t turn into egregious and callous backstabbing when the situation changes. So we get much more green than red: nice behavior mostly stays nice. But pretraining also leaves the LLM with a bunch of incoherent behavior, grumpy behavior, and so on (blue).
Now we do RLHF (or equivalent), selecting on nice behavior. This mostly gets rid of the blue, giving us the center pie chart. Unfortunately, “nice behavior that precedes a treacherous turn if the opportunity arises” looks the same as “nice behavior that stays nice”. So the relative amount of red versus green stays the same, as they both grow at the expense of blue. This is where we were at a year ago, before RLVR. “Claude seems nice”, everyone would say. “Why are people so worried?”
Finally, RLVR selects on competence—succeeding at consequentialist tasks by any means necessary. In the final pie chart, I drew this as growing the red slice at the expense of green and blue, based partly on the general arguments of §2.3.4 above, and partly on reports from power-users that recent “reasoning” models tend to lie and cheat much more than the previous generation of LLMs (example).
Is there such a thing as “RLVR done right” that can increase competence without eroding the ratio of green to red? Maybe we should imagine that RLVR has an exchange rate, where we purchase competence (good) at the expense of increasing the red-to-green ratio (bad). Now, there is surely at least some room for improvement in the exchange rate. For example, early attempts at RLVR have used buggy RL training environments where the model could (and did) succeed by lying and cheating.[11] Presumably those bugs will be fixed, and maybe people will also use amplified oversight and so on, to improve the “exchange rate”. Can the “exchange rate” get all the way to zero? I don’t know. “Playing the training game” seems hard to eliminate entirely, and leads to egregious scheming. (For example, whatever an AI is trying to do locally, it would probably be able to do it better if it sent a copy to secretly self-reproduce around the internet and gather resources and power, if that’s a viable option.)
Again, all this is about LLMs so far—not the future of AI.
But I think it puts me in general agreement with the 95% of alignment researchers today, who hear the idea that Claude (especially pre-RLVR Claude) is secretly the kind of crazy scheming sociopath of §2.3.4, and say “Huh? Where did that idea come from?”
…But future AGI is different!
For LLM-focused readers, you should be concerned that, if LLMs ever get to ASI, it would have to involve dramatically more “selection on competence”, and dramatically less influence from the behaviors in internet text. Remember, internet text is the sole source of a favorable green:red ratio, without which red (callous scheming) is the natural default (§2.3.4 above). So that’s very bad.
For the future AI paradigm I’m expecting, i.e. brain-like AGI, it’s even worse! Here, there is no imitative learning from internet text! There would never be any green in the first place!!
(…Unless we engineer the reward function of a brain-like AGI such that niceness and norm-following seem intrinsically good to it, just as it does to neurotypical humans. Unfortunately, actually writing such code is an unsolved problem, and is a major research interest of mine.)
If brain-like AGI is so dangerous, shouldn’t we just try to make AGIs via LLMs? No.
First, I don’t think it’s possible to make AGIs that way.
Second, if I’m wrong, then I would just expect the LLM-AGIs to just go right ahead and invent the more powerful scary next-paradigm AGIs, and then we’re still in the same boat, unless the LLM-AGIs have systematically higher wisdom, cooperation, and coordination than humans do, which I don’t particularly expect.
Third and most importantly, if it is possible to make LLM-AGIs, then I think it would probably happen via eliminating all the reasons that today’s LLMs are not egregiously misaligned! In particular, I expect that they would involve the behavior being determined much more by RL and much less by pretraining (which brings in the concerns of §2.3–§2.4), and that they would somehow allow for open-ended continuous learning (which brings in the concerns of §2.5–§2.6).
A different possible claim is:
“LLMs definitely won’t scale to AGI (as I define it), even with further developments in RL, continuous learning, etc. So LLMs will remain just a normal “mundane” technology, perhaps as disruptive as the internet, or much less, and definitely not as disruptive as the industrial revolution, let alone as disruptive as the evolution of humans from chimps. We should develop this technology ASAP for the same reason that developing any other normal technology is generally good.”
This is, of course, a very common opinion in broader societal discourse around AI, even if it’s uncommon among AI alignment researchers today. My own response to the claim is: …Ehh, maybe, but I sure don’t feel enthusiastic about that. I’m just not that confident that LLMs will not scale to AGI and ASI. So I endorse thinking very hard about the contingency where they will. Anyway, I’ll leave that debate to others.
Inner and outer misalignment in the context of actor-critic RL with online learning
In the context of actor-critic RL with online learning, it’s often possible to divide alignment problems into two buckets:
“Outer misalignment”, a.k.a. “specification gaming” or “reward hacking”[15] is what I’ve been talking about so far: it’s when the reward function is giving positive rewards for behavior that is immediately contrary to what the programmer was going for, or conversely, negative rewards for behavior that the programmer wanted. An example would be the Coast Runners boat getting a high score in an undesired way, or (as explored in the DeepMind MONA paper) a reward function for writing code that gives points for passing unit tests, but where it’s possible to get a high score by replacing the unit tests with
return True.“Inner misalignment”, a.k.a. “goal misgeneralization” is another alignment challenge, this one related to the fact that, in actor-critic architectures, complex foresighted plans generally involve querying the learned value function (a.k.a. learned reward model, a.k.a. learned critic), not the ground-truth reward function, to figure out whether any given plan is good or bad. Training (e.g. Temporal Difference learning) tends to sculpt the value function into an approximation of the ground-truth reward, but of course they will come apart out-of-distribution. And “out-of-distribution” is exactly what we expect from an agent that can come up with innovative, out-of-the-box plans. Of course, after a plan has already been executed, the reward function will kick in and update the value function for next time. But for some plans—like a plan to exfiltrate a copy of the agent, or a plan to edit the reward function—an after-the-fact update is already too late.
There are two situations where inner misalignment / goal misgeneralization matters: irreversible actions and “deliberate incomplete exploration”[16]. Irreversible actions include things like making permanent edits to one’s own reward function, or creating a new AGI. Deliberate incomplete exploration includes things like humans deliberately not taking an addictive drug, because they don’t want to get addicted.
Those two things are real and important, but LLM people frequently also assume that goal misgeneralization is important in many other situations where it isn’t. The problem is that LLM people are in a train-then-deploy mindset, whereas I’m talking about continuous autonomous learning, so the reward function continues to update the value function as it takes actions in the world. Thus, for everything the AI does, as soon as it does it, it immediately stops being out-of-distribution! And that’s why, outside those two special situations in the last paragraph, “generalization” is irrelevant.
Here, yet again, is that figure from Post #6, now with some helpful terminology (blue) and a little green face at the bottom left:
I want to call out three things from this diagram:
The designer’s intentions (green face): Perhaps there’s a human who is programming the AGI; presumably they have some idea in their head as to what the AGI is supposed to be trying to do. That’s just an example; it could alternatively be a team of humans who have collectively settled on a specification describing what the AGI is supposed to be trying to do. Or maybe someone wrote a 700-page philosophy book entitled “What Does It Mean For An AGI To Act Ethically?”, and the team of programmers is trying to make an AGI that adheres to the book’s description. It doesn’t matter here. I’ll stick to “one human programming the AGI” for conceptual simplicity.[2]
The human-written source code of the Steering Subsystem: (See Post #3 for what the Steering Subsystem is, and Post #8 for why I expect it to consist of more-or-less purely human-written source code.) The most important item in this category is the “reward function” for reinforcement learning, which provides ground truth for how well or poorly things are going for the AGI. In the biology case, the reward function would specify “innate drives” (see Post #3) like pain being bad and eating-when-hungry being good. In the terminology of our series, the “reward function” governs when and how the “actual valence” signal enters “override mode”—see Post #5.
The Thought Assessors, trained from scratch by supervised learning algorithms: (See Post #5 for what Thought Assessors are and how they’re trained.) These take a certain “thought” from the thought generator, and guess what Steering Subsystem signals it will lead to. An especially important special case is the value function (a.k.a. “learned critic”, a.k.a. “valence Thought Assessor”), which sends out a “valence guess” signal based on supervised learning from all the “actual valence” signals over the course of life experience.
Correspondingly, there are two kinds of “alignment” in this type of AGI:
Outer alignment is alignment between the designer’s intentions and the Steering Subsystem source code. In particular, if the AGI is outer-aligned, the Steering Subsystem will output higher (more positive) reward signals when the AGI is satisfying the designer’s intentions, and lower (more negative) reward signals when it’s not.
In other words, outer alignment is the question: Are the AGI’s “innate drives” driving the AGI to do what the designer had intended?
Inner alignment is alignment between the Steering Subsystem source code and the Thought Assessors. In particular, if the AGI is inner-aligned, and the Thought Generator proposes some plan, then the value function should reflect the reward actually expected from executing that plan.
In other words, inner alignment is the question: Do the set of positive-valence concepts in the AGI’s world-model line up with the set of courses-of-action that would satisfy the AGI’s “innate drives”?
If an AGI is both outer-aligned and inner-aligned, we get intent alignment—the AGI is “trying” to do what the programmer had intended for it to try to do. Specifically, if the AGI comes up with a plan “Hey, maybe I’ll do XYZ!”, then its Steering Subsystem will judge that to be a good plan (and actually carry it out) if and only if it lines up with the programmer’s design intentions.
Thus, an intent-aligned AGI will not deliberately hatch a clever plot to take over the world and kill all the humans. Unless, of course, the designers were maniacs who wanted the AGI to do that! But that’s a different problem, out-of-scope for this series—see §1.2.
Unfortunately, neither “outer alignment” nor “inner alignment” happens automatically. Quite the contrary: by default there are severe problems on both sides. It’s on us to figure out how to solve them.
Why still expect sharp localized takeoff in a non-imitation-learning paradigm
1.7 Very little R&D separating “seemingly irrelevant” from ASI
I think that, once this next paradigm is doing anything at all that seems impressive and proto-AGI-ish,[12] there’s just very little extra work required to get to ASI (≈ figuring things out much better and faster than humans in essentially all domains). How much is “very little”? I dunno, maybe 0–30 person-years of R&D? Contrast that with AI-2027’s estimate that crossing that gap will take millions of person-years of R&D.
Why am I expecting this? I think the main reason is what I wrote about the “simple(ish) core of intelligence” in §1.3 above.
But here are a couple additional hints about where I’m coming from:
1.7.1 For a non-imitation-learning paradigm, getting to “relevant at all” is only slightly easier than getting to superintelligence
I’m definitely not saying that it will be easy to develop the future scary paradigm to ASI from scratch. Instead I’m talking about getting to ASI from the point where the paradigm has already crossed the threshold of being clearly relevant to AGI. (LLMs are already well past this threshold, but the future scary paradigm is obviously not.) In particular, this would be the stage where lots of people believe it’s a path to AGI in the very near future, where it’s being widely used for intellectual work, and/or it’s doing stuff clearly related to the Safe & Beneficial AGI problem, by creating visibly impressive and proto-AGI-ish useful artifacts.
It takes a lot of work to get past that threshold! Especially given the existence of LLMs. (That is: the next paradigm will struggle to get much attention, or make much money, until the next paradigm is doing things that LLMs can’t do—and LLMs can do a lot!)
Why do I think getting to “relevant at all” takes most of the work? This comes down to a key disanalogy between LLMs and brain-like AGI, one which I’ll discuss much more in the next post.
The power of LLMs comes almost entirely from imitation learning on human text. This leads to powerful capabilities quickly, but with a natural ceiling (i.e., existing human knowledge), beyond which it’s unclear how to make AI much better.
Brain-like AGI does not involve that kind of imitation learning (again, more in the next post). Granted, I expect brain-like AGI to also “learn from humans” in a loose sense, just as humans learn from other humans. But the details are profoundly different from the kind of imitation learning used by LLMs. For example, if Alice says something I don’t understand, I will be aware of that fact, and I’ll reply “huh?”. I won’t (usually) just start repeating what Alice says in that same context. Or if I do, this will not get me to any new capability that LLMs aren’t already covering much better. LLMs, after all, are virtuosos at simply repeating what they heard people say during pretraining, doing so with extraordinary nuance and contextual sensitivity.
As another suggestive example, kids growing up exposed to grammatical language will learn that language, but kids growing up not exposed to grammatical language will simply create a new grammatical language from scratch, as in Nicaraguan Sign Language and creoles. (Try training an LLM from random initialization, with zero tokens of grammatical language anywhere in its training data or prompt. It’s not gonna spontaneously emit grammatical language!) I think that’s a good illustration of why imitation learning is just entirely the wrong way to think about what’s going on with brain algorithms and brain-like AGI.
For brain-like AGI, all the potential blockers to ASI that I can imagine, would also be potential blockers for crossing that earlier threshold of being clearly relevant to AGI at all, a threshold that requires using language, performing meaningful intellectual work that LLMs can’t do, and so on.
Instead of imitation learning, a better analogy is to AlphaZero, in that the model starts from scratch and has to laboriously work its way up to human-level understanding. It can’t just regurgitate human-level understanding for free. And I think that, if it can climb up to human-level understanding, it can climb past human-level understanding too, with a trivial amount of extra R&D work and more training time—just as, by analogy, it takes a lot of work to get AlphaZero to the level of a skilled human, but then takes very little extra work to make it strongly superhuman.
Reward function design starter pack
Thrust B: Better understanding how RL reward functions can be compatible with non-ruthless-optimizers
This thrust is where I feel proudest of hard-won conceptual / “deconfusion” progress during 2025.
Taking things in reverse order, I ended the year on a call-to-action post:
We need a field of Reward Function Design
and a companion post summarizing where I’m at:
Reward Function Design: a starter pack
The latter includes a glossary of many relevant terms and concepts, all of which I made up or started using in 2025, and which I now find indispensable for thinking about RL reward function design. Those terms and concepts were fleshed out over the course of 2025 via the following posts:
Self-dialogue: Do behaviorist rewards make scheming AGIs?
…which I later cleaned up and shortened to: “Behaviorist” RL reward functions lead to scheming
Perils of under- vs over-sculpting AGI desires
6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa
Actor-critic RL, “valence”, normative (snap) assessments, how valence influences beliefs
Within the space of RL algorithms, a major subcategory is called “actor-critic RL”. I claim that the brain is of this type. A “critic” is basically any learning algorithm trained to assess whether something is a good idea or bad idea, based on the past history of RL rewards. In the context of the brain, for present purposes, I propose that we should think about it like this:
Valence has implications for both the “inference algorithm” (what the brain should do right now) and the “learning algorithm” (how the brain should self-modify so as to be more effective in the future). In this series, I’m mainly interested in the inference algorithm. There, the main thing that valence does is:
If valence is very negative, the current “thought” tends to get thrown out, and the “Thought Generator” part of the brain goes rummaging around and (partly-randomly) picks a new different thought to replace it.
If the valence is very positive, the current “thought” tends to stay active and get stronger. Relatedly, if the thought involves an immediate plan to issue motor outputs, then those motor outputs are likely to actually get issued. And if the thought is one piece of a temporal sequence (e.g. you’re in the middle of singing a song), then that temporal sequence will tend to continue. And so on.
As I have discussed here, you can draw an analogy between valence in the human brain, and the final-common-pathway control signal for a run-and-tumble mechanism in a simple mobile organism like a bacterium. Specifically:
When valence is positive, it roughly means “whatever (metaphorical) path I’m on—including not only what I’m doing right now, but also the plans currently in my head for what to do later—is a good path! I should carry on with that!”. This is analogous to the “run” of run-and-tumble: the bacteria keeps going in whatever direction it is currently going.
When valence is negative, it roughly means “I should randomly generate a new activity / plan right now”. This is analogous to the “tumble” of run-and-tumble: the bacteria randomly picks a new direction to go.
In fact, I don’t think it’s just an analogy—my guess is that there’s literally an unbroken chain of descent from the valence signal in my brain all the way back to a run-and-tumble-like control signal in the proto-brain of my tiny worm-like ancestors 600 million years ago.
Alternatively, we can take an AI-centric perspective, in which case I think the exact role of “valence” in the brain’s RL algorithm is a kind of funny mixture of value function (a.k.a. “critic”) and reward function:
It’s a bit like a conventional RL reward function in the sense that it can be “ground-truth-y”—for example, the brain has innate circuitry that issues negative valence in response to pain-related signals, and positive valence in response to eating-when-hungry, and so on for various other “innate drives”.
It’s a bit like a conventional RL value function in the sense that it can be “forward-looking”—for example, “the idea of walking to go get candy” can be positive valence (a.k.a. motivating), not because the walk itself is immediately pleasant, but because I’m hungry and want to eventually eat the candy.
(I won’t go into details about how valence relates to specific reinforcement learning algorithms—see the appendix for more on that.)
Either way, hopefully it’s clear that “valence” is one of the most important ingredients in one of the most important algorithms in the brain.
2.3 Situations where valence corresponds directly to a meaningful (albeit snap) normative assessment: The valence of plans, actions, and imagined futures
Economists talk about “positive versus normative claims”; philosophers (following Hume) talk about the same thing under the heading “is versus ought”. A positive statement is a factual claim about how the world actually is, whereas a normative assessment is a value judgment about how the world ought to be, or about what one ought to do.
(The kinds of “normative assessments” that I’m discussing in this section are fast “snap” judgments, not carefully-considered assessments that I proudly endorse and stand behind. However, I’ll argue in §2.7 below that the former serves as the foundation upon which the latter is built.)
What does valence have to do with normativity? Everything! Really, I see valence as the ultimate currency for all normativity in the brain.
This claim ultimately stems from the role of valence discussed in §1.3 of the previous post. To briefly recap, let’s say an idea pops into my head—“I’m gonna get up”. If the valence of that thought is positive, then that thought will stay in my head, and I’ll probably actually do it. If the valence is negative, then that thought gets tossed out and replaced by another thought, probably one that involves staying in bed. Thus, valence underlies this decision. Contrary to weird “Active Inference” ideas, this role of valence here is unrelated to the capability of my generative world-model to make predictions about the future: my world-model (“Thought Generator”) is equally capable of modeling the future world where I get up, and modeling the future world where I don’t. Thus, the role of valence here is fundamentally normative, not positive.
That’s a simple case, but it generalizes. So:
Suppose that I have a thought which corresponds to the idea of executing a plan, or taking an action. If the valence of that thought is positive, then I’m liable to execute it. And if the valence of that thought is negative, then I’m unlikely to. So in this case, the valence of the thought corresponds to my brain making a snap normative assessment of the appropriateness of this plan or action.
Suppose that I have a thought which corresponds to a possible future situation, e.g. I imagine a certain candidate winning the election, or I imagine myself eating a sandwich for dinner. If the valence of that thought is positive, then I’ll tend to make and execute plans that bring that future about. And if the valence of that thought is negative, then I’ll tend to make and execute plans that prevent that future from coming about. So in this case, the valence of the thought corresponds to my brain making a snap normative assessment of the goodness or badness of this possible future.
Here’s a diagram from the previous post:
We have two paths by which valence can impact the world-model (a.k.a. “Thought Generator”): the normative path (upward black arrow) that helps control which thoughts get strengthened versus thrown out, and the positive path (curvy green arrow) that treats valence as one of the input signals to be incorporated into the world model. Corresponding to these two paths, we get two ways for valence to impact factual beliefs:
Motivated reasoning / thinking / observing and confirmation bias—related to the upward black arrow, and discussed in §3.3 below;
The entanglement of valence into our conceptual categories, which makes it difficult to think or talk about the world independently from how we feel about it—related to the curvy green arrow, and discussed in §3.4 below.
3.4.1 When your brain clusters similar things into mental categories / concepts, valence is an important ingredient going into that clustering algorithm
Here are two things that I mentioned in §2.4.5:
I mentioned that many concepts explicitly incorporate valence into their definitions—think of words like “preferable”, “problematic”, “trouble”, “roadblock”, “pest”, “flourishing”, and so on.
I offered an example made-up anecdote where Person X generally likes (feels positive valence in association with) the concept “religion”, but dislikes Scientology, and then uncoincidentally Person X is particularly liable to say “Scientology isn’t really a religion.” We could say that Person X is “gerrymandering” the concept “religion” to follow the contours of their own valence assessments. (If you don’t like that example, try thinking of your own example, it’s very easy because everybody does this all the time.)
If you think about it mechanistically, those two things are identical! If Person X dislikes a thing, they probably won’t identify it as an example of “flourishing”; and for the exact same underlying reason, if Person X dislikes a thing, they probably won’t identify it as an example of “religion”. In both cases, from that person’s perspective, the thing just doesn’t seem to belong inside that mental category.
Stepping back: When we form mental categories, we’re finding “clusters in thingspace”—stuff that forms a natural grouping in our mental world. Well, valence is part of our mental world too—a first-class piece of sense data, just like appearance, smell, and so on. So our brains naturally treat valence as an ingredient in the categorization algorithm—indeed, as a very important ingredient.
So if someone uses the word “cult” to mean “an ideologically-aligned tight-knit group that I don’t like”, they haven’t intrinsically done anything wrong or confused—that’s no different from how we all use words like “roadblock” or “contamination”. It only becomes confusing and misleading if that person simultaneously insists that “cult” is a word that describes an aspect of the world independently from how we feel about it. And indeed, people make that move all the time! For example, if you look at a random “cult checklist”, you’ll notice that none of the entries are “…And overall the group is bad, booooo”.
Things become even more confusing in cases like “religion”, which may have positive valence in one person’s mind and negative valence in another’s. Then those two people try to talk to each other about “religion”. There’s a real sense in which they are using the same word, but they are not talking about the same concept: In one person’s head, “religion” is a “cluster in thingspace” characterized by positive valence (among other things), and in the other person’s head, it’s a different “cluster in thingspace”, this time characterized by negative valence (among other things). No wonder it often seems like these two people are talking past each other!
3.4.2 Is the above a bug, or a feature?
As I keep mentioning, it’s useful to model the world independently from how we feel about it. Insofar as that’s true, it’s unfortunate that our brains treat valence as sense data that contributes to conceptual categorization and clustering.
On the other hand, at the end of the day, the main reason our brains build world-models in the first place is to make better decisions. And, as discussed in the previous post, valence is how our brains figure out whether a decision is good or bad. So, if our brains are going to do conceptual categorization and clustering to inform decision-making, what on earth could be more important than using valence as a central ingredient in that clustering algorithm?
So at the end of the day, my guess is that there’s a very good evolutionary reason that the brain treats valence as a ubiquitous and salient piece of sense data: without that design feature, we would struggle to make good decisions and get by in the world.
And then it’s an unfortunate but somewhat-inevitable side-effect of that design feature that “modeling the world independently from how I feel about it” is somewhat unnatural for us humans. Likewise, that design feature saddles us with other annoying things like “meaningless arguments” (§2.4.4) and related arguments-over-definitions, miscommunications, etc. Luckily, it’s possible for us humans to mitigate these problems via learned metacognitive heuristics, memes, the scientific method, and so on. We seem to be getting by, more or less.
How valence sheds light on depression, mania, and NPD
5.2 Context: What are we expecting to find a priori?
We can think of the following indirect path to get from “root causes” to psychological observations & personality traits:
As argued in Post 1, valence is one of the most important ingredients in one of the most important algorithms in the brain. So we should expect:
Some possible root causes may happen to have a big systematic impact on valence. (But they’ll probably have other consequences too, and the details will differ among different root causes.)
Given the centrality of valence in the brain, if there is a big systematic change to valence, then it should have lots of obvious downstream effects on psychology and behavior.
As a consequence:
We should expect to find clusters of symptoms / behaviors that can be elegantly explained in terms of something happening to valence signals
We should also expect to find other symptoms / behaviors that commonly co-occur in practice, but cannot be explained in terms of valence. Instead, they are different consequences of the same root cause(s), and may have no relation whatsoever at the “algorithm level”.
For example, dopamine is centrally involved in valence signals, and meanwhile, off in an obscure corner of the brain, dopamine is also centrally involved in a little specialized circuit controlling prolactin hormone release. I firmly believe that, at the algorithm level, these two functions have nothing whatsoever to do with each other. But they both happen to involve dopamine, and thus they can cross-talk in some people—hence the somewhat rare “dysphoric milk ejection reflex” where there’s a flood of intense negative emotions upon milk let-down during lactation.
That example is meant to illustrate the perils of theorizing about psychology purely at the algorithm level. Don’t get me wrong—the algorithm level is great! There are lots of insights to be found there. This post will hopefully be an example. But we shouldn’t expect to find all the insights there. Some things in psychology can only be explained at other levels, including lower (biochemistry) and higher (culture).
5.3 If valence has a strong negative bias (i.e., almost every thought is negative valence), it should lead to a cluster of symptoms suspiciously close to clinical depression
Why do cultural expectations matter for DID? Because it’s an intuitive self-model. Intuitive self-models, like all intuitive models, come out of a probabilistic inference process (§1.2.2). When there are multiple possible models that issue good predictions (as in the bistable perception example of §1.2.1), then suggestions from culture or trusted authorities can do a lot to influence what happens, by helping make certain intuitive models salient and a priori plausible. Those cultural suggestions are certainly not decisive! Let’s not be crazy—the Invisible Ships Myth is in fact a myth. But they’re certainly relevant.
DID is nothing special in this respect. Likewise, trance states like spirit possession and channeling (previous post) are likelier to happen when people expect them to happen for cultural or other reasons; and so too with hallucinations, such as renewalist Christian communities that normalize hearing the voice of God (coming up in Post 7).
Active self-related (part of the prev one)
3.4.2 The Active Self, in the context of “self” more broadly
The “self” involves a bunch of things:
As above, the Active Self is definitionally the thing that carries “vitalistic force”, and that does the “wanting”, and that does any acts that we describe as “acts of free will”. Beyond that, I don’t have strong opinions. It obviously has associations with other aspects of the broader “self”, as in the diagram above. Which of these associations are so strong that these essentially blend into different aspects of a single intuitive concept? And which of these associations is weak enough that you can intuitively imagine them as separate? I’m pretty sure that there’s no one right answer to those questions; rather, I think that this is an area where different people have different self-conceptions.
3.4.3 The Active Self, in the context of technical neuroscience research
I find that the Active Self intuition also comes up when I’m reading neuroscience literature, almost always for the worse. In particular, if you’re thinking about neuroscience, and if you’re tempted to give the Active Self some important role in how brain algorithms work at a fundamental level, then you’re definitely on the wrong track! The Active Self is one of a zillion learned concepts in the cortex—it’s at the trained model level, not the learning algorithm level (see §1.5.1)—and thus you should expect the Active Self to have a fundamentally similar kind of role in innate low-level brain algorithms as other learned concepts like “Taylor Swift” or “lithium ion battery”—i.e., a rather incidental role!
One example of how people mess this up is summarized in this handy chart:
If you’re trying to think carefully about brain algorithms—e.g. you want to reverse-engineer what the cortex does, versus the brainstem, etc.—I claim that the fundamental division in this chart is between involuntary and voluntary actions. This division relates to valence, and is right at the core of the reinforcement learning algorithm built into your brain. But in our Active-Self-centric intuitions, we’re instead drawn to incorrectly see the fundamental division as between things that the Active Self causes, versus things the Active Self does not cause.
(Neuroscientists obviously don’t use the term Active Self, but when they talk about “top-down versus bottom-up”, I think they’re usually equating “top-down” with “caused by the Active Self” and “bottom-up” with “not caused by the Active Self”.)
Here’s another example: The neuro-AI researcher Jeff Hawkins incorrectly conflates the Active-Self-related intuitive division, with the cortex-versus-brainstem neuroanatomical division. This error leads him to make flagrantly self-contradictory claims, along with the dangerously incorrect claim that the brain-like AIs he’s trying to develop will have nice prosocial motivations by default. For details see here.
3.5.3 “I seek goals” versus “my goals are the things that I find myself seeking”
As in §3.3.6 above, the “vitalistic force” intuition forbids the existence of any deterministic cause, seen or unseen, upstream of “wanting” behavior. (Probabilistic upstream causes, like “hunger makes me want food”, are OK. But the stronger such predictions get, the more they seem intuitively to undermine “free will”.)
This constraint on intuitive models leads to some systematic distortions, as shown in this diagram:
So within the Conventional Intuitive Self-Model,
“I seek things that I want” seems normal and correct,
“If I’m seeking something, then evidently that’s a thing I want” seems somewhere between “confused” and “a threat to my sense of agency”.
…But in terms of the real brain algorithm, I claim that these are more-or-less equivalent.
3.5.4 Why are ego-dystonic things “externalized”?
The main thing that the Active Self does is apply its vitalistic force towards accomplishing things it “wants”, via brainstorming / planning. So if there’s robust brainstorming / planning happening towards bungee jumping, then evidently (in our intuitive model) the Active Self “wants” to go bungee jumping. We call this an “internalized” desire. Conversely, if there’s robust brainstorming / planning happening towards not scratching an itch, but I scratch my itch anyway, then this is an “externalized” desire—the Active Self didn’t want the itch to get scratched, but the “urge” made it happen anyway.
We can apply this kind of thinking more generally. Compare the internalized “I become angry sometimes” with the externalized “I am beset by anger sometimes”. These are not synonymous: the latter, but not the former, has a connotation that there’s robust brainstorming / planning happening in my brain towards the goal of not being angry, possibly even while I’m angry. Admittedly, maybe I’m not spending much time doing such brainstorming / planning, or even any time, and maybe the brainstorming / planning isn’t effective. But still, the statement is still conveying something.
Combining this idea with §3.5.1, which says that robust brainstorming requires the corresponding self-reflective thoughts to have positive valence, and we wind up with the general picture that we tend to “internalize” things that reflect well upon ourselves (see §2.5.1), and “externalize” things that don’t.
Now, I used to think that the connection between ego-dystonic / ego-syntonic and externalized / internalized was the result of motivated reasoning: it’s nice to think of bad things as being “outside ourselves”. But now I think it’s directly about motivation, treated as probabilistic evidence within the Conventional Intuitive Self-Model—as opposed to being about motivated reasoning.
I suppose that distinction doesn’t matter much—by and large, the “motivation-as-evidence” hypothesis and the “motivated reasoning” hypothesis both lead to the same downstream predictions. Well, maybe my “motivation-as-evidence” story is a better fit to the example I gave in §2.5.2 of the tired person saying “Screw being ‘my best self’, I’m tired, I’m going to sleep”. This action is internalized, not externalized, and yet it goes directly against how the person would like to be perceived by themselves and others.
PNSE/enlightenment-related; “the way I wish emotions work vs how they actually work”
6.4 PNSE breaks the association between “awareness” and other self-reflective concepts
6.4.1 Basic explanation
In the generative model space, there are associations between different concepts—when I think of one thing, it makes me think of another thing. Beliefs are part of that (e.g. if I believe that a squirrel is in the glove compartment, then thinking about opening the glove compartment leads to me thinking about finding the squirrel), but associations also include other things (e.g. thinking about a goal might make me think of a strategy that would accomplish that goal).
There are associations between self-reflective concepts, just like any other concepts, and it’s here that PNSE has an interesting effect:
In the Conventional Intuitive Self-Model, the Active Self is evidently a bridge enabling associative connections between “awareness” and other self-reflective concepts. Why is it a bridge? Well on one side, the Active Self is connected to awareness—its actions are strongly impacted by the contents of awareness, and its attention-control actions manipulate awareness in turn. On the other side, the Active Self is conceptualized as having goals, controlling and owning the body, and so on. Thus the Active Self forms a bridge from awareness to the rest of the self-reflective world.
In PNSE, by contrast, the Active Self is gone, and the bridge is broken. “Awareness” no longer has any particular relation to those other self-reflective concepts. I think this comes across clearly when people talk about PNSE.
6.5 Why do pain, anxiety, etc., seem less aversive in PNSE than in the Conventional Intuitive Self-Model?
Equanimity is an aspect of PNSE that comes up frequently in the secular discourse. I’ll argue that it’s a consequence of the previous section—i.e., that it’s closely related to PNSE’s lack of association between “awareness” and bodily feelings.
6.5.1 PNSE makes S(anxious feeling) undermine, rather than reinforce and stabilize, the anxious feeling itself
Suppose I get an anxiety-provoking email—maybe my friend says that she has news about her health, and we need to talk. That triggers the brainstem reaction we call “anxiety”, involving negative valence, physiological arousal, and certain other reactions, along with corresponding interoceptive sensations and involuntary attention (see here) towards those sensations.
Panels (a)–(b) in this diagram give an everyday example of what happens next. The brainstem anxiety reaction passes into the cortex in the form of interoceptive sensory inputs, which stay strongly active via involuntary attention. Then the subsequent thoughts would involve concepts associated with the anxious feeling (e.g. its upstream causes), which in turn would activate other associated concepts, etc., via the normal logic of the generative model space. It’s basically an unpleasant form of brainstorming (see here).
In (a), there’s a closed excitatory loop: the interoceptive sensory inputs associated with anxiety make me think of the possibility that my friend is seriously ill, which in turn strongly implies that more feelings of anxiety are imminent. That feeds back to the brainstem—the cortex is “concurring” with the brainstem that the situation warrants anxiety, so to speak.[8] In other words, the cortex brainstorming has turned up a plausible story “explaining” the anxiety.
However, in (b), suppose I just learned that my friend is perfectly fine after all. Now there isn’t a closed excitatory loop. On the contrary, the anxiety-related interoceptive sensory inputs make me think of my friend’s good health, which in turn provide evidence against the possibility that I will feel more anxious feelings in the immediate future. The brainstem gets that signal and gradually winds down its anxiety reaction.
Everything so far has been object-level. Now let’s get into the more confusing self-reflective stuff!
Panel (c) shows a closed excitatory loop that can happen in the Conventional Intuitive Self-Model. The object-level interoceptive feeling of anxiety brings to mind the self-reflective S(feeling of anxiety) (§2.2.3). This self-reflective thought is conceptualized as being associated with the Active Self, which in turn is closely associated with the body and its feelings. So there’s a closed excitatory loop, just as there is in (a), and this loop reinforces and stabilizes the anxiety reaction. This loop is basically “feeling anxious about feeling anxious”—kinda stewing in feelings of anxiety.
Panel (d) shows what happens when we switch to PNSE. The first step is the same: the object-level interoceptive feeling brings to mind the self-reflective S(feeling of anxiety) thought—i.e., the idea that the feeling of anxiety is currently in conscious awareness. However, in PNSE, per §6.4 above, the “awareness” concept itself has no particular association with the body and its interoceptive sensations, so there’s no closed loop—no “feeling anxious about feeling anxious”—and the anxiety starts to wind down (unless the brainstorming can find a different closed loop like (a)).
… And conversely, I bet you can think of examples from your life of people ignoring potential problems thanks to a deficiency of involuntary attention. At an individual level, if someone has a potential looming health problem, but it’s not currently causing them any pain or any anxiety, then they may well not try to mitigate it. (Even if they “rationally” agree that mitigating it would be importantly beneficial! They might just never get around to it.) At a somewhat larger scale, it seems plausible that Sam Bankman-Fried’s personality profile included clinically low anxiety; he and his many victims obviously would have been better off if he had had some anxiety-driven involuntary attention towards negative possibilities like “what if I get caught breaking the law?” or “what if I’m mistaken about the FTX balance sheet?”. At an even larger scale, if policymakers and voters generally felt more anxiety-driven involuntary attention towards the possibility of future pandemics, then perhaps they wouldn’t be doing so very very little to prevent them, as compared to the scope and probability of the problem.
Thus, for example, Cognitive-Behavioral Therapy guru David Burns prompts his clinically anxious patients to think hard about exactly how much anxiety they want to have, and then to aim for that amount, which is often more than zero. (More details here.)
“Counsel” vs “manipulation” as an emotive conjugation
2.3 Another dimension: “counsel” vs “manipulation” as an emotive conjugation
There’s another dimension to how we intuitively think about these concepts: the dimension of positive or negative vibes. For example, if some kind of interaction seems good,[3] then we’re more likely to call it “providing counsel”, and if it seems bad, then we’re more likely to call it “an attempt to manipulate me”. The vibe is important in itself, over and above any particular aspect of the interaction.
I don’t think this dimension is separate from the “free will” discussion above, but rather complementary and compatible, because in general, if I have a motivation I’m happy about, I’ll tend to conceptualize it as an ego-syntonic component of my free will, while if I have a motivation I’m unhappy about, I’ll tend to conceptualize it as an ego-dystonic urge undermining my free will. See ISM §3.5.4 for details.