Neuroscience of human social instincts: a sketch
(For a PDF version of this post, go to: https://doi.org/10.5281/zenodo.17953592)
(Last update: May 2026. See changelog at the bottom.)
(If you’re in a hurry, you can just read the “Background and summary” section, and skip the other 85%.)
0. Background and summary
0.1 Background: What’s the problem and why should we care?
There’s a neuroscience problem which is centrally important for Artificial General Intelligence (AGI) safety, but which has had me stumped for as long as I’ve been in this field. Indeed, solving this problem is the main reason I got into neuroscience in the first place! In this post, I sketch an outline of a possible solution.[1]
What is this grand problem? As described in Intro to Brain-Like-AGI Safety, I believe the following:
We can divide the brain into a “Learning Subsystem” (cortex, striatum, amygdala, cerebellum, and a few other areas) that houses a bunch of randomly-initialized within-lifetime learning algorithms, and a “Steering Subsystem” (hypothalamus, brainstem, and a few other areas) that houses a bunch of specific, genetically-specified “business logic”. A major role of the Steering Subsystem is as the home for the brain’s “innate drives”, a.k.a. “primary rewards”, roughly equivalent to the reward function in reinforcement learning—things like eating-when-hungry being good (other things equal), pain being bad, and so on.
Some of those “innate drives” are related to human social instincts—a suite of reactions and drives that are upstream of things like compassion, friendship, love, spite, sense of fairness and justice, etc.
The grand problem is: how do those human social instincts work? Ideally, an answer to this problem would look like legible pseudocode that’s simultaneously compatible with behavioral observations (including everyday experience), with evolutionary considerations, and with a neuroscience-based story of how that pseudocode is actually implemented by neurons in the brain.[2]
Explaining how human social instincts work is tricky mainly because of the “symbol grounding problem”. In brief, everything we know—all the interlinked concepts that constitute our understanding of the world and ourselves—is created “from scratch” in the cortex by a learning algorithm, and thus winds up in the form of a zillion unlabeled data entries like “pattern 387294 implies pattern 579823 with confidence 0.184”, or whatever.[3] Yet certain activation states of these unlabeled entries—e.g., the activation state that encodes the fact that Jun just told me that Xiu thinks I’m cute—need to somehow trigger social instincts in the Steering Subsystem. So there must be some way that the brain can “ground” these unlabeled learned concepts. (See my earlier post Symbol Grounding and Human Social Instincts.)
A solution to this grand problem seems useful for Artificial General Intelligence (AGI) safety, since (for better or worse) someone someday might invent AGI that works by similar algorithms as the brain, and we’ll want to make those AGIs intrinsically care about people’s welfare. It would be a good jumping-off point to understand how humans wind up intrinsically caring about other people’s welfare sometimes. (Slightly longer version in §2.2 here; much longer version in Intro Series §12.)
0.2 Summary of the rest of the post
I’ll start by going through the four algorithmic ingredients we need for my hypothesis, one by one, in each case describing what it is algorithmically, why it’s useful evolutionarily, and where in the brain we might go looking to find the specific neurons that are running this (alleged) algorithm.
Here’s the roadmap:
Ingredient 1 is innate sensory heuristics in the Steering Subsystem (hypothalamus & brainstem)—previously discussed in Intro Series §3.2.1. An example would be some part of your brainstem that detects skittering spiders in your field-of-view.
Ingredient 1A is innate sensory heuristics for conspecific detection in particular. (Terminology note: “Conspecific” = “another member of the same species”.) This is a special case of Ingredient 1, but I think it’s an important and widespread special case. For example, humans have innate reactions to seeing humans (faces, gait, etc.), hearing human voices, and so on—just as mice have innate reactions to seeing and smelling other mice. I claim that these heuristics are combined to trigger a general “social attention reflex” in the Steering Subsystem.
Ingredient 2 is “short-term predictors”—previously discussed in Intro Series §5.4. These are supervised learning algorithms, mainly housed in the “extended striatum” (including amygdala), that search for connections between aspects of your rich understanding of the world (e.g. the learned concept “spider”) and Steering Subsystem reactions (e.g. feeling jittery). This allows generalization—for example, the “social attention reflex” can be triggered even when a conspecific is not standing right there.
Ingredient 3 is tailoring learned models via involuntary attention and learning rate. Basically, involuntary attention can sculpt large-scale information flows within the cortex, altering what the short-term predictors wind up learning. As an example, the orienting reflex to a skittering spider comes along with involuntary attention, which ensures that when your brainstem notices a spider, your cortex / “global workspace” / conscious attention jumps to the spider, and to things related to the spider, as opposed to continuing to daydream about Taylor Swift. That also enables what I call an “interoceptive concept finder”, a special kind of short-term predictor which helps label your unlabeled learned interoceptive concept space, via tracking correlations with ground-truth signals like physiological arousal.
Ingredient 4 is reading out transient empathetic simulations via a combination of all the above ingredients. Basically, the “social attention reflex” activates a transient involuntary lack of attention to your own raw interoceptive inputs. That clears the way for any “feeling”-related signal from the cortex at that moment to be interpreted (by the Steering Subsystem) as indicative of what that other person is feeling.
Then, I’ll go through an important (putative) example of social instincts built from these ingredients, which I call the “compassion / spite circuit”. This circuit leads to an innate drive to feel compassion towards people we like, and to feel spite and schadenfreude towards people we hate.
In an elegant twist, I claim that this very same “compassion / spite circuit” also leads to an innate “drive to feel liked / admired”—a drive that I hypothesized earlier and believe to be central to both status-seeking and norm-following. The trick in explaining how they’re related is:
“Drive for compassion” basically amounts to “I want Ahmed to feel pleasure”;
“Drive to feel liked / admired” basically amounts to “I want Ahmed to feel pleasure upon thinking about me”;
…and it turns out that, at the particular moments when the “compassion / spite circuit” gets strongly activated, Ahmed is very often thinking about me! An example would be if Ahmed and I are making eye contact while having a conversation.
Then I’ll go more briefly through some other possible social instincts, including a sketch of a possible “drive to feel feared” (whose existence I previously hypothesized here). For context, dual strategies theory talks about “prestige” and “dominance” as two forms of status; while the “drive to feel liked / admired” leads to prestige-seeking, the “drive to feel feared” correspondingly leads to dominance-seeking.
0.3 Confidence level
My confidence gradually decreases as you proceed through the article. The “Background” section above is rock-solid in my mind, as are Ingredients 1, 1A, and 2. Ingredients 3 and especially 4 are somewhat new to this post, but derive from ideas I’ve been playing around with for a year or two, and I feel pretty good about them. The specific putative examples of social instincts in §5–§7 are much more new and speculative, and are oversimplified at best. But I’m optimistic that they’re on the right track, and that they’re at least a “foot in the door” towards future refinements.
0.4 Later work
UPDATE NOV. 2025: After you finish this post, see also my later follow-up posts Social drives 1: “Sympathy Reward”, from compassion to dehumanization & Social drives 2: “Approval Reward”, from norm-enforcement to status-seeking, which further flesh out how my neuroscientific hypothesis (below) connects to everyday experiences and intuitions.
1. Ingredient 1: Innate sensory heuristics in the Steering Subsystem
The Steering Subsystem (brainstem and hypothalamus, more-or-less) takes sensory data, does innately-specified calculations on them, and uses the results to trigger innate reactions.
Think of things like seeing a slithering snake, or a skittering spider; smelling or tasting rotten food; male dogs smelling a female dog in heat; camouflaged animals recognizing the microenvironment where their bodies will blend in; and so on.
Note that these are all imperfect heuristics, anchored to innate circuitry, rather than developing along with our understanding of the world. We can call it a venomous-spider-detector circuit, for example, noting that it evolved because venomous spiders were dangerous to early humans.[4] But if we do that, then we acknowledge that it will have both false positives (e.g. centipedes, harmless spiders) and false negatives (funny-looking stationary venomous spiders), when compared to actual venomous spiders as we intelligently understand them. In vision especially, think of these heuristics as detecting relatively simple patterns of blobs and motion textures, as opposed to an “image classifier” / “video classifier” up to the standards of modern ML or human capabilities.
For more discussion of Ingredient 1, see Intro Series §3.2.1.
1.1 Ingredient 1A: Innate sensory heuristics for conspecific detection in particular
As a special case of Ingredient 1, I claim that, in pretty much all animals, there are sensory heuristics that are specifically designed by evolution to trigger on conspecifics. That would include one or more variations on: seeing a conspecific, hearing a conspecific, touching (or being touched by) a conspecific, smelling a conspecific, etc.
(I’m confident in this part because pretty much all animals have innate behaviors towards conspecifics that are different from their behaviors in other situations—mating, intermale aggression, parenting, being parented, herding, huddling, and so on.)
I claim that these all trigger a special Steering Subsystem innate behavior that I call “the social attention reflex”:
1.2 Neuroscience details
Neuroscience details box
The sensory heuristics involve brainstem areas like the superior colliculus (for innate heuristic calculations on visual data), inferior colliculus (auditory data), gustatory nucleus of the medulla (taste data), and so on. (Again see Intro Series §3.2.1.)
In the case of visual sensory heuristics, I’m actually not 100% confident that these calculations are located in the superior colliculus proper; for all I know, they’re partly or entirely in the neighboring parabigeminal nucleus, or whatever. There are papers on this topic, but they can’t always be taken at face value—see for example me complaining about methodologies used in the literature here and here.
For the “social attention reflex”, it would be somewhere within the Steering Subsystem, but I don’t have any particular insight into exactly where. If I had to guess, I might guess that it’s one of the many little cell groups of the medial preoptic hypothalamus, since those often involve social interactions. If not that, then I’d guess it’s somewhere else in the hypothalamus, or (less likely) some other part of the Steering Subsystem.
If you want to experimentally find the cell group that orchestrates the “social attention reflex”, the conceptually-simplest method would be to first find one of the sensory heuristics for conspecific detection (e.g. the face detector), see what its efferent connections (downstream targets) are, and treat all those as top candidates to be studied one-by-one.
2. Ingredient 2: Generalization via short-term predictors
Ingredient 1 is a first step towards understanding, say, fear-of-spiders. But it’s not the whole story, because I don’t just get nervous when there is actually a large skittering spider in my field-of-view right now, but also when I imagine one, or when somebody tells me that there’s a spider behind me, etc. How does that work? The answer is: what I call the “short-term predictor”.
The “short-term predictor” is a learning algorithm that involves three ingredients—context, output, and supervisor. For definitions see this post; or in the ML supervised learning literature, you can substitute “context” = “trained model input”, “output” = “trained model output”, and “supervisor” = “label” (i.e., ground truth), which is subtracted from the trained model output to get an error that updates the model.[5]
The important points are that:
The short-term predictor will learn within your lifetime to associate otherwise-inscrutable world-model concepts—like the concept of “spider”, the word “spider”, the detailed visual appearance of spiders, the concept of “centipede”, etc.—with the physiological arousal brainstem reaction;
The “output” of the short-term predictor can itself trigger that brainstem reaction, in a kind of self-fulfilling prophecy that I call “defer-to-predictor mode” (see Intro Series §5).
Thus, this kind of story explains the fact that I viscerally react to learning that there’s a spider in my vicinity that I can’t immediately see or feel.
If we take the brainstem reaction and the short-term predictor together, it can function as what I call a long-term predictor, again see Intro Series §5.
By the same token, the “social attention reflex” can trigger when I’m thinking of a conspecific, even if the conspecific is not standing right there, triggering my brainstem sensory heuristics right now.
2.1 Neuroscience details
Neuroscience details box
I think the short-term predictors that I’ll be talking about in this post are mostly centered around small clusters of medium spiny neurons somewhere in the amygdala, or the lateral septum, or the medial part of the nucleus accumbens shell. (I haven’t tried to pin them down in more detail than that. See Intro Series §5.5.4 for some more general neuroscience discussion of this topic.)
However, in some cases pyramidal neurons can play this short-term predictor role as well, such as in the cortex-like (basolateral) section of the amygdala, along with certain parts of cortex layer 5PT.
The supervisory signal (either ground truth or an error signal, I’m not sure) probably makes an intermediate stop (“relay”) at some little cluster of neurons on the fringes of the Ventral Tegmental Area (VTA), not shown in the diagram above, in which case the supervisory signal would ultimately arrive at the spiny neuron in the form of a dopamine signal. I think. (But there are also VTA GABA neurons that seem somehow related to these particular short-term predictors. I haven’t tried to make sense of that in detail.)
3. Ingredient 3: Tailoring learned models via involuntary attention and learning rate
3.1 Involuntary attention
Let’s talk more about what happens when you see a skittering spider out of the corner of your eye:
When the seeing-a-spider brainstem sensory heuristic triggers, I claim that one thing it does is trigger an “orienting reflex”. Part of that reflex involves moving the eyes, head, and body towards whatever triggered the heuristic. And another part of it involves involuntary attention towards the visual inputs in general, and the corresponding part of the field-of-view in particular.
The involuntary attention plays an important role in constraining what “thought” the cortex is thinking. If you’re daydreaming, imagining, remembering, etc., then your current “thought” has very little to do with current visual inputs. By contrast, involuntary attention towards vision forms a constraint that the thought must be “about” the visual inputs. It’s not completely constraining—the same thought can also contextualize those visual inputs by roping in presumed upstream causes, or expected consequences, or other associations, etc. But the visual inputs have to be a central part of the thought. In other words, you’re not only pointing your eyes at the spider, but you’re also actually thinking about the spider with your cortex (“global workspace”).
To be more specific about what’s going on, we need to be thinking about large-scale patterns of information flow within the cortex, as in the following toy example:
When you’re using visual imagination, your consciously-accessible visual areas of the cortex (e.g. the inferior temporal gyrus (IT)) are, in essence, disconnected from the immediate visual input. You can imagine Taylor Swift’s new dress while looking at a swamp. By contrast, when you’re paying attention to what you’re looking at, then there’s a consistency requirement: the visual models (i.e., generative models of visual data) in IT have to be consistent with the immediate visual input from your retina.
And my claim is that the Steering Subsystem has some control over this kind of large-scale information flow among different parts of the cortex, via its “involuntary attention”.
Incidentally, for this post, I’m less interested in vision than interoception, the “sense” of how we’re feeling. We can have a (generalized) “orienting reflex” towards interoceptive inputs just as we can towards visual inputs—an itchy bug bite will summon attention just as reliably as an unexpected noise will. So here’s the analogous diagram for the case of interoception, which we’ll expand on later:
3.1.1 Side note: Transient attentional gaps are more common, and harder to notice, than you realize
You might be wondering: Is it really true that, if I’m imagining Taylor Swift’s new dress, then my awareness is detached from immediate visual input? Don’t we continue to be aware of visual input even while imagining something else?
A few responses:
First, your cortex has lots of vision-related areas, and it’s possible for some visual areas to be yoked to immediate visual input while other visual areas are simultaneously yoked to episodic memory. I think this definitely happens to some extent.
Second, your attention can jump around between different things rather quickly, such that most people imagine themselves to have far more complete and continuous visual awareness than they actually do—see things like change blindness, or the selective attention test, or the fact that you can only perceive colors at the center of your field-of-view.
Third, the cortex tracks time-extended models, and accordingly has a general ability to pull up activation history from slightly (e.g. half a second) earlier, anywhere in the cortex. That makes it very hard to introspect upon exactly what you were or weren’t thinking at any given moment. For a much more detailed discussion of this point, with an example, see Intuitive Self-Models §2.3.
This is a general lesson, going beyond just vision: transient (fraction-of-a-second) attentional gaps and shifts are hard to notice, both as they happen and in hindsight. Don’t unthinkingly trust your intuitions on that topic. (I’ll be centrally relying on these transient attentional shifts in this post, so it’s important that you are thinking about them clearly.)
3.2 Combining attention with learning rate modulation
The Steering Subsystem can get an additional lever of control over some brain learning algorithm by adjusting its learning rate to different settings at different times, depending on the large-scale information flows in the cortex. This opens up a flexible design space that the genome exploits in a variety of ways.
As a worked example relevant to this post, let’s take the interoception diagram from §3.1 above, and add in a short-term predictor with learning rate modulation. And how exactly will its learning rate be modulated? However we want—it’s a design degree of freedom! But for this example, we’ll set the short-term predictor learning rate to zero unless you’re paying attention to actual interoceptive input. So here’s the newly-expanded diagram from above:
What’s the point of this setup? Well, it will transform this short-term predictor into what we might call an “interoceptive concept finder”, that can find and flag the idea of physiological arousal in your interoceptive concept space, more or less.[6]
Think of this setup as somewhat like “linear probes” in ML interpretability research: the short-term predictor simply finds correlations between the ground truth (actual physiological arousal in your Steering Subsystem) and your various unlabeled learned interoceptive concepts.
And then why do we need learning rate modulation? Because the correlations we’re looking for are only present when you’re paying attention to your own interoceptive inputs. If you’re not—e.g. if you’re reading a book and empathetically simulating the protagonist—the correlations get messed up. For example, if the book protagonist is feeling intense rage, then you might (transiently) experience actual anger yourself. But you also might not! And even if you do empathetically feel some anger on behalf of the protagonist, it would probably be more “mild anger” than “intense rage”. Either way, the active concepts in the cortex (based on the book text) would not match the actual state of your Steering Subsystem. (See Valence series §1.5.4–§1.5.5 for more on this point.) So during such times, it’s fine if this short-term predictor continues to be queried, but we don’t want it to be updated.
OK, so we can build an “interoceptive concept finder” by taking a short-term predictor and judiciously setting up its context data, learning rate modulation, temporal delay setting, and so on. Then what? Is an “interoceptive concept finder” setup the best way to build a short-term predictor for physiological arousal? …Wrong question! We don’t have to pick just one “best” short-term predictor for physiological arousal! The brain can have more than one short-term predictor for the same signal. They can be complementary. For example, I don’t think an “interoceptive concept finder” for physiological arousal would be the most effective way to react quickly and preemptively to dangerous situations—for that, you’d want a predictor that listens directly to exteroceptive inputs like vision and sound. But on the other hand, an “interoceptive concept finder” is probably helpful for planning, since it can tell the Steering Subsystem about the feelings that might result from a possible future plan (see “the interface problem” in Intro series §6.2.2).
Anyway, it turns out that our “interoceptive concept finder” is exactly what we need for our social instincts story. Let’s keep going:
3.3 Neuroscience details
Neuroscience details box
For involuntary attention: There are probably multiple pathways working in conjunction. Probably cholinergic and/or adrenergic neurons are involved. More specifically, cholinergic projections to the cortex are probably part of this story, and so are the cholinergic projections to thalamic relay cells. I don’t know the details.
For adjusting learning rate: There are a bunch of ways this could work. If there’s an error signal coming from the Steering Subsystem (hypothalamus or brainstem) to a short-term predictor, it could be set to zero, and then there’s no learning. Or maybe there’s a separate signal for learning rate (maybe acetylcholine again?) coming from the Steering Subsystem, which could be turned off instead. There could also be some more indirect effect of lack-of-attention on the cortex side—like maybe the cortex representations are less active when they’re further removed from sensory input, and that indirectly reduces learning rate, or something. I don’t know.
Two short-term predictors for the same thing: I mentioned that for physiological arousal and similar innate state variables, I think there are (at least) two different short-term predictors of that same ground truth, one using exteroception-related data as context, the other (i.e. the “interoceptive concept finder”) using interoception-related data as context. My guess is that the former is in the amygdala. The latter is maybe somewhere in the medial prefrontal or cingulate cortex (or insula … or precuneus … or NAc medial shell … I really don’t know). (Clarification for the latter: I think most of the short-term predictors are medium spiny neurons in the “extended striatum”, and have been labeling my diagrams accordingly. But as I mentioned in §2.1 above, I do think there are places where pyramidal neurons play a short-term predictor role too, including in layer 5PT of certain parts of the cortex.)
4. Ingredient 4: Reading out transient empathetic simulations
If we apply the same kind of reasoning as above, it suggests a path to solving the symbol-grounding problem for somebody else’s feelings. A key ingredient we need is “involuntary LACK of attention towards interoceptive inputs”, triggered by the “social attention reflex” of Ingredient 1A—the right side of this diagram:
What is this “lack of attention” supposed to accomplish? Here’s a schematic diagram illustrating the flows of information / attention / constraints in a normal situation (left) and in a situation where one of the Ingredient 1A conspecific detection heuristics has just fired (right):
The involuntary lack of attention transiently disconnects the interoceptive models from what I’m feeling right now. Instead, the space of interoceptive models in the cortex will settle into whatever is most consistent with what’s happening in the visual, semantic, and other areas of the cortex (a.k.a. “global workspace”). And thanks to the orienting reflex, those other areas of the cortex are modeling Zoe.
And therefore, if any interoceptive models are active, they’re ones that have some semantic association with Zoe. Or more simply: they’re how Zoe seems to be feeling, from my perspective.
We’re almost there! I’ll pull out the right half of that figure, and attach an “interoceptive concept finder” (§3.2 above), and a gate that only opens precisely when the social attention reflex is active:
And bam, we have solved the symbol grounding problem for other people’s feelings! The signal at the bottom should occasionally be allowed through the gate, and when it does, it will carry information about how a different person seems to be feeling.
(I showed the example of physiological arousal, but the same logic applies to “being happy”, “being angry”, “being in pain”, etc.)
This step is built on the kind of “transient empathetic simulation” that I’ve discussed previously: the “interoceptive concept finder” short-term predictor is trained by supervised learning on instances of myself feeling physiological arousal, but right now it’s being triggered by thinking about someone else feeling physiological arousal.
4.1 So, the “social attention reflex” is also a “this is an empathetic simulation” flag?
Well, kinda. But with some caveats.
The sense in which this is true is: both the interoceptive model space and the associated short-term predictors are trained in a circumstance where they relate exclusively to my own interoceptive inputs, but then they’re sometimes queried in a circumstance where they relate to someone else’s interoceptive inputs.
But in other senses, calling it an “empathetic simulation” flag might be a bit misleading.
First, it would be a transient empathetic simulation, lasting a fraction of a second, which is rather different from how we normally use the term “empathy”—more on that in Intro Series §13.5.2.
Arguably, even “transient empathetic simulation” is an overstatement—it’s just some learned semantic association between what I’m seeing and some feeling-related concept. The concept of Zoe seems to somehow imply the concept of stress, within my world-model. That’s all. I don’t really need to be “taking her perspective”, nor to be feeling Zoe’s simulated stress in Zoe’s simulated loins, or whatever.
Second, this reflex is exclusively related to empathetic simulations of what someone is feeling[7]—not empathetic simulations of what they’re thinking, seeing, etc. For example, if I’m curious whether Zoe can see the moon from where she’s standing, then I would do a quick empathetic simulation of what Zoe is seeing. The “social attention reflex” is not particularly related to that; indeed, if anything, this reflex is probably anticorrelated with that, since it innately activates in situations where orienting reflexes are pulling attention to our own exteroceptive sensory inputs.
Thus, my framework implies that social instincts can only involve reacting to someone’s (assumed) feelings. It cannot (directly) involve reacting to what someone is seeing, or thinking, etc. I think that claim rings true to everyday experience.
And there’s actually a deeper reason to believe that claim. If I take Zoe’s visual perspective and imagine that she’s looking at a saxophone, then my Steering Subsystem can’t do anything with that information. The Steering Subsystem doesn’t understand saxophones, or anything else about our big complicated world. But it does know the “meaning” of its suite of innate physiological state variables and signals—physiological arousal, body temperature, goosebumps, and so on. See my discussion of “the interface problem” in Intro Series §6.2.2.
Third, as mentioned above, only a subset of short-term predictors (those set up as “interoceptive concept finders”) will output transient empathetic simulation data during a social attention reflex. Other short-term predictors will not.
4.2 Neuroscience details
Neuroscience details box
Involuntary lack-of-attention signal: Well, absence-of-attention might just involve suppressing presence-of-attention pathways, like the ones I mentioned under Ingredient 3 above (possibly involving acetylcholine). Or it might be a different system that pushes in the opposite direction—maybe involving serotonin? Or (more likely) multiple complementary signals that work in different ways. I don’t have any strong opinions here.
5. Hypothesis: a “compassion / spite circuit”
Everything so far was preliminaries—now we can start speculating about real social instincts! My main example is a possible innate drive circuit that would be upstream of compassion and spite. Start with another Steering Subsystem signal:
5.1 The “Conspecific seems to be feeling (dis)pleasure” signal
The first step is to get a “conspecific seems to be feeling pleasure / displeasure”[8] signal in the Steering Subsystem, as follows:
The purple box is yet another Steering Subsystem signal that I’m labeling “pleasure / displeasure”. This is closely related to valence—for details see Valence Series §A. Then the gray box would be an intermediate variable[9] in the Steering Subsystem which would, by design, track the extent to which I think of the conspecific as feeling pleasure / displeasure.
That was just the start. Next, how do we build a social instinct out of the gray “conspecific seems to be feeling pleasure / displeasure” box? We need another Steering Subsystem parameter!
5.2 The “friend (+) vs enemy (–)” parameter
I introduced another Steering Subsystem parameter called “friend (+) vs enemy (–)”. When this parameter is extremely negative, it indicates that whatever you’re thinking about (in this case, the conspecific) should be physically attacked, right now. If the activity level is mildly negative, then you probably won’t go that far, but you’ll still feel like they’re the enemy and you hate them. If it’s positive, you’ll feel “on the same team” as them.
Anyway, when the “friend (+) vs enemy (–)” parameter is positive, then “conspecific seems to be feeling pleasure / displeasure” causes positive / negative valence respectively. This innate drive would lead to compassion—we feel intrinsically motivated by the idea that the conspecific is feeling pleasure, and intrinsically demotivated by the idea that the conspecific is feeling displeasure.
…And if the “friend (+) vs enemy (–)” parameter is negative, we flip the sign: “conspecific seems to be feeling pleasure / displeasure” causes negative / positive valence respectively. This innate drive would lead to both spite and schadenfreude.
How is the “friend (+) vs enemy (–)” parameter itself calculated? By other social instincts outside the scope of this post—more on that in §7 below. Perhaps part of it is a different circuit that says: if thinking about a conspecific co-occurs with positive valence (i.e., if we like / admire them), then that probably shifts the friend/enemy parameter a bit more towards friend, and perhaps also conversely with negative valence. That’s not circular, because conspecifics can acquire positive or negative valence for all kinds of reasons, just like sweaters or computers or anything else can acquire positive or negative valence for all kinds of reasons, including non-social dynamics like if I’m hungry and the conspecific gives me yummy food. That’s a robust and flexible system that will leverage my rich understanding of the world to systematically assign “friend” status to conspecifics who lead to good things happening for me. That’s probably just one factor among many; I imagine that there are lots of innate circuits that can impact friend / enemy status in various circumstances. Of course, as usual, the friend / enemy parameter would be attached to one or more short-term predictors, enabling memory, generalization, and perhaps also transient empathetic simulations.
5.2.1 Evolution and zoological context
Evolutionary and zoological context box
Pretty much every complex social animal has innate, stereotyped behaviors for both helping and hurting conspecifics in different circumstances—e.g. attack behaviors, and companionship-type behaviors such as within families.
And evolutionarily, if it makes sense to help or hurt conspecifics through innate, stereotyped behaviors, then presumably it also makes sense to help or hurt conspecifics through the more powerful and flexible pathways that leverage within-lifetime learning, as would happen through a “compassion / spite circuit”. (See (Appetitive, Consummatory) ≈ (RL, reflex).)
Indeed, even in rodents, I think there’s clear evidence of more flexible, goal-oriented behaviors to (selectively) help conspecifics. For example, Márquez et al. 2015 find that rats help conspecifics via choice of arm in a T-shaped maze. And Bartal et al. 2014 find that rats release conspecifics from restraints, but only in situations where they feel friendly towards the conspecific. (See also: Kettler et al. 2021.) I don’t think either of these needs to be explained with my proposed “compassion / spite circuit” above involving transient empathetic simulation; for example, maybe rats squeak in a certain way when they’re happy, and hearing another rat make a happy squeak triggers a primary reward, or whatever. But anyway, as far as I can tell at a glance, the “compassion / spite circuit” is at least plausibly present even in rodents.
…Or maybe it’s just a “compassion” circuit for rodents. I can’t immediately find any evidence either way on whether rats display flexible, goal-oriented spite-type behavior towards other rats they hate. (They undoubtedly have inflexible, stereotyped, threat and attack postures and behaviors, but that’s different—again see (Appetitive, Consummatory) ≈ (RL, reflex).) Let me know if you’ve seen otherwise!
5.2.2 Neuroscience details
Neuroscience details box
I expect that friend-vs-enemy is two groups of neurons that are mutually inhibitory, as opposed to one that swings positive and negative compared to baseline. That’s how the hypothalamus handles hungry-vs-full, for example (see here). As for where those neuron groups are, I don’t know. Probably medial hypothalamus somewhere.
5.3 Phasic physiological arousal
“Phasic” means that physiological arousal jumps up for a fraction of a second, in synchronization with noticing something, thinking a certain thought, etc. The opposite of “phasic” is “tonic”, like how I can have generally high arousal (alertness, excitement) in the morning and generally low arousal in the afternoon.
Now, one thing that my compassion / spite circuit above is missing is a notion that some interactions can feel more important / high-stakes to me than others. I think this is a separate axis of variation from the friend / enemy axis. For example, my neighbor and my boss are both solidly on the “friend” side of my friend / enemy spectrum—I feel “warmly” towards both, or something—but interactions with my boss feel much higher stakes, and correspondingly I react more strongly to their perceived feelings. So let’s refine the circuit above to fix that:
Basically, when I orient to a conspecific, then recognize them, the associated phasic arousal[10] tracks how important (high-stakes) is this interaction with the conspecific, from my perspective. Then we use that to scale up or down the compassion / spite response.
5.3.1 Neuroscience details
Neuroscience details box
I think the locus coeruleus, a tiny group of 30,000 neurons (in humans), is the high-level arousal-controller in your brain, and its activity can vary over short timescales (up and down within half a second, there’s a plot in Clayton et al. 2004). If you measure pupil dilation, then maybe you’ll miss some of the very fastest dynamics, but you will see the variation on a ≈1-second timescale. If you measure skin conductance, that’s slower still.
I’m generally assuming in this post that “arousal” is a scalar. That’s probably something of an oversimplification (see Poe et al. 2020 & Luskin et al. 2025) but good enough for present purposes.
I’ve been talking as if the role of phasic arousal is specific to the “compassion / spite circuit”, but a more elegant possibility is that it’s a special case of a very general interaction between arousal and valence, such that arousal makes all good things seem better, and makes all bad things seem worse, other things equal. After all, arousal is saying that a situation is high-stakes. So that kind of general dynamic seems evolutionarily plausible to me.
(For the record, I think the general interaction between arousal and valence is not just multiplicative. I think there’s also a thing that we call “being overwhelmed”, where sufficiently high arousal can cause negative valence all by itself. Basically, in a very high-stakes situation, the Steering Subsystem wants to say that things are either very good or very bad, and in the absence of positive evidence that things are very good, it treats “very bad” as a default.)
5.4 Generalization via short-term predictors
As usual, Steering Subsystem signals can serve as ground-truth supervision for short-term predictors, which supports generalization. Thanks to “defer-to-predictor mode” (see Intro Series §5), we wind up with Steering Subsystem social instincts activating in situations where nobody is in the room with me right now, but nevertheless I find myself intrinsically motivated by the idea of Zoe feeling good in general, and/or Zoe feeling good about me in particular.
6. The “compassion / spite circuit” also causes a “drive to feel liked / admired”
Let’s talk about the social instinct that I call “drive to feel liked / admired”—i.e., an innate drive that makes it so that, if I think highly of person X, then it’s inherently motivating to believe that person X thinks highly of me too. To make this work, one might think that we need another ingredient. It’s not enough for the Steering Subsystem to have strong evidence that my conspecific is feeling pleasure or displeasure, as above. The Steering Subsystem has to get strong evidence that my conspecific is feeling pleasure or displeasure in regards to me in particular. Where could such evidence come from?
Remarkably, my answer is: we already got it! We don’t need any other ingredients. It’s just an emergent consequence of the same circuit above!! Let me explain why:
6.1 Key idea: My “compassion / spite circuit” is disproportionately active and important while the conspecific is thinking about me-in-particular
Let’s say Zoe walks up to me and says “hey”. Or she’s having a conversation with me. Or she’s staring at me from across the room. These situations are quite common, and have two critical properties: (1) my “social attention reflex” is triggering like crazy, perhaps once a second or even more, and (2) Zoe is probably thinking about me-in-particular.
So what? Well…
Thanks to (1) and the “compassion / spite circuit”, I am very sensitive to whether Zoe is feeling pleasure or displeasure right now.
And thanks to (2), Zoe’s pleasure or displeasure right now is not just a generic marker of her overall mental health, but rather has an awful lot to do with how Zoe feels about me. Does she like or dislike me? Am I helping or hurting her? Am I doing things that she thinks are cool vs cringe? If we’re talking, is she enjoying the conversation? Etc.
Here’s a diagram illustrating this:
Thus, the compassion / spite circuit leads people to have a particular motivation for other people to have positive feelings about them. This is what I’ve called “the drive to feel liked / admired”.
(Note to readers: when I was initially writing this post, I was very focused on “drive to feel liked / admired”. Later on, I decided that “drive to feel liked / admired” is just one of numerous downstream impacts of this kind of reward signal. See my follow-up post: Social drives 2: “Approval Reward”, from norm-enforcement to status-seeking.)
6.2 If the same circuit drives both compassion and “drive to feel liked / admired”, why aren’t they more tightly correlated across the population?
If the same innate circuit in the Steering Subsystem is upstream of both compassion and “drive to feel liked / admired”, then one might think that these two things should be yoked together. In other words, if that circuit’s output is generally strong in one person, then they should wind up with both drives being powerful influences on their behavior, and if it’s weak in another person, then they should wind up with neither drive being a powerful influence.
But in fact, in my everyday experience, these seem to be somewhat independent axes of variation, with some people apparently driven much more by one than the other. How does that work?
The answer is simple. If, in the course of life, the circuit often activates when the conspecific is thinking about me-in-particular, and rarely activates when they aren’t, then that would lead the circuit to mostly incentivize and generalize feeling liked / admired. And conversely, if the circuit rarely activates when the conspecific is thinking about me-in-particular, and often activates when they aren’t, then that would lead the circuit to mostly incentivize and generalize compassion.
As an example of the former, suppose Phoebe tends to react very weakly (low arousal, or perhaps not orienting at all) to seeing a person out of the corner of her eye, or to hearing someone’s voice in the distance as they talk to someone else, but Phoebe does reliably react to the more powerful stimuli of transient eye contact, or someone getting her attention to talk to her. Then Phoebe would wind up with a relatively strong drive to feel liked / admired relative to her compassion drive.[11]
As an example of the latter, let’s turn to autism. As I’ve discussed in Intense World Theory of Autism, autism involves many different suites of symptoms which don’t always go together (sensory sensitivity, “learning algorithm hyperparameters”, proneness to seizures, etc.). But a common social manifestation would be kinda the reverse of the above. Given their trigger-happy arousal system, they’ll respond robustly and frequently to things like noticing someone out of the corner of their eye, or hearing someone in the distance. But as for receiving eye contact, or someone deliberately trying to get their attention, they’ll find it so overwhelming that they’ll tend to avoid those situations in the first place,[12] or use other coping methods to limit their physiological arousal. So that’s my attempted explanation for why many autistic people have an especially weak “drive to feel liked / admired”, relative to their comparatively-more-typical levels of compassion and spite, if I understand correctly.
6.3 Whose admiration do I crave?
I think it’s common sense that, in the “drive to feel liked / admired”, we’re driven to be liked / admired by some people much more than others. For example, think of a real person whom you greatly admire, more than almost anyone else, and imagine that they look you in the eye and say, “wow, I’m very impressed by you!” That would probably feel extremely exciting and motivating! Such events can be life-changing—see Mentorship, Management, and Mysterious Old Wizards. Next, imagine some random unimpressive person looks you in the eye and says the same thing. OK cool, maybe you’d be happy to receive the compliment. Or maybe not even that. It sure wouldn’t go down as a life-affirming memory to be treasured forever. More examples in footnote→[13]
I had previously written that, if Zoe likes / admires me, then that feels intrinsically motivating to the extent that I like / admire Zoe in turn. Whoops, I’ve changed my mind! Instead, I now think that it feels intrinsically motivating to the extent that interactions with Zoe seem important and high-stakes from my perspective, regardless of whether I like / admire her.[14] (However, if I see her as “enemy” rather than “friend”, then that would have an impact). For example, if Zoe is my boss whom I mildly like / admire, I think I would still react strongly to her approval. That’s what we get from the circuit above—the physiological arousal will respond to how high-stakes it feels for me to be interacting with Zoe, along with the various other factors (e.g. receiving eye contact automatically causes extra arousal). I think my new theory is a better fit to everyday experience, but you can judge for yourself and let me know what you think.
There’s an additional question of what’s upstream of that—i.e., what leads to some people inducing physiological arousal (i.e. being “attention-grabbing”, “intimidating”, “larger-than-life”, etc.) more than others? I think it’s complicated—lots of things go into that. Some come straight from arousal-inducing innate reactions. For example, I think we have an innate reaction that induces arousal upon interacting with a tall person, just as many other animals have instincts to “size each other up”. The evolutionary logic is: Any interaction with a tall person is high-stakes because they could potentially beat us up. In other cases, the physiological arousal routes through within-lifetime learning. Is the person in a position to strongly impact my life?
Incidentally, if we compare my previous theory (that I’m driven to be liked / admired by Zoe in proportion to how much I like / admire Zoe in turn) to my current theory (that I’m driven to be liked / admired by Zoe in proportion to how much interactions with Zoe feel arousing, a.k.a. high-stakes), I think there’s some overlap in predictions, because there’s correlation between strongly liking / admiring Zoe, versus feeling like interactions with Zoe are high-stakes. I think the correlation comes from both directions. If I strongly like / admire Zoe, then as a consequence, my interactions with her can feel high-stakes. My liking / admiring her puts her in a position to impact my life. For example, if she spurns me, then I’ve lost access to something I enjoy; plus, I’ve implicitly given her the power to crush my self-esteem. In the other direction, if interactions with Zoe feel high-stakes, I think that can impact how much I like / admire Zoe, for various reasons, including the general valence-arousal interaction mentioned in §5.3.1.
7. Other examples of social instincts
I think the “compassion / spite circuit” above is an important piece of the puzzle of human social instincts. But there’s a whole lot more to social instincts beyond that! Really, I think there’s a bunch of interacting circuits and signals in the Steering Subsystem. How can we pin it down?
Experimentally, there’s a longstanding thread of work laboriously characterizing each of the hundreds of little neuron groups in the Steering Subsystem. More of that would obviously help. I mentioned at least one specific experiment above (§1.2). In parallel, perhaps we could try leapfrogging that process by measuring a complete connectome! My impression is that there are viable roadmaps to a full mouse connectome within years, not decades—much sooner than people seem to realize. Indeed, my guess is that getting a primate or even human connectome well before Artificial General Intelligence is totally a viable possibility, given appropriate philanthropic or other support. (See here.)
On the theory side, as we wait for that data, I think there’s still plenty of room for further careful armchair theorizing to come up with plausible hypotheses. A possible starting point for brainstorming is to look at the set of innate stereotyped (a.k.a. “consummatory”) behavior towards conspecifics, to guess at some of the signals that might be internal to the Steering Subsystem. Doing that is a bit tricky for humans, since our behavioral repertoire comes disproportionately from learning and culture (excepting early childhood, I suppose). But for example, if a rodent sees another rodent, it might display:
(A) Aggressive behavior—e.g. threatening or attacking;
(B) Friendly, helpful behavior—e.g. grooming or snuggling;
(C) Submissive behavior—e.g. rolling on one’s back in response to a potential threat;
(D) Playful behavior—e.g. laughing or play-posture;
(Many more—see for example Panksepp’s seven categories.)
Of these:
I think the “friend (+) / enemy (–)” flag mentioned above is somehow connected to whatever signals are upstream of (B) and (A) respectively.
I offered a starting-point proposal for (D) previously at A Theory of Laughter.
…But (C) seems to be an important ingredient missing in what I’ve said so far.
So that brings us to:
7.1 “Drive to feel feared” (a.k.a. “drive to receive submission”)
Dual strategies theory (see my own discussion at Social status part 2/2: everything else) says that people can have “high status” in two different ways: “prestige” and “dominance”. If the “drive to feel liked / admired” above is upstream of seeking prestige for its own sake, then the “drive to feel feared” would be correspondingly upstream of seeking dominance for its own sake.
The “drive to feel feared” could also be called “drive to receive submission”—i.e., a drive for others to display submissive behavior towards me, as in those rats rolling onto their backs. I’m not sure which of those two terms is better. I figure there’s probably some Steering Subsystem signal that’s upstream of both a tendency towards submissive behavior and a tendency towards fear and flight behavior, and it’s this upstream signal that flows into the circuit.
Evolutionarily, it makes perfect sense for there to be a “drive to feel feared”. If someone submits to me, then I’m dominant, and I get first dibs on food and mates without having to fight.
Neuroscientifically, I think the circuit for “drive to feel feared” could be parallel to the “compassion / spite circuit” above. More specifically, the first step is using Ingredient 4 to get to “Conspecific seems to be feeling fear / submission”:
And then we combine that with physiological arousal to get a motivational effect:
And as before, this would fire especially strongly under eye contact or other signals that the conspecific is thinking of you-in-particular:
(As drawn, the circuit might (mis)fire when I notice my friend submitting to a bully who is also simultaneously threatening me. I think that would be solvable by gating the circuit such that it doesn’t fire if I myself am also feeling fear / submission. Let me know if you think of other examples where this proposal doesn’t work.)
8. Conclusion
I feel like I have the big picture of a plausible nuts-and-bolts explanation of how the human brain solves the symbol grounding problem to implement social instincts. It might be wrong, and I’m happy for feedback.
Ingredients 1–4 constitute a kind of domain-specific language in which I think all of our social instincts are written. And then §5–§7 includes an attempt to build two specific social instincts out of the elements of that language, out of a much larger collection of social instincts yet to be sorted out. I figure that the things I wrote down, while a bit sketchy and incomplete, are probably capturing at least some aspects of compassion, spite, schadenfreude, “drive to feel liked / admired”, and “drive to feel feared”, and I think these collectively capture a lot of the human social world. (See also my post A theory of laughter for how laughter and play work.)
If you think this post is totally on the wrong track, then please let me know, by email or the comments section below. If it’s on the right track, then that’s great, but we still obviously have tons of work left to do to really pin down human social instincts, possibly in conjunction with experiments, as discussed in §7 above.
In case anyone’s wondering, I think my next project going forward will be to spend a while pondering the very biggest picture of brain-like AGI safety—everything from reward functions and training environments and testing, to governance and deployment and society, in light of (what I hope is) my newfound understanding of how human social instincts generally work. My confusion on that topic has been a big blocker to my thinking and progress previous times that I tried to do that. After that, I guess I’ll figure out where to go from there! Should be interesting.
Thanks Seth Herd and Simon Skade for critical comments on earlier drafts, and thanks various commenters and especially Rif A. Saurous for critical feedback that informed later revisions.
Changelog
(Some previous versions of the post are archived at the DOI link. I can share even more fine-grained version history and list of changes upon request.)
2025-11-26: Since initial publication, I’ve added links to some later follow-up posts (search for “UPDATE” in the text), made some minor wording changes, replaced a secondary-source reference with the corresponding primary source, and added a reference.
2026-04-30: I changed terminology from “the ‘thinking of a conspecific’ flag” to “the social attention reflex”. I think the new term has better connotations, especially the way it invokes a parallel to “orienting reflex” and “startle reflex”, which likewise are associated with fast, transient, and involuntary changes in both attention and other innate signals like pleasure and arousal.
Relatedly, I deleted a few words suggesting that the social attention reflex is more likely to be in the medial hypothalamus than the lateral hypothalamus. My old term (“thinking of a conspecific” flag) suggested a social-related state variable, which struck me as more medial-ish. But now I’m thinking of it more as a fast reflex, which strikes me as more lateral-ish if anything. But I dunno, I’m just guessing.
I also dramatically shortened and simplified §6.1: (“Key idea: My ‘compassion / spite circuit’ is disproportionately active and important while the conspecific is thinking about me-in-particular”). I decided that this is a pretty straightforward point, and I was making it unnecessarily complicated.
Other minor wording tweaks (especially §3.2) for clarity.
2026-05-19: I rewrote §3–§5.1 to remove unnecessary complication, and clean up some errors and muddled thinking. More details:
In §3.2, I previously had a toy example of learning rate modulation in the thought assessors, where I was daydreaming about Taylor Swift, and then I suddenly orient to a spider jumping at me, and the learning rate modulation (I argued) was necessary to prevent learning that Taylor Swift is a risk factor for spiders jumping out at me. I do think that’s an actual solution to an actual problem, and that it’s implemented in the brain partly via the well-known “cholinergic interneuron pause” in response to (generalized) orienting reflexes. But I described this example poorly (and somewhat incorrectly), and more importantly it’s an example that’s not directly related to this post, and I think it was just causing unnecessary confusion (even I was confused when I re-read it). So I switched to a new example that overlaps much more with §4. I also deleted the discussion of learning rate modulation in the Thought Generator, which I decided was somewhat misleading and confusing as written, and off-topic anyway.
That change to §3, in turn, allowed me to shorten and streamline §4, including in ways that hopefully made §5.1 a bit clearer in turn.
The new version introduces and uses a new term I just made up, “interoceptive concept finder”, for a particular type of short-term predictor.
- ^
Some bits of text in this introductory section are copied from an earlier (wrong) post, “Spatial attention as a “tell” for empathetic simulation?”.
- ^
For a different (simpler) example of what I think it looks like to make progress towards that kind of pseudocode, see my post A Theory of Laughter.
- ^
Thanks to regional specialization across the cortex (roughly correspondingly to “neural network architecture” in ML lingo), there can be a priori reason to believe that, for example, “pattern 387294” is a pattern in short-term auditory data whereas “pattern 579823” is a pattern in large-scale visual data, or whatever. But that’s not good enough. The symbol grounding problem for social instincts needs much more specific information than that. If Jun just told me that Xiu thinks I’m cute, then that’s a very different situation from if Jun just told me that Fang thinks I’m cute, leading to very different visceral reactions and drives. Yet those two possibilities are built from generally the same kinds of data.
- ^
Actually, this is an area where the evolutionary “design spec” can be pretty inscrutable. The (so-called) spider detector circuit, like any image classifier, triggers on all kinds of inputs, not all of which are spiders, including Bizarre Visual Input Type 74853 that has no relation to spiders and would occur on average once every 100 lifetimes in our ancestral environment. And maybe it just so happened that Bizarre Visual Input Type 74853 correlates with danger, such that noticing and recoiling from it was adaptive. Then that very fact would be part of the evolutionary pressure sculpting the (so-called) spider detector circuit, such that the term “spider detector circuit” is not a 100% perfect description of its evolutionary purpose.
- ^
My diagrams are drawn with the “supervisor” signal traveling from the Steering Subsystem to the short-term predictor, and then the subtraction step (“supervisor – output = error”) happening in the short-term predictor. But that’s just for illustration. I’m also open-minded to the possibility that the subtraction is performed in the Steering Subsystem, and that it’s the error signal that travels up to the short-term predictor. That’s more of a low-level implementation detail that I’m not too concerned with for the purpose of this post.
- ^
I’m oversimplifying. I think what would actually happen is: the predictor will flag your various interoceptive concepts in proportion to how much physiological arousal they entail. Note that some cultures probably don’t even have a “physiological arousal” concept per se; see my Lisa Feldman Barrett post.
- ^
For purposes of this discussion, things like sense-of-pain, sense-of-temperature, and “affective touch” (c-tactile receptors) count as interoception, not exteroception, despite the fact that you can in fact learn about the outside world via those signals. After all, the skin is an organ, and sensing the health and status of your organs is an interoception thing. See How Do You Feel by Bud Craig (2020) for detailed physiological evidence—nerve types, pathways in the spine and brain, etc.—that this is the right classification.
- ^
Here and elsewhere, I’m using English-language emotion words to refer to Steering Subsystem signals, because I don’t know how else to refer to them. But be warned that there is never a perfect correspondence between brainstem signals and emotion words (as we actually use them in everyday life). For more discussion of that point, see Lisa Feldman Barrett versus Paul Ekman on facial expressions & basic emotions.
- ^
As a general rule, there are multiple ways to turn pseudocode into neuroscientifically-plausible circuits. For example, the gray box is an intermediate variable in this calculation. I’m drawing it explicitly because it makes it easier to follow. But it might not be a separate cell group in the hypothalamus. Or conversely, it could be two cell groups, one for “pleasure” and the other for “displeasure”, with mutual inhibition. Or something else, who knows.
- ^
In terms of the Ingredient 4 discussion, this would be the actual phasic arousal in our own bodies, which is impacted by the exteroception-sensitive short-term predictors, but is not impacted by transient empathetic simulations of someone else’s phasic arousal.
- ^
I guess I’m predicting that people with constitutionally low arousal responses (extraverts, thrill-seekers, etc.) will tend to have a higher ratio of status drive to compassion drive. But I didn’t check that. It’s not a strong prediction—there are probably a bunch of other factors at play too.
- ^
Aversion to eye contact is common among autistic people. For example, John Elder Robison entitled his first memoir Look Me in the Eye, and discusses his aversion to eye contact in the prologue. And in the book excerpt I copied here, there are three quotes from autistic people about their experience of eye contact.
- ^
As an example, there’s an anecdote here of someone making a “feelgood” email folder for when she was feeling down, and most of the entries she mentions are basically compliments from people whom (I suspect) she sees as important and intimidating. As another example, my 9yo craves “impressing his parents” like a drug, and strives endlessly for us to laugh at his jokes, admire his knowledge and achievements, etc. But when we had regular visits with a 4yo who idolized him, he basically couldn’t care less.
- ^
Update Sept 2025: I think there’s an additional phenomenon where, if thoughts of Person X tend to induce physiological arousal in Person Y, then that contributes not only to Y wanting to feel liked / admired by X, but also (under certain conditions) to Y feeling sexually attracted to X, especially if Y is a cis woman. For more discussion see §3–§6 of my follow-up post Neuroscience of human sexual attraction triggers (3 hypotheses).
- 6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa by (3 Dec 2025 18:37 UTC; 367 points)
- “Sharp Left Turn” discourse: An opinionated review by (28 Jan 2025 18:47 UTC; 227 points)
- Shallow review of technical AI safety, 2024 by (29 Dec 2024 12:01 UTC; 202 points)
- Varieties Of Doom by (17 Nov 2025 21:36 UTC; 173 points)
- Foom & Doom 2: Technical alignment is hard by (23 Jun 2025 17:19 UTC; 173 points)
- Why we should expect ruthless sociopath ASI by (18 Feb 2026 17:28 UTC; 156 points)
- My AGI safety research—2025 review, ’26 plans by (11 Dec 2025 17:05 UTC; 137 points)
- We need a field of Reward Function Design by (8 Dec 2025 19:15 UTC; 118 points)
- “The Era of Experience” has an unsolved technical alignment problem by (24 Apr 2025 13:57 UTC; 116 points)
- My AGI safety research—2024 review, ’25 plans by (31 Dec 2024 21:05 UTC; 111 points)
- A Theory of Laughter by (23 Aug 2023 15:05 UTC; 105 points)
- [Intuitive self-models] 1. Preliminaries by (19 Sep 2024 13:45 UTC; 100 points)
- [Intro to brain-like-AGI safety] 15. Conclusion: Open problems, how to help, AMA by (17 May 2022 15:11 UTC; 100 points)
- [Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL by (2 Mar 2022 15:26 UTC; 84 points)
- Reward Function Design: a starter pack by (8 Dec 2025 19:15 UTC; 82 points)
- [Intuitive self-models] 8. Rooting Out Free Will Intuitions by (4 Nov 2024 18:16 UTC; 79 points)
- LLM AGI may reason about its goals and discover misalignments by default by (15 Sep 2025 14:58 UTC; 75 points)
- [Intro to brain-like-AGI safety] 13. Symbol grounding & human social instincts by (27 Apr 2022 13:30 UTC; 72 points)
- Neuroscience of human sexual attraction triggers (3 hypotheses) by (25 Aug 2025 17:51 UTC; 70 points)
- “Act-based approval-directed agents”, for IDA skeptics by (18 Mar 2026 18:47 UTC; 68 points)
- Against empathy-by-default by (16 Oct 2024 16:38 UTC; 63 points)
- New version of “Intro to Brain-Like-AGI Safety” by (23 Jan 2026 16:21 UTC; 59 points)
- Perils of under- vs over-sculpting AGI desires by (5 Aug 2025 18:13 UTC; 58 points)
- Re SMTM: negative feedback on negative feedback by (14 May 2025 19:50 UTC; 57 points)
- “Behaviorist” RL reward functions lead to scheming by (23 Jul 2025 16:55 UTC; 56 points)
- Self-dialogue: Do behaviorist rewards make scheming AGIs? by (13 Feb 2025 18:39 UTC; 43 points)
- Social drives 2: “Approval Reward”, from norm-enforcement to status-seeking by (12 Nov 2025 20:40 UTC; 42 points)
- Social drives 1: “Sympathy Reward”, from compassion to dehumanization by (10 Nov 2025 14:53 UTC; 36 points)
- Excerpts from my neuroscience to-do list by (6 Oct 2025 21:05 UTC; 28 points)
- Video & transcript: Challenges for Safe & Beneficial Brain-Like AGI by (8 May 2025 21:11 UTC; 27 points)
- Video & transcript: Challenges for Safe & Beneficial Brain-Like AGI by (EA Forum; 8 May 2025 21:11 UTC; 8 points)
- 's comment on Social drives 1: “Sympathy Reward”, from compassion to dehumanization by (29 Jan 2026 14:24 UTC; 7 points)
- 's comment on Ruby’s Quick Takes by (21 Aug 2025 1:33 UTC; 7 points)
- New version of “Intro to Brain-Like-AGI Safety” by (EA Forum; 23 Jan 2026 16:21 UTC; 6 points)
- 's comment on Why we should expect ruthless sociopath ASI by (21 Feb 2026 11:35 UTC; 5 points)
- 's comment on In (highly contingent!) defense of interpretability-in-the-loop ML training by (6 Feb 2026 18:13 UTC; 4 points)
- 's comment on “The Era of Experience” has an unsolved technical alignment problem by (25 Apr 2025 23:40 UTC; 4 points)
- 's comment on Book Review: Affective Neuroscience by (12 Mar 2025 13:24 UTC; 3 points)
- 's comment on Roman Malov’s Shortform by (4 Jun 2025 8:36 UTC; 2 points)
- 's comment on Why we should expect ruthless sociopath ASI by (20 Feb 2026 4:10 UTC; 1 point)
I feel inordinately proud of this post, probably because this was a problem that I’ve been confused about since 2019, and I literally taught myself neuroscience in large part because I wanted to solve this problem, and I spent what amounts to several years of full-time effort building up an ability to tackle it … and this post represented the moment when I finally felt like I had my foot in the door towards a satisfying solution.
Granted, there’s still plenty more work to do, and indeed I’ve continued to follow up on this work in the past year since I wrote this post; but it now feels like I’m filling in gaps, and fleshing out details, and refactoring inelegant descriptions, whereas before it felt like I was trying to breach a wall of impenetrable mystery.