I think this was a badly written post, and it appropriately got a lot of pushback.
Let my briefly try again: clarifying what I was trying to communicate.
Evolution did not succeed at aligning humans to the sole outer objective function of inclusive genetic fitness.
There are multiple possible reasons why evolution didn’t succeed, and presumably multiple stacked problems.
But one thing that I’ve sometimes heard claimed or implied is that evolution couldn’t possibly have succeeded at instilling inclusive genetic fitness as a goal, because individuals humans don’t have inclusive genetic fitness as a concept.
Evolution could only have approximated that goal with a godshatter of adaptions to prefer various proxies to inclusive genetic fitness, where each proxy has to be close to the level of sensory-evidence. eg. Evolution can shape humans to like the taste of sugar, or the feeling of orgasm, or to prefer sexy-looking people, or even to love their cousins (less than their brothers but more than their more distant relatives). But, it’s claimed, evolution can’t shape humans to desire their own inclusive genetic fitness directly, because it can’t instill goals that aren’t at the at the level of sensory-evidence.
And so it’s not surprising that the proxies would completely deviate from the “intended” target, as soon as conditions changed.
Before the 20th century, not a single human being had an explicit concept of “inclusive genetic fitness”, the sole and absolute obsession of the blind idiot god. We have no instinctive revulsion of condoms or oral sex. Our brains, those supreme reproductive organs, don’t perform a check for reproductive efficacy before granting us sexual pleasure.
Why not? Why aren’t we consciously obsessed with inclusive genetic fitness? Why did the Evolution-of-Humans Fairy create brains that would invent condoms? “It would have been so easy,” thinks the human, who can design new complex systems in an afternoon.
The Evolution Fairy, as we all know, is obsessed with inclusive genetic fitness. When she decides which genes to promote to universality, she doesn’t seem to take into account anything except the number of copies a gene produces. (How strange!)
But since the maker of intelligence is thus obsessed, why not create intelligent agents—you can’t call them humans—who would likewise care purely about inclusive genetic fitness? Such agents would have sex only as a means of reproduction, and wouldn’t bother with sex that involved birth control. They could eat food out of an explicitly reasoned belief that food was necessary to reproduce, not because they liked the taste, and so they wouldn’t eat candy if it became detrimental to survival or reproduction. Post-menopausal women would babysit grandchildren until they became sick enough to be a net drain on resources, and would then commit suicide.
Supposedly, evolution can’t produce an inclusive genetic fitness maximizer, not just that it happened not to.
However, this story is undercut by an example in which evolution was able to make an abstract concept (not just a bunch of sensory correlates of that concept in the ancestral environment) an optimization target that the human will apply it’s full creative intelligence to achieving.
Social status seems like one such an example.
It’s an abstract concept that many humans have as an actual long term optimization target. They’ll implement plans over years to increase their prestige, they don’t just have a myopic status-grabbing heuristic. (One example: someone going to med school and residency for the prestigious role of being a doctor a decade later.)
And humans seem to have have a desire for social status itself, or at least not just for a collection of sensory-evidence-level proxy measures that correlated in the ancestral environment, and which break down entirely when the environment changes.
(If you doubt this, compare status-seeking behavior to male sexual preferences. In the latter case, it looks much more like evolution did instill a bunch of specific desires for close-to-sensory-level features that were proxies for fertility and health: big breasts, long legs, unwrinkled skin. Heterosexual men find those features desirable, and finding out that a particular sexy woman is actually infertile doesn’t change the desirability.
But in the case of status-seeking, I can’t write a list of collection of near-sensory-level features that that people desire, independently of actual social prestige. The markers of status are enormously varied, by culture and subculture, and constantly changing. I bet that Steve Byrnes can point out a bunch of specific sensory evidence that the brain uses to construct the status concept (stuff like gaze length of conspecifics or something?), but the human motivation system isn’t just optimizing for those physical proxy measures, or people wouldn’t be motivated to get prestige on internet forums where people have reputations but never see each other’s faces.)
This is suggestive that at least in some circumstances, evolution actually can shape an organism to have at least a specific abstract concept as a long term optimization target, and recruit the organism’s own intelligence to identifying how how that concept applies in many varied environments.
This is not to say that evolution succeeded at aligning humans. It didn’t. This also doesn’t imply that alignment is easy. Maybe it is, maybe it isn’t, but this argument doesn’t establish that.
But it is to say that the specific story for why evolution failed at aligning humans to inclusive genetic fitness that I believed in say 2020, is incorrect, or at least incomplete.
I bet that Steve Byrnes can point out a bunch of specific sensory evidence that the brain uses to construct the status concept (stuff like gaze length of conspecifics or something?), but the human motivation system isn’t just optimizing for those physical proxy measures, or people wouldn’t be motivated to get prestige on internet forums where people have reputations but never see each other’s faces.
Sensory evidence is definitely involved, but kinda indirectly. As I wrote in the latter: “The central situation where Approval Reward fires in my brain, is a situation where someone else (especially one of my friends or idols) feels a positive or negative feeling as they think about and interact with me.” I think it has to start with in-person interactions with other humans (and associated sensory evidence), but then there’s “generalization upstream of reward signals” such that rewards also get triggered in semantically similar situations, e.g. online interactions. And it’s intimately related to the fact that there’s a semantic overlap between “I am happy” and “you are happy”, via both involving a “happy” concept. It’s a trick that works for certain social things but can’t be applied to arbitrary concepts like inclusive genetic fitness.
I stand by my nitpick in other comment that you’re not using the word “concept” quite right. Or, hmm, maybe we can distinguish (A) “concept” = a latent variable in a specific human brain’s world-model, versus (B) “concept” = some platonic Natural Abstraction™ or whatever, whether or not any human is actually tracking it. Maybe I was confused because you’re using the (B) sense but I (mis)read it as the (A) sense? In AI alignment, we care especially about getting a concept in the (A) sense to be explicitly desired because that’s likelier to generalize out-of-distribution, e.g. via out-of-the-box plans. (Arguably.) There are indeed situations where the desires bestowed by Approval Reward come apart from social status as normally understood (cf. this section, plus the possibility that we’ll all get addicted to sycophantic digital friends upon future technological changes), and I wonder whether the whole question of “is Approval Reward exactly creating social status desire, or something that overlaps it but comes apart out-of-distribution?” might be a bit ill-defined via “painting the target around the arrow” in how we think about what social status even means.
(This is a narrow reply, not taking a stand on your larger points, and I wrote it quickly, sorry for errors.)
There is a natural abstraction of “human status” that covers play, work, love, friends, community and more.
Humans seek “human status” as an intrinsic motivation/reward/value, such that humans will seek status even at the cost of other goals and of generalized empowerment.
In humans, “human status” often functions as a long-term target (I think more satisficing than optimizing).
This proves that evolution is capable of creating intelligent agents that target abstract concepts as goals in ways that survive a massive training/deployment shift.
This is a small piece of evidence against one of many reasons that we are all going to die.
The top-voted objection is that this abstraction includes “human status” that commenters see as fake: internet forums, obscure hobbies, video games, and music fandom. I don’t find this compelling, it’s just pointing out that the drive is for “human status”, not “real world status”, “high class status”, “rationalist status”, or some other thing.
My best objection is that the order is reversed. Humans have genes that cause us to have various behaviors and seek various things, and then the natural abstraction of “human status” is something that we use to learn and describe what humans end up doing. If humans had ended up doing a slightly different thing, based on different genetics, and that resulted in different behaviors, then those behaviors would be what we called “human status”. There is another natural abstraction concept of “generic status” that abstracts all status-like concepts in all animals on Earth, and humans don’t target that. When I learned that grooming is a marker of status in some primates, that didn’t cause me to spend more time seeking opportunities to be groomed.
It would be more accurate to say that the natural abstraction of “human status” co-evolved with the genetics of humans, rather than them happening in either order. We see this with the natural abstraction of “Claude” co-evolving with the weights that make up various Claude models. I gave this +4 in the review. It would have been extremely valuable to post this twenty years ago, but today it seems obvious that we can grow artificial intelligent agents that target natural abstractions in the environment, including the Anthropic trick of making and targeting a natural abstraction at the same time.
Seems like you’re mushing together several loosely related things, including what we might call model-based motivation, explicit long-term planning, unified purpose, and precisely targeted goals.
Model-based motivation: being motivated to do something in a way that relies on your internal models of the world, not just on direct sensory rewards.
Explicit long-term planning: being aware of your goal, explicitly planning ways to achieve it, following those plans including over periods of months or years.
Unified purpose: a person’s motivations and actions in a domain fitting together coherently to work towards a single purpose, even across contexts.
Precisely targeted goals: having the goal precisely match something that can be specified on other grounds besides what we can empirically observe that people aim for (like “inclusive genetic fitness” which is picked out by theory).
The godshatter post is mainly about the last two—people have a collection of fragmented motivations which helped towards the selected-for purpose in the contexts where we evolved. Your argument here is mainly about the first two.
I think that the first two are pretty common, and are found in human romantic/reproductive goals, e.g. long-term planning around having kids, or motivations to improve ones appearance in ways that you expect potential partners to find attractive. I think that the last two are pretty rare, including for status—most people have a collection of somewhat-status-related motivations (though perhaps a small fraction of people (sociopaths?) have status as a more unified goal), and I haven’t seen anyone specify the “status” target well enough to even check if people’s motivations aim at that precise target.
I bet that Steve Byrnes can point out a bunch of specific sensory evidence that the brain uses to construct the status concept (stuff like gaze length of conspecifics, or something?), but the human motivation system isn’t just optimizing for those physical proxy measures, or people wouldn’t be motivated to get prestige on internet forums where people have reputations but never see each other’s faces.
Curious to see what Steven Byrnes would actually say here. I fed your comment and Byrnes’ twoposts on social status to Opus 4.5, it thought for 3m 40s (!) and ended up arguing he’d disagree with your social status example:
Byrnes explicitly argues the opposite position in §2.2.2 of the second post. He writes:
“I don’t currently think there’s an innate drive to ‘mostly lead’ per se. Rather, I think there’s an innate drive that we might loosely describe as ‘a drive to feel liked / admired’… and also an innate drive that we might loosely describe as ‘a drive to feel feared’. These drives are just upstream of gaining an ability to ‘mostly lead’.”
And more pointedly:
“I’m avoiding a common thing that evolutionary psychologists do (e.g. Secret of Our Success by Henrich), which is to point to particular human behaviors and just say that they’re evolved—for example, they might say there’s an ‘innate drive to be a leader’, or ‘innate drive to be dominant’, or ‘innate drive to imitate successful people’, and so on. I think those are basically all ‘at the wrong level’ to be neuroscientifically plausible.”
So Byrnes is explicitly rejecting the claim that evolution installed “status-seeking” as a goal at the level Eli describes.
Byrnes proposes a three-layer architecture: first, there are primitive innate drives—”feel liked/admired” and “feel feared”—which are still at the feeling level, not the abstract-concept level. Second, there’s a very general learning mechanism (which he discusses extensively in his valence series, particularly §4.5–4.6) that figures out what actions and situations produce those feelings in the local environment. Third, there are some low-level sensory adaptations (like “an innate brainstem reflex to look at people’s faces”) that feed into the learning system. Status-seeking behavior emerges from this combination, but “status” itself isn’t the installed goal.
Why does this matter? Eli presents something like a dichotomy: either (A) evolution can only do sensory-level proxies that break in novel contexts (like male preferences for big breasts, which don’t update when you learn a woman is infertile), or (B) evolution can install abstract concepts as goals (like status). Byrnes’ model offers a third option: evolution installs feeling-level drives plus a general learning mechanism. The learning mechanism explains why status-seeking transfers to internet forums—the primitive drive to “feel liked/admired” still triggers when you get upvotes, and the learning system figures out how to get more of that—without requiring that “status” itself be the installed goal. This third option actually supports the original claim Eli is arguing against. Evolution didn’t need to install “status” as a concept; it installed feelings + learning, and the abstract behavior emerged.
(mods, let me know if this is slop and I’ll take it down)
I think this was a badly written post, and it appropriately got a lot of pushback.
Let my briefly try again: clarifying what I was trying to communicate.
Evolution did not succeed at aligning humans to the sole outer objective function of inclusive genetic fitness.
There are multiple possible reasons why evolution didn’t succeed, and presumably multiple stacked problems.
But one thing that I’ve sometimes heard claimed or implied is that evolution couldn’t possibly have succeeded at instilling inclusive genetic fitness as a goal, because individuals humans don’t have inclusive genetic fitness as a concept.
Evolution could only have approximated that goal with a godshatter of adaptions to prefer various proxies to inclusive genetic fitness, where each proxy has to be close to the level of sensory-evidence. eg. Evolution can shape humans to like the taste of sugar, or the feeling of orgasm, or to prefer sexy-looking people, or even to love their cousins (less than their brothers but more than their more distant relatives). But, it’s claimed, evolution can’t shape humans to desire their own inclusive genetic fitness directly, because it can’t instill goals that aren’t at the at the level of sensory-evidence.
And so it’s not surprising that the proxies would completely deviate from the “intended” target, as soon as conditions changed.
Supposedly, evolution can’t produce an inclusive genetic fitness maximizer, not just that it happened not to.
However, this story is undercut by an example in which evolution was able to make an abstract concept (not just a bunch of sensory correlates of that concept in the ancestral environment) an optimization target that the human will apply it’s full creative intelligence to achieving.
Social status seems like one such an example.
It’s an abstract concept that many humans have as an actual long term optimization target. They’ll implement plans over years to increase their prestige, they don’t just have a myopic status-grabbing heuristic. (One example: someone going to med school and residency for the prestigious role of being a doctor a decade later.)
And humans seem to have have a desire for social status itself, or at least not just for a collection of sensory-evidence-level proxy measures that correlated in the ancestral environment, and which break down entirely when the environment changes.
(If you doubt this, compare status-seeking behavior to male sexual preferences. In the latter case, it looks much more like evolution did instill a bunch of specific desires for close-to-sensory-level features that were proxies for fertility and health: big breasts, long legs, unwrinkled skin. Heterosexual men find those features desirable, and finding out that a particular sexy woman is actually infertile doesn’t change the desirability.
But in the case of status-seeking, I can’t write a list of collection of near-sensory-level features that that people desire, independently of actual social prestige. The markers of status are enormously varied, by culture and subculture, and constantly changing. I bet that Steve Byrnes can point out a bunch of specific sensory evidence that the brain uses to construct the status concept (stuff like gaze length of conspecifics or something?), but the human motivation system isn’t just optimizing for those physical proxy measures, or people wouldn’t be motivated to get prestige on internet forums where people have reputations but never see each other’s faces.)
This is suggestive that at least in some circumstances, evolution actually can shape an organism to have at least a specific abstract concept as a long term optimization target, and recruit the organism’s own intelligence to identifying how how that concept applies in many varied environments.
This is not to say that evolution succeeded at aligning humans. It didn’t. This also doesn’t imply that alignment is easy. Maybe it is, maybe it isn’t, but this argument doesn’t establish that.
But it is to say that the specific story for why evolution failed at aligning humans to inclusive genetic fitness that I believed in say 2020, is incorrect, or at least incomplete.
If it helps, my take is in Neuroscience of human social instincts: a sketch and its follow-up Social drives 2: “Approval Reward”, from norm-enforcement to status-seeking.
Sensory evidence is definitely involved, but kinda indirectly. As I wrote in the latter: “The central situation where Approval Reward fires in my brain, is a situation where someone else (especially one of my friends or idols) feels a positive or negative feeling as they think about and interact with me.” I think it has to start with in-person interactions with other humans (and associated sensory evidence), but then there’s “generalization upstream of reward signals” such that rewards also get triggered in semantically similar situations, e.g. online interactions. And it’s intimately related to the fact that there’s a semantic overlap between “I am happy” and “you are happy”, via both involving a “happy” concept. It’s a trick that works for certain social things but can’t be applied to arbitrary concepts like inclusive genetic fitness.
I stand by my nitpick in other comment that you’re not using the word “concept” quite right. Or, hmm, maybe we can distinguish (A) “concept” = a latent variable in a specific human brain’s world-model, versus (B) “concept” = some platonic Natural Abstraction™ or whatever, whether or not any human is actually tracking it. Maybe I was confused because you’re using the (B) sense but I (mis)read it as the (A) sense? In AI alignment, we care especially about getting a concept in the (A) sense to be explicitly desired because that’s likelier to generalize out-of-distribution, e.g. via out-of-the-box plans. (Arguably.) There are indeed situations where the desires bestowed by Approval Reward come apart from social status as normally understood (cf. this section, plus the possibility that we’ll all get addicted to sycophantic digital friends upon future technological changes), and I wonder whether the whole question of “is Approval Reward exactly creating social status desire, or something that overlaps it but comes apart out-of-distribution?” might be a bit ill-defined via “painting the target around the arrow” in how we think about what social status even means.
(This is a narrow reply, not taking a stand on your larger points, and I wrote it quickly, sorry for errors.)
I think it’s true and valuable to say:
There is a natural abstraction of “human status” that covers play, work, love, friends, community and more.
Humans seek “human status” as an intrinsic motivation/reward/value, such that humans will seek status even at the cost of other goals and of generalized empowerment.
In humans, “human status” often functions as a long-term target (I think more satisficing than optimizing).
This proves that evolution is capable of creating intelligent agents that target abstract concepts as goals in ways that survive a massive training/deployment shift.
This is a small piece of evidence against one of many reasons that we are all going to die.
The top-voted objection is that this abstraction includes “human status” that commenters see as fake: internet forums, obscure hobbies, video games, and music fandom. I don’t find this compelling, it’s just pointing out that the drive is for “human status”, not “real world status”, “high class status”, “rationalist status”, or some other thing.
My best objection is that the order is reversed. Humans have genes that cause us to have various behaviors and seek various things, and then the natural abstraction of “human status” is something that we use to learn and describe what humans end up doing. If humans had ended up doing a slightly different thing, based on different genetics, and that resulted in different behaviors, then those behaviors would be what we called “human status”. There is another natural abstraction concept of “generic status” that abstracts all status-like concepts in all animals on Earth, and humans don’t target that. When I learned that grooming is a marker of status in some primates, that didn’t cause me to spend more time seeking opportunities to be groomed.
It would be more accurate to say that the natural abstraction of “human status” co-evolved with the genetics of humans, rather than them happening in either order. We see this with the natural abstraction of “Claude” co-evolving with the weights that make up various Claude models. I gave this +4 in the review. It would have been extremely valuable to post this twenty years ago, but today it seems obvious that we can grow artificial intelligent agents that target natural abstractions in the environment, including the Anthropic trick of making and targeting a natural abstraction at the same time.
Seems like you’re mushing together several loosely related things, including what we might call model-based motivation, explicit long-term planning, unified purpose, and precisely targeted goals.
Model-based motivation: being motivated to do something in a way that relies on your internal models of the world, not just on direct sensory rewards.
Explicit long-term planning: being aware of your goal, explicitly planning ways to achieve it, following those plans including over periods of months or years.
Unified purpose: a person’s motivations and actions in a domain fitting together coherently to work towards a single purpose, even across contexts.
Precisely targeted goals: having the goal precisely match something that can be specified on other grounds besides what we can empirically observe that people aim for (like “inclusive genetic fitness” which is picked out by theory).
The godshatter post is mainly about the last two—people have a collection of fragmented motivations which helped towards the selected-for purpose in the contexts where we evolved. Your argument here is mainly about the first two.
I think that the first two are pretty common, and are found in human romantic/reproductive goals, e.g. long-term planning around having kids, or motivations to improve ones appearance in ways that you expect potential partners to find attractive. I think that the last two are pretty rare, including for status—most people have a collection of somewhat-status-related motivations (though perhaps a small fraction of people (sociopaths?) have status as a more unified goal), and I haven’t seen anyone specify the “status” target well enough to even check if people’s motivations aim at that precise target.
Curious to see what Steven Byrnes would actually say here. I fed your comment and Byrnes’ two posts on social status to Opus 4.5, it thought for 3m 40s (!) and ended up arguing he’d disagree with your social status example:
(mods, let me know if this is slop and I’ll take it down)