Evolution is a bad analogy for AGI: inner alignment
TL;DR: The dynamics of human learning processes and reward circuitry are more relevant than evolution for understanding how inner values arise from outer optimization criteria.
This post is related to Steve Byrnes’ Against evolution as an analogy for how humans will create AGI, but more narrowly focused on how we should make inferences about values.
Thanks to Alex Turner, Charles Foster, and Logan Riggs for their feedback on a draft of this post.
Introduction
How should we expect AGI development to play out?
True precognition appears impossible, so we use various analogies to AGI development, such as evolution, current day humans, or current day machine learning. Such analogies are far from perfect, but we still may be able to extract useful information by carefully examining them.
In particular, we want to understand how inner values relate to the outer optimization criteria. Human evolution is one possible source of data on this question. In this post, I’ll argue that human evolution actually provides very little usable evidence on AGI outcomes. In contrast, analogies to the human learning process are much more fruitful.
Inner values versus outer optimization criteria
One way people motivate extreme levels of concern about inner misalignment is to reference the fact that evolution failed to align humans to the objective of maximizing inclusive genetic fitness. From Eliezer Yudkowsky’s AGI Ruin post:
16. Even if you train really hard on an exact loss function, that doesn’t thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments. Humans don’t explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn’t produce inner optimization in that direction. This happens in practice in real life, it is what happened in the only case we know about…
I don’t think that “evolution → human values” is the most useful reference class when trying to understand how outer optimization criteria relate to inner values. Evolution didn’t directly optimize over our values. It optimized over our learning process and reward circuitry. Once you condition on a particular human’s learning process + reward circuitry configuration + the human’s environment, you screen off the influence of evolution on that human’s values. So, there are really (at least) two classes of observations from which we can draw evidence:
“evolution’s inclusive genetic fitness criteria → a human’s learned values” (as mediated by evolution’s influence over the human’s learning process + reward circuitry)
“a particular human’s learning process + reward circuitry + training environment → the human’s learned values”
I will present five reasons why I think evidence from (2) “human learning → human values” is more relevant to predicting AGI.
1: Training an AI is more similar to human learning than to evolution
The relationship we want to make inferences about is:
“a particular AI’s learning process + reward function + training environment → the AI’s learned values”
I think that “AI learning → AI values” is much more similar to “human learning → human values” than it is to “evolution → human values”. Steve Byrnes makes this case in much more detail in his post on the matter. Two of the ways I think AI learning more closely resembles human learning, and not evolution, are:
The simple type signatures of the two processes. Evolution is a bi-level optimization process, with evolution optimizing over genes, and the genes specifying the human learning process, which then optimizes over human cognition. Evolution does not directly optimize over a human’s cognition. And because learned cognition is not directly accessible to the genome, evolution must use roundabout methods to influence human values through the genome.
In contrast, SGD directly optimizes over an AI’s cognition, just as human within-lifetime learning directly optimizes over human cognition. The human and AI learning processes are much closer to their respective cognitive structures, compared with evolution.The differences between the parameter counts of the respective objects of optimization (the genome for evolution, the brain’s circuitry for human learning, and the AI’s parameter’s for AI training).
The genome has very few parameters compared to even current day neural networks, much less the brain or future AGIs. Our experience with ML scaling laws very strongly implies that parameter counts matter a lot for a system’s learning dynamics. Better to compare highly parameterized systems to other highly parameterized systems.
“AI learning → AI values”, “human learning → human values”, and “evolution → human values” each represent very different optimization processes, with many specific dissimilarities between any pair of them. However, I think the balance of dissimilarities points to “human learning → human values” being the closer reference class for “AI learning → AI values”. As a result, I think the vast majority of our intuitions regarding the likely outcomes of inner goals versus outer optimization should come from looking at the “human learning → human values” analogy, not the “evolution → human values” analogy.
2: We have more total evidence from human outcomes
Additionally, I think we have a lot more total empirical evidence from “human learning → human values” compared to from “evolution → human values”. There are billions of instances of humans, and each of them presumably have somewhat different learning processes / reward circuit configurations / learning environments. Each of them represents a different data point regarding how inner goals relate to outer optimization. In contrast, the human species only evolved once. Thus, evidence from “human learning → human values” should account for even more of our intuitions regarding inner goals versus outer optimization than the difference in reference class similarities alone would indicate.
3: Human learning trajectories represent a broader sampling of the space of possible learning processes
One common objection is that “human learning” represents a tiny region in the space of all possible mind designs, and so we cannot easily generalize our observations of humans to minds in general. This is, of course, true, and it greatly limits the strength of any AI-related conclusions we can draw from looking at “human learning → human values”. However, I again hold that inferences from “evolution → human values” suffer from an even more extreme version of this same issue. “Evolution → human values” represent an even more restricted look at the general space of optimization processes than we get from the observed variations in different humans’ learning processes, reward circuit configurations, and learning environments.
4: Evidence from humans are more accessible than evidence from evolution
Human evolution happened hundreds of thousands of years ago. We are deeply uncertain about the details of the human ancestral environment and which traits were under what selection pressure. We are still unsure about what precise selection pressure led humans to be so generally intelligent at all. We are very far away from being able to precisely quantify all the potentially values-related selection pressures in the ancestral environment, or how those selection pressures changed our reward systems or our tendencies to form downstream values.
In contrast, human within lifetime learning happens all the time right now. It’s available for analysis and even experimental intervention. Given two evidence sources about a given phenomenon, where one evidence source is much more easily accessible than the other, then all else equal, the more accessible evidence source should represent a greater fraction of our total information on the phenomenon. This is another reason why we should expect evidence from humans to account for a greater proportion of our total information about how inner values relate to outer optimization criteria.
5: Evolution could not have succeeded anyways
I think that a careful account of how evolution shaped our learning process in the ancestral environment implies that evolution had next to no chance of aligning humans with inclusive genetic fitness.
There are no features of the ancestral environment which would lead to an ancestral human learning about the abstract idea of inclusive genetic fitness. There were no ancestral humans that held an explicit representation of inclusive genetic fitness. So, there was never an opportunity for evolution to select for humans who attached their values to an explicit representation of inclusive genetic fitness.
Regardless of how difficult it is, in general, to get learning systems to form values around different abstract concepts, evolution could not have possibly gotten us to form a value around the particular abstraction of inclusive genetic fitness because we didn’t form such an abstraction in the ancestral environment. Ancestral humans had zero variance in their tendency to form values around inclusive genetic fitness. Evolution cannot select for traits that don’t vary across a population, so evolution could not have selected for humans that formed their values around inclusive genetic fitness.
In contrast, the sorts of things that we humans end up valuing are usually the sorts of things that are easy to form abstractions around. Thus, we are not doomed by the same difficulty that likely prevented evolution from aligning humans to inclusive genetic fitness.
This point is extremely important. I want to make sure to convey it correctly, so I will quote two previous expressions of this point by other sources:
Risks from Learned Optimization notes that the lack of environmental data related to inclusive genetic fitness effectively increases the description length complexity of specifying an intelligence that deliberately optimizes for inclusive genetic fitness:
…description cost is especially high if the learned algorithm’s input data does not contain easy-to-infer information about how to optimize for the base objective. Biological evolution seems to differ from machine learning in this sense, since evolution’s specification of the brain has to go through the information funnel of DNA. The sensory data that early humans received didn’t allow them to infer the existence of DNA, nor the relationship between their actions and their genetic fitness. Therefore, for humans to have been aligned with evolution would have required them to have an innately specified model of DNA, as well as the various factors influencing their inclusive genetic fitness. Such a model would not have been able to make use of environmental information for compression, and thus would have required a greater description length. In contrast, our models of food, pain, etc. can be very short since they are directly related to our input data.
From Alex Turner (in private communication):
If values form because reward sends reinforcement flowing back through a person’s cognition and reinforces the thoughts which (credit assignment judges to have) led to the reward, then if a person never thinks about inclusive reproductive fitness, they can never ever form a value shard around inclusive reproductive fitness. Certain abstractions, like lollipops or people, are convergently learned early in the predictive-loss-minimization process and thus are easy to form values around. But if there aren’t local mutations which make a person more probable to think thoughts about inclusive genetic fitness before/while the person gets reward, then evolution can’t instill this value. Even if the descendents of that person will later be able to think thoughts about fitness.
Total significance of evolution
There are many sources of empirical evidence that can inform our intuitions regarding how inner goals relate to outer optimization criteria. My current (not very deeply considered) estimate of how to weight these evidence sources is roughly:
~60% from “human learning → human values”
~4% from “evolution → human values”
~36% from various other evidence sources, which I won’t address further in this post, such as:
economics
microbial ecology
politics
current results in machine learning
game theory / multi-agent negotiation dynamics
Edit: since writing this post, I’ve learned a lot more about inductive biases and what deep learning theory we currently have, so my relative weightings have shifted quite a lot towards “current results in machine learning”.
Implications
I think that using “human learning → human values” as our reference class for inner goals versus outer optimization criteria suggests a much more straightforward relationship between the two, as compared to the (lack of a) relationship suggested by “evolution → human values”. Looking at the learning trajectories of individual humans, it seems like a given person’s values have a great deal in common with the sorts of experiences they’ve found rewarding in their lives up to that point in time. E.g., a person who grew up with and displayed affection for dogs probably doesn’t want a future totally devoid of dogs, or one in which dogs suffer greatly.
Please note that I am not arguing that humans are inner aligned, or that looking at humans implies inner alignment is easy. Humans are misaligned with maximizing their outer reward source (activation of reward circuitry). I operationalize this misalignment as: “After a distributional shift from their learning environment, humans frequently behave in a manner that predictably fails to maximize reward in their new environment, specifically because they continue to implement values they’d acquired from their learning environment which are misaligned to reward maximization in the new environment”.
For example, one way in which humans are inner misaligned is that, if you introduce a human into a new environment which has a button that will wirehead the human (thus maximizing reward in the new environment), but has other consequences that are extremely bad by light of the human’s preexisting values (e.g., killing a beloved family member), most humans won’t push the button.
I also think this regularity in inner values is reasonably robust to large increases in capabilities. If you take a human whose outer behavior suggests they like dogs, and give that human very strong capabilities to influence the future, I do not think they are at all likely to erase dogs from existence. It’s probably not as robust to your choice of which specific human to try this with. E.g., many people would screw themselves over with reckless self-modification. My point is that higher capabilities alone do not automatically render inner values completely alien to those demonstrated at lower capabilities.
(Part 2 will address whether the “sharp left turn” demonstrated by human capabilities with respect to evolution implies that we should expect a similar sharp left turn in AI capabilities.)
- My Objections to “We’re All Gonna Die with Eliezer Yudkowsky” by 21 Mar 2023 0:06 UTC; 357 points) (
- Evolution provides no evidence for the sharp left turn by 11 Apr 2023 18:43 UTC; 205 points) (
- My Objections to “We’re All Gonna Die with Eliezer Yudkowsky” by 21 Mar 2023 1:23 UTC; 166 points) (EA Forum;
- Inner and outer alignment decompose one hard problem into two extremely hard problems by 2 Dec 2022 2:43 UTC; 147 points) (
- Shard Theory in Nine Theses: a Distillation and Critical Appraisal by 19 Dec 2022 22:52 UTC; 143 points) (
- Disentangling Shard Theory into Atomic Claims by 13 Jan 2023 4:23 UTC; 86 points) (
- The heritability of human values: A behavior genetic critique of Shard Theory by 20 Oct 2022 15:51 UTC; 80 points) (
- Who are some prominent reasonable people who are confident that AI won’t kill everyone? by 5 Dec 2022 9:12 UTC; 72 points) (
- Some of my disagreements with List of Lethalities by 24 Jan 2023 0:25 UTC; 70 points) (
- Against most, but not all, AI risk analogies by 14 Jan 2024 3:36 UTC; 63 points) (
- 2022 (and All Time) Posts by Pingback Count by 16 Dec 2023 21:17 UTC; 53 points) (
- The heritability of human values: A behavior genetic critique of Shard Theory by 20 Oct 2022 15:53 UTC; 49 points) (EA Forum;
- Evolution provides no evidence for the sharp left turn by 11 Apr 2023 18:48 UTC; 43 points) (EA Forum;
- Against most, but not all, AI risk analogies by 14 Jan 2024 19:13 UTC; 43 points) (EA Forum;
- Can someone explain to me why most researchers think alignment is probably something that is humanly tractable? by 3 Sep 2022 1:12 UTC; 32 points) (
- 30 Apr 2023 0:33 UTC; 25 points) 's comment on Accuracy of arguments that are seen as ridiculous and intuitively false but don’t have good counter-arguments by (
- 4 Oct 2022 1:54 UTC; 20 points) 's comment on Humans aren’t fitness maximizers by (
- 3 Dec 2023 0:19 UTC; 13 points) 's comment on Quick takes on “AI is easy to control” by (EA Forum;
- 15 Nov 2023 21:23 UTC; 6 points) 's comment on AGI Ruin: A List of Lethalities by (
- 23 May 2023 21:56 UTC; 5 points) 's comment on Yoshua Bengio: How Rogue AIs may Arise by (
- 17 Jan 2024 10:49 UTC; 3 points) 's comment on Against most, but not all, AI risk analogies by (
- 24 Feb 2023 8:10 UTC; 2 points) 's comment on Full Transcript: Eliezer Yudkowsky on the Bankless podcast by (
- 6 Sep 2022 22:56 UTC; 2 points) 's comment on Will Capabilities Generalise More? by (
I’m confused. What is the outer optimization target for human learning?
My two top guesses below.
To me it looks like human values are result of humans learning from environment (which was influenced by humans before and includes current humans). So it’s kind of like human values are what humans learned by definition. So observing that humans learned human values doesn’t tell us anything.
Or maybe you mean something like parents / society / … teaching new humans their values? I see some other problems there:
I’m not sure what’s success rate but values seem to be changing noticeably
There was a lot of time to test multiple methods of teaching new humans values, with humans not changing that much.
The outer optimization target for the human learning process is kind of indeterminate, but to the extent we can determine it, it’s something like “learn the things that causally contributed to IGF in the ancestral environment.” This isn’t the same as IGF itself. It would include cooperation, sex drive, a fear of death, a taste for sugary and fatty foods, etc. We seem to be pretty well aligned from that perspective.
Also, if you view evolution from a wider perspective, we’re not that misaligned, since it’s just trying to find sticky patterns that reproduce themselves a lot, and it seems likely that human civilization will conquer the lightcone in some form or another fairly soon (even if it’s misaligned AI doing it).
I feel like the lesson here is actually “if alignment is too hard, then lower your aim until you find a target that’s easy enough to align to.” I mean, why is the outer optimization target for the human learning process these imperfect proxies for IGF that must have sometimes contributed negatively or suboptimally to IGF even in the ancestral environment? Why not align to IGF itself or better proxies? Probably because that was too hard for both outer and inner alignment reasons?
From this perspective, evolution managed to find alignment targets that were both “easy enough” and “good enough” (at least until the recent advent of things like porn, birth control, processed foods), and solved outer and inner alignment for these targets, but it had millions or billions of “training runs” to play with, and little chance of global catastrophe if something went wrong during one of these “training runs”.
Given your overall optimism about AI alignment, you probably view this analogy more optimistically than I do. Would be interested to hear you spell out your own perspective more.
It’s very difficult to get any agent to robustly pursue something like IGF because it’s an inherently sparse and beyond-lifetime goal. Human values have been pre-densified for us: they are precisely the kinds of things it’s easy to get an intelligence to pursue fairly robustly. We get dense, repeated, in-lifetime feedback about stuff like sex, food, love, revenge, and so on. A priori, if you’re an agent built by evolution, you should expect to have values that are easy to learn— it would be surprising if it turned out that evolution did things the hard way. So evolution suggests alignment should be easy.
What if some humans actually value something that’s sparse and beyond-lifetime like IGF? For example, Nick Bostrom seems to value avoiding astronomical waste. How to explain that, if our values only come from “dense, repeated, in-lifetime feedback”?
See also this top-level comment which may be related. If some people value philosophy and following correct philosophical conclusions, that would explain Nick Bostrom, but I’m not sure what “valuing philosophy” is about exactly, or how to align AI to do that. Any thoughts on this?
People come to have sparse and beyond-lifetime goals through mechanisms that are unavailable to biological evolution— it took thousands of years of memetic evolution for people to even develop the concept of a long future that we might be able to affect with our short lives. We’re in a much better position to instill long-range goals into AIs, if we choose to do so— we can simply train them to imitate human thought processes which give rise to longterm-oriented behaviors.
trying to find patterns that reduce themselves a lot with minimal change in the patterns (but still some change) seems like a better model of evolution to me, and by that metric, if we solve ai alignment with us, I think we’ll end up mostly solving our alignment with dna’s values—much of what dna valued has been lost, but those who care about the environment for its own sake and beauty will represent a high enough capability group to construct the repair process. if given the chance to do so by an AI that respects their values, anyway.
Many humans (especially me) value normativity (doing what is right) and philosophy, which are mysterious and/or contentious even to professional philosophers. Do you think it will be easy to align AIs to these values? If so, can you please go into some detail about this? For example, if there are many competing ideas for what normativity and philosophy really are, how can we ensure that AI will learn to value the right ones?
I’m curious what learnings you’re referring to. If you’ve written about it somewhere, please add a link to the post?
I think that a good analogy would be to compare the genome with the hyperparameters of neural networks. It’s not perfect, the genome influences human “training” in a much more indirect way (brain design, neurotransmitters) than hyperparameters, but it shows that evolutionary optimization of the genome (hyperparameters) happens on a different level than actual learning (human learning and training).
I would consider that you cannot weight these things along a single metric. Say evolution → human values really is only 4% of your value alignment, if that 4% is the fundamental core then it’s not part of the sum of all values, but a coefficient or a base where the other stuff is the exponent. It’s the hardware the software has to be loaded on, but not totally tabula rasa either.
Correct me if I’m wrong, but this would assume that if you could somehow make a human level intelligence snake and raise it in human society (let’s pretend nobody considers it weird that there’s a snake taking Chemistry class with them), then that snake would be 96% aligned with humanity?
My intuition would be along the lines of the parable of the scorpion and the frog.
The “4%” wasn’t addressing the question of “where do humans get their values from?” It was addressing “When trying to make predictions about AGI outcomes, how much weight should we assign these various sources of evidence?”
My perspective isn’t blank slatism. The genome has various leavers by which it can influence the sorts of values that a human forms. E.g., the snake wouldn’t have human-like reward circuitry, so it would probably learn to value very different things than a human which went through the same experiences. For more on this, see: “Learning from scratch” in the brain.
So in this case I think we then agree. But it seems a bit at odds with the 4% weighting of genetic roots. If we agree the snake would exhibit very different values despite experiencing the ‘human learning’ part then shouldn’t this adjust the 60% weight you grant that? Seems the evolutionary roots made all the difference for the snake. Which is the whole point about initial AGI alignment having to be exactly right.
Otherwise I understand your post to be ‘for humans, how much of human value is derived from evolution vs learning’. But that’s using humans as evidence who are human to begin with.
This is a neat distillation of Steve’s piece, and also a helpful and persuasive extension. I appreciated the arguments 2, 3, and 4 in particular (‘2. We have more total evidence from human outcomes’, ‘3. Human learning trajectories represent a broader sampling of the space of possible learning processes’, ‘4. Evidence from humans are more accessible than evidence from evolution’).
I wanted to raise two counterpoints, without a strong opinion on how much weight they deserve.
You’re contrasting human learning → human values with evolution → human values. But without here strongly justifying, I think it makes sense to look more broadly at evolution → animal values (and potentially even broader than that). After all, the field of RL had its beginnings as a descriptive theory for animal and human behaviour more broadly.
a counter counter is that humans are the most obvious (maybe only) general-ish and reflective intelligence we know about
There might be an explanatory gap here for people caring quite so much about things like lineage, legacy, having children and descendants
maybe this is explained by within-lifetime reinforcement due to social norms, which are themselves subject to selection? (but this is just another evolution, right?) I don’t know if this is enough.
so I’m not sure it’s entirely true to say ‘5. Evolution could not have succeeded anyways’
unless you think the original optimization daemon point is completely inevitable
Evolution had to succeed. In order for evolution to be noticed and/or modeled by anything, the patterns of neurons had to align perfectly, even if there was a one-in-a-trillion chance of something like neurons randomly forming the correct general intelligence, anywhere, ever. The fact that we came from neuron brute forcing doesn’t tell us that much about whether neuron brute forcing can create general intelligence.
Animals and insects aren’t evidence at all; given that intelligence evolved, there would be plenty of offshoots.
By “evolution succeeds,” the OP means “succeeds at aligning humans with caring about inclusive genetic fitness” – not at creating general intelligence.
The link you include mentions that anthropic updating on our observations can sometimes give us evidence on how hard something was likely to be initially (e.g., the cold war example where survival is evidence that things were less dangerous than we might have thought, all else equal). You can do something similar with the evolution of intelligence: This paper argues that if the evolution of human-level intelligence had been very unlikely, we’d be closer to the extremes of when Earth is no longer hospitable to big-brained life forms. The fact that the sun isn’t going to expand for a while longer (and make Earth uninhabitable) or that asteroid risks aren’t massively overdue for us compared to evolutionary timescales suggests that the evolution of general intelligence on earth wasn’t some freak accident that would almost never happen again under similar circumstances.