I agree with most of the factual claims made in this post about evolution. I agree that “IGF is the objective” is somewhat sloppy shorthand. However, after diving into the specific ways the object level details of “IGF is the objective” play out, I am confused about why you believe this implies the things you claim they imply about the sharp left turn / inner misalignment. Overall, I still believe that natural selection is a reasonable analogy for inner misalignment.
I agree fitness is not a single stationary thing. I agree this is prima facie unlike supervised learning, where the objective is typically stationary. However, it is pretty analogous to RL, and especially multi agent RL, and overall I don’t think of the inner misalignment argument as depending on stationarity of the environment in either direction. AlphaGo might early in training select for policies that do tactic X initially because it’s a good tactic to use against dumb Go networks, and then once all the policies in the pool learn to defend against that tactic it is no longer rewarded. Therefore I don’t see any important disanalogy between evolution and multi agent RL. I have various thoughts on why language models do not make RL analogies irrelevant that I can explain but that’s another completely different rabbit hole.
I agree that humans (to a first approximation) still have the goals/drives/desires we were selected for. I don’t think I’ve heard anyone claim that humans suddenly have an art creating drive that suddenly appeared out of nowhere recently, nor have I heard any arguments about inner alignment that depend on an evolution analogy where this would need to be true. The argument is generally that the ancestral environment selected for some drives that in the ancestral environment reliably caused something that the ancestral environment selected for, but in the modern environment the same drives persist but their consequences in terms of [the amount of that which the ancestral environment was selecting for] now changes, potentially drastically. I think the misconception may arise from a closely related claim that some make, which is that AI systems might develop weird arbitrary goals (tiny metallic squiggles) because any goal with sufficient intelligence implies playing the training game and then doing a sharp left turn. However, the claim here is not that the tiny metallic squiggles drive will suddenly appear at some point and replace the “make humans really happy” drive that existed previously. The claim is that the drive for tiny metallic squiggles was always, from the very beginning, the reason why [make humans really happy] was the observed behavior in environment [humans can turn you off if they aren’t happy with you], and therefore in a different environment [humans can no longer turn you off], the observed behavior is [kill everyone and make squiggles].
I agree that everything is very complex always. I agree that there are multiple different goals/drives/desires in humans that result in children, of which the sex drive is only one. I agree that humans still have children sometimes, and still want children per se sometimes, but in practice this results in less and less children than in the ancestral environment over time (I bet even foragers are at least above replacement rate) for exactly the reason that the drives that we have always had for the reason that they caused us to survive/reproduce in the past now correspond much less well. I also agree that infanticide exists and occurs (but in the ancestral environment, there are counterbalancing drives like taboos around infanticide). In general, in many cases, simplifying assumptions totally break the analogy and make the results meaningless. I don’t think I’ve been convinced that this is one of those cases.
I don’t really care about defending the usage of “fitness as the objective” specifically, and so I don’t think the following is a crux and am happy to concede some of the points below for the sake of argument about the object facts of inner alignment. However, for completeness, my take on when “fitness” can be reasonably described as the objective, and when it can’t be:
I agree that couched in terms of the specific traits, the thing that evolution does in practice is sometimes favoring some traits and sometimes favoring other traits. However, I think there’s an important sense in which these traits are not drawn from a hat- natural selection selects for lighter/darker moths because it makes it easier for the moths to survive and reproduce! If lighter moths become more common whenever light moths survive and reproduce better, and vice versa for dark moths, as opposed to moths just randomly becoming more light or more dark in ways uncorrelated to survival/reproduction, it seems pretty reasonable to say that survival/reproduction is closer to the thing being optimized than some particular lightness/darkness function that varies between favoring lightness and darkness.
I agree it is possible to do artificial selection for some particular trait like moth color and in this case saying that the process optimizes “fitness” (or survival/reproduction) collapses to saying the same thing as the process optimizes moth lightness/darkness. I agree it would be a little weird to insist that “fitness” is the goal in this case, and that the color is the more natural goal. I also agree that the evolutionary equations plays out the same way whether the source of pressure is artificial human selection or birds eating the moths. Nonetheless, I claim the step where you argue the two cases are equivalent for the purposes of whether we can consider fitness the objective is the step that breaks down. I think the difference between this case and the previous case is that the causality flows differently. We can literally draw from a hat whether we want light moths or dark moths, and then reshape the environment until fitness lines up with our preference for darkness, whereas in the other case, the environment is drawn from a hat and the color selection is determined downstream of that.
Thank you, I like this comment. It feels very cooperative and like some significant effort went into it, and it also seems to touch the core of some important consideratios.
I notice I’m having difficulty responding, in that I disagree with some of what you said, but then have difficulty figuring out my reasons for that disagreement. I have the sense there’s a subtle confusion going on, but trying to answer you makes me uncertain whether others are the ones with the subtle confusion or if I am.
I’ll think about it some more and get back to you.
So I think the issue is that when we discuss what I’d call the “standard argument from evolution”, you can read two slightly different claims into it. My original post was a bit muddled because I think those claims are often conflated, and before writing this reply I hadn’t managed to explicitly distinguish them.
The weaker form of the argument, which I interpret your comment to be talking about, goes something like this:
The original evolutionary “intent” of various human behaviors/goals was to increase fitness, but in the modern day these behaviors/goals are executed even though their consequences (in terms of their impact on fitness) are very different. This tells us that the intent of the process that created a behavior/goal does not matter. Once the behavior/goal has been created, it will just do what it does even if the consequences of that doing deviate from their original purpose. Thus, even if we train an AI so that it carries out goal X in a particular context, we have no particular reason to expect that it would continue to automatically carry out the same (intended) goal if the context changes enough.
I agree with this form of the argument and have no objections to it. I don’t think that the points in my post are particularly relevant to that claim. (I’ve even discussed a form of inner optimization in humans that causes value drift that I don’t recall anyone else discussing in those terms before.)
However, I think that many formulations are actually implying, if not outright stating a stronger claim:
In the case of evolution, humans were originally selected for IGF but are now doing things that are completely divorced from that objective. Thus, even if we train an AI so that it carries out goal X in a particular context, we have a strong reason to expect that its behavior would deviate so much from the goal as to become practically unrecognizable.
So the difference is something like the implied sharpness of the left turn. In the weak version, the claim is just that the behavior might go some unknown amount to the left. We should figure out how to deal with this, but we don’t yet have much empirical data to estimate exactly how much it might be expected to go left. In the strong version, the claim is that the empirical record shows that the AI will by default swerve a catastrophic amount to the left.
(Possibly you don’t feel that anyone is actually implying the stronger version. If you don’t and you would already disagree with the stronger version, then great! We are in agreement. I don’t think it matters whether the implication “really is there” in some objective sense, or even whether the original authors intended it or not. I think the relevant thing is that I got that implication from the posts I read, and I expect that if I got it, some other people got it too. So this post is then primarily aimed at the people who did read the strong version to be there and thought it made sense.)
You wrote:
I agree that humans (to a first approximation) still have the goals/drives/desires we were selected for. I don’t think I’ve heard anyone claim that humans suddenly have an art creating drive that suddenly appeared out of nowhere recently, nor have I heard any arguments about inner alignment that depend on an evolution analogy where this would need to be true. The argument is generally that the ancestral environment selected for some drives that in the ancestral environment reliably caused something that the ancestral environment selected for, but in the modern environment the same drives persist but their consequences in terms of [the amount of that which the ancestral environment was selecting for] now changes, potentially drastically.
If we are talking about the weak version of the argument, then yes, I agree with everything here. But I think the strong version—where our behavior is implied to be completely at odds with our original behavior—has to implicitly assume that things like an art-creation drive are something novel.
Now I don’t think that anyone who endorses the strong version (if anyone does) would explicitly endorse the claim that our art-creation drive just appeared out of nowhere. But to me, the strong version becomes pretty hard to maintain if you take the stance that we are mostly still executing all of the behaviors that we used to, and it’s just that their exact forms and relative weightings are somewhat out of distribution. (Yes, right now our behavior seems to lead to falling birthrates and lots of populations at below replacement rates, which you could argue was a bigger shift than being “somewhat out of distribution”, but… to me that intuitively feels like it’s less relevant than the fact that most individual humans still want to have children and are very explicitly optimizing for that, especially since we’ve only been in the time of falling birthrates for a relatively short time and it’s not clear whether it’ll continue for very long.)
I think the strong version also requires one to hold that evolution does, in fact, consistently and predominantly optimize for a single coherent thing. Otherwise, it would mean that our current-day behaviors could be explained by “evolution doesn’t consistently optimize for any single thing” just as well as they could be explained by “we’ve experienced a left turn from what evolution originally optimized for”.
However, it is pretty analogous to RL, and especially multi agent RL, and overall I don’t think of the inner misalignment argument as depending on stationarity of the environment in either direction. AlphaGo might early in training select for policies that do tactic X initially because it’s a good tactic to use against dumb Go networks, and then once all the policies in the pool learn to defend against that tactic it is no longer rewarded.
I agree that there are contexts where it would be analogous to that. But in that example, AlphaGo is still being rewarded for winning games of Go, and it’s just that the exact strategies it needs to use differ. That seems different than e.g. the bacteria example, where bacteria are selected for exactly the opposite traits—either selected for producing a toxin and an antidote, or selected for not producing a toxin and an antidote. That seems to me more analogous to a situation where AlphaGo is initially being rewarded for winning at Go, then once it starts consistently winning it starts getting rewarded for losing instead, and then once it starts consistently losing it starts getting rewarded for winning again.
And I don’t think that that kind of a situation is even particularly rare—anything that consumes energy (be it a physical process such as producing a venom or a fur, or a behavior such as enjoying exercise) is subject to that kind of an “either/or” choice.
Now you could say that “just like AlphaGo is still rewarded for winning games of Go and it’s just the strategies that differ, the organism is still rewarded for reproducing and it’s just the strategies that differ”. But I think the difference is that for AlphaGo, the rewards are consistently shaping its “mind” towards having a particular optimization goal—one where the board is in a winning state for it.
And one key premise on which the “standard argument from evolution” rests is that evolution has not consistently shaped the human mind in such a direct manner. It’s not that we have been created with “I want to have surviving offspring” as our only explicit cognitive goal, with all of the evolutionary training going into learning better strategies to get there by explicit (or implicit) reasoning. Rather we have been given various motivations that exhibit varying degrees of directness in how useful they are for that goal—from “I want to be in a state where I produce great art” (quite indirect) to “I want to have surviving offspring” (direct), with the direct goal competing with all the indirect ones for priority. Unlike AlphaGo, which does have the cognitive capacity for direct optimization toward its goal being the sole reward criteria all along.
This is also a bit hard to put a finger on, but I feel like there’s some kind of implicit bait-and-switch happening with the strong version of the standard argument. It correctly points out that we have not had IGF as our sole explicit optimization goal because we didn’t start by having enough intelligence for that to work. Then it suggests that because of this, AIs are likely to also be misaligned… even though, unlike with human evolution, we could just optimize them for one explicit goal from the beginning, so we should expect our AIs to be much more reliably aligned with that goal!
I think the main crux is that in my mind, the thing you call the “weak version” of the argument simply is the only and sufficient argument for inner misalignment and very sharp left turn. I am confused precisely what distinction you draw between the weak and strong version of the argument; the rest of this comment is an attempt to figure that out.
My understanding is that in your view, having the same drive as before means also having similar actions as before. For example, if humans have a drive for making art, in the ancestral environment this means drawing on cave walls (maybe this helped communicate the whereabouts of food in the ancestral environment). In the modern environment, this may mean passing up a more lucrative job opportunity to be an artist, but it still means painting on some other surface. Thus, the art drive, taking almost the same kinds of actions it ever did (maybe we use acrylic paints from the store instead of grinding plants into dyes ourselves), no longer results in the same consequences in amount of communicating food locations or surviving and having children or whatever it may be. But this is distinct from a sharp left turn, where the actions also change drastically (from helping humans to killing humans).
I agree this is more true for some drives. However, I claim that the association between drives and behaviors is not true in general. I claim humans have a spectrum of different kinds of drives, which differ in how specifically the drive specifies behavior. At one end of the spectrum, you can imagine stuff like breathing or blinking where it’s kind of hard to even say whether we have a “breathing goal” or a clock that makes you breath regularly—the goal is the behavior, in the same way a cup has the “goal” of holding water. At this end of the spectrum it is valid to use goal/drive and behavior interchangeably. At the other end of the spectrum are goals/drives which are very abstract and specify almost nothing about how you get there: drives like desire for knowledge and justice and altruism and fear of death.
The key thing that makes these more abstract drives special is that because they do not specifically prescribe actions, the behaviors are produced by the humans reasoning about how to achieve the drive, as opposed to behaviors being selected for by evolution directly. This means that a desire for knowledge can lead to reading books, or launching rockets, or doing crazy abstract math, or inventing Anki, or developing epistemology, or trying to build AGI, etc. None of these were specifically behaviors that evolution could have reinforced in us—the behaviors available in the ancestral environment were things like “try all the plants to see which ones are edible”. Evolution reinforced the abstract drive for knowledge, and left it up to individual human brains to figure out what to do, using the various Lego pieces of cognition that evolution built for us.
This means that the more abstract drives can actually suddenly just prescribe really different actions when important facts in the world change, and those actions will look very different from the kinds of actions previously taken. To take a non-standard example, for the entire history of the existence of humanity up until quite recently, it just simply has not been feasible for anyone to contribute meaningfully to eradicating entire diseases (indeed, for most of human history there was no understanding of how diseases actually worked, and people often just attributed it to punishment of the gods or otherwise found some way to live with it, and sometimes, as a coping mechanism, to even think the existence of disease and death necessary or net good). From the outside it may appear as if for the entire history of humanity there was no drive for disease eradication, and then suddenly in the blink of an evolutionary timescale eye a bunch of humans developed a disease eradication drive out of nowhere, and then soon thereafter suddenly smallpox stopped existing (and soon potentially malaria and polio). These will have involved lots of novel (on evolutionary timescale) behaviors like understanding and manufacturing microscopic biological things at scale, or setting up international bodies for coordination. In actuality, this was driven by the same kinds of abstract drives that have always existed like curiosity and fear of death and altruism, not some new drive that popped into being, but it involved lots of very novel actions steering towards a very difficult target.
I don’t think any of these arguments depend crucially on whether there is a sole explicit goal of the training process, or if the goal of the training process changes a bunch. The only thing the argument depends on is whether there exist such abstract drives/goals (and there could be multiple). I think there may be a general communication issue where there is a type of person that likes to boil problems down to their core, which is usually some very simple setup, but then neglects to actually communicate why they believe this particular abstraction captures the thing that matters.
I am confused by your AlphaGo argument because “winning states of the board” looks very different depending on what kinds of tactics your opponent uses, in a very similar way to how “surviving and reproducing” looks very different depending on what kinds of hazards are in the environment. (And winning winning states of the board always looking like having more territory encircled seems analogous to surviving and reproducing always looking like having a lot of children)
I think there is also a disagreement about what AlphaGo does, though this is hard to resolve without better interpretability—I predict that AlphaGo is actually not doing that much direct optimization in the sense of an abstract drive to win that it reasons about, but rather has a bunch of random drives piled up that cover various kinds of situations that happen in Go. In fact, the biggest gripe I have with most empirical alignment research is that I think models today fail to have sufficiently abstract drives, quite possibly for reasons related to why they are kind of dumb today and why things like AutoGPT mysteriouly have failed to do anything useful whatsoever. But this is a spicy claim and I think not that many other people would endorse this.
I don’t think any of these arguments depend crucially on whether there is a sole explicit goal of the training process, or if the goal of the training process changes a bunch. The only thing the argument depends on is whether there exist such abstract drives/goals
I agree that they don’t depend on that. Your arguments are also substantially different from the ones I was criticizing! The ones I was responding were ones like the following:
The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn’t make the resulting humans optimize mentally for IGF. Like, sure, the apes are eating because they have a hunger instinct and having sex because it feels good—but it’s not like they could be eating/fornicating due to explicit reasoning about how those activities lead to more IGF. They can’t yet perform the sort of abstract reasoning that would correctly justify those actions in terms of IGF. And then, when they start to generalize well in the way of humans, they predictably don’t suddenly start eating/fornicating because of abstract reasoning about IGF, even though they now could. Instead, they invent condoms, and fight you if you try to remove their enjoyment of good food (telling them to just calculate IGF manually). The alignment properties you lauded before the capabilities started to generalize, predictably fail to generalize with the capabilities. (A central AI alignment problem: capabilities generalization, and the sharp left turn)
15. [...] We didn’t break alignment with the ‘inclusive reproductive fitness’ outer loss function, immediately after the introduction of farming—something like 40,000 years into a 50,000 year Cro-Magnon takeoff, as was itself running very quickly relative to the outer optimization loop of natural selection. Instead, we got a lot of technology more advanced than was in the ancestral environment, including contraception, in one very fast burst relative to the speed of the outer optimization loop, late in the general intelligence game. [...]
16. Even if you train really hard on an exact loss function, that doesn’t thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments. Humans don’t explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn’t produce inner optimization in that direction.(AGI Ruin: A List of Lethalities)
Those arguments are explicitly premised on humans having been optimized for IGF, which is implied to be a single thing. As I understand it, your argument is just that humans now have some very different behaviors from the ones they used to have, omitting any claims of what evolution originally optimized us for, so I see it as making a very different sort of claim.
To respond to your argument itself:
I agree that there are drives for which the behavior looks very different from anything that we did in the ancestral environment. But does very different-looking behavior by itself constitute a sharp left turn relative to our original values?
I would think that if humans had experienced a sharp left turn, then the values of our early ancestors should look unrecognizable to us, and vice versa. And certainly, there do seem to be quite a few things that our values differ on—modern notions like universal human rights and living a good life while working in an office might seem quite alien and repulsive to some tribal warrior who values valor in combat and killing and enslaving the neighboring tribe, for instance.
At the same time… I think we can still basically recognize and understand the values of that tribal warrior, even if we don’t share them. We do still understand what’s attractive about valor, power, and prowess, and continue to enjoy those kinds of values in less destructive forms in sports, games, and fiction. We can read Gilgamesh or Homer or Shakespeare and basically get what the characters are motivated by and why they are doing the things they’re doing. An anthropologist can go to a remote tribe to live among them and report that they have the same cultural and psychological universals as everyone else and come away with at least some basic understanding of how they think and why.
It’s true that humans couldn’t eradicate diseases before. But if you went to people very far back in time and told them a story about a group of humans who invented a powerful magic that could destroy diseases forever and then worked hard to do so… then the people of that time would not understand all of the technical details, and maybe they’d wonder why we’d bother bringing the cure to all of humanity rather than just our tribe (though Prometheus is at least commonly described as stealing fire for all of humanity, so maybe not), but I don’t think they would find it a particularly alien or unusual motivation otherwise. Humans have hated disease for a very long time, and if they’d lost any loved ones to the particular disease we were eradicating they might even cheer for our doctors and want to celebrate them as heroes.
Similarly, humans have always gone on voyages of exploration—e.g. the Pacific islands were discovered and settled long ago by humans going on long sea voyages—so they’d probably have no difficulty relating to a story about sorcerers going to explore the moon, or of two tribes racing for the glory of getting there first. Babylonians had invented the quadratic formula by 1600 BC and apparently had a form of Fourier analysis by 300 BC, so the math nerds among them would probably have some appreciation of modern-day advanced math if it was explained to them. The Greek philosophers argued over epistemology, and there were apparently instructions on how to animate golems (arguably AGI-like) around by the late 12th/early 13th century.
So I agree that the same fundamental values and drives can create very different behavior in different contexts… but if it is still driven by the same fundamental values and drives in a way that people across time might find relatable, why is that a sharp left turn? Analogizing that to AI, it would seem to imply that if the AI generalized its drives in that kind of way when it came to novel contexts, then we would generally still be happy about the way it had generalized them.
This still leaves us with that tribal warrior disgusted with our modern-day weak ways. I think that a lot of what is going on with him is that he has developed particular strategies for fulfilling his own fundamental drives—being a successful warrior was the way you got what you wanted back in that day—and internalized them as a part of his aesthetic of what he finds beautiful and what he finds disgusting. But it also looks to me like this kind of learning is much more malleable than people generally expect. One’s sense of aesthetics can be updated by propagating new facts into it, and strongly-held identities (such as “I am a technical person”) can change in response to new kinds of strategies becoming viable, and generally many (I think most) deep-seated emotional patterns can at least in principle be updated. (Generally, I think of human values in terms of a two-level model, where the underlying “deep values” are relatively constant, with emotional responses, aesthetics, identities, and so forth being learned strategies for fulfilling those deep values. The strategies are at least in principle updatable, subject to genetic constraints such as the person’s innate temperament that may be more hardcoded.)
I think that the tribal warrior would be disgusted by our society because he would rightly recognize that we have the kinds of behavior patterns that wouldn’t bring glory in his society and that his tribesmen would find it shameful to associate with, and also that trying to make it in our society would require him to unlearn a lot of stuff that he was deeply invested in. But if he was capable of making the update that there were still ways for him to earn love, respect, power, and all the other deep values that his warfighting behavior had originally developed to get… then he might come to see our society as not that horrible after all.
I am confused by your AlphaGo argument because “winning states of the board” looks very different depending on what kinds of tactics your opponent uses, in a very similar way to how “surviving and reproducing” looks very different depending on what kinds of hazards are in the environment.
I don’t think the actual victory states look substantially different? They’re all ones where AlphaGo has more territory than the other player, even if the details of how you get there are going to be different.
I predict that AlphaGo is actually not doing that much direct optimization in the sense of an abstract drive to win that it reasons about, but rather has a bunch of random drives piled up that cover various kinds of situations that happen in Go.
Yeah, I would expect this as well, but those random drives would still be systematically shaped in a consistent direction (that which brings you closer to a victory state).
I agree with most of the factual claims made in this post about evolution. I agree that “IGF is the objective” is somewhat sloppy shorthand. However, after diving into the specific ways the object level details of “IGF is the objective” play out, I am confused about why you believe this implies the things you claim they imply about the sharp left turn / inner misalignment. Overall, I still believe that natural selection is a reasonable analogy for inner misalignment.
I agree fitness is not a single stationary thing. I agree this is prima facie unlike supervised learning, where the objective is typically stationary. However, it is pretty analogous to RL, and especially multi agent RL, and overall I don’t think of the inner misalignment argument as depending on stationarity of the environment in either direction. AlphaGo might early in training select for policies that do tactic X initially because it’s a good tactic to use against dumb Go networks, and then once all the policies in the pool learn to defend against that tactic it is no longer rewarded. Therefore I don’t see any important disanalogy between evolution and multi agent RL. I have various thoughts on why language models do not make RL analogies irrelevant that I can explain but that’s another completely different rabbit hole.
I agree that humans (to a first approximation) still have the goals/drives/desires we were selected for. I don’t think I’ve heard anyone claim that humans suddenly have an art creating drive that suddenly appeared out of nowhere recently, nor have I heard any arguments about inner alignment that depend on an evolution analogy where this would need to be true. The argument is generally that the ancestral environment selected for some drives that in the ancestral environment reliably caused something that the ancestral environment selected for, but in the modern environment the same drives persist but their consequences in terms of [the amount of that which the ancestral environment was selecting for] now changes, potentially drastically. I think the misconception may arise from a closely related claim that some make, which is that AI systems might develop weird arbitrary goals (tiny metallic squiggles) because any goal with sufficient intelligence implies playing the training game and then doing a sharp left turn. However, the claim here is not that the tiny metallic squiggles drive will suddenly appear at some point and replace the “make humans really happy” drive that existed previously. The claim is that the drive for tiny metallic squiggles was always, from the very beginning, the reason why [make humans really happy] was the observed behavior in environment [humans can turn you off if they aren’t happy with you], and therefore in a different environment [humans can no longer turn you off], the observed behavior is [kill everyone and make squiggles].
I agree that everything is very complex always. I agree that there are multiple different goals/drives/desires in humans that result in children, of which the sex drive is only one. I agree that humans still have children sometimes, and still want children per se sometimes, but in practice this results in less and less children than in the ancestral environment over time (I bet even foragers are at least above replacement rate) for exactly the reason that the drives that we have always had for the reason that they caused us to survive/reproduce in the past now correspond much less well. I also agree that infanticide exists and occurs (but in the ancestral environment, there are counterbalancing drives like taboos around infanticide). In general, in many cases, simplifying assumptions totally break the analogy and make the results meaningless. I don’t think I’ve been convinced that this is one of those cases.
I don’t really care about defending the usage of “fitness as the objective” specifically, and so I don’t think the following is a crux and am happy to concede some of the points below for the sake of argument about the object facts of inner alignment. However, for completeness, my take on when “fitness” can be reasonably described as the objective, and when it can’t be:
I agree that couched in terms of the specific traits, the thing that evolution does in practice is sometimes favoring some traits and sometimes favoring other traits. However, I think there’s an important sense in which these traits are not drawn from a hat- natural selection selects for lighter/darker moths because it makes it easier for the moths to survive and reproduce! If lighter moths become more common whenever light moths survive and reproduce better, and vice versa for dark moths, as opposed to moths just randomly becoming more light or more dark in ways uncorrelated to survival/reproduction, it seems pretty reasonable to say that survival/reproduction is closer to the thing being optimized than some particular lightness/darkness function that varies between favoring lightness and darkness.
I agree it is possible to do artificial selection for some particular trait like moth color and in this case saying that the process optimizes “fitness” (or survival/reproduction) collapses to saying the same thing as the process optimizes moth lightness/darkness. I agree it would be a little weird to insist that “fitness” is the goal in this case, and that the color is the more natural goal. I also agree that the evolutionary equations plays out the same way whether the source of pressure is artificial human selection or birds eating the moths. Nonetheless, I claim the step where you argue the two cases are equivalent for the purposes of whether we can consider fitness the objective is the step that breaks down. I think the difference between this case and the previous case is that the causality flows differently. We can literally draw from a hat whether we want light moths or dark moths, and then reshape the environment until fitness lines up with our preference for darkness, whereas in the other case, the environment is drawn from a hat and the color selection is determined downstream of that.
Thank you, I like this comment. It feels very cooperative and like some significant effort went into it, and it also seems to touch the core of some important consideratios.
I notice I’m having difficulty responding, in that I disagree with some of what you said, but then have difficulty figuring out my reasons for that disagreement. I have the sense there’s a subtle confusion going on, but trying to answer you makes me uncertain whether others are the ones with the subtle confusion or if I am.
I’ll think about it some more and get back to you.
So I think the issue is that when we discuss what I’d call the “standard argument from evolution”, you can read two slightly different claims into it. My original post was a bit muddled because I think those claims are often conflated, and before writing this reply I hadn’t managed to explicitly distinguish them.
The weaker form of the argument, which I interpret your comment to be talking about, goes something like this:
The original evolutionary “intent” of various human behaviors/goals was to increase fitness, but in the modern day these behaviors/goals are executed even though their consequences (in terms of their impact on fitness) are very different. This tells us that the intent of the process that created a behavior/goal does not matter. Once the behavior/goal has been created, it will just do what it does even if the consequences of that doing deviate from their original purpose. Thus, even if we train an AI so that it carries out goal X in a particular context, we have no particular reason to expect that it would continue to automatically carry out the same (intended) goal if the context changes enough.
I agree with this form of the argument and have no objections to it. I don’t think that the points in my post are particularly relevant to that claim. (I’ve even discussed a form of inner optimization in humans that causes value drift that I don’t recall anyone else discussing in those terms before.)
However, I think that many formulations are actually implying, if not outright stating a stronger claim:
In the case of evolution, humans were originally selected for IGF but are now doing things that are completely divorced from that objective. Thus, even if we train an AI so that it carries out goal X in a particular context, we have a strong reason to expect that its behavior would deviate so much from the goal as to become practically unrecognizable.
So the difference is something like the implied sharpness of the left turn. In the weak version, the claim is just that the behavior might go some unknown amount to the left. We should figure out how to deal with this, but we don’t yet have much empirical data to estimate exactly how much it might be expected to go left. In the strong version, the claim is that the empirical record shows that the AI will by default swerve a catastrophic amount to the left.
(Possibly you don’t feel that anyone is actually implying the stronger version. If you don’t and you would already disagree with the stronger version, then great! We are in agreement. I don’t think it matters whether the implication “really is there” in some objective sense, or even whether the original authors intended it or not. I think the relevant thing is that I got that implication from the posts I read, and I expect that if I got it, some other people got it too. So this post is then primarily aimed at the people who did read the strong version to be there and thought it made sense.)
You wrote:
If we are talking about the weak version of the argument, then yes, I agree with everything here. But I think the strong version—where our behavior is implied to be completely at odds with our original behavior—has to implicitly assume that things like an art-creation drive are something novel.
Now I don’t think that anyone who endorses the strong version (if anyone does) would explicitly endorse the claim that our art-creation drive just appeared out of nowhere. But to me, the strong version becomes pretty hard to maintain if you take the stance that we are mostly still executing all of the behaviors that we used to, and it’s just that their exact forms and relative weightings are somewhat out of distribution. (Yes, right now our behavior seems to lead to falling birthrates and lots of populations at below replacement rates, which you could argue was a bigger shift than being “somewhat out of distribution”, but… to me that intuitively feels like it’s less relevant than the fact that most individual humans still want to have children and are very explicitly optimizing for that, especially since we’ve only been in the time of falling birthrates for a relatively short time and it’s not clear whether it’ll continue for very long.)
I think the strong version also requires one to hold that evolution does, in fact, consistently and predominantly optimize for a single coherent thing. Otherwise, it would mean that our current-day behaviors could be explained by “evolution doesn’t consistently optimize for any single thing” just as well as they could be explained by “we’ve experienced a left turn from what evolution originally optimized for”.
I agree that there are contexts where it would be analogous to that. But in that example, AlphaGo is still being rewarded for winning games of Go, and it’s just that the exact strategies it needs to use differ. That seems different than e.g. the bacteria example, where bacteria are selected for exactly the opposite traits—either selected for producing a toxin and an antidote, or selected for not producing a toxin and an antidote. That seems to me more analogous to a situation where AlphaGo is initially being rewarded for winning at Go, then once it starts consistently winning it starts getting rewarded for losing instead, and then once it starts consistently losing it starts getting rewarded for winning again.
And I don’t think that that kind of a situation is even particularly rare—anything that consumes energy (be it a physical process such as producing a venom or a fur, or a behavior such as enjoying exercise) is subject to that kind of an “either/or” choice.
Now you could say that “just like AlphaGo is still rewarded for winning games of Go and it’s just the strategies that differ, the organism is still rewarded for reproducing and it’s just the strategies that differ”. But I think the difference is that for AlphaGo, the rewards are consistently shaping its “mind” towards having a particular optimization goal—one where the board is in a winning state for it.
And one key premise on which the “standard argument from evolution” rests is that evolution has not consistently shaped the human mind in such a direct manner. It’s not that we have been created with “I want to have surviving offspring” as our only explicit cognitive goal, with all of the evolutionary training going into learning better strategies to get there by explicit (or implicit) reasoning. Rather we have been given various motivations that exhibit varying degrees of directness in how useful they are for that goal—from “I want to be in a state where I produce great art” (quite indirect) to “I want to have surviving offspring” (direct), with the direct goal competing with all the indirect ones for priority. Unlike AlphaGo, which does have the cognitive capacity for direct optimization toward its goal being the sole reward criteria all along.
This is also a bit hard to put a finger on, but I feel like there’s some kind of implicit bait-and-switch happening with the strong version of the standard argument. It correctly points out that we have not had IGF as our sole explicit optimization goal because we didn’t start by having enough intelligence for that to work. Then it suggests that because of this, AIs are likely to also be misaligned… even though, unlike with human evolution, we could just optimize them for one explicit goal from the beginning, so we should expect our AIs to be much more reliably aligned with that goal!
I think the main crux is that in my mind, the thing you call the “weak version” of the argument simply is the only and sufficient argument for inner misalignment and very sharp left turn. I am confused precisely what distinction you draw between the weak and strong version of the argument; the rest of this comment is an attempt to figure that out.
My understanding is that in your view, having the same drive as before means also having similar actions as before. For example, if humans have a drive for making art, in the ancestral environment this means drawing on cave walls (maybe this helped communicate the whereabouts of food in the ancestral environment). In the modern environment, this may mean passing up a more lucrative job opportunity to be an artist, but it still means painting on some other surface. Thus, the art drive, taking almost the same kinds of actions it ever did (maybe we use acrylic paints from the store instead of grinding plants into dyes ourselves), no longer results in the same consequences in amount of communicating food locations or surviving and having children or whatever it may be. But this is distinct from a sharp left turn, where the actions also change drastically (from helping humans to killing humans).
I agree this is more true for some drives. However, I claim that the association between drives and behaviors is not true in general. I claim humans have a spectrum of different kinds of drives, which differ in how specifically the drive specifies behavior. At one end of the spectrum, you can imagine stuff like breathing or blinking where it’s kind of hard to even say whether we have a “breathing goal” or a clock that makes you breath regularly—the goal is the behavior, in the same way a cup has the “goal” of holding water. At this end of the spectrum it is valid to use goal/drive and behavior interchangeably. At the other end of the spectrum are goals/drives which are very abstract and specify almost nothing about how you get there: drives like desire for knowledge and justice and altruism and fear of death.
The key thing that makes these more abstract drives special is that because they do not specifically prescribe actions, the behaviors are produced by the humans reasoning about how to achieve the drive, as opposed to behaviors being selected for by evolution directly. This means that a desire for knowledge can lead to reading books, or launching rockets, or doing crazy abstract math, or inventing Anki, or developing epistemology, or trying to build AGI, etc. None of these were specifically behaviors that evolution could have reinforced in us—the behaviors available in the ancestral environment were things like “try all the plants to see which ones are edible”. Evolution reinforced the abstract drive for knowledge, and left it up to individual human brains to figure out what to do, using the various Lego pieces of cognition that evolution built for us.
This means that the more abstract drives can actually suddenly just prescribe really different actions when important facts in the world change, and those actions will look very different from the kinds of actions previously taken. To take a non-standard example, for the entire history of the existence of humanity up until quite recently, it just simply has not been feasible for anyone to contribute meaningfully to eradicating entire diseases (indeed, for most of human history there was no understanding of how diseases actually worked, and people often just attributed it to punishment of the gods or otherwise found some way to live with it, and sometimes, as a coping mechanism, to even think the existence of disease and death necessary or net good). From the outside it may appear as if for the entire history of humanity there was no drive for disease eradication, and then suddenly in the blink of an evolutionary timescale eye a bunch of humans developed a disease eradication drive out of nowhere, and then soon thereafter suddenly smallpox stopped existing (and soon potentially malaria and polio). These will have involved lots of novel (on evolutionary timescale) behaviors like understanding and manufacturing microscopic biological things at scale, or setting up international bodies for coordination. In actuality, this was driven by the same kinds of abstract drives that have always existed like curiosity and fear of death and altruism, not some new drive that popped into being, but it involved lots of very novel actions steering towards a very difficult target.
I don’t think any of these arguments depend crucially on whether there is a sole explicit goal of the training process, or if the goal of the training process changes a bunch. The only thing the argument depends on is whether there exist such abstract drives/goals (and there could be multiple). I think there may be a general communication issue where there is a type of person that likes to boil problems down to their core, which is usually some very simple setup, but then neglects to actually communicate why they believe this particular abstraction captures the thing that matters.
I am confused by your AlphaGo argument because “winning states of the board” looks very different depending on what kinds of tactics your opponent uses, in a very similar way to how “surviving and reproducing” looks very different depending on what kinds of hazards are in the environment. (And winning winning states of the board always looking like having more territory encircled seems analogous to surviving and reproducing always looking like having a lot of children)
I think there is also a disagreement about what AlphaGo does, though this is hard to resolve without better interpretability—I predict that AlphaGo is actually not doing that much direct optimization in the sense of an abstract drive to win that it reasons about, but rather has a bunch of random drives piled up that cover various kinds of situations that happen in Go. In fact, the biggest gripe I have with most empirical alignment research is that I think models today fail to have sufficiently abstract drives, quite possibly for reasons related to why they are kind of dumb today and why things like AutoGPT mysteriouly have failed to do anything useful whatsoever. But this is a spicy claim and I think not that many other people would endorse this.
I agree that they don’t depend on that. Your arguments are also substantially different from the ones I was criticizing! The ones I was responding were ones like the following:
Those arguments are explicitly premised on humans having been optimized for IGF, which is implied to be a single thing. As I understand it, your argument is just that humans now have some very different behaviors from the ones they used to have, omitting any claims of what evolution originally optimized us for, so I see it as making a very different sort of claim.
To respond to your argument itself:
I agree that there are drives for which the behavior looks very different from anything that we did in the ancestral environment. But does very different-looking behavior by itself constitute a sharp left turn relative to our original values?
I would think that if humans had experienced a sharp left turn, then the values of our early ancestors should look unrecognizable to us, and vice versa. And certainly, there do seem to be quite a few things that our values differ on—modern notions like universal human rights and living a good life while working in an office might seem quite alien and repulsive to some tribal warrior who values valor in combat and killing and enslaving the neighboring tribe, for instance.
At the same time… I think we can still basically recognize and understand the values of that tribal warrior, even if we don’t share them. We do still understand what’s attractive about valor, power, and prowess, and continue to enjoy those kinds of values in less destructive forms in sports, games, and fiction. We can read Gilgamesh or Homer or Shakespeare and basically get what the characters are motivated by and why they are doing the things they’re doing. An anthropologist can go to a remote tribe to live among them and report that they have the same cultural and psychological universals as everyone else and come away with at least some basic understanding of how they think and why.
It’s true that humans couldn’t eradicate diseases before. But if you went to people very far back in time and told them a story about a group of humans who invented a powerful magic that could destroy diseases forever and then worked hard to do so… then the people of that time would not understand all of the technical details, and maybe they’d wonder why we’d bother bringing the cure to all of humanity rather than just our tribe (though Prometheus is at least commonly described as stealing fire for all of humanity, so maybe not), but I don’t think they would find it a particularly alien or unusual motivation otherwise. Humans have hated disease for a very long time, and if they’d lost any loved ones to the particular disease we were eradicating they might even cheer for our doctors and want to celebrate them as heroes.
Similarly, humans have always gone on voyages of exploration—e.g. the Pacific islands were discovered and settled long ago by humans going on long sea voyages—so they’d probably have no difficulty relating to a story about sorcerers going to explore the moon, or of two tribes racing for the glory of getting there first. Babylonians had invented the quadratic formula by 1600 BC and apparently had a form of Fourier analysis by 300 BC, so the math nerds among them would probably have some appreciation of modern-day advanced math if it was explained to them. The Greek philosophers argued over epistemology, and there were apparently instructions on how to animate golems (arguably AGI-like) around by the late 12th/early 13th century.
So I agree that the same fundamental values and drives can create very different behavior in different contexts… but if it is still driven by the same fundamental values and drives in a way that people across time might find relatable, why is that a sharp left turn? Analogizing that to AI, it would seem to imply that if the AI generalized its drives in that kind of way when it came to novel contexts, then we would generally still be happy about the way it had generalized them.
This still leaves us with that tribal warrior disgusted with our modern-day weak ways. I think that a lot of what is going on with him is that he has developed particular strategies for fulfilling his own fundamental drives—being a successful warrior was the way you got what you wanted back in that day—and internalized them as a part of his aesthetic of what he finds beautiful and what he finds disgusting. But it also looks to me like this kind of learning is much more malleable than people generally expect. One’s sense of aesthetics can be updated by propagating new facts into it, and strongly-held identities (such as “I am a technical person”) can change in response to new kinds of strategies becoming viable, and generally many (I think most) deep-seated emotional patterns can at least in principle be updated. (Generally, I think of human values in terms of a two-level model, where the underlying “deep values” are relatively constant, with emotional responses, aesthetics, identities, and so forth being learned strategies for fulfilling those deep values. The strategies are at least in principle updatable, subject to genetic constraints such as the person’s innate temperament that may be more hardcoded.)
I think that the tribal warrior would be disgusted by our society because he would rightly recognize that we have the kinds of behavior patterns that wouldn’t bring glory in his society and that his tribesmen would find it shameful to associate with, and also that trying to make it in our society would require him to unlearn a lot of stuff that he was deeply invested in. But if he was capable of making the update that there were still ways for him to earn love, respect, power, and all the other deep values that his warfighting behavior had originally developed to get… then he might come to see our society as not that horrible after all.
I don’t think the actual victory states look substantially different? They’re all ones where AlphaGo has more territory than the other player, even if the details of how you get there are going to be different.
Yeah, I would expect this as well, but those random drives would still be systematically shaped in a consistent direction (that which brings you closer to a victory state).
It’s used a lot in the comment sections. Do you know a better refutal than this post?