Evolution Solved Alignment (what sharp left turn?)
Some people like to use the evolution of homo sapiens as an argument by analogy concerning the apparent difficulty of aligning powerful optimization processes:
And in the same stroke that its capabilities leap forward, its alignment properties are revealed to be shallow, and to fail to generalize.
The much confused framing of this analogy has lead to a protracted debate about it’s applicability.
The core issue is just misaligned mesaoptimization. We have a powerful optimization process optimizing world stuff according to some utility function. The concern is that a sufficiently powerful optimization process will (inevitably?) lead to internal takeover by a selfish mesa-optimizer unaligned to the outer utility function, resulting in a bad (low or zero utility) outcome.
In the AGI scenario, the outer utility function is CEV, or external human empowerment, or whatever (insert placeholder, not actually relevant). The optimization process is the greater tech economy and AI/ML research industry. The fear is that this optimization process, even if outer aligned, could result in AGI systems unaligned to the outer objective (humanity’s goals), leading to doom (humanity’s extinction). Success here would be largenum utility, and doom/extinction is 0. So the claim is mesaoptimization inner alignment failure leads to 0 utility outcomes: complete failure.
For the evolution of human intelligence, the optimizer is just evolution: biological natural selection. The utility function is something like fitness: ex gene replication count (of the human defining genes)[1]. And by any reasonable measure, it is obvious that humans are enormously successful. If we normalize so that a utility score of 1 represents a mild success—the expectation of a typical draw of a great apes species, then humans’ score is >4 OOM larger, completely off the charts.[2]
So evolution of human intelligence is an interesting example: of alignment success. The powerful runaway recursive criticality that everyone feared actually resulted in an enormous anomalously high positive utility return, at least in this historical example. Human success, if translated into the AGI scenario, corresponds to the positive singularity of our wildest dreams.
Did it have to turn out this way? No!
Due to observational selection effects, we naturally wouldn’t be here if mesaoptimization failure during brain evolution was too common across the multiverse.[3] But we could have found ourselves in a world with many archaeological examples of species achieving human general technocultural intelligence and then going extinct—not due to AGI of course, but simply due to becoming too intelligent to reproduce. But we don’t, as far as I know.
And that is exactly what’d we necessarily expect to see in the historical record if mesaoptimization inner misalignment was a common failure mode: intelligent dinosaurs that suddenly went extinct, ruins of proto pachyderm cities, the traces of long forgotten underwater cetacean atlantis, etc.
So evolution solved alignment in the only sense that actually matters: according to its own utility function, the evolution of human intelligence enormously increased utility, rather than imploding it to 0.
So back to the analogy—where did it go wrong?
The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn’t make the resulting humans optimize mentally for IGF. Like, sure, the apes are eating because they have a hunger instinct and having sex because it feels good—but it’s not like they could be eating/fornicating due to explicit reasoning about how those activities lead to more IGF. They can’t yet perform the sort of abstract reasoning that would correctly justify those actions in terms of IGF. And then, when they start to generalize well in the way of humans, they predictably don’t suddenly start eating/fornicating because of abstract reasoning about IGF, even though they now could. Instead, they invent condoms[4],
Nate’s critique is an example of the naive engineer fallacy. Nate is critiquing a specific detail of evolution’s solution, but failing to notice that all that matters is the score, and humans are near an all time high score success[5]. Evolution didn’t make humans explicitly just optimize mentally for IGF because that—by itself—probably would have been a stupid failure of a design, and evolution is a superhuman optimizer whose designs are subtle, mysterious, and often beyond human comprehension.
Instead evolution created a solution with many layers and components- a defense in depth against mesaoptimization misalignment. And even though all of those components will inevitably fail in many individuals—even most! - that is completely irrelevant at the species level, and in fact just part of the design of how evolution explores the state space.
And finally, if all else fails, evolution did in fact find some weird way to create humans who rather obviously consciously optimize for IGF! And so if the other mechanisms had all started to fail too frequently, the genes responsible for that phenotype would inevitably become more common.
On further reflection, much of premodern history already does look like at least some humans consciously optimizing for something like IGF: after all, “be fruitful and multiply” is hardly a new concept. What do you think was really driving the nobility of old, with all their talk of bloodlines and legacies? There already is some deeper drive to procreate at work in our psyche (to varying degrees); we are clearly not all just mere byproducts of pleasure’s pursuit[6].
The central takeway is that evolution adapted the brain’s alignment mechanisms/protections in tandem with our new mental capabilities, such that the sharp left turn led to an enormous runaway alignment success.
- ↩︎
Nitpick arguments about how you define this specifically are irrelevant and uninteresting. Homo sapiens is enormously successful! If you really think you know the true utility function of evolution, and humans are a failure according to that metric, you have simply deluded yourself. My argument here does not depend on the details of the evolutionary utility function.
- ↩︎
We are unarguably the most successful recent species, probably the most successful mammal species ever, and all that despite arriving in a geological blink of an eye. The dU/dt for homo sapiens is probably the highest ever, so we are tracking to be the most successful species ever, if current trends continue (which of course is another story).
- ↩︎
Full consideration of the observational selection effects also leads to an argument for alignment success via the simulation argument, as future alignment success probably creates many historical sims, whereas failures do not.
- ↩︎
Condom analogs are at least 5000 years old; there is ample evidence contraception was understand and used in various ancient civilizations, and many premodern tribal people understood herbal methods, so humans have probably had this knowledge since the beginning, in one form or another. (Although memetic evolution would naturally apply optimization pressure against wide usage)
- ↩︎
Be careful anytime you find yourself defining peak evolutionary fitness as anything other than the species currently smiling from on top a giant heap of utility.
- ↩︎
I say this as I am about to have a child myself, planned for reasons I cannot fully yet articulate.
- Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor by 9 Jan 2024 20:42 UTC; 47 points) (
- Are humans misaligned with evolution? by 19 Oct 2023 3:14 UTC; 42 points) (
- Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI? by 11 Jan 2024 12:56 UTC; 34 points) (
- 23 Oct 2023 21:34 UTC; 1 point) 's comment on Fergus Fettes’s Shortform by (
You write:
Footnote 1 says:
Excuse me, what? This is not evolution’s utility function. It’s not optimizing for gene count. It does one thing, one thing only, and it does it well: it promotes genes that increase their RELATIVE FREQUENCY in the reproducing population.
The failure of alignment is witnessed by the fact that humans very very obviously fail to maximize the relative frequency of their genes in the next generation, given the opportunities available to them; and they are often aware of this; and they often choose to do so anyway. The whole argument in this post is totally invalid.
I don’t understand why you say promoting genes relative frequency is how it should be defined. Wouldn’t a gene drive like thing then max out that measure?
Also promoting genes that caused species extinction would also count as a win by that metric. I think that can happen sometimes—i.e. larger individuals are more mate-worthy and the species gets ever bigger (or suchlike) until it doesn’t fit its niche and goes extinct. These seem like failure modes rather than the utility function. Are there protections against gene drives in organizations populations?
IIUC a lot of DNA in a lot of species is gene-drive-like things.
By what standard are you judging when something is a failure mode or a desired outcome? I’m saying that what evolution is, is a big search process for genes that increase their relative frequency given the background gene pool. When evolution built humans, it didn’t build agents that try to promote the relative frequency of the genes that they are carrying. Hence, inner misalignment and sharp left turn.
Yes, gene drives have very high (gene-level) fitness. Genes that don’t get carried along with the drive can improve their own gene-level fitness by preventing gene drives from taking hold, so I’d expect there to also be machinery to suppress gene drives, if that’s easy enough to evolve.
If gene drives having high gene-level fitness seems wrong to you, then read this: https://www.lesswrong.com/posts/gDNrpuwahdRrDJ9iY/evolving-to-extinction. Or, if you have more time, Dawkins’s The Selfish Gene is quite good. Evolution is not anthropomorphic, and doesn’t try to avoid “failure modes” like extinction, it’s just a result of selection on a reproducing population.
I have read selfish gene—I think the metric is about relative frequency doesn’t work. A gene that reduced the population from 1 million to 10 but increased is abundance to 100% would max out that metric. If it made the entire species go extinct, what even does the metric say?
Obviously the common understanding is that is an evolutionary failure, but the metric disagrees. Not sure what kind of argument you would accept against your metric capturing the essence of evolution.
If you’ve read The Selfish Gene, then would you agree that under Dawkins’s notion of gene-level fitness, the genes composing a gene drive have high gene-level fitness? If not, why?
Always a good question to ask. TekhneMakre gives a good example of a situation where the two metrics disagree in this comment. Quoting:
If the relative frequency metric captures the essence of evolution, then gene A should be successful and gene B should be unsuccessful. Conversely, the total-abundance metric you suggest implies that gene B should be successful while gene A should be unsuccessful.
So one argument that would change my mind is if your showed (eg. via simulation, or just by a convincing argument) that gene B becomes prevalent in this hypothetical scenario.
Yes thanks, that thread goes over it in more detail than I could.
I don’t see how this detail is relevant. The fact remains that humans are, in evolutionary terms, much more successful than most other mammals.
Currently the world fertility rate is 2.3 children per woman. This is more than the often quoted 2.1 for a stable population. It is true that many developed countries are currently below that value. But this only means that these subpopulations will go extinct, while other subpopulations with higher fertility inherit the world. E.g. in Africa, Muslim countries, the Amish etc.
For example, in women high fertility is associated with, and likely caused by, low IQ and low education. The popular theory is that smart well-educated women have profitable and satisfying careers and consequently see much higher opportunity cost for having children instead, so they have fewer or none. And people with cognitive properties that cause them to have more children likely are currently getting more frequent in the population. Same as men who tend to forget to use condoms etc. These outlooks are not rosy from an ethical perspective, since properties like low intelligence and low conscientiousness are associated with a lot of human hardship such as poverty and violence, but they are totally in line with what evolution optimizes for.
What do you mean by “in evolutionary terms, much more successful”?
Good question. I think a good approximation is total body mass of the population. By this measure, only cattle are more successful than humans, and obviously only because we deliberately breed them for food: https://xkcd.com/1338/
In other words, you’re pointing out that the people who have the most ability to choose how many children to have, choose on average to have fewer and therefore to reduce their genes’ relative frequencies in the next generation. They also have longer generation times, amplifying the effect. This is equivalent to “humans defect against evolution’s goals as soon as they have the opportunity to do so.”
Subpopulations which do this are expected to disappear relatively quickly in evolutionary time scales. Natural selection is error correcting. This can mean people get less intelligent again, or they start to really love getting children rather than enjoying sex.
Correct. In a few handfuls of generations the population shifts in those directions. This holds as long as evolutionary timescales are the important ones, but it it not at all clear to me that this is what matters today. If the subpopulations that do this move faster than even short evolutionary timescales, then the selection pressure is light enough that they can oppose it.
I have no problem with a world where people evolve towards wanting more children or caring less about sex itself. I think if it were easy for evolution to achieve that it would have happened a long time ago, which means I think in practice we’re selecting among subpopulations for cultural factors more than biological ones. That, in turn, means those subpopulations are susceptible to outside influence… which is mostly what we’ve been seeing for the whole timeline of fertility dropping as nations develop economically. Some communities are more resistant because they start with stronger beliefs on this front, but none are immune. And frankly I think the only way they could be immune is by enforcing kinds of rigidity that the larger world won’t want to permit, seeing as they’d be seen as abusive, especially to children.
As far as intelligence goes: a world where the average person gets dumber while a small elite becomes smarter and more powerful and wealthier (by starting companies, inventing technologies, controlling policy making, and adopting things like life extension and genetic screening as they become available and viable) is an unstable powder keg. Eventually there’s conflict. Who wins? That depends on how big the gap is. Today, the smart people would probably lose, overwhelmed by numbers and lack of coordination ability. In the future, when the boundaries between populations are stark and stable cultural divides? Well by then the smarter subpopulation has all sorts of options. Like a custom virus that targets genetic characteristics of the low-intelligence subpopulation and causes infertility. Or armies of robots to supplement their numbers. Or radically longer lives so that they at least increase in absolute numbers over time and can have more children per lifetime (just slower) if they want to (note: this would enable people who love having kids to have even more of them!). Or carefully targeted propaganda campaigns (memetic warfare) to break down the cultural wall that’s sustaining the differences.
It is completely irrelevant to my larger point unless you are claiming that changing out the specific detailed choice of evolutionary utility function would change the high level outcome of humans being successful (relatively high scoring) according to that utility function, or more importantly—simply not being a failure.
My analogy is between:
humans optimizing AI according to human utility function which actually results in extinction (0 score according to that utility function) due to inner misalignment
and
evolution optimizing brains according to some utility function which actually results in extinction (0 score according to that utility function) due to inner misalignment
No part of my argument depends on the specific nitpick detail of the evolution utility function, other than it outputs a non failure score for human history to date.
So are you actually arguing that humans are an evolutionary failure? Or that we simply got lucky as the development of early technoculture usually results in extinction? Any doomer predictions of our future extinction can’t be used as part of a doom argument—that’s just circular reasoning. The evidence to date is that brains were an enormous success according to any possible reasonable evolutionary utility function.
Or maybe you simply dislike the analogy because it doesn’t support your strongly held beliefs? Fine, but at least make that clear.
Evolution optimizes replicators whenever there is variation and selection over the replicated information. The units of replication are ultimately just bit sequences, some replicate more successfully than others, and replication count is thus the measure of replication success.
You can then consider individual genes or more complex distributions over gene sets that define species, but either way, some of these things replicate more successfully than others.
Sigh—what?
The distribution over genes changes over time, and the process driving that change is obviously an evolutionary optimization process, because there are 1.) mechanisms causing gene variation (mutation etc), and 2.) selection—some genes replicate more than others.
Thus changes which increase gene count are more frequent/likely, creating a clear direction favored by evolutionary optimization.
Say you have a species. Say you have two genes, A and B.
Gene A has two effects:
A1. Organisms carrying gene A reproduce slightly MORE than organisms not carrying A.
A2. For every copy of A in the species, every organism in the species (carrier or not) reproduces slightly LESS than it would have if not for this copy of A.
Gene B has two effects, the reverse of A:
B1. Organisms carrying gene B reproduce slightly LESS than organisms not carrying B.
B2. For every copy of B in the species, every organism in the species (carrier or not) reproduces slightly MORE than it would have if not for this copy of B.
So now what happens with this species? Answer: A is promoted to fixation, whether or not this causes the species to go extinct; B is eliminated from the gene pool. Evolution doesn’t search to increase total gene count, it searches to increase relative frequency. (Note that this is not resting specifically on the species being a sexually reproducing species. It does rest on the fixedness of the niche capacity. When the niche doesn’t have fixed capacity, evolution is closer to selecting for increasing gene count. But this doesn’t last long; the species grows to fill capacity, and then you’re back to zero-sum selection.)
Sure, genes and species (defined as distributions over gene packages) are separate replicators. Both replicate according to variation and selection, so both evolve.
Notice in your example gene A could actually fail in the long run if it’s too successful, which causes optimization pressure at the larger system/package level to protect against these misalignments (see all the various defenses against transposons, retroviruses, etc).
Ok so the point is that the vast vast majority of optimization power coming from {selection over variation in general} is coming more narrowly from {selection for genes that increase their relative frequency in the gene pool} and not from {selection between different species / other large groups}. In arguments about misalignment, evolution refers to {selection for genes that increase their relative frequency in the gene pool}.
If you run a big search process, and then pick a really extreme actual outcome X of the search process, and then go back and say “okay, the search process was all along a search for X”, then yeah, there’s no such thing as misalignment. But there’s still such a thing as a search process visibly searching for Y and getting some extreme and non-Y-ish outcome, and {selection for genes that increase their relative frequency in the gene pool} is an example.
No—Selection is over the distribution that defines the species set (and recursively over the fractal clusters within that down to individuals), and operates at the granularity of complete gene packages (individuals), not individual genes.
The search process is just searching for designs that replicate well in environment. There could be misalignment in theory—as I discussed that would manifest as species tending to go extinct right around the early technocultural transition, when you have a massive sudden capability gain due to the exploding power of within lifetime learning/optimization.
So the misalignment is possible in theory, but we do not have evidence of that in the historical record. We don’t live in that world.
This is a retcon, as I described here:
You’re speaking as though humanity is the very first example of a species that reproduced a lot, but it’s always been the case that some species reproduced more than others and left more descendant species—the ancestor of mammals or eukaryotes, for example. This force has been constant and significant for as long as evolution has been a thing(more selection happens at the within-species level, sure, but that doesn’t mean between-species selection is completely unprecedented)
Within an organism there are various forms of viral genes which reproduce themselves largely at the expense of cells, organs, or the whole organism. A species is actually composed recursively of groups with geographically varying gene distributions and some genes can grow within local populations at the expense of that local population. But that is counteracted by various mechanisms selecting at larger scales, and all of this is happening at many levels simultaneously well beyond that of just gene and species. A species decomposes fractally geographically into many diverse subgroups with some but limited gene flow (slowing the spread of faster ‘defecting’ viral like genes) and which are all in various forms of competition over time, but still can interbreed and thus are part of the same species.
The world is composed of patterns. Some patterns replicate extensively, so pattern measure varies over many OOM. The patterns both mutate and variably replicate, so over time the distribution over patterns changes.
In fully generality, an optimization process applies computation to update the distribution over patterns. This process has a general temporal arrow—a direction/gradient. The ‘utility function’ is thus simply the function F such that dF/dt describes the gradient. For evolution in general, this is obviously pattern measure, and truly can not be anything else.
Typically most individual members of a species will fail to replicate at all, but that is irrelevant on several levels: their genes are not limited to the one physical soma, and the species is not any individual regardless. In fact if every individual reproduced at the same rate, the species would not evolve—as selection requires differential individual failure to drive species success.
Alignment in my analogy has a precise definition and measurable outcome: species success. Any inner/outer alignment failure results in species extinction, or it wasn’t an alignment failure, period. This definition applies identically to the foom doom scenario (where AGI causes human extinction), and the historical analogy (where evolution of linguistic intelligence could cause a species to go extinct because they decide to stop reproducing).
This sure sounds like my attempt elsewhere to describe your position:
Which you dismissed.
One evolutionary process but many potential competing sub-components. Of course there is always misalignment.
The implied optimization gradient of any two different components of the system can never be perfectly aligned (as otherwise they wouldn’t be different).
The foom doom argument is that humanity and AGI will be very misaligned such that the latter’s rise results in the extinction of the former.
The analogy from historical evolution is the misalignment between human genes and human minds, where the rise of the latter did not result in extinction of the former. It plausibly could have, but that is not what we observe.
The analogy is that the human genes thing produces a thing (human minds) which wants stuff, but the stuff it wants is different from what what the human genes want. From my perspective you’re strawmanning and failing to track the discourse here to a sufficient degree that I’m bowing out.
Not nearly different enough to prevent the human genes from getting what they want in excess.
If we apply your frame of the analogy to AGI, we have slightly misaligned AGI which doesn’t cause human extinction, and instead enormously amplifies our utility.
From my perspective you persistently ignore, misunderstand, or misrepresent my arguments, overfocus on pedantic details, and refuse to update or agree on basics.
I think there’s benefit from being more specific about what we’re arguing about.
CLAIM 1: If there’s a learning algorithm whose reward function is X, then the trained model will not necessarily explicitly desire X.
I think everyone agrees that this is true, and that evolution provides an example. Most people don’t even know what inclusive genetic fitness is, and those who do, and who also know that donating eggs / sperm would score highly on that metric, nevertheless often don’t donate eggs / sperm.
CLAIM 2: If there’s a learning algorithm whose reward function is X, then the trained model cannot possibly explicitly desire X.
I think everyone agrees that this is false—neither Nate nor almost anyone else (besides Yamploskiy) thinks perfect AGI alignment is impossible. I think everyone probably also agrees that evolution provides a concrete counterexample—it’s a big world, people have all kinds of beliefs and desires, there’s almost certainly at least one person somewhere who knows what IGF is and explicitly wants to maximize theirs.
CLAIM 3: If there’s a learning algorithm whose reward function is X, and no particular efforts are taken to ensure alignment (e.g. freezing the model occasionally and attempting mechanistic interpretability, or simbox testing, or whatever), then the probability that the trained model will catastrophically fail to achieve X at all is: [FILL IN THE BLANK].
Here there’s some disagreement about how to fill in the blank in the context of AGI. (I’ll leave aside evolution for the moment, I’ll get back to it after these bullet points.)
Nate’s probability for claim 3 in the context of AGI alignment is, IIUC, maybe upwards of 90% (but no more). And then he gets to >>90% probability of doom via disjunctive stuff.
I say “No, I refuse to fill in the blank at the end of claim 3, until you need to tell me more specifics about this learning algorithm”.
That said, for the AGI learning algorithms that I personally expect to be used (which are importantly disanalogous to evolution, and also importantly disanalogous to LLM+RLHF), I would fill in the blank at, I dunno, maybe 50%. (And then, like Nate, I get to much higher probability of doom for other reasons.)
I’m not sure what OP thinks.
Then a separate question is whether evolution provides evidence on how to fill in that blank.
I agree with OP that humans today are doing clearly better than the very low bar of “catastrophically fail to achieve IGF at all”. I think the jury is still out about what will happen in the future. Maybe future humans will all upload their brains into computers, which would presumably count as zero IGF, if we define IGF via literal DNA molecules (which we don’t have to, but that’s another story). Whatever, I dunno.
But I mostly don’t care either way, because again, I don’t expect AGI learning algorithms to resemble evolution-of-humans tightly enough to make evolution a useful source of evidence on claim 3. I think different learning algorithms call for filling in the blank differently.
Noting that I think this is not “perfect” AGI alignment, and I think it’s actually a pretty horrible outcome.
Whether it’s horrible or not depends on the X, right? Which I didn’t really define, and which is a whole can of worms (reward misgeneralization, use/mention distinction, etc.). (Like, if we send RL reward based on stock market price, is claim 2 talking about “trained model explicitly wants high stock price”, or “trained model explicitly wants high reward function output”, or “we get to define X to be either of those and many more things besides”? I didn’t really specify.)
But yeah, for X’s that we can plausibly operationalize into a traditional RL reward function†, then I’m inclined to say that we probably don’t want an AI that explicitly wants X. I think we’re in agreement on that.
† (as opposed to weirder things like “the reward depends in part on the AI’s thought process, and not just its outputs”)
I mostly agree of course, but will push back a bit on claim 1:
Not really—if anything evolution provides the counterexample. In general creatures do ‘explicitly desire’ reproductive fitness, to the reasonable extent that ‘explicitly desire’ makes sense or translates to that organism’s capability set. Humans absolutely do desire reproductive fitness explicitly—to varying degrees, and have since forever. The claim is only arguably true if you pedantically nitpick define “explicitly desire reproductive fitness” to something useless like (express the detailed highly specific currently accepted concept from ev bio in english), which is ridiculous.
A Sumerian in 1000 BC who lived for the purpose of ensuring the success of their progeny and kin absolutely is an example of explicitly desiring and optimizing for the genetic utility function of “be fruitful and multiply”.
Of course—and we’d hope that there is some decoupling eventually! Otherwise it’s just be fruitful and multiply, forever.
This “we’d hope” is misalignment with evolution, right?
Naturally the fact that bio evolution has been largely successful at alignment so far doesn’t mean that will continue indefinitely into the future after further metasystems transitions.
But that is future speculation, and its obviously tautological to argue “bio evolution will fail at alignment in the future” as part of your future model and argument for why alignment is hard in general—we haven’t collected that evidence yet!
Moreover, its not so much that we are misaligned with evolution, and more that bio evolution is being usurped by memetic/info evolution—the future replicator patterns of merit are increasingly no longer just physical genes.
I’m saying that the fact that you, an organism built by the evolutionary process, hope to step outside the evolutionary process and do stuff that the evolutionary process wouldn’t do, is misalignment with the evolutionary process.
I’m saying you didn’t seem to grasp my distinction between systemic vs bio evolution. I do not “hope to step outside the evolutionary process”. The decoupling is only with bio-evolution and genes. The posthuman goal is to move beyond biology, become substrate independent etc, but that is hardly the end of evolution.
A different way to maybe triangulate here: Is misalignment possible, on your view? Like does it ever make sense to say something like “A created B, but failed at alignment and B was misaligned with A”? I ask because I could imagine a position, that sort of sounds a little like what you’re saying, which goes:
I literally provided examples of what misalignment with bio-evolution would look like:
If we haven’t seen such an extinction in the archaeological record, it can mean one of several things:
misalignment is rare, or
misalignment is not rare once the species becomes intelligent, but intelligence is rare or
intelligence usually results in transcendence, so there’s only one transition before the bio becomes irrelevant in the lightcone (and we are it)
We don’t know which. I think it’s a combination of 2 and 3.
The original argument that your OP is responding to is about “bio evolution”. I understand the distinction, but why is it relevant? Indeed, in the OP you say:
So we’re talking about bio evolution, right?
The OP is talking about history and thus bio evolution, and this thread shifted into the future (where info-evolution dominates) here:
I’m saying that you, a bio-evolved thing, are saying that you hope something happens, and that something is not what bio-evolution wants. So you’re a misaligned optimizer from bio-evolution’s perspective.
If you narrowly define the utility function as “IGF via literal DNA molecules” - (which obviously is the relevant context for my statement “hope that there is some decoupling eventually”) then obviously I’m somewhat misaligned to that util func (but not completely, I am having children). And i’m increasingly aligned with the more general utility functions.
None of this is especially relevant, because I am not a species.
Well, there is at least one human ever whose explicit desires are strongly counter to inclusive genetic fitness (e.g. a person who murders their whole family and then themselves), and there is at least one human ever whose explicit desires are essentially a perfect match to inclusive genetic fitness. I hope we can at least agree about that extremely weak claim. :)
Anyway, I think you raise another interesting question along the lines of:
Claim 4: If we train an AGI using RL with reward function X, and the trained model winds up explicitly wanting Y which is not exactly X but has a lot to do with X and correlates with X, then the trained model ultimately winds up doing lots of X.
You seem to believe this claim, right?
E.g. your comment is sending the vibe that, if we notice that an ancient Sumerian explicitly wanted success of their progeny and kin—which is not exactly IGF, but about as closely related as one could reasonably hope for under the circumstances—then we should be happy, i.e. we should treat that fact as a reason to be optimistic. If so, that vibe seems to flow through the validity of claim 4, right?
I.e., if Claim 4 is false, we would instead have to say “Well yeah sure, the Sumerian’s motivations are as good as one could reasonably hope for under the circumstances…but that’s still not good enough!”.
Anyway, what opinion should we have about Claim 4? Mine is: I mostly want to know more about X & Y (and the exact nature of the learning algorithm) before saying more. I can think of X & Y pairs where it’s plausibly OK. But I strongly believe there are also lots of cases where Claim 4 spectacularly fails by Goodhart’s law.
Basically, the more powerful the AGI gets, and the more it can do things like “invent exotic new technology, reshape itself, reshape humans, reshape the world, etc.”, then the more room there is for the AGI to get into way-out-of-distribution situations where X & Y strongly come apart (cf. last few paragraphs of here).
If humans eventually stop using literal DNA molecules because we start uploading our brains or find some other high-tech alternative, that would be an example of Claim 4 failing, I think.
Obviously true but also wierd that anyone would think that’s relevant? Even for a population which is perfectly aligned in expectation, you wouldn’t expect any individual to be perfectly aligned.
How could the Sumerian literally be optimizing explicitly for IGF when the term/concept didn’t exist until 1964? IGF is also almost certainly a crude approximation of the ‘true’ utility function of genetic evolution.
So given that we have uncertainty over the utility function of evolution, and moreover that explicitly encoding a very complex utility function precisely into a (extremely resource constrained) brain would be a stupid engineering idea regardless—clearly obviously suboptimal for actually optimizing said utility function—the analogy argument doesn’t depend on the explicit form of evolution’s utility function at all.
For any reasonable choice of a genetic utility function, homo sapiens is successful. Thus the evolution of human brains is an example of successful alignment: the exploding power of our new inner optimizers (brains) did not result in species failure. It is completely irrelevant whether individual humans are aligned or not, and ridiculous to expect them to be. The success criteria is a measure over the species, not individuals.
Mostly agree yes it depends on many details, but this post was specifically about updating on the fact that we do not observe evolution having trouble with alignment—and again there is only a singular correct way to measure that (species level success/failure).
Maybe? But that hasn’t happened yet. Also what if all the DNA information still exists, but stored more compactly on digital computers, which then in simulation replicate enormously, and the simulations themselves expand as our compute spreads a bit into space and advances further..
The argument “but homo sapiens obviously will go extinct in the future, cuz doom” obviously has no room as part of an argument for doom in the first place.
Some people think that misalignment in RL is fundamentally impossible. I’m pretty sure you’re not in that category. But for such people, it’s nice to have an example of “RL algorithm with reward function R → at least one trained model that scores catastrophically poorly on R as soon as it goes out of distribution”.
The CoinRun thing could also work in this context, but is subject to the (misguided) rebuttal “oh but that’s just because the model’s not smart enough to understand R”. So human examples can also be useful in the course of such conversations.
Here are a bunch of claims that are equally correct:
The history of evolution to date is exactly what you’d get from an RL algorithm optimizing genes for how many future copies are encoded in literal DNA molecules in basement reality.
The history of evolution to date is exactly what you’d get from an RL algorithm optimizing genes for how many future copies are encoded in literal DNA molecules in either basement reality or accurate simulations.
The history of evolution to date is exactly what you’d get from an RL algorithm optimizing genes for how many future copies are encoded in DNA molecules, or any other format that resembles it functionally (regardless of whether it resembles it chemically or mechanistically).
[infinitely many more things like that]
(Related to “goal misgeneralization”). If humans switch from DNA to XNA or upload themselves into simulations or whatever, then the future would be high-reward according to some of those RL algorithms and the future would be zero-reward according to others of those RL algorithms. In other words, one “experiment” is simultaneously providing evidence about what the results look like for infinitely many different RL algorithms. Lucky us.
You’re welcome to argue that the evidence from some of these “experiments” is more relevant to AI alignment than the evidence from others of these experiments. (I figure that’s what you mean when you say “for any reasonable choice of a genetic utility function…”?) But to make that argument, I think you’d have to start talking more specifically about what you expect AI alignment to look like and thus why some of those bullet point RL algorithms are more disanalagous to AI alignment than others. You seem not to be doing that, unless I missed it, instead you seem to be just talking intuitively about what’s “reasonable” (not what’s a tight analogy).
(Again, my own position is that all of those bullet points are sufficiently disanalogous to AI alignment that it’s not really a useful example, with the possible exception of arguing with someone who takes the extreme position that misalignment is impossible in RL as discussed at the top of this comment.)
I don’t actually believe that, but I also suspect that for those who do the example isn’t relevant because evolution is not a RL algorithm.
Evolution is an outer optimizer that uses RL as the inner optimizer. Evolution proceeds by performing many experiments in parallel, all of which are imperfect/flawed, and the flaws are essential to the algorithms’s progress. The individual inner RL optimizer can’t be perfectly aligned as if they were the outer optimizer wouldn’t have actually worked in the first place, and the outer optimizer can be perfectly aligned even though every single instance of the inner optimizer is at least somewhat misaligned. This is perhaps less commonly understood than I initially assumed—but true nonetheless.
Technology evolves according to an evolutionary process that is more similar/related to genetic evolution (but with greatly improved update operators). The analogy is between genetic evolution over RL brains <-> tech evolution over RL AGIs.
Again none of those future scenarios have played out, they aren’t evidence yet, just speculation.
I made this diagram recently—I just colored some things red to flag areas where I think we’re disagreeing.
The first thing is you think “evolution is not a RL algorithm”. I guess you think it’s a learning algorithm but not a reinforcement learning algorithm? Or do you not even think it’s a learning algorithm? I’m pretty confused by your perspective. Everyone calls PPO an RL algorithm, and I think of evolution as “kinda like PPO but with a probably-much-less-efficient learning rule because it’s not properly calculating gradients”. On the other hand, I’m not sure exactly what the definition of RL is—I don’t think it’s perfectly standardized.
(We both agree that, as a learning algorithm, evolution is very weird compared to the learning algorithms that people have been programming in the past and will continue programming in the future. For example, ML runs normally involve updating the trained model every second or sub-second or whatever, not “one little trained model update step every 20 years”. In other words, I think we both agree that the right column is a much much closer analogy for future AGI training than the middle column.)
(But as weird a learning algorithm as the middle column is, it’s still a learning algorithm. So if we want to make and discuss extremely general statements that apply to every possible learning algorithm, then it’s fair game to talk about the middle column.)
The second thing is on the bottom row, I get the impression that
you want to emphasize that the misalignment could be much higher (e.g. if nobody cared about biological children & kin),
I want to emphasize that the misalignment could be much less (e.g. if men were paying their life savings for the privilege of being a sperm donor).
But that’s not really a disagreement, because both of those things are true. :)
To a certain extant this is taxonomy/definitions, but there is an entire literature on genetic/evolutionary optimization algorithms and it does seem odd to categorize them under RL—I’ve never heard/seen those authors categorize it that way and RL is a somewhat distant branch of ML. In my mental taxonomy (which I do believe is more standard/canonical) the genetic/evolutionary search family of algorithms are combinatoric/discrete algorithms that evaluate many candidate solutions in parallel and use separate heuristics for updating params (typically not directly related to the fitness function), and instead use fitness function evaluation to select replication factor for the next generation. They approximate the full bayesian posterior with large scale but crude sampling.
Has advantages over gradient methods for exploring highly compressed/combinatoric parameter spaces, but scales poorly to higher dimensions as it doesn’t solve fine grained credit assignment at all.
RL on the other hand is just a slight variation of UL/SL, and is usually (almost always?) a continuous/gradient method.
So anyway the useful analogy is not between genetic evolution and within-lifetime learning, it’s between (genetic evolution, brain within-lifetime learning) and (tech/memetic evolution, ML training).
Random @nostalgebraist blog post: ““Reinforcement learning” (RL) is not a technique. It’s a problem statement, i.e. a way of framing a task as an optimization problem, so you can hand it over to a mechanical optimizer. … Life itself can be described as an RL problem”
Russell & Norvig 3rd edition: “The task of reinforcement learning is to use observed rewards to learn an optimal (or nearly optimal) policy for the environment. …Reinforcement learning might be considered to encompass all of AI.” (emphasis in original)
If we take the Russell & Norvig definition, and plug in “reward = IGF” and “policy = a genome that (in the right environment) unfolds into a complete brain and body”, then evolution by natural selection is “using observed rewards to learn an optimal (or nearly optimal) policy for the environment”, thus qualifying as RL according to that definition. Right?
Obviously I agree that if we pick a paper on arxiv that has “RL” in its title, it is almost certainly not talking about evolutionary biology :-P
I’ll also reiterate that I think we both agree that, even if the center column of that chart picture above is “an RL algorithm” by definition, it’s mechanistically pretty far outside the distribution of RL algorithms that human AI programmers are using now and are likely to use in the future, whereas within-lifetime learning is a more central example in almost every way. I’m definitely not disagreeing about that.
One important difference between RL and SL / SSL is that in RL, after you take an action, there’s no ground truth about what action you counterfactually should have taken instead. In other words, SL / SSL gives you a full error gradient “for free” with each query, whereas RL doesn’t. (You can still get an error gradient in policy-optimization, but it’s more expensive—you need to query the policy a bunch of times to get one gradient, IIUC.) Thus, RL algorithms involve random exploration then exploitation, whereas things like ImageNet training and LLM pretraining do not involve any random exploration.
(Notice here that evolution-by-natural-selection does involve a version of explore-then-exploit: mutations are random, and are initially trialed in a small number of individual organisms, and only if they’re helpful are they deployed more and more widely, eventually across the whole species.)
If we define RL broadly as “learning/optimizing an algorithm/policy towards optimality according to some arbitrary reward/objective function”, then sure it becomes indistinguishable from general optimization.
However to me it is specifically the notion of reward in RL that distinguishes it, as you say here:
Which already makes it less general that general optimization. A reward function is inherently a proxy—designed by human engineers to approximate the more complex unknown utility, or evolved by natural selection as a practical approximate proxy for something like IGF.
Evolution by natural selection doesn’t have any proxy reward function, so it lacks that distinguishing feature of RL. The ‘optimization objective’ of biological evolution is simply something like an emergent telic arrow of physics: replicators tend to replicate, similar to net entropy always increases, etc. When humans use evolutionary algorithms as evolutionary simulations, the fitness function is more or less approximating the fitness of a genotype if the physics was simulated out in more detail.
So anyway I think of RL as always an inner optimizer, using a proxy reward function that approximates the outer optimizer’s objective function.
I don’t think that, in order for an algorithm to be RL, its reward function must by definition be a proxy for something else more complicated. For example, the RL reward function of AlphaZero is not an approximation of a more complex thing—the reward function is just “did you win the game or not?”, and winning the game is a complete and perfect description of what DeepMind programmers wanted the algorithm to do. And everyone agrees that AlphaZero is an RL algorithm, indeed a central example. Anyway, AlphaZero would be an RL algorithm regardless of the motivations of the DeepMind programmers, right?
True in games the environment itself can directly provide a reward channel, such that the perfect ‘proxy’ simplifies to the trivial identity connection on that channel. But that’s hardly an interesting case right? A human ultimately designed the reward channel for that engineered environment, often as a proxy for some human concept.
The types of games/sims that are actually interesting for AGI, or even just general robots or self driving cars, are open ended where designing the correct reward function (as a proxy for true utility) is much of the challenge.
It’s very weird notion of what constitutes an evidence. If you built AGI and your interpretability tools show you that AGI is plotting to kill you, it would be pretty hard evidence in favor of sharp left turn, even if you were still alive.
Not at all—just look at the definition of solomonoff induction: the distribution over world models/theories is updated strictly on new historical observation bits, and never on future predicted observations. If you observe mental states inside the AGI, that is naturally valid observational evidence from your pov. But that is very different from you predicting that the AGI is going to kill you, and then updating your world model based on those internal predictions—those feedback loops rapidly diverge from reality (and may be related to schizophrenia).
I’m talking about observable evidence, like, transhumanists claiming they will drop their biological bodies on first possibility.
I think Nate’s post “Humans aren’t fitness maximizers” discusses this topic more directly than does his earlier “Sharp Left Turn” post. It also has some lively discussion in the comments section.
I won’t argue with the basic premise that at least on some metrics that could be labeled as evolution’s “values”, humans are currently doing very well.
But, the following are also true:
Evolution has completely lost control. Whatever happens to human genes from this point forward is entirely dependent on the whims of individual humans.
We are almost powerful enough to accidentally cause our total extinction in various ways, which would destroy all value from evolution’s perspective
There are actions that humans could take, and might take once we get powerful enough, that would seem fine to us but would destroy all value from evolution’s perspective.
Examples of such actions in (3) could be:
We learn to edit the genes of living humans to gain whatever traits we want. This is terrible from evolution’s perspective, if evolution is concerned with maximizing the prevalence of existing human genes
We learn to upload our consciousness onto some substrate that does not use genes. This is also terrible from a gene-maximizing perspective
None of those actions is guaranteed to happen. But if I were creating an AI, and I found that it was enough smarter than me that I no longer had any way to control it, and if I noticed that it was considering total-value-destroying actions as reasonable things to maybe do someday, then I would be extremely concerned.
If the claim is that evolution has “solved alignment”, then I’d say you need to argue that the alignment solution is stable against arbitrary gains in capability. And I don’t think that’s the case here.
All 3 of your points are future speculations and as such are not evidence yet. The evidence we have to date is that homo sapiens are an anomously successful species, despite the criticality phase transition of a runaway inner optimization process (brains).
So all we can say is that the historical evidence gives us an example of a two stage optimization process (evolutionary outer optimization and RL/UL within lifetime learning) producing AGI/brains which are roughly sufficiently aligned at the population level such that the species is enormously successful (high utility according to the outer utility function, even if there is misalignment between that and the typical inner utility function of most brains).
There exists a clear misalignment between the principles of evolution and human behavior. This discrepancy is evident in humans’ inclination towards pursuing immediate gratification, often referred to as “wireheading,” rather than prioritizing the fundamental goals of replication and survival.
An illustrative example of this misalignment can be observed in the unfortunate demise of certain indigenous communities due to excessive alcohol consumption. This tragic outcome serves as a poignant reminder of the consequences of prioritizing immediate pleasure over long-term sustainability.
In contemporary society, the majority of humans exhibit a preference for a high-quality and pleasurable life over the singular focus on reproduction. This shift in priorities has begun to exert a discernible impact on global fertility rates.
In fact, one could view the historical trajectory of civilization as a progression towards increasingly effective methods of wireheading, essentially the stimulation of pleasure centers in the human brain. Initially, the negative repercussions of such behavior were confined to specific individuals and groups, exerting limited global effects. However, as our capacity for wireheading continues to advance, there is a growing concern that it may gradually influence the entirety of human civilization.
In essence, the tension between our evolutionary legacy and our pursuit of immediate gratification is an issue that transcends individual actions and has the potential to reshape the course of our entire society.
But natural selection favors people who prefer having children, and various forms of wireheading eventually stop working if they decrease fitness.
“The training loop is still running” is not really a counterargument to “the training loop accidentally spat out something smarter than the training loop, that wants stuff other than the stuff the training loop is optimizing for, that iterates at a much faster speed than the training, and looks like it’s just about to achieve total escape velocity”.
I don’t think it looks like we are about to achieve escape velocity. The broader laws of natural selection and survival of the fittest even apply to future AI and Grabby Aliens. Essentially the argument from “Meditations On Moloch”.
These are distinct from biological evolution, so if our descendants end up being optimized over by them rather than by biological evolution, that feels like it’s conceding the argument (that we’ll have achieved escape velocity).
If Evolution had a lot more time to align humans to relative-gene-replication-count, before humans put an end to biological life, then sure, seems plausible that Evolution might be able to align humans very robustly. But Evolution does not have infinite time or “retries”—humanity is in the process of executing something like a “sharp left turn”, and seems likely to succeed long before the human gene pool is taken over by sperm bank donors and such.
Humans have not put an end to biological life.
Your doom predictions aren’t evidence, and can’t be used in any way in this analogy. To do so is just circular reasoning. “Sure brains haven’t demonstrated misalignment yet, but they are about to because doom is coming! Therefor evolution fails at alignment and thus doom is likely!”
For rational minds, evidence is strictly historical[1]. The evidence we have to date is that humans are enormously successful, despite any slight misalignment or supposed “sharp left turn”.
There are many other scenarios where DNA flourishes even after a posthuman transition.
Look closely at how Solomonoff induction works, for example. Its world model is updated strictly from historical evidence, not its own future predictions.
Yup. I, too, have noticed that.
C’mon, man, that’s obviously a misrepresentation of what I was saying. Or maybe my earlier comment failed badly at communication? In case that’s so, here’s an attempted clarification (bolded parts added):
Point being: Sure, Evolution managed to cough up some individuals who explicitly optimize for IGF. But they’re exceptions, not the rule; and humanity seems (based on past observations!) to be on track to (mostly) end DNA-based life. So it seems premature to say that Evolution succeeded at aligning humanity.
In case you’re wondering what past observations lead me to think that humans are unaligned[2] w.r.t. IGF and on track to end (or transcend) biological life, here are some off the top of my head:
Of the people whose opinions on the subject I’m aware of (including myself), nearly all would like to transcend (or end) biological life.[3]
Birth rates in most developed nations have been low or below replacement for a long time.[4] There seems to be a negative correlation between wealth/education and number of offspring produced. That matches my impression that as people gain wealth, education, and empowerment in general, most choose to spend it mostly on something other than producing offspring.
Diligent sperm bank donors are noteworthy exceptions. Most people are not picking obvious low-hanging fruit to increasing their IGF. Rich people waste money on yachts and stuff, instead of using it to churn out as many high-fitness offspring as possible; etc.
AFAIK, most of the many humans racing to build ASI are not doing so with the goal of increasing their IGF. And absent successful attempts to align ASI specifically to producing lots of DNA-based replicators, I don’t see strong reason to expect the future to be optimized for quantity of DNA-based replicators.
Perhaps you disagree with the last point above?
Interesting. Could you list a few of those scenarios?
Note: I wasn’t even talking (only) about doom; I was talking about humanity seemingly being on track to end biological life. I think the “good” outcomes probably also involve transcending biology/DNA-based replicators.
to the extent that it even makes sense to talk about incoherent things like humans being “(mis/un)aligned” to anything.
My sample might not be super representative of humanity as a whole. Maybe somewhat representative of people involved in AI, though?
At least according to sources like this: https://en.wikipedia.org/wiki/Total_fertility_rate
Evolution has succeeded at aligning homo sapiens brains to date[1] - that is the historical evidence we have.
I don’t think most transhumanists explicitly want to end biological life, and most would find that abhorrent. Transcending to a postbiological state probably doesn’t end biology any more/less than biology ended geology.
The future is complex and unknown. Is the ‘DNA’ we are discussing the information content or the physical medium? Seems rather obvious it’s the information that matters, not the medium. Transcendence to a posthuman state probably involves vast computation some of which is applied to ancestral simulations (which we may already be in) which enormously preserves and multiplies the info content of the DNA.
Not perfectly of course but evolution doesn’t do anything perfectly. It aligned brains well enough such that the extent of any misalignment was insignificant compared to the enormous utility our brains provided.
I’m guessing we agree on the following:
Evolution shaped humans to have various context-dependent drives (call them Shards) and the ability to mentally represent and pursue complex goals. Those Shards were good proxies for IGF in the EEA[1].
Those Shards were also good[2] enough to produce billions of humans in the modern environment. However, it is also the case that most modern humans spend at least part of their optimization power on things orthogonal to IGF.
I think our disagreement here maybe boils down to approximately the following question:
With what probability are we in each of the following worlds?
(World A) The Shards only work[2:1] conditional on the environment being sufficiently similar to the EEA, and humans not having too much optimization power. If the environment changes too far OOD, or if humans were to gain a lot of power[3], then the Shards would cease to be good[2:2] proxies.
In this world, we should expect the future to contain only a small fraction[4] of the “value” it would have, if humanity were fully “aligned”[2:3]. I.e. Evolution failed to “(robustly) align humanity”.
(World B) The Shards (in combination with other structures in human DNA/brains) are in fact sufficiently robust that they will keep humanity aligned[2:4] even in the face of distributional shift and humans gaining vast optimization power.
In this world, we should expect the future to contain a large fraction of the “value” it would have, if humanity were fully “aligned”[2:5]. I.e. Evolution succeeded in “(robustly) aligning humanity”.
(World C) Something else?
I think we’re probably in (A), and IIUC, you think we’re most likely in (B). Do you consider this an adequate characterization?
If yes, the obvious next question would be: What tests could we run, what observations could we make,[5] that would help us discern whether we’re in (A) or (B) (or (C))?
(For example: I think the kinds of observations I listed in my previous comment are moderate-to-strong evidence for (A); and the existence of some explicit-IGF-maximizing humans is weak evidence for (B).)
Environment of evolutionary adaptedness. For humans: hunter-gatherer tribes on the savanna, or maybe primitive subsistence agriculture societies.
in the sense of optimizing for IGF, or whatever we’re imagining Evolution to “care” about.
e.g. ability to upload their minds, construct virtual worlds, etc.
Possibly (but not necessarily) still a large quantity in absolute terms.
Without waiting a possibly-long time to watch how things in fact play out.
I agree with your summary of what we agree on—that evolution succeeded at aligning brains to IGF so far. That was the key point of the OP.
Before getting into World A vs World B, I need to clarify again that my standard for “success at alignment” is a much weaker criterion than you may be assuming. You seem to consider success to require getting near the maximum possible (ie large fraction) utility, which I believe is uselessly unrealistic. By success I simply mean not a failure, as in not the doom scenario of extinction or near zero utility.
So Worlds A is still a partial success if there is some reasonable population of humans (say even just on the order of millions) in bio bodies or in detailed sims.
I don’t agree with this characterization—the EEA ended ~10k years ago and human fitness has exploded since then rather than collapsed to zero. It is a simple fact that according to any useful genetic fitness metric, human fitness has exploded with our exploding optimization power so far.
I believe this is the dominate evidence, and it indicates:
If tech evolution is similar enough to bio evolution then we should roughly expect tech evolution to have a similar level of success
Likewise doom is unlikely unless the tech evolution process producing AGI has substantially different dynamics from the gene evolution process which produced brains
See this comment for more on the tech/gene evolution analogy and potential differences.
I don’t think your evidence from “opinions of people you know” is convincing for the same reasons I don’t think opinions from humans circa 1900 were much useful evidence for predicting the future of 2023.
I don’t think “humans explicitly optimizing for the goal of IGF” is even the correct frame to think of how human value learning works (see shard theory).
As a concrete example, Elon Musk seems to be on track for high long term IGF, without consciously optimizing for IGF.
(Ah. Seems we were using the terms “(alignment) success/failure” differently. Thanks for noting it.)
In-retrospect-obvious key question I should’ve already asked: Conditional on (some representative group of) humans succeeding at aligning ASI, what fraction of the maximum possible value-from-Evolution’s-perspective do you expect the future to attain? [1]
My modal guess is that the future would attain ~1% of maximum possible “Evolution-value”.[2]
Seems like a reasonable (albeit very preliminary/weak) outside view, sure. So, under that heuristic, I’d guess that the future will attain ~1% of max possible “human-value”.
setting completely aside whether to consider the present “success” or “failure” from Evolution’s perspective.
I’d call that failure on Evolution’s part, but IIUC you’d call it partial success? (Since the absolute value would still be high?)
In general I think maximum values are weird because they are potentially nearly unbounded, but it sounds like we may then be in agreement absent terminology.
But in general I do not think of anything “less than 1% of the maximum value” as failure in most endeavors. For example the maximum attainable wealth is perhaps $100T or something, but I don’t think it’d be normal/useful to describe the world’s wealthiest people as failures at being wealthy because they only have ~$100B or whatever.
And regardless the standard doom arguments from EY/MIRI etc are very much “AI will kill us all!”, and not “AI will prevent us from attaining over 1% of maximum future utility!”
I agree that a successful post-human world would probably involve a large amount[1] of resources spent on simulating (or physically instantiating) things like humans engaging in play, sex, adventure, violence, etc. IOW, engaging in the things for which Evolution installed Shards in us. However, I think that is not the same as [whatever Evolution would care about, if Evolution could care about anything]. For the post-human future to be a success from Evolution’s perspective, I think it would have to be full of something more like [programs (sentient or not, DNA or digital) striving to make as many copies of themselves as possible].
(If we make the notion of “DNA” too broad/vague, then we could interpret almost any future outcome as “success for Evolution”.)
a large absolute amount, but maybe not a large relative amount.
Any ancestral simulation will naturally be full of that, so it boils down to the simulation argument.
The natural consequence of “postbiological humans” is effective disempowerment if not extinction of humanity as a whole.
Such “transhumanists” clearly do not find the eradication of biology abhorrent, any more than any normal person would find the idea of “substrate independence”(death of all love and life) to be abhorrent.
I agree that the optimization process of natural selection likely irons out any cases of temporary misalignment (some human populations having low fertility) over a medium time span. People who tend to not have children eventually get replaced by people who love children, or people who tend to forget to use contraceptives etc. This is basically the force Scott Alexander calls Moloch.
Unfortunately this point doesn’t obviously generalize to AI alignment. Natural selection is a simple, natural optimization process, which optimizes a simple “goal”. But getting an AI to align with human ethics might be much harder, because human ethics might be a target much easier to miss, and not one that nature automatically “aims” for by itself, as it does for IGF.
I don’t think human ethics is really the right alignment goal? We want AI aligned to true human utility (individually, and in aggregates). The human brain utility/value function is actually a proxy of IGF, but aligning a proxy to that proxy doesnt’ seem much more difficult than aligning a proxy to IGF (especially given lower complexity universal proxies like external empowerment ).
Much of the human brain alignment mechanism already is empathic/altruistic inter-brain alignment which evolved due to IGF acting over genes shared across kin/familiy/tribe in tension with personal selfish empowerment. We can intentionally reverse engineer those mechanisms, and even without doing so memetic evolution may apply a similar pressure (both due to how AI is trained on human thoughts and the parallel of code meme vs gene evolution).
My point is that organisms become, over generations, aligned to IGF automatically, simply because organisms who have more offspring outcompete others in the future. But no such automatism exists for purposefully engineering systems that optimize for other aims. Which does strongly suggest it is more likely to go wrong.
My analogy is between genetic evolution and technological/memetic evolution, which are both just specific instances of general systemic replicating pattern evolution. At a high level AGI will be created by humanity as a technocultural system, which is guided by memetic evolution.
Ideas which are successful (ie the transformer meme in DL) can quickly replicate massively, and your automatic alignment argument also translates to memetic evolution. The humans working on AGI do not have random values, they have values that are the end result of millennia of cultural evolution.
Hmm… Reading your last comment again, it seems you make some point similar to Robin Hanson’s Poor Folks Do Smile. An argument that evolution optimizes to some limited extent for happiness or satisfied preferences, to a degree significantly above neutral welfare lives. Add to that Hanson’s commitment to total (rather than average) utilitarianism, evolution is pretty aligned with ethics, as it optimizes for a high number of positive welfare individuals. I’m not sure how his argument is supposed to generalize to a world with AI though. Presumably we humans don’t want to be replaced with even fitter AI, while evolutionary pressures arguably do “want” this. And total utilitarianism (the Repugnant Conclusion) seems unreasonable to me anyway.
But apart from your similarity to Hanson: I think the goal of AI alignment really is ethics. Not some specific human values shaped by millennia of biological or even cultural evolution. First, I don’t think cultural evolution influences our terminal values. And while biological evolution does indeed shape terminal values, it operates fundamentally on the level of individual agents. But ethics is about aggregate welfare. A society of Martians may have different terminal values from us (they may not fear death after procreation, for example), but they could still come up with the same forms of utilitarianism as we do. Ethics, at least if some sort of utilitarianism is right, isn’t about labelling some idiosyncratic collection of terminal values as “good” that we historically evolved to have, but rather about maximizing an aggregate (like average or sum) of terminal values (happiness or desire satisfaction) of individuas, whatever those individual values may be. I think biological evolution doesn’t very strongly optimize for the total aggregate (though Hanson may disagree), and hardly at all for the average.
Though admittedly I don’t fully understand your position here.
It seems straightforwardly obvious that it does for reasonable definitions of terminal values: for example a dedicated jihadi who sacrifices their mortal life for reward in the afterlife is rather obviously pursing complex cultural (linguistically programmed) terminal values.
I don’t believe in ethics in that sense. Everything we do as humans is inevitably unavoidably determined by our true values shaped by millennia of cultural evolution on top of eons of biological evolution. Ethics is simply a system of low complexity simplified inter agent negotiation protocols/standards—not our true values at all. Our true individual values can only be understood through deep neuroscience/DL ala shard theory.
Individual humans working on AGI are motivated by their individual values, but the aggregate behavior of the entire tech economy is best understood on its own terms at the system level as an inhuman optimization process optimizing for some utility function ultimately related to individual human values and their interactions, but not in obvious ways, and not really related to ethics.
But that’s clearly an instrumental value. The expected utility of sacrificing his life may be high if he believes he will have an afterlife as a result. But if he just stops believing that, his beliefs (his epistemics) have changed, and so the expected utility calculation changes, and so the instrumental value of sacrificing his life has changed. His value for the afterlife doesn’t need to change at all.
You can imagine a jihadist who reads some epistemology and science books and as a result comes to believe that the statements in the Quaran weren’t those of God but rather of some ordinary human without any prophetic abilities. Which diminishes their credibility to those found in an arbitrary fiction book. So he may stop believing in God altogether. Even if only his beliefs have changed, it’s very unlikely he will continue to see any instrumental value in being a jihadist.
It’s like drinking a glass of clear liquid because you want to quench your thirst. Unfortunately the liquid contains poison, while you assumed it was pure water. Drinking poison here doesn’t mean you terminally value poison, it just means you had a false belief.
Regarding ethics/alignment: You talk about what people are actually motivated by. But this is arguably a hodgepodge of mostly provisional instrumental values that we would significantly modify if we changed our beliefs about the facts. As we have done in the past decades or centuries. First we may believe doing X is harmful, and so instrumentally disvalue X, then we may learn more about the consequences of doing X, such that we now believe X is beneficial or just harmless. Or the other way round. It would be no good to optimize an AI that is locked in with increasingly outdated instrumental values.
And an AI that always blindly follows our current instrumental values is also suboptimal: It may be much smarter than us, so it would avoid many epistemic mistakes we make. A child may value some amount of autonomy, but it also values being protected from bad choices by its smarter parents—choices which would harm it rather than benefit. An aligned ASI should act like such a parent: It should optimize for what we would want if we were better informed. (We may even terminally value some autonomy at the cost of making some avoidable instrumental mistakes, though only to some extent.) For this the AI has to find out what our terminal values are, and it has to use some method of aggregating them, since our values might conflict in some cases. That’s what normative ethics is about.
There are several endgame scenarios for evolution:
We create misaligned AI and go extinct along with the rest of the biosphere — a catastrophic failure for evolution.
We create aligned AI and proceed into a glorious transhuman future where everyone is either a cyborg, an uploaded mind, or a nanobot swarm with no trace of DNA — another catastrophic failure for evolution.
We create aligned AI and proceed into a glorious transhuman future, but some people choose to remain “natural” for nostalgic reasons. I don’t think we can extrapolate transhuman ethics to determine whether this will happen.
From the POV of evolution, it’s as if we initiated ASI and thought, “Well, it hasn’t killed us yet.”
Was creating photosynthesis and destroying most of the biosphere with oxygen a “catastrophic failure for evolution”?
Yes, because for most of the evolved lineages at the time, their values of gene preservation were suddenly zeroed. The later increase in diversity doesn’t change that that’s a big dip
Hardly catastrophic if it was a dip followed by a great increase in diversity&complexity. More akin to a dip during a cyclic learning rate shift that partially resets the board to jump the system out of a local equilibrium.
What of scenarios where both organic life and ‘AI’ go ‘extinct’? e.g. a gamma ray burst, really big asteroid impact, etc...
Seems like humans are soon going to put an end to DNA-based organisms, or at best relegate them to some small fraction of all “life”. I.e., seems to me that the future is going to score very poorly on the gene-replication-count utility function, relative to what it would score if humanity (or individual humans) were actually aligned to gene-replication-count.
Do you disagree? (Do you expect the post-ASI future to be tiled with human DNA?)
Obviously Evolution doesn’t actually have a utility function, and if it did, gene-replication-count is probably not it, as TekhneMakre points out. But, let’s accept that for now, arguendo.
Given that we’re not especially powerful optimizers relative to what’s possible (we’re only powerful relative to what exists on Earth…for now), this is at best an existence proof for the possibility of alignment for optimizers of fairly limited power. This is to say I don’t think this result is very relevant to the discussion of a sharp left turn in AI because, even if someone buys your argument, AI are not necessarily like humans in relevant ways that will be likely to make them aligned with anything in particular.
Really? Would your argument change if we could demonstrate a key role for sexual selection, primate wars or the invention of cooking over fire?