I mostly agree of course, but will push back a bit on claim 1:
CLAIM 1: If there’s a learning algorithm whose reward function is X, then the trained model will not necessarily explicitly desire X.
I think everyone agrees that this is true, and that evolution provides an example.
Not really—if anything evolution provides the counterexample. In general creatures do ‘explicitly desire’ reproductive fitness, to the reasonable extent that ‘explicitly desire’ makes sense or translates to that organism’s capability set. Humans absolutely do desire reproductive fitness explicitly—to varying degrees, and have since forever. The claim is only arguably true if you pedantically nitpick define “explicitly desire reproductive fitness” to something useless like (express the detailed highly specific currently accepted concept from ev bio in english), which is ridiculous.
A Sumerian in 1000 BC who lived for the purpose of ensuring the success of their progeny and kin absolutely is an example of explicitly desiring and optimizing for the genetic utility function of “be fruitful and multiply”.
I think the jury is still out about what will happen in the future. Maybe future humans will all upload their brains into computers, which is presumably count as zero IGF, if we define IGF via literal DNA molecules (which we don’t have to, but that’s another story). Whatever, I dunno.
Of course—and we’d hope that there is some decoupling eventually! Otherwise it’s just be fruitful and multiply, forever.
Naturally the fact that bio evolution has been largely successful at alignment so far doesn’t mean that will continue indefinitely into the future after further metasystems transitions.
But that is future speculation, and its obviously tautological to argue “bio evolution will fail at alignment in the future” as part of your future model and argument for why alignment is hard in general—we haven’t collected that evidence yet!
Moreover, its not so much that we are misaligned with evolution, and more that bio evolution is being usurped by memetic/info evolution—the future replicator patterns of merit are increasingly no longer just physical genes.
I’m saying that the fact that you, an organism built by the evolutionary process, hope to step outside the evolutionary process and do stuff that the evolutionary process wouldn’t do, is misalignment with the evolutionary process.
I’m saying you didn’t seem to grasp my distinction between systemic vs bio evolution. I do not “hope to step outside the evolutionary process”. The decoupling is only with bio-evolution and genes. The posthuman goal is to move beyond biology, become substrate independent etc, but that is hardly the end of evolution.
A different way to maybe triangulate here: Is misalignment possible, on your view? Like does it ever make sense to say something like “A created B, but failed at alignment and B was misaligned with A”? I ask because I could imagine a position, that sort of sounds a little like what you’re saying, which goes:
There’s no such thing as misalignment. There’s one overarching process, call it evolution or whatever you like, and this process goes through stages of creating new things along new dimensions, but all the stages are part of the overall process. Anything called “misalignment” is describing the relationship of two parts or stages that are contained in the overarching process. The overarching process is at a higher level than that misalignment relationship, and the misalignment helps compute the overarching process.
I literally provided examples of what misalignment with bio-evolution would look like:
Due to observational selection effects, we naturally wouldn’t be here if mesaoptimization failure during brain evolution was too common across the multiverse.[3] But we could have found ourselves in a world with many archaeological examples of species achieving human general technocultural intelligence and then going extinct—not due to AGI of course, but simply due to becoming too intelligent to reproduce.
The original argument that your OP is responding to is about “bio evolution”. I understand the distinction, but why is it relevant? Indeed, in the OP you say:
For the evolution of human intelligence, the optimizer is just evolution: biological natural selection. The utility function is fitness: gene replication count (of the human defining genes).
The OP is talking about history and thus bio evolution, and this thread shifted into the future (where info-evolution dominates) here:
I think the jury is still out about what will happen in the future. Maybe future humans will all upload their brains into computers, which is presumably count as zero IGF, if we define IGF via literal DNA molecules (which we don’t have to, but that’s another story). Whatever, I dunno.
Of course—and we’d hope that there is some decoupling eventually! Otherwise it’s just be fruitful and multiply, forever.
I’m saying that you, a bio-evolved thing, are saying that you hope something happens, and that something is not what bio-evolution wants. So you’re a misaligned optimizer from bio-evolution’s perspective.
If you narrowly define the utility function as “IGF via literal DNA molecules” - (which obviously is the relevant context for my statement “hope that there is some decoupling eventually”) then obviously I’m somewhat misaligned to that util func (but not completely, I am having children). And i’m increasingly aligned with the more general utility functions.
None of this is especially relevant, because I am not a species.
Well, there is at least one human ever whose explicit desires are strongly counter to inclusive genetic fitness (e.g. a person who murders their whole family and then themselves), and there is at least one human ever whose explicit desires are essentially a perfect match to inclusive genetic fitness. I hope we can at least agree about that extremely weak claim. :)
Anyway, I think you raise another interesting question along the lines of:
Claim 4:If we train an AGI using RL with reward function X, and the trained model winds up explicitly wanting Y which is not exactly X but has a lot to do with X and correlates with X, then the trained model ultimately winds up doing lots of X.
You seem to believe this claim, right?
E.g. your comment is sending the vibe that, if we notice that an ancient Sumerian explicitly wanted success of their progeny and kin—which is not exactly IGF, but about as closely related as one could reasonably hope for under the circumstances—then we should be happy, i.e. we should treat that fact as a reason to be optimistic. If so, that vibe seems to flow through the validity of claim 4, right?
I.e., if Claim 4 is false, we would instead have to say “Well yeah sure, the Sumerian’s motivations are as good as one could reasonably hope for under the circumstances…but that’s still not good enough!”.
Anyway, what opinion should we have about Claim 4? Mine is: I mostly want to know more about X & Y (and the exact nature of the learning algorithm) before saying more. I can think of X & Y pairs where it’s plausibly OK. But I strongly believe there are also lots of cases where Claim 4 spectacularly fails by Goodhart’s law.
Basically, the more powerful the AGI gets, and the more it can do things like “invent exotic new technology, reshape itself, reshape humans, reshape the world, etc.”, then the more room there is for the AGI to get into way-out-of-distribution situations where X & Y strongly come apart (cf. last few paragraphs of here).
If humans eventually stop using literal DNA molecules because we start uploading our brains or find some other high-tech alternative, that would be an example of Claim 4 failing, I think.
Well, there is at least one human ever whose explicit desires are strongly counter to inclusive genetic fitness (e.g. a person who murders their whole family and then themselves), and there is at least one human ever whose explicit desires are essentially a perfect match to inclusive genetic fitness.
Obviously true but also wierd that anyone would think that’s relevant? Even for a population which is perfectly aligned in expectation, you wouldn’t expect any individual to be perfectly aligned.
E.g. your comment is sending the vibe that, if we notice that an ancient Sumerian explicitly wanted success of their progeny and kin—which is not exactly IGF, but about as closely related as one could reasonably hope for under the circumstances—then we should be happy, i.e. we should treat that fact as a reason to be optimistic.
How could the Sumerian literally be optimizing explicitly for IGF when the term/concept didn’t exist until 1964? IGF is also almost certainly a crude approximation of the ‘true’ utility function of genetic evolution.
So given that we have uncertainty over the utility function of evolution, and moreover that explicitly encoding a very complex utility function precisely into a (extremely resource constrained) brain would be a stupid engineering idea regardless—clearly obviously suboptimal for actually optimizing said utility function—the analogy argument doesn’t depend on the explicit form of evolution’s utility function at all.
For any reasonable choice of a genetic utility function, homo sapiens is successful. Thus the evolution of human brains is an example of successful alignment: the exploding power of our new inner optimizers (brains) did not result in species failure. It is completely irrelevant whether individual humans are aligned or not, and ridiculous to expect them to be. The success criteria is a measure over the species, not individuals.
Anyway, what opinion should we have about Claim 4? Mine is: I mostly want to know more about X & Y (and the exact nature of the learning algorithm) before saying more. I can think of X & Y pairs where it’s plausibly OK. But I strongly believe there are also lots of cases where Claim 4 spectacularly fails by Goodhart’s law.
Mostly agree yes it depends on many details, but this post was specifically about updating on the fact that we do not observe evolution having trouble with alignment—and again there is only a singular correct way to measure that (species level success/failure).
If humans eventually stop using literal DNA molecules because we start uploading our brains or find some other high-tech alternative, that would be an example of Claim 4 failing, I think.
Maybe? But that hasn’t happened yet. Also what if all the DNA information still exists, but stored more compactly on digital computers, which then in simulation replicate enormously, and the simulations themselves expand as our compute spreads a bit into space and advances further..
The argument “but homo sapiens obviously will go extinct in the future, cuz doom” obviously has no room as part of an argument for doom in the first place.
Obviously true but also wierd that anyone would think that’s relevant? Even for a population which is perfectly aligned in expectation, you wouldn’t expect any individual to be perfectly aligned.
Some people think that misalignment in RL is fundamentally impossible. I’m pretty sure you’re not in that category. But for such people, it’s nice to have an example of “RL algorithm with reward function R → at least one trained model that scores catastrophically poorly on R as soon as it goes out of distribution”.
The CoinRun thing could also work in this context, but is subject to the (misguided) rebuttal “oh but that’s just because the model’s not smart enough to understand R”. So human examples can also be useful in the course of such conversations.
For any reasonable choice of a genetic utility function …
Also what if all the DNA information still exists, but stored more compactly on digital computers, which then in simulation replicate enormously, and the simulations themselves expand as our compute spreads a bit into space and advances further..
Here are a bunch of claims that are equally correct:
The history of evolution to date is exactly what you’d get from an RL algorithm optimizing genes for how many future copies are encoded in literal DNA molecules in basement reality.
The history of evolution to date is exactly what you’d get from an RL algorithm optimizing genes for how many future copies are encoded in literal DNA molecules in either basement reality or accurate simulations.
The history of evolution to date is exactly what you’d get from an RL algorithm optimizing genes for how many future copies are encoded in DNA molecules, or any other format that resembles it functionally (regardless of whether it resembles it chemically or mechanistically).
[infinitely many more things like that]
(Related to “goal misgeneralization”). If humans switch from DNA to XNA or upload themselves into simulations or whatever, then the future would be high-reward according to some of those RL algorithms and the future would be zero-reward according to others of those RL algorithms. In other words, one “experiment” is simultaneously providing evidence about what the results look like for infinitely many different RL algorithms. Lucky us.
You’re welcome to argue that the evidence from some of these “experiments” is more relevant to AI alignment than the evidence from others of these experiments. (I figure that’s what you mean when you say “for any reasonable choice of a genetic utility function…”?) But to make that argument, I think you’d have to start talking more specifically about what you expect AI alignment to look like and thus why some of those bullet point RL algorithms are more disanalagous to AI alignment than others. You seem not to be doing that, unless I missed it, instead you seem to be just talking intuitively about what’s “reasonable” (not what’s a tight analogy).
(Again, my own position is that all of those bullet points are sufficiently disanalogous to AI alignment that it’s not really a useful example, with the possible exception of arguing with someone who takes the extreme position that misalignment is impossible in RL as discussed at the top of this comment.)
Some people think that misalignment in RL is fundamentally impossible. . for such people, it’s nice to have an example of “RL algorithm with reward function R → at least one trained model that scores catastrophically poorly on R as soon as it goes out of distribution”.
I don’t actually believe that, but I also suspect that for those who do the example isn’t relevant because evolution is not a RL algorithm.
Evolution is an outer optimizer that uses RL as the inner optimizer. Evolution proceeds by performing many experiments in parallel, all of which are imperfect/flawed, and the flaws are essential to the algorithms’s progress. The individual inner RL optimizer can’t be perfectly aligned as if they were the outer optimizer wouldn’t have actually worked in the first place, and the outer optimizer can be perfectly aligned even though every single instance of the inner optimizer is at least somewhat misaligned. This is perhaps less commonly understood than I initially assumed—but true nonetheless.
Technology evolves according to an evolutionary process that is more similar/related to genetic evolution (but with greatly improved update operators). The analogy is between genetic evolution over RL brains <-> tech evolution over RL AGIs.
If humans switch from DNA to XNA or upload themselves into simulations or whatever, then the future would be high-reward according to some of those RL algorithms and the future would be zero-reward according to others of those RL algorithms. .. You’re welcome to argue that the evidence from some of these “experiments” is more relevant to AI alignment than the evidence from others of these experiments.
Again none of those future scenarios have played out, they aren’t evidence yet, just speculation.
I made this diagram recently—I just colored some things red to flag areas where I think we’re disagreeing.
The first thing is you think “evolution is not a RL algorithm”. I guess you think it’s a learning algorithm but not a reinforcement learning algorithm? Or do you not even think it’s a learning algorithm? I’m pretty confused by your perspective. Everyone calls PPO an RL algorithm, and I think of evolution as “kinda like PPO but with a probably-much-less-efficient learning rule because it’s not properly calculating gradients”. On the other hand, I’m not sure exactly what the definition of RL is—I don’t think it’s perfectly standardized.
(We both agree that, as a learning algorithm, evolution is very weird compared to the learning algorithms that people have been programming in the past and will continue programming in the future. For example, ML runs normally involve updating the trained model every second or sub-second or whatever, not “one little trained model update step every 20 years”. In other words, I think we both agree that the right column is a much much closer analogy for future AGI training than the middle column.)
(But as weird a learning algorithm as the middle column is, it’s still a learning algorithm. So if we want to make and discuss extremely general statements that apply to every possible learning algorithm, then it’s fair game to talk about the middle column.)
The second thing is on the bottom row, I get the impression that
you want to emphasize that the misalignment could be much higher (e.g. if nobody cared about biological children & kin),
I want to emphasize that the misalignment could be much less (e.g. if men were paying their life savings for the privilege of being a sperm donor).
But that’s not really a disagreement, because both of those things are true. :)
The first thing is you think “evolution is not a RL algorithm”. I guess you think it’s a learning algorithm but not a reinforcement learning algorithm?”.
To a certain extant this is taxonomy/definitions, but there is an entire literature on genetic/evolutionary optimization algorithms and it does seem odd to categorize them under RL—I’ve never heard/seen those authors categorize it that way and RL is a somewhat distant branch of ML. In my mental taxonomy (which I do believe is more standard/canonical) the genetic/evolutionary search family of algorithms are combinatoric/discrete algorithms that evaluate many candidate solutions in parallel and use separate heuristics for updating params (typically not directly related to the fitness function), and instead use fitness function evaluation to select replication factor for the next generation. They approximate the full bayesian posterior with large scale but crude sampling.
Has advantages over gradient methods for exploring highly compressed/combinatoric parameter spaces, but scales poorly to higher dimensions as it doesn’t solve fine grained credit assignment at all.
RL on the other hand is just a slight variation of UL/SL, and is usually (almost always?) a continuous/gradient method.
So anyway the useful analogy is not between genetic evolution and within-lifetime learning, it’s between (genetic evolution, brain within-lifetime learning) and (tech/memetic evolution, ML training).
Russell & Norvig 3rd edition: “The task of reinforcement learning is to use observed rewards to learn an optimal (or nearly optimal) policy for the environment. …Reinforcement learning might be considered to encompass all of AI.” (emphasis in original)
If we take the Russell & Norvig definition, and plug in “reward = IGF” and “policy = a genome that (in the right environment) unfolds into a complete brain and body”, then evolution by natural selection is “using observed rewards to learn an optimal (or nearly optimal) policy for the environment”, thus qualifying as RL according to that definition. Right?
Obviously I agree that if we pick a paper on arxiv that has “RL” in its title, it is almost certainly not talking about evolutionary biology :-P
I’ll also reiterate that I think we both agree that, even if the center column of that chart picture above is “an RL algorithm” by definition, it’s mechanistically pretty far outside the distribution of RL algorithms that human AI programmers are using now and are likely to use in the future, whereas within-lifetime learning is a more central example in almost every way. I’m definitely not disagreeing about that.
RL on the other hand is just a slight variation of UL/SL, and is usually (almost always?) a continuous/gradient method.
One important difference between RL and SL / SSL is that in RL, after you take an action, there’s no ground truth about what action you counterfactually should have taken instead. In other words, SL / SSL gives you a full error gradient “for free” with each query, whereas RL doesn’t. (You can still get an error gradient in policy-optimization, but it’s more expensive—you need to query the policy a bunch of times to get one gradient, IIUC.) Thus, RL algorithms involve random exploration then exploitation, whereas things like ImageNet training and LLM pretraining do not involve any random exploration.
(Notice here that evolution-by-natural-selection does involve a version of explore-then-exploit: mutations are random, and are initially trialed in a small number of individual organisms, and only if they’re helpful are they deployed more and more widely, eventually across the whole species.)
The task of reinforcement learning is to use observed rewards to learn an optimal (or nearly optimal) policy for the environment
If we define RL broadly as “learning/optimizing an algorithm/policy towards optimality according to some arbitrary reward/objective function”, then sure it becomes indistinguishable from general optimization.
However to me it is specifically the notion of reward in RL that distinguishes it, as you say here:
One important difference between RL and SL / SSL is that in RL, after you take an action, there’s no ground truth about what action you counterfactually should have taken instead.
Which already makes it less general that general optimization. A reward function is inherently a proxy—designed by human engineers to approximate the more complex unknown utility, or evolved by natural selection as a practical approximate proxy for something like IGF.
Evolution by natural selection doesn’t have any proxy reward function, so it lacks that distinguishing feature of RL. The ‘optimization objective’ of biological evolution is simply something like an emergent telic arrow of physics: replicators tend to replicate, similar to net entropy always increases, etc. When humans use evolutionary algorithms as evolutionary simulations, the fitness function is more or less approximating the fitness of a genotype if the physics was simulated out in more detail.
So anyway I think of RL as always an inner optimizer, using a proxy reward function that approximates the outer optimizer’s objective function.
I don’t think that, in order for an algorithm to be RL, its reward function must by definition be a proxy for something else more complicated. For example, the RL reward function of AlphaZero is not an approximation of a more complex thing—the reward function is just “did you win the game or not?”, and winning the game is a complete and perfect description of what DeepMind programmers wanted the algorithm to do. And everyone agrees that AlphaZero is an RL algorithm, indeed a central example. Anyway, AlphaZero would be an RL algorithm regardless of the motivations of the DeepMind programmers, right?
True in games the environment itself can directly provide a reward channel, such that the perfect ‘proxy’ simplifies to the trivial identity connection on that channel. But that’s hardly an interesting case right? A human ultimately designed the reward channel for that engineered environment, often as a proxy for some human concept.
The types of games/sims that are actually interesting for AGI, or even just general robots or self driving cars, are open ended where designing the correct reward function (as a proxy for true utility) is much of the challenge.
Again none of those future scenarios have played out, they aren’t evidence yet, just speculation
It’s very weird notion of what constitutes an evidence. If you built AGI and your interpretability tools show you that AGI is plotting to kill you, it would be pretty hard evidence in favor of sharp left turn, even if you were still alive.
Not at all—just look at the definition of solomonoff induction: the distribution over world models/theories is updated strictly on new historical observation bits, and never on future predicted observations. If you observe mental states inside the AGI, that is naturally valid observational evidence from your pov. But that is very different from you predicting that the AGI is going to kill you, and then updating your world model based on those internal predictions—those feedback loops rapidly diverge from reality (and may be related to schizophrenia).
I mostly agree of course, but will push back a bit on claim 1:
Not really—if anything evolution provides the counterexample. In general creatures do ‘explicitly desire’ reproductive fitness, to the reasonable extent that ‘explicitly desire’ makes sense or translates to that organism’s capability set. Humans absolutely do desire reproductive fitness explicitly—to varying degrees, and have since forever. The claim is only arguably true if you pedantically nitpick define “explicitly desire reproductive fitness” to something useless like (express the detailed highly specific currently accepted concept from ev bio in english), which is ridiculous.
A Sumerian in 1000 BC who lived for the purpose of ensuring the success of their progeny and kin absolutely is an example of explicitly desiring and optimizing for the genetic utility function of “be fruitful and multiply”.
Of course—and we’d hope that there is some decoupling eventually! Otherwise it’s just be fruitful and multiply, forever.
This “we’d hope” is misalignment with evolution, right?
Naturally the fact that bio evolution has been largely successful at alignment so far doesn’t mean that will continue indefinitely into the future after further metasystems transitions.
But that is future speculation, and its obviously tautological to argue “bio evolution will fail at alignment in the future” as part of your future model and argument for why alignment is hard in general—we haven’t collected that evidence yet!
Moreover, its not so much that we are misaligned with evolution, and more that bio evolution is being usurped by memetic/info evolution—the future replicator patterns of merit are increasingly no longer just physical genes.
I’m saying that the fact that you, an organism built by the evolutionary process, hope to step outside the evolutionary process and do stuff that the evolutionary process wouldn’t do, is misalignment with the evolutionary process.
I’m saying you didn’t seem to grasp my distinction between systemic vs bio evolution. I do not “hope to step outside the evolutionary process”. The decoupling is only with bio-evolution and genes. The posthuman goal is to move beyond biology, become substrate independent etc, but that is hardly the end of evolution.
A different way to maybe triangulate here: Is misalignment possible, on your view? Like does it ever make sense to say something like “A created B, but failed at alignment and B was misaligned with A”? I ask because I could imagine a position, that sort of sounds a little like what you’re saying, which goes:
I literally provided examples of what misalignment with bio-evolution would look like:
If we haven’t seen such an extinction in the archaeological record, it can mean one of several things:
misalignment is rare, or
misalignment is not rare once the species becomes intelligent, but intelligence is rare or
intelligence usually results in transcendence, so there’s only one transition before the bio becomes irrelevant in the lightcone (and we are it)
We don’t know which. I think it’s a combination of 2 and 3.
The original argument that your OP is responding to is about “bio evolution”. I understand the distinction, but why is it relevant? Indeed, in the OP you say:
So we’re talking about bio evolution, right?
The OP is talking about history and thus bio evolution, and this thread shifted into the future (where info-evolution dominates) here:
I’m saying that you, a bio-evolved thing, are saying that you hope something happens, and that something is not what bio-evolution wants. So you’re a misaligned optimizer from bio-evolution’s perspective.
If you narrowly define the utility function as “IGF via literal DNA molecules” - (which obviously is the relevant context for my statement “hope that there is some decoupling eventually”) then obviously I’m somewhat misaligned to that util func (but not completely, I am having children). And i’m increasingly aligned with the more general utility functions.
None of this is especially relevant, because I am not a species.
Well, there is at least one human ever whose explicit desires are strongly counter to inclusive genetic fitness (e.g. a person who murders their whole family and then themselves), and there is at least one human ever whose explicit desires are essentially a perfect match to inclusive genetic fitness. I hope we can at least agree about that extremely weak claim. :)
Anyway, I think you raise another interesting question along the lines of:
Claim 4: If we train an AGI using RL with reward function X, and the trained model winds up explicitly wanting Y which is not exactly X but has a lot to do with X and correlates with X, then the trained model ultimately winds up doing lots of X.
You seem to believe this claim, right?
E.g. your comment is sending the vibe that, if we notice that an ancient Sumerian explicitly wanted success of their progeny and kin—which is not exactly IGF, but about as closely related as one could reasonably hope for under the circumstances—then we should be happy, i.e. we should treat that fact as a reason to be optimistic. If so, that vibe seems to flow through the validity of claim 4, right?
I.e., if Claim 4 is false, we would instead have to say “Well yeah sure, the Sumerian’s motivations are as good as one could reasonably hope for under the circumstances…but that’s still not good enough!”.
Anyway, what opinion should we have about Claim 4? Mine is: I mostly want to know more about X & Y (and the exact nature of the learning algorithm) before saying more. I can think of X & Y pairs where it’s plausibly OK. But I strongly believe there are also lots of cases where Claim 4 spectacularly fails by Goodhart’s law.
Basically, the more powerful the AGI gets, and the more it can do things like “invent exotic new technology, reshape itself, reshape humans, reshape the world, etc.”, then the more room there is for the AGI to get into way-out-of-distribution situations where X & Y strongly come apart (cf. last few paragraphs of here).
If humans eventually stop using literal DNA molecules because we start uploading our brains or find some other high-tech alternative, that would be an example of Claim 4 failing, I think.
Obviously true but also wierd that anyone would think that’s relevant? Even for a population which is perfectly aligned in expectation, you wouldn’t expect any individual to be perfectly aligned.
How could the Sumerian literally be optimizing explicitly for IGF when the term/concept didn’t exist until 1964? IGF is also almost certainly a crude approximation of the ‘true’ utility function of genetic evolution.
So given that we have uncertainty over the utility function of evolution, and moreover that explicitly encoding a very complex utility function precisely into a (extremely resource constrained) brain would be a stupid engineering idea regardless—clearly obviously suboptimal for actually optimizing said utility function—the analogy argument doesn’t depend on the explicit form of evolution’s utility function at all.
For any reasonable choice of a genetic utility function, homo sapiens is successful. Thus the evolution of human brains is an example of successful alignment: the exploding power of our new inner optimizers (brains) did not result in species failure. It is completely irrelevant whether individual humans are aligned or not, and ridiculous to expect them to be. The success criteria is a measure over the species, not individuals.
Mostly agree yes it depends on many details, but this post was specifically about updating on the fact that we do not observe evolution having trouble with alignment—and again there is only a singular correct way to measure that (species level success/failure).
Maybe? But that hasn’t happened yet. Also what if all the DNA information still exists, but stored more compactly on digital computers, which then in simulation replicate enormously, and the simulations themselves expand as our compute spreads a bit into space and advances further..
The argument “but homo sapiens obviously will go extinct in the future, cuz doom” obviously has no room as part of an argument for doom in the first place.
Some people think that misalignment in RL is fundamentally impossible. I’m pretty sure you’re not in that category. But for such people, it’s nice to have an example of “RL algorithm with reward function R → at least one trained model that scores catastrophically poorly on R as soon as it goes out of distribution”.
The CoinRun thing could also work in this context, but is subject to the (misguided) rebuttal “oh but that’s just because the model’s not smart enough to understand R”. So human examples can also be useful in the course of such conversations.
Here are a bunch of claims that are equally correct:
The history of evolution to date is exactly what you’d get from an RL algorithm optimizing genes for how many future copies are encoded in literal DNA molecules in basement reality.
The history of evolution to date is exactly what you’d get from an RL algorithm optimizing genes for how many future copies are encoded in literal DNA molecules in either basement reality or accurate simulations.
The history of evolution to date is exactly what you’d get from an RL algorithm optimizing genes for how many future copies are encoded in DNA molecules, or any other format that resembles it functionally (regardless of whether it resembles it chemically or mechanistically).
[infinitely many more things like that]
(Related to “goal misgeneralization”). If humans switch from DNA to XNA or upload themselves into simulations or whatever, then the future would be high-reward according to some of those RL algorithms and the future would be zero-reward according to others of those RL algorithms. In other words, one “experiment” is simultaneously providing evidence about what the results look like for infinitely many different RL algorithms. Lucky us.
You’re welcome to argue that the evidence from some of these “experiments” is more relevant to AI alignment than the evidence from others of these experiments. (I figure that’s what you mean when you say “for any reasonable choice of a genetic utility function…”?) But to make that argument, I think you’d have to start talking more specifically about what you expect AI alignment to look like and thus why some of those bullet point RL algorithms are more disanalagous to AI alignment than others. You seem not to be doing that, unless I missed it, instead you seem to be just talking intuitively about what’s “reasonable” (not what’s a tight analogy).
(Again, my own position is that all of those bullet points are sufficiently disanalogous to AI alignment that it’s not really a useful example, with the possible exception of arguing with someone who takes the extreme position that misalignment is impossible in RL as discussed at the top of this comment.)
I don’t actually believe that, but I also suspect that for those who do the example isn’t relevant because evolution is not a RL algorithm.
Evolution is an outer optimizer that uses RL as the inner optimizer. Evolution proceeds by performing many experiments in parallel, all of which are imperfect/flawed, and the flaws are essential to the algorithms’s progress. The individual inner RL optimizer can’t be perfectly aligned as if they were the outer optimizer wouldn’t have actually worked in the first place, and the outer optimizer can be perfectly aligned even though every single instance of the inner optimizer is at least somewhat misaligned. This is perhaps less commonly understood than I initially assumed—but true nonetheless.
Technology evolves according to an evolutionary process that is more similar/related to genetic evolution (but with greatly improved update operators). The analogy is between genetic evolution over RL brains <-> tech evolution over RL AGIs.
Again none of those future scenarios have played out, they aren’t evidence yet, just speculation.
I made this diagram recently—I just colored some things red to flag areas where I think we’re disagreeing.
The first thing is you think “evolution is not a RL algorithm”. I guess you think it’s a learning algorithm but not a reinforcement learning algorithm? Or do you not even think it’s a learning algorithm? I’m pretty confused by your perspective. Everyone calls PPO an RL algorithm, and I think of evolution as “kinda like PPO but with a probably-much-less-efficient learning rule because it’s not properly calculating gradients”. On the other hand, I’m not sure exactly what the definition of RL is—I don’t think it’s perfectly standardized.
(We both agree that, as a learning algorithm, evolution is very weird compared to the learning algorithms that people have been programming in the past and will continue programming in the future. For example, ML runs normally involve updating the trained model every second or sub-second or whatever, not “one little trained model update step every 20 years”. In other words, I think we both agree that the right column is a much much closer analogy for future AGI training than the middle column.)
(But as weird a learning algorithm as the middle column is, it’s still a learning algorithm. So if we want to make and discuss extremely general statements that apply to every possible learning algorithm, then it’s fair game to talk about the middle column.)
The second thing is on the bottom row, I get the impression that
you want to emphasize that the misalignment could be much higher (e.g. if nobody cared about biological children & kin),
I want to emphasize that the misalignment could be much less (e.g. if men were paying their life savings for the privilege of being a sperm donor).
But that’s not really a disagreement, because both of those things are true. :)
To a certain extant this is taxonomy/definitions, but there is an entire literature on genetic/evolutionary optimization algorithms and it does seem odd to categorize them under RL—I’ve never heard/seen those authors categorize it that way and RL is a somewhat distant branch of ML. In my mental taxonomy (which I do believe is more standard/canonical) the genetic/evolutionary search family of algorithms are combinatoric/discrete algorithms that evaluate many candidate solutions in parallel and use separate heuristics for updating params (typically not directly related to the fitness function), and instead use fitness function evaluation to select replication factor for the next generation. They approximate the full bayesian posterior with large scale but crude sampling.
Has advantages over gradient methods for exploring highly compressed/combinatoric parameter spaces, but scales poorly to higher dimensions as it doesn’t solve fine grained credit assignment at all.
RL on the other hand is just a slight variation of UL/SL, and is usually (almost always?) a continuous/gradient method.
So anyway the useful analogy is not between genetic evolution and within-lifetime learning, it’s between (genetic evolution, brain within-lifetime learning) and (tech/memetic evolution, ML training).
Random @nostalgebraist blog post: ““Reinforcement learning” (RL) is not a technique. It’s a problem statement, i.e. a way of framing a task as an optimization problem, so you can hand it over to a mechanical optimizer. … Life itself can be described as an RL problem”
Russell & Norvig 3rd edition: “The task of reinforcement learning is to use observed rewards to learn an optimal (or nearly optimal) policy for the environment. …Reinforcement learning might be considered to encompass all of AI.” (emphasis in original)
If we take the Russell & Norvig definition, and plug in “reward = IGF” and “policy = a genome that (in the right environment) unfolds into a complete brain and body”, then evolution by natural selection is “using observed rewards to learn an optimal (or nearly optimal) policy for the environment”, thus qualifying as RL according to that definition. Right?
Obviously I agree that if we pick a paper on arxiv that has “RL” in its title, it is almost certainly not talking about evolutionary biology :-P
I’ll also reiterate that I think we both agree that, even if the center column of that chart picture above is “an RL algorithm” by definition, it’s mechanistically pretty far outside the distribution of RL algorithms that human AI programmers are using now and are likely to use in the future, whereas within-lifetime learning is a more central example in almost every way. I’m definitely not disagreeing about that.
One important difference between RL and SL / SSL is that in RL, after you take an action, there’s no ground truth about what action you counterfactually should have taken instead. In other words, SL / SSL gives you a full error gradient “for free” with each query, whereas RL doesn’t. (You can still get an error gradient in policy-optimization, but it’s more expensive—you need to query the policy a bunch of times to get one gradient, IIUC.) Thus, RL algorithms involve random exploration then exploitation, whereas things like ImageNet training and LLM pretraining do not involve any random exploration.
(Notice here that evolution-by-natural-selection does involve a version of explore-then-exploit: mutations are random, and are initially trialed in a small number of individual organisms, and only if they’re helpful are they deployed more and more widely, eventually across the whole species.)
If we define RL broadly as “learning/optimizing an algorithm/policy towards optimality according to some arbitrary reward/objective function”, then sure it becomes indistinguishable from general optimization.
However to me it is specifically the notion of reward in RL that distinguishes it, as you say here:
Which already makes it less general that general optimization. A reward function is inherently a proxy—designed by human engineers to approximate the more complex unknown utility, or evolved by natural selection as a practical approximate proxy for something like IGF.
Evolution by natural selection doesn’t have any proxy reward function, so it lacks that distinguishing feature of RL. The ‘optimization objective’ of biological evolution is simply something like an emergent telic arrow of physics: replicators tend to replicate, similar to net entropy always increases, etc. When humans use evolutionary algorithms as evolutionary simulations, the fitness function is more or less approximating the fitness of a genotype if the physics was simulated out in more detail.
So anyway I think of RL as always an inner optimizer, using a proxy reward function that approximates the outer optimizer’s objective function.
I don’t think that, in order for an algorithm to be RL, its reward function must by definition be a proxy for something else more complicated. For example, the RL reward function of AlphaZero is not an approximation of a more complex thing—the reward function is just “did you win the game or not?”, and winning the game is a complete and perfect description of what DeepMind programmers wanted the algorithm to do. And everyone agrees that AlphaZero is an RL algorithm, indeed a central example. Anyway, AlphaZero would be an RL algorithm regardless of the motivations of the DeepMind programmers, right?
True in games the environment itself can directly provide a reward channel, such that the perfect ‘proxy’ simplifies to the trivial identity connection on that channel. But that’s hardly an interesting case right? A human ultimately designed the reward channel for that engineered environment, often as a proxy for some human concept.
The types of games/sims that are actually interesting for AGI, or even just general robots or self driving cars, are open ended where designing the correct reward function (as a proxy for true utility) is much of the challenge.
It’s very weird notion of what constitutes an evidence. If you built AGI and your interpretability tools show you that AGI is plotting to kill you, it would be pretty hard evidence in favor of sharp left turn, even if you were still alive.
Not at all—just look at the definition of solomonoff induction: the distribution over world models/theories is updated strictly on new historical observation bits, and never on future predicted observations. If you observe mental states inside the AGI, that is naturally valid observational evidence from your pov. But that is very different from you predicting that the AGI is going to kill you, and then updating your world model based on those internal predictions—those feedback loops rapidly diverge from reality (and may be related to schizophrenia).
I’m talking about observable evidence, like, transhumanists claiming they will drop their biological bodies on first possibility.