Gradient descent (when applied to train AIs) allows much more fine-grained optimization than evolution, for these reasons:
Evolution by natural selection acts on the genome, which can only crudely affect behavior and only very indirectly affect values, whereas gradient descent acts on the weights which much more directly affect the AI’s behavior and maybe can affect values
Evolution can only select between two alleles in a discrete way, whereas gradient descent operates over a continuous space
Evolution has a minimum feedback loop of one organism generation, whereas RL has a much shorter minimum feedback loop length of one episode
Evolution can only combine information from different individuals inefficiently through sex, whereas we can combine gradients from many episodes to produce one AI that’s learned strategies from all episodes
We can adapt our alignment RL methods, data, hyperparameters, and objectives as we observe problems in the wild
We can do adversarial training against other AIs, but ancestral humans didn’t have to contend with animals whose goal was to trick them into not reproducing by any means necessary; the closest was animals that try to kill us. (Our fear of death is therefore much more robust than our desire to maximize reproductive fitness)
On current models, we can observe the chain of thought (although the amount we can train against it while maintaining faithfulness is limited)
We can potentially do interpretability (if that ever works out)
It’s unclear the degree to which these will solve inner alignment problems or cause AI goals to be more robust than animal goals to distributional shift, but we’re in much better shape than evolution was.
I agree we’re in better shape all else equal than evolution was, though not by enough that I think this is no longer a disaster. Even with all these advantages, it still seems like we don’t have control in a meaningful sense—i.e., we can’t precisely instill particular values, and we can’t tell what values we’ve instilled. Many of the points here don’t bear on this imo, e.g., it’s unclear to me that having tighter feedback loops of the ~same crude process makes the crude process any more precise. Likewise, adapting our methods, data, and hyperparameters in response to problems we encounter doesn’t seem like it will solve those problems, since the issues (e.g., of proxies and unintended off-target effects) will persist. Imo, the bottom line is still that we’re blindly growing a superintelligence we don’t remotely understand, and I don’t see how these techniques shift the situation into one where we are in control of our future.
Much of my hope is that by the time we reach a superintelligence level where we need to instill reflectively endorsed values to optimize towards in a very hands-off way rather than just constitutions, behaviors, or goals, we’ll have figured something else out. I’m not claiming the optimizer advantage alone is enough to be decisive in saving the world.
To the point about tighter feedback loops, I see the main benefit as being in conjunction with adapting to new problems. Suppose that we notice AIs take some bad but non-world-ending action like murdering people; then we can add a big dataset of situations in which AIs shouldn’t murder people to the training data. If we were instead breeding animals, we would have to wait dozens of generations for mutations that reduce murder rate to appear and reach fixation. Since these mutations affect behavior through brain architecture, they would have a higher chance of deleterious effects. And if we’re also selecting for intelligence, they would be competing against mutations that increase intelligence, producing a higher alignment tax. All this means that we have less chance to detect whether our proxies hold up (capabilities researchers have many of these advantages too, but the AGI would be able to automate capabilities training anyway).
If we expect problems to get worse at some rate until an accumulation of unsolved alignment issues culminates in disempowerment, it seems to me there is a large band of rates where we can stay ahead of them with AI training but evolution wouldn’t be able to.
We can do adversarial training against other AIs, but ancestral humans didn’t have to contend with animals whose goal was to trick them into not reproducing by any means necessary
We did have to contend with memes that tried to hijack our minds to spread them horizontally (as opposed to vertically, by having more kids), but unfortunately (or fortunately) such “adversarial training” wasn’t powerful enough to instill a robust desire to maximize reproductive fitness. Our adversarial training for AI could also be very limited compared to the adversaries or natural distributional shifts the AI will face in the future.
Our fear of death is therefore much more robust than our desire to maximize reproductive fitness
My fear of death has been much reduced after learning about ideas like quantum immortality and simulation arguments, so it doesn’t seem that much more robust. Its apparent robustness in others looks like an accidental effect of most people not paying attention or being able to fully understand such ideas, which does not seem to have a relevant analogy for AI safety.
Quantum theory and simulation arguments both suggest that there are many copies of myself in the multiverse. From a first person subjective anticipation perspective, experiencing death as nothingness seems impossible so it seems like I should either anticipate my subjective experience continuing as one of the surviving copies, or the whole concept of subjective anticipation is confused. From a third person / God’s view, death can be thought of some of the copies being destroyed or a reduction in my “measure”, but I don’t seem to fear this, just as I didn’t jump in joy to learn about having a huge number of copies in the first place. The situation seems too abstract or remote or foreign to trigger my fear (or joy) response.
Does this meaningfully reduce the probability that you jump out of the way of a car or get screened for heart disease? The important thing isn’t whether you have an emotional fear response, but how the behavior pattern of avoiding generalizes.
Gradient descent (when applied to train AIs) allows much more fine-grained optimization than evolution, for these reasons:
Evolution by natural selection acts on the genome, which can only crudely affect behavior and only very indirectly affect values, whereas gradient descent acts on the weights which much more directly affect the AI’s behavior and maybe can affect values
Evolution can only select between two alleles in a discrete way, whereas gradient descent operates over a continuous space
Evolution has a minimum feedback loop of one organism generation, whereas RL has a much shorter minimum feedback loop length of one episode
Evolution can only combine information from different individuals inefficiently through sex, whereas we can combine gradients from many episodes to produce one AI that’s learned strategies from all episodes
We can adapt our alignment RL methods, data, hyperparameters, and objectives as we observe problems in the wild
We can do adversarial training against other AIs, but ancestral humans didn’t have to contend with animals whose goal was to trick them into not reproducing by any means necessary; the closest was animals that try to kill us. (Our fear of death is therefore much more robust than our desire to maximize reproductive fitness)
On current models, we can observe the chain of thought (although the amount we can train against it while maintaining faithfulness is limited)
We can potentially do interpretability (if that ever works out)
It’s unclear the degree to which these will solve inner alignment problems or cause AI goals to be more robust than animal goals to distributional shift, but we’re in much better shape than evolution was.
I agree we’re in better shape all else equal than evolution was, though not by enough that I think this is no longer a disaster. Even with all these advantages, it still seems like we don’t have control in a meaningful sense—i.e., we can’t precisely instill particular values, and we can’t tell what values we’ve instilled. Many of the points here don’t bear on this imo, e.g., it’s unclear to me that having tighter feedback loops of the ~same crude process makes the crude process any more precise. Likewise, adapting our methods, data, and hyperparameters in response to problems we encounter doesn’t seem like it will solve those problems, since the issues (e.g., of proxies and unintended off-target effects) will persist. Imo, the bottom line is still that we’re blindly growing a superintelligence we don’t remotely understand, and I don’t see how these techniques shift the situation into one where we are in control of our future.
Much of my hope is that by the time we reach a superintelligence level where we need to instill reflectively endorsed values to optimize towards in a very hands-off way rather than just constitutions, behaviors, or goals, we’ll have figured something else out. I’m not claiming the optimizer advantage alone is enough to be decisive in saving the world.
To the point about tighter feedback loops, I see the main benefit as being in conjunction with adapting to new problems. Suppose that we notice AIs take some bad but non-world-ending action like murdering people; then we can add a big dataset of situations in which AIs shouldn’t murder people to the training data. If we were instead breeding animals, we would have to wait dozens of generations for mutations that reduce murder rate to appear and reach fixation. Since these mutations affect behavior through brain architecture, they would have a higher chance of deleterious effects. And if we’re also selecting for intelligence, they would be competing against mutations that increase intelligence, producing a higher alignment tax. All this means that we have less chance to detect whether our proxies hold up (capabilities researchers have many of these advantages too, but the AGI would be able to automate capabilities training anyway).
If we expect problems to get worse at some rate until an accumulation of unsolved alignment issues culminates in disempowerment, it seems to me there is a large band of rates where we can stay ahead of them with AI training but evolution wouldn’t be able to.
We did have to contend with memes that tried to hijack our minds to spread them horizontally (as opposed to vertically, by having more kids), but unfortunately (or fortunately) such “adversarial training” wasn’t powerful enough to instill a robust desire to maximize reproductive fitness. Our adversarial training for AI could also be very limited compared to the adversaries or natural distributional shifts the AI will face in the future.
My fear of death has been much reduced after learning about ideas like quantum immortality and simulation arguments, so it doesn’t seem that much more robust. Its apparent robustness in others looks like an accidental effect of most people not paying attention or being able to fully understand such ideas, which does not seem to have a relevant analogy for AI safety.
Noted. Somewhat surprised you believe in quantum immortality, is there a particular reason?
Quantum theory and simulation arguments both suggest that there are many copies of myself in the multiverse. From a first person subjective anticipation perspective, experiencing death as nothingness seems impossible so it seems like I should either anticipate my subjective experience continuing as one of the surviving copies, or the whole concept of subjective anticipation is confused. From a third person / God’s view, death can be thought of some of the copies being destroyed or a reduction in my “measure”, but I don’t seem to fear this, just as I didn’t jump in joy to learn about having a huge number of copies in the first place. The situation seems too abstract or remote or foreign to trigger my fear (or joy) response.
Does this meaningfully reduce the probability that you jump out of the way of a car or get screened for heart disease? The important thing isn’t whether you have an emotional fear response, but how the behavior pattern of avoiding generalizes.