I really don’t want to get into gory details here. I strongly agree that things can go wrong. We would need to be discussing questions like: What exactly is the RL algorithm, and what’s the nature of the exploring and exploiting that it does? What’s the inductive bias? What’s the environment? All of these are great questions! I often bring them up myself. (E.g. §7.2 here.)
I’m really trying to make a weak point here, which is that we should at least listen to arguments in this genre rather than dismissing them out of hand. After all, many humans, given the option, would not want to enter a state of perpetual bliss while their friends and family get tortured. Likewise, as I mentioned, I have never done cocaine, and don’t plan to, and would go out of my way to avoid it, even though it’s a very pleasurable experience. I think I can explain these two facts (and others like them) in terms of RL algorithms (albeit probably a different type of RL algorithm than you normally have in mind). But even if we couldn’t explain it, whatever, we can still observe that it’s a thing that really happens. Right?
And I’m not mainly thinking of complete plans but rather one ingredient in a plan. For example, I’m much more open-minded to a story that includes “the specific failure mode of wireheading doesn’t happen thanks to miraculous cancellation of inner and outer misalignment” than to a story that sounds like “the alignment problem is solved completely thanks to miraculous cancellation of inner and outer misalignment”.
The problem is, even if the argument that wireheading doesn’t happen is valid, it is not a cancellation. It just means that the wireheading failure mode is replaced by some equally bad or worse failure mode. This argument is basically saying “your model that predicted this extremely specific bad outcome is inaccurate, therefore this bad outcome probably won’t happen, good news!” But it’s not meaningful good news, because it does literally nothing to select for the extremely specific good outcome that we want.
If a rocket was launched towards the Moon with a faulty navigation system, that does not mean the rocket is likely to land on Mars instead. Observing that the navigation system is faulty is not even an “ingredient in a plan” to get the rocket to Mars.
I also don’t think that the drug analogy is especially strong evidence about anything. If you assumed that the human brain is a simple RL algorithm trained on something like pleasure vs. pain then not doing drugs would indeed be an example of not-wireheading. But why should we assume that? I think that the human brain is likely to be at least something like a metacognitive agent in which case you can come to model drugs as a “trap” (and there can be many additional complications).
I don’t think this part of the conversation is going anywhere useful. I don’t personally claim to have any plan for AGI alignment right now. If I ever do, and if “miraculous [partial] cancellation” plays some role in that plan, I guess we can talk then. :)
I also don’t thing that the drug analogy is especially strong evidence…
I guess you’re saying that humans are “metacognitive agents” not “simple RL algorithms”, and therefore the drug thing provides little evidence about future AI. But that step assumes that the future AI will be a “simple RL algorithm”, right? It would provide some evidence if the future AI were similarly a “metacognitive agent”, right? Isn’t a “metacognitive agent” a kind of RL algorithm? (That’s not rhetorical, I don’t know.)
A metacognitive agent is not really an RL algorithm is in the usual sense. To first approximation, you can think of it as metalearning a policy that approximates (infra-)Bayes-optimality on a simplicity prior. The actual thing is more sophisticated and less RL-ish, but this is enough to understand how it can avoid wireheading.
Many alignment failure modes (wireheading, self-modification, inner misalignment...) can be framed as “traps”, since they hinge of the non-asymptotic properties of generalization, i.e. on the “prior” or “inductive bias”. Therefore frameworks that explicitly impose a prior (such as a metacognitive agents) are useful for understanding and avoiding these failure modes. (But, this has little to do with the OP, the way I see it.)
I really don’t want to get into gory details here. I strongly agree that things can go wrong. We would need to be discussing questions like: What exactly is the RL algorithm, and what’s the nature of the exploring and exploiting that it does? What’s the inductive bias? What’s the environment? All of these are great questions! I often bring them up myself. (E.g. §7.2 here.)
I’m really trying to make a weak point here, which is that we should at least listen to arguments in this genre rather than dismissing them out of hand. After all, many humans, given the option, would not want to enter a state of perpetual bliss while their friends and family get tortured. Likewise, as I mentioned, I have never done cocaine, and don’t plan to, and would go out of my way to avoid it, even though it’s a very pleasurable experience. I think I can explain these two facts (and others like them) in terms of RL algorithms (albeit probably a different type of RL algorithm than you normally have in mind). But even if we couldn’t explain it, whatever, we can still observe that it’s a thing that really happens. Right?
And I’m not mainly thinking of complete plans but rather one ingredient in a plan. For example, I’m much more open-minded to a story that includes “the specific failure mode of wireheading doesn’t happen thanks to miraculous cancellation of inner and outer misalignment” than to a story that sounds like “the alignment problem is solved completely thanks to miraculous cancellation of inner and outer misalignment”.
The problem is, even if the argument that wireheading doesn’t happen is valid, it is not a cancellation. It just means that the wireheading failure mode is replaced by some equally bad or worse failure mode. This argument is basically saying “your model that predicted this extremely specific bad outcome is inaccurate, therefore this bad outcome probably won’t happen, good news!” But it’s not meaningful good news, because it does literally nothing to select for the extremely specific good outcome that we want.
If a rocket was launched towards the Moon with a faulty navigation system, that does not mean the rocket is likely to land on Mars instead. Observing that the navigation system is faulty is not even an “ingredient in a plan” to get the rocket to Mars.
I also don’t think that the drug analogy is especially strong evidence about anything. If you assumed that the human brain is a simple RL algorithm trained on something like pleasure vs. pain then not doing drugs would indeed be an example of not-wireheading. But why should we assume that? I think that the human brain is likely to be at least something like a metacognitive agent in which case you can come to model drugs as a “trap” (and there can be many additional complications).
I don’t think this part of the conversation is going anywhere useful. I don’t personally claim to have any plan for AGI alignment right now. If I ever do, and if “miraculous [partial] cancellation” plays some role in that plan, I guess we can talk then. :)
I guess you’re saying that humans are “metacognitive agents” not “simple RL algorithms”, and therefore the drug thing provides little evidence about future AI. But that step assumes that the future AI will be a “simple RL algorithm”, right? It would provide some evidence if the future AI were similarly a “metacognitive agent”, right? Isn’t a “metacognitive agent” a kind of RL algorithm? (That’s not rhetorical, I don’t know.)
A metacognitive agent is not really an RL algorithm is in the usual sense. To first approximation, you can think of it as metalearning a policy that approximates (infra-)Bayes-optimality on a simplicity prior. The actual thing is more sophisticated and less RL-ish, but this is enough to understand how it can avoid wireheading.
Many alignment failure modes (wireheading, self-modification, inner misalignment...) can be framed as “traps”, since they hinge of the non-asymptotic properties of generalization, i.e. on the “prior” or “inductive bias”. Therefore frameworks that explicitly impose a prior (such as a metacognitive agents) are useful for understanding and avoiding these failure modes. (But, this has little to do with the OP, the way I see it.)