The problem is, even if the argument that wireheading doesn’t happen is valid, it is not a cancellation. It just means that the wireheading failure mode is replaced by some equally bad or worse failure mode. This argument is basically saying “your model that predicted this extremely specific bad outcome is inaccurate, therefore this bad outcome probably won’t happen, good news!” But it’s not meaningful good news, because it does literally nothing to select for the extremely specific good outcome that we want.
If a rocket was launched towards the Moon with a faulty navigation system, that does not mean the rocket is likely to land on Mars instead. Observing that the navigation system is faulty is not even an “ingredient in a plan” to get the rocket to Mars.
I also don’t think that the drug analogy is especially strong evidence about anything. If you assumed that the human brain is a simple RL algorithm trained on something like pleasure vs. pain then not doing drugs would indeed be an example of not-wireheading. But why should we assume that? I think that the human brain is likely to be at least something like a metacognitive agent in which case you can come to model drugs as a “trap” (and there can be many additional complications).
I don’t think this part of the conversation is going anywhere useful. I don’t personally claim to have any plan for AGI alignment right now. If I ever do, and if “miraculous [partial] cancellation” plays some role in that plan, I guess we can talk then. :)
I also don’t thing that the drug analogy is especially strong evidence…
I guess you’re saying that humans are “metacognitive agents” not “simple RL algorithms”, and therefore the drug thing provides little evidence about future AI. But that step assumes that the future AI will be a “simple RL algorithm”, right? It would provide some evidence if the future AI were similarly a “metacognitive agent”, right? Isn’t a “metacognitive agent” a kind of RL algorithm? (That’s not rhetorical, I don’t know.)
A metacognitive agent is not really an RL algorithm is in the usual sense. To first approximation, you can think of it as metalearning a policy that approximates (infra-)Bayes-optimality on a simplicity prior. The actual thing is more sophisticated and less RL-ish, but this is enough to understand how it can avoid wireheading.
Many alignment failure modes (wireheading, self-modification, inner misalignment...) can be framed as “traps”, since they hinge of the non-asymptotic properties of generalization, i.e. on the “prior” or “inductive bias”. Therefore frameworks that explicitly impose a prior (such as a metacognitive agents) are useful for understanding and avoiding these failure modes. (But, this has little to do with the OP, the way I see it.)
The problem is, even if the argument that wireheading doesn’t happen is valid, it is not a cancellation. It just means that the wireheading failure mode is replaced by some equally bad or worse failure mode. This argument is basically saying “your model that predicted this extremely specific bad outcome is inaccurate, therefore this bad outcome probably won’t happen, good news!” But it’s not meaningful good news, because it does literally nothing to select for the extremely specific good outcome that we want.
If a rocket was launched towards the Moon with a faulty navigation system, that does not mean the rocket is likely to land on Mars instead. Observing that the navigation system is faulty is not even an “ingredient in a plan” to get the rocket to Mars.
I also don’t think that the drug analogy is especially strong evidence about anything. If you assumed that the human brain is a simple RL algorithm trained on something like pleasure vs. pain then not doing drugs would indeed be an example of not-wireheading. But why should we assume that? I think that the human brain is likely to be at least something like a metacognitive agent in which case you can come to model drugs as a “trap” (and there can be many additional complications).
I don’t think this part of the conversation is going anywhere useful. I don’t personally claim to have any plan for AGI alignment right now. If I ever do, and if “miraculous [partial] cancellation” plays some role in that plan, I guess we can talk then. :)
I guess you’re saying that humans are “metacognitive agents” not “simple RL algorithms”, and therefore the drug thing provides little evidence about future AI. But that step assumes that the future AI will be a “simple RL algorithm”, right? It would provide some evidence if the future AI were similarly a “metacognitive agent”, right? Isn’t a “metacognitive agent” a kind of RL algorithm? (That’s not rhetorical, I don’t know.)
A metacognitive agent is not really an RL algorithm is in the usual sense. To first approximation, you can think of it as metalearning a policy that approximates (infra-)Bayes-optimality on a simplicity prior. The actual thing is more sophisticated and less RL-ish, but this is enough to understand how it can avoid wireheading.
Many alignment failure modes (wireheading, self-modification, inner misalignment...) can be framed as “traps”, since they hinge of the non-asymptotic properties of generalization, i.e. on the “prior” or “inductive bias”. Therefore frameworks that explicitly impose a prior (such as a metacognitive agents) are useful for understanding and avoiding these failure modes. (But, this has little to do with the OP, the way I see it.)