In this post, Turner sets out to debunk what he perceives as “fundamentally confused ideas” which are common in the AI alignment field. I strongly disagree with his claims.
In section 1, Turner quotes a passage from “Superintelligence”, in which Bostrom talks about the problem of wireheading. Turner declares this to be “nonsense” since, according to Turner, RL systems don’t seek to maximize a reward.
First, Bostrom (AFAICT) is describing a system which (i) learns online (ii) maximizes long-term consequences. There are good reasons to focus on such a system: these are properties that are desirable in an AI defense system, if the system is aligned. Now, the LLM+RLHF paradigm which Turners puts in the center is, at least superficially, not like that. However, this is no argument against Bostrom: today’s systems already went beyond LLM+RLHF (introducing RL over chain-of-thought) and tomorrow’s systems are likely to be even more different. And, if a given AI design does not somehow acquire properties i+ii even indirectly (e.g. via in-context learning), then it’s not clear how would it be useful for creating a defense system.
Second, Turner might argue that even granted i+ii, the AI would still not maximize reward because the properties of deep learning would cause it to converge to some different, reward-suboptimal, model. While this is often true, it is hardly an argument why not to worry.
While deep learning is not known to guarantee convergence to the reward-optimal policy (we don’t know how to prove almost any guarantees about deep learning), RL algorithms are certainly designed with reward maximization in mind. If your AI is unaligned even under best-case assumptions about learning convergence, it seems very unlikely that deviating from these assumptions would somehow cause it to be aligned (while remaining highly capable). To argue otherwise is akin to hoping for the rocket to reach the moon because our equations of orbital mechanics don’t account for some errors, rather than despite of it.
After this argument, Turner adds that “as a point of further fact, RL approaches constitute humanity’s current best tools for aligning AI systems today”. This observation seems completely irrelevant. It was indeed expected that RL would be useful in the subhuman regime, when the system cannot fail catastrophically simply because it lacks the capabilities. (Even when it convinces some vulnerable person to commit suicide, OpenAI’s legal department can handle it.) I would expect it to be obvious to Bostrom even back then, and doesn’t invalidate his conclusions in the slightest.
In section 3, Turner proceeds to attack the so-called “counting argument” for misalignment. The counting argument goes, since there are much more misaligned minds/goals than aligned minds/goals, even conditional on “good” behavior in training, it seems unlikely that current methods will produce an aligned mind. Turner (quoting Belrose and Pope) counters this argument by way of analogy. Deep learning successfully generalizes even though most models that perform well on the training data don’t perform well on the test data. Hence, (they argue) the counting argument must be fallacious.
The major error that Turner, Belrose and Pope are making is that of confusing aleatoric and epistemic uncertainty. There is also a minor error of being careless about what measure the counting is performed over.
If we did not know anything about some algorithm except that it performs well on the training data, we would indeed have at most a weak expectation of it performing well on the test data. However, deep learning is far from random in this regard: it was selected by decades of research to be that sort of algorithm that does generalize well. Hence, the counting argument in this case gives us a perfectly reasonable prior.
(The minor point is that, w.r.t to a simplicity prior, even a random algorithm has some bounded-from-below probability of generalizing well.)
The counting argument is not premised on deep understanding of how deep learning works (which at present doesn’t exist), but on a reasonable prior about what should we expect from our vantage point of ignorance. It describes our epistemic uncertainty, not the aleatoric uncertainty of deep learning. We can imagine that, if we knew how deep learning really works in the context of typical LLM training data etc, we would be able to confidently conclude that, say, RLHF has a high probability to eventually produce agents that primarily want to build astronomical superstructures in the shape of English letters, or whatnot. (It is ofc also possible we would conclude that LLM+RLHF will never produce anything powerful enough to be dangerous or useful-as-defense-sytem.) That would not be inconsistent with the counting argument as applied from our current state of knowledge.
The real question is then, conditional on our knowledge that deep learning often generalizes well, how confident are we that it will generalize aligned behavior from training to deployment, when scaled up to highly capable systems. Unfortunately, I don’t think this update is strong enough to make us remotely safe. The fact deep learning generalizes implies that it implements some form of Occam’s razor, but Occam’s razor doesn’t strongly select for alignment, as far as we can tell. Our current (more or less) best model of Occam’s razor is Solomonoff induction, which Turner dismisses as irrelevant to neural networks: but here again, the fact that our understanding is flawed just pushes us back towards the counting-argument-prior, not towards safety.
Also, we should keep in mind that deep learning doesn’t always generalize well empirically, it’s just that when it fails we add more data until it starts generalizing. But, if the failure is “kill all humans”, there is nobody left to add more data.
Turner’s conclusion is “it becomes far easier to just use the AIs as tools which do things we ask”. The extent to which I agree with this depends on the interpretation of the vague term “tools”. Certainly modern AI is a tool that does approximately what we ask (even though when using AI for math, I’m already often annoyed at its attempts to cheat and hide the flaws of its arguments). However, I don’t think we know how to safety create “tools” that are powerful enough to e.g. nearly-autonomously do alignment research or otherwise make substantial steps toward building an AI defense systems.
Second, Turner might argue that even granted i+ii, the AI would still not maximize reward because the properties of deep learning would cause it to converge to some different, reward-suboptimal, model. While this is often true, it is hardly an argument why not to worry.
While deep learning is not known to guarantee convergence to the reward-optimal policy (we don’t know how to prove almost any guarantees about deep learning), RL algorithms are certainly designed with reward maximization in mind. If your AI is unaligned even under best-case assumptions about learning convergence, it seems very unlikely that deviating from these assumptions would somehow cause it to be aligned (while remaining highly capable). To argue otherwise is akin to hoping for the rocket to reach the moon because our equations of orbital mechanics don’t account for some errors, rather than despite of it.
I partly agree, but think you take this point too far. I would say:
It is possible in principle for us to be in a situation where the reward-optimal policy is bad (by human lights), but a specific learning algorithm winds up with a reward-suboptimal policy which is better (by human lights). In other words, it’s possible in principle for a learning algorithm to have an inner misalignment (a.k.a. goal misgeneralization) that cancels out an equal-and-opposite outer misalignment (a.k.a. specification gaming).
We obviously should not breezily assume without argument that this kind of miraculous cancellation will happen.
…But if someone wants to make a specific argument that this miraculous cancellation will happen in a particular case, then OK, cool, we should listen to that argument with an open mind.
I do in fact think there’s at least one case where there’s a reasonable prima facie argument for a miraculous cancellation of this type, and it’s one that TurnTrout has often brought up. Namely, the case of wireheading (and similar).
Suppose there’s a sequence of actions A, which is astronomically unlikely to occur by chance, and that leads to a maximally high reward, in a way that humans don’t like. E.g., A might involve the AI hacking into its own RAM space. It might be the case that normal explore-exploit RL techniques will never randomly come upon A. And it might further be the case that, even if the RL agent winds up foresighted and self-aware and aware of A, it’s motivated to avoid taking action sequence A, for the same reason that I’m not motivated to try addictive drugs (i.e., instrumental convergence goal-guarding). So it never does A, not even once. And thus TD learning (or whatever) never makes the RL agent “want” A. This would be outer misalignment (because A is high reward even though humans don’t like it) that gets cancelled out by inner misalignment (because the agent avoids A despite it being high-reward).
That’s not a bulletproof scenario; there are lots of ways it can go wrong. But I think it’s an existence proof that “miraculous cancellation” proposals should at least be seriously considered rather than dismissed out of hand.
What do you mean “randomly come upon A”? RL is not random. Why wouldn’t it find A?
Let the proxy reward function we use to train the AI be rp and the “true” reward function that we intend the AI to follow be rt. Supposedly, these function agree on some domain D but catastrophically go apart outside of it. Then, if all the training data lies inside D, which reward function is selected depends on the algorithm’s inductive bias (and possibly also on luck). The “cancellation” hope is then that inductive bias favors rt over rp.
But why would that be the case? Realistically, the inductive bias is something like “simplicity”. And human preferences are very complex. On the other hand, something like “the reward is such-and-such bits in the input” is very simple. So instead of cancelling out, the problem is only aggravated.
And that’s under the assumption that rp and rt actually agree on D, which is in itself wildly optimistic.
I really don’t want to get into gory details here. I strongly agree that things can go wrong. We would need to be discussing questions like: What exactly is the RL algorithm, and what’s the nature of the exploring and exploiting that it does? What’s the inductive bias? What’s the environment? All of these are great questions! I often bring them up myself. (E.g. §7.2 here.)
I’m really trying to make a weak point here, which is that we should at least listen to arguments in this genre rather than dismissing them out of hand. After all, many humans, given the option, would not want to enter a state of perpetual bliss while their friends and family get tortured. Likewise, as I mentioned, I have never done cocaine, and don’t plan to, and would go out of my way to avoid it, even though it’s a very pleasurable experience. I think I can explain these two facts (and others like them) in terms of RL algorithms (albeit probably a different type of RL algorithm than you normally have in mind). But even if we couldn’t explain it, whatever, we can still observe that it’s a thing that really happens. Right?
And I’m not mainly thinking of complete plans but rather one ingredient in a plan. For example, I’m much more open-minded to a story that includes “the specific failure mode of wireheading doesn’t happen thanks to miraculous cancellation of inner and outer misalignment” than to a story that sounds like “the alignment problem is solved completely thanks to miraculous cancellation of inner and outer misalignment”.
The problem is, even if the argument that wireheading doesn’t happen is valid, it is not a cancellation. It just means that the wireheading failure mode is replaced by some equally bad or worse failure mode. This argument is basically saying “your model that predicted this extremely specific bad outcome is inaccurate, therefore this bad outcome probably won’t happen, good news!” But it’s not meaningful good news, because it does literally nothing to select for the extremely specific good outcome that we want.
If a rocket was launched towards the Moon with a faulty navigation system, that does not mean the rocket is likely to land on Mars instead. Observing that the navigation system is faulty is not even an “ingredient in a plan” to get the rocket to Mars.
I also don’t think that the drug analogy is especially strong evidence about anything. If you assumed that the human brain is a simple RL algorithm trained on something like pleasure vs. pain then not doing drugs would indeed be an example of not-wireheading. But why should we assume that? I think that the human brain is likely to be at least something like a metacognitive agent in which case you can come to model drugs as a “trap” (and there can be many additional complications).
I don’t think this part of the conversation is going anywhere useful. I don’t personally claim to have any plan for AGI alignment right now. If I ever do, and if “miraculous [partial] cancellation” plays some role in that plan, I guess we can talk then. :)
I also don’t thing that the drug analogy is especially strong evidence…
I guess you’re saying that humans are “metacognitive agents” not “simple RL algorithms”, and therefore the drug thing provides little evidence about future AI. But that step assumes that the future AI will be a “simple RL algorithm”, right? It would provide some evidence if the future AI were similarly a “metacognitive agent”, right? Isn’t a “metacognitive agent” a kind of RL algorithm? (That’s not rhetorical, I don’t know.)
A metacognitive agent is not really an RL algorithm is in the usual sense. To first approximation, you can think of it as metalearning a policy that approximates (infra-)Bayes-optimality on a simplicity prior. The actual thing is more sophisticated and less RL-ish, but this is enough to understand how it can avoid wireheading.
Many alignment failure modes (wireheading, self-modification, inner misalignment...) can be framed as “traps”, since they hinge of the non-asymptotic properties of generalization, i.e. on the “prior” or “inductive bias”. Therefore frameworks that explicitly impose a prior (such as a metacognitive agents) are useful for understanding and avoiding these failure modes. (But, this has little to do with the OP, the way I see it.)
This is a deeply confused post.
In this post, Turner sets out to debunk what he perceives as “fundamentally confused ideas” which are common in the AI alignment field. I strongly disagree with his claims.
In section 1, Turner quotes a passage from “Superintelligence”, in which Bostrom talks about the problem of wireheading. Turner declares this to be “nonsense” since, according to Turner, RL systems don’t seek to maximize a reward.
First, Bostrom (AFAICT) is describing a system which (i) learns online (ii) maximizes long-term consequences. There are good reasons to focus on such a system: these are properties that are desirable in an AI defense system, if the system is aligned. Now, the LLM+RLHF paradigm which Turners puts in the center is, at least superficially, not like that. However, this is no argument against Bostrom: today’s systems already went beyond LLM+RLHF (introducing RL over chain-of-thought) and tomorrow’s systems are likely to be even more different. And, if a given AI design does not somehow acquire properties i+ii even indirectly (e.g. via in-context learning), then it’s not clear how would it be useful for creating a defense system.
Second, Turner might argue that even granted i+ii, the AI would still not maximize reward because the properties of deep learning would cause it to converge to some different, reward-suboptimal, model. While this is often true, it is hardly an argument why not to worry.
While deep learning is not known to guarantee convergence to the reward-optimal policy (we don’t know how to prove almost any guarantees about deep learning), RL algorithms are certainly designed with reward maximization in mind. If your AI is unaligned even under best-case assumptions about learning convergence, it seems very unlikely that deviating from these assumptions would somehow cause it to be aligned (while remaining highly capable). To argue otherwise is akin to hoping for the rocket to reach the moon because our equations of orbital mechanics don’t account for some errors, rather than despite of it.
After this argument, Turner adds that “as a point of further fact, RL approaches constitute humanity’s current best tools for aligning AI systems today”. This observation seems completely irrelevant. It was indeed expected that RL would be useful in the subhuman regime, when the system cannot fail catastrophically simply because it lacks the capabilities. (Even when it convinces some vulnerable person to commit suicide, OpenAI’s legal department can handle it.) I would expect it to be obvious to Bostrom even back then, and doesn’t invalidate his conclusions in the slightest.
In section 3, Turner proceeds to attack the so-called “counting argument” for misalignment. The counting argument goes, since there are much more misaligned minds/goals than aligned minds/goals, even conditional on “good” behavior in training, it seems unlikely that current methods will produce an aligned mind. Turner (quoting Belrose and Pope) counters this argument by way of analogy. Deep learning successfully generalizes even though most models that perform well on the training data don’t perform well on the test data. Hence, (they argue) the counting argument must be fallacious.
The major error that Turner, Belrose and Pope are making is that of confusing aleatoric and epistemic uncertainty. There is also a minor error of being careless about what measure the counting is performed over.
If we did not know anything about some algorithm except that it performs well on the training data, we would indeed have at most a weak expectation of it performing well on the test data. However, deep learning is far from random in this regard: it was selected by decades of research to be that sort of algorithm that does generalize well. Hence, the counting argument in this case gives us a perfectly reasonable prior.
(The minor point is that, w.r.t to a simplicity prior, even a random algorithm has some bounded-from-below probability of generalizing well.)
The counting argument is not premised on deep understanding of how deep learning works (which at present doesn’t exist), but on a reasonable prior about what should we expect from our vantage point of ignorance. It describes our epistemic uncertainty, not the aleatoric uncertainty of deep learning. We can imagine that, if we knew how deep learning really works in the context of typical LLM training data etc, we would be able to confidently conclude that, say, RLHF has a high probability to eventually produce agents that primarily want to build astronomical superstructures in the shape of English letters, or whatnot. (It is ofc also possible we would conclude that LLM+RLHF will never produce anything powerful enough to be dangerous or useful-as-defense-sytem.) That would not be inconsistent with the counting argument as applied from our current state of knowledge.
The real question is then, conditional on our knowledge that deep learning often generalizes well, how confident are we that it will generalize aligned behavior from training to deployment, when scaled up to highly capable systems. Unfortunately, I don’t think this update is strong enough to make us remotely safe. The fact deep learning generalizes implies that it implements some form of Occam’s razor, but Occam’s razor doesn’t strongly select for alignment, as far as we can tell. Our current (more or less) best model of Occam’s razor is Solomonoff induction, which Turner dismisses as irrelevant to neural networks: but here again, the fact that our understanding is flawed just pushes us back towards the counting-argument-prior, not towards safety.
Also, we should keep in mind that deep learning doesn’t always generalize well empirically, it’s just that when it fails we add more data until it starts generalizing. But, if the failure is “kill all humans”, there is nobody left to add more data.
Turner’s conclusion is “it becomes far easier to just use the AIs as tools which do things we ask”. The extent to which I agree with this depends on the interpretation of the vague term “tools”. Certainly modern AI is a tool that does approximately what we ask (even though when using AI for math, I’m already often annoyed at its attempts to cheat and hide the flaws of its arguments). However, I don’t think we know how to safety create “tools” that are powerful enough to e.g. nearly-autonomously do alignment research or otherwise make substantial steps toward building an AI defense systems.
I partly agree, but think you take this point too far. I would say:
It is possible in principle for us to be in a situation where the reward-optimal policy is bad (by human lights), but a specific learning algorithm winds up with a reward-suboptimal policy which is better (by human lights). In other words, it’s possible in principle for a learning algorithm to have an inner misalignment (a.k.a. goal misgeneralization) that cancels out an equal-and-opposite outer misalignment (a.k.a. specification gaming).
We obviously should not breezily assume without argument that this kind of miraculous cancellation will happen.
…But if someone wants to make a specific argument that this miraculous cancellation will happen in a particular case, then OK, cool, we should listen to that argument with an open mind.
I do in fact think there’s at least one case where there’s a reasonable prima facie argument for a miraculous cancellation of this type, and it’s one that TurnTrout has often brought up. Namely, the case of wireheading (and similar).
Suppose there’s a sequence of actions A, which is astronomically unlikely to occur by chance, and that leads to a maximally high reward, in a way that humans don’t like. E.g., A might involve the AI hacking into its own RAM space. It might be the case that normal explore-exploit RL techniques will never randomly come upon A. And it might further be the case that, even if the RL agent winds up foresighted and self-aware and aware of A, it’s motivated to avoid taking action sequence A, for the same reason that I’m not motivated to try addictive drugs (i.e., instrumental convergence goal-guarding). So it never does A, not even once. And thus TD learning (or whatever) never makes the RL agent “want” A. This would be outer misalignment (because A is high reward even though humans don’t like it) that gets cancelled out by inner misalignment (because the agent avoids A despite it being high-reward).
That’s not a bulletproof scenario; there are lots of ways it can go wrong. But I think it’s an existence proof that “miraculous cancellation” proposals should at least be seriously considered rather than dismissed out of hand.
What do you mean “randomly come upon A”? RL is not random. Why wouldn’t it find A?
Let the proxy reward function we use to train the AI be rp and the “true” reward function that we intend the AI to follow be rt. Supposedly, these function agree on some domain D but catastrophically go apart outside of it. Then, if all the training data lies inside D, which reward function is selected depends on the algorithm’s inductive bias (and possibly also on luck). The “cancellation” hope is then that inductive bias favors rt over rp.
But why would that be the case? Realistically, the inductive bias is something like “simplicity”. And human preferences are very complex. On the other hand, something like “the reward is such-and-such bits in the input” is very simple. So instead of cancelling out, the problem is only aggravated.
And that’s under the assumption that rp and rt actually agree on D, which is in itself wildly optimistic.
I really don’t want to get into gory details here. I strongly agree that things can go wrong. We would need to be discussing questions like: What exactly is the RL algorithm, and what’s the nature of the exploring and exploiting that it does? What’s the inductive bias? What’s the environment? All of these are great questions! I often bring them up myself. (E.g. §7.2 here.)
I’m really trying to make a weak point here, which is that we should at least listen to arguments in this genre rather than dismissing them out of hand. After all, many humans, given the option, would not want to enter a state of perpetual bliss while their friends and family get tortured. Likewise, as I mentioned, I have never done cocaine, and don’t plan to, and would go out of my way to avoid it, even though it’s a very pleasurable experience. I think I can explain these two facts (and others like them) in terms of RL algorithms (albeit probably a different type of RL algorithm than you normally have in mind). But even if we couldn’t explain it, whatever, we can still observe that it’s a thing that really happens. Right?
And I’m not mainly thinking of complete plans but rather one ingredient in a plan. For example, I’m much more open-minded to a story that includes “the specific failure mode of wireheading doesn’t happen thanks to miraculous cancellation of inner and outer misalignment” than to a story that sounds like “the alignment problem is solved completely thanks to miraculous cancellation of inner and outer misalignment”.
The problem is, even if the argument that wireheading doesn’t happen is valid, it is not a cancellation. It just means that the wireheading failure mode is replaced by some equally bad or worse failure mode. This argument is basically saying “your model that predicted this extremely specific bad outcome is inaccurate, therefore this bad outcome probably won’t happen, good news!” But it’s not meaningful good news, because it does literally nothing to select for the extremely specific good outcome that we want.
If a rocket was launched towards the Moon with a faulty navigation system, that does not mean the rocket is likely to land on Mars instead. Observing that the navigation system is faulty is not even an “ingredient in a plan” to get the rocket to Mars.
I also don’t think that the drug analogy is especially strong evidence about anything. If you assumed that the human brain is a simple RL algorithm trained on something like pleasure vs. pain then not doing drugs would indeed be an example of not-wireheading. But why should we assume that? I think that the human brain is likely to be at least something like a metacognitive agent in which case you can come to model drugs as a “trap” (and there can be many additional complications).
I don’t think this part of the conversation is going anywhere useful. I don’t personally claim to have any plan for AGI alignment right now. If I ever do, and if “miraculous [partial] cancellation” plays some role in that plan, I guess we can talk then. :)
I guess you’re saying that humans are “metacognitive agents” not “simple RL algorithms”, and therefore the drug thing provides little evidence about future AI. But that step assumes that the future AI will be a “simple RL algorithm”, right? It would provide some evidence if the future AI were similarly a “metacognitive agent”, right? Isn’t a “metacognitive agent” a kind of RL algorithm? (That’s not rhetorical, I don’t know.)
A metacognitive agent is not really an RL algorithm is in the usual sense. To first approximation, you can think of it as metalearning a policy that approximates (infra-)Bayes-optimality on a simplicity prior. The actual thing is more sophisticated and less RL-ish, but this is enough to understand how it can avoid wireheading.
Many alignment failure modes (wireheading, self-modification, inner misalignment...) can be framed as “traps”, since they hinge of the non-asymptotic properties of generalization, i.e. on the “prior” or “inductive bias”. Therefore frameworks that explicitly impose a prior (such as a metacognitive agents) are useful for understanding and avoiding these failure modes. (But, this has little to do with the OP, the way I see it.)