(Warning, very long post, partly thinking out loud. But I endorse the summary. I would be most interested in Eliezer’s response.)
Something vaguely “consequentialist” is an important part of how humans reason about hard cognitive problems of all kinds (e.g. we must decide what cognitive strategy to use, what to focus our attention on, what topics to explore and which to ignore).
It’s not clear what prediction problems require this kind of consequentialism and what kinds of prediction problems can be solved directly by a brute force search for predictors. (I think Ilya has suggested that the cutoff is something like “anything a human can do in 100ms, you can train directly.”)
However, the behavior of an intelligent agent is in some sense a “universal” example of a hard-to-predict-without-consequentialism phenomenon.
If someone claims to have a solution that “just” requires a predictor, then they haven’t necessarily reduced the complexity of the problem, given that a good predictor depend on something consequentialist. If the predictor only needs to apply in some domain, then maybe the domain is easy and you can attack it more directly. But if that domain includes predicting intelligent agents, then it’s obviously not easy.
Actually building an agent that solves these hard prediction problems will probably require building some kind of consequentialism. So it offers just as much opportunity to kill yourself as the general AI problem.
And if you don’t explicitly build in consequentialism, then you’ve just made the situation even worse. There is still probably consequentialism somewhere inside your model, you just don’t even understand how it works because it was produced by a brute force search.
I think that this argument is mostly right. I also think that many thoughtful ML researchers would agree with the substantive claims, though they might disagree about language. We aren’t going to be able to directly train a simple model to solve all of the cognitive problems a human can solve, but there is a real hope that we could train a simple model to control computational machinery in a way that solves hard cognitive problems. And those policies will be “consequentialist” in the sense that their behavior is optimized to achieve a desired consequence. (An NTM is a simple mostly theoretical example of this; there are more practical instantiations as well, and moreover I think it is clear that you can’t actually use full differentiability forever and at least some of the system is going to have to be trained by RL.)
I get off the boat once we start drawing inferences about what AI control research should look like—at this point I think Eliezer’s argument becomes quite weak.
If Eliezer or Nate were to lay out a precise argument I think it would be easy to find the precise point where I object. Unfortunately no one is really in the position to be making precise arguments, so everything is going to be a bit blurrier. But here are some of the observations that seem to lead me to a very different conclusion:
A.
Many decision theories, priors, etc. are reflectively consistent. Eliezer imagines an agent which uses the “right” settings in the long run because it started out with the right settings (or some as-yet-unknown framework for handling its uncertainty) and so stuck with them. I imagine an agent which uses the “right” settings in the long run because it defers to humans, and which may in the short term use incorrect decision theory/priors/etc. This is a central advantage of the act-based approach, and in my view no one has really offered a strong response.
The most natural response would be that using a wrong decision theory/prior/etc. in the short term would lead to bad outcomes, even if one had appropriate deference to humans. The strongest version of this argument goes something like “we’ve encountered some surprises in the past, like the simulation argument, blackmail by future superintelligences, some weird stuff with aliens etc., Pascal’s mugging, and it’s hard to know that we won’t encounter more surprises unless we figure out many of these philosophical issues.”
I think this argument has some merit, but these issues seem to be completely orthogonal to the development of AI (humans might mess these things up about as well as an act-based AI), and so they should be evaluated separately. I think they look a lot less urgent than AI control—I think the only way you end up with MIRI’s level of interest is if you see our decisions about AI as involving a long-term commitment.
B.
I think that Eliezer at least does not yet understand, or has not yet thought deeply about, the situation where we use RL to train agents how to think. He repeatedly makes remarks about how AI control research targeted at deep learning will not generalize to extremely powerful AI systems, while consistently avoiding engagement with the most plausible scenario where deep learning is a central ingredient of powerful AI.
C.
There appears to be a serious methodological disagreement about how AI control research should work.
For existing RL systems, the alignment problem is open. Without solving this problem, it is hard to see how we could build an aligned system which used existing techniques in any substantive way.
Future AI systems may involve new AI techniques that present new difficulties.
I think that we should first resolve, or try our best to resolve, the difficulties posed by existing techniques—whether or not we believe that new techniques will emerge. Once we resolve that problem, we can think about how new techniques will complicate the alignment problem, and try to produce new solutions that will scale to accommodate a wider range of future developments.
Part of my view is that it is much easier to work on problems for which we have a concrete model. Another part is that our work on the alignment problem matters radically more if AI is developed soon. There are a bunch of other issues at play, I discuss a subset here.
I think that Eliezer’s view is something like: we know that future techniques will introduce some qualitatively new difficulties, and those are most likely to be the real big ones. If we understand how to handle those difficulties, then we will be in a radically better position with respect to value alignment. And if we don’t, we are screwed. So we should focus on those difficulties.
Eliezer also believes that the alignment problem is most likely to be very difficult or impossible for systems of the kind that we currently build, such that some new AI techniques are necessary before anyone can build an aligned AI, and such that it is particularly futile to try to solve the alignment problem for existing techniques.
Thanks, Paul—I missed this response earlier, and I think you’ve pointed out some of the major disagreements here.
I agree that there’s something somewhat consequentialist going on during all kinds of complex computation. I’m skeptical that we need better decision theory to do this reliably—are there reasons or intuition-pumps you know of that have a bearing on this?
Different decision theories / priors / etc. are reflectively consistent, so you may want to make sure to choose the right ones the first time. (I think that the act-based approach basically avoids this.)
We have encountered some surprising possible failure modes, like blackmail by distant superintelligences, and might be concerned that we will run into new surprises if we don’t understand consequentialism well.
I guess there is one more:
If we want to understand what our agents are doing, we need to have a pretty good understanding of how effective decision-making ought to work. Otherwise algorithms whose consequentialism we understand will tend to be beaten out by algorithms whose consequentialism we don’t understand. This may make alignment way harder.
Here is my understanding of the argument:
(Warning, very long post, partly thinking out loud. But I endorse the summary. I would be most interested in Eliezer’s response.)
Something vaguely “consequentialist” is an important part of how humans reason about hard cognitive problems of all kinds (e.g. we must decide what cognitive strategy to use, what to focus our attention on, what topics to explore and which to ignore).
It’s not clear what prediction problems require this kind of consequentialism and what kinds of prediction problems can be solved directly by a brute force search for predictors. (I think Ilya has suggested that the cutoff is something like “anything a human can do in 100ms, you can train directly.”)
However, the behavior of an intelligent agent is in some sense a “universal” example of a hard-to-predict-without-consequentialism phenomenon.
If someone claims to have a solution that “just” requires a predictor, then they haven’t necessarily reduced the complexity of the problem, given that a good predictor depend on something consequentialist. If the predictor only needs to apply in some domain, then maybe the domain is easy and you can attack it more directly. But if that domain includes predicting intelligent agents, then it’s obviously not easy.
Actually building an agent that solves these hard prediction problems will probably require building some kind of consequentialism. So it offers just as much opportunity to kill yourself as the general AI problem.
And if you don’t explicitly build in consequentialism, then you’ve just made the situation even worse. There is still probably consequentialism somewhere inside your model, you just don’t even understand how it works because it was produced by a brute force search.
I think that this argument is mostly right. I also think that many thoughtful ML researchers would agree with the substantive claims, though they might disagree about language. We aren’t going to be able to directly train a simple model to solve all of the cognitive problems a human can solve, but there is a real hope that we could train a simple model to control computational machinery in a way that solves hard cognitive problems. And those policies will be “consequentialist” in the sense that their behavior is optimized to achieve a desired consequence. (An NTM is a simple mostly theoretical example of this; there are more practical instantiations as well, and moreover I think it is clear that you can’t actually use full differentiability forever and at least some of the system is going to have to be trained by RL.)
I get off the boat once we start drawing inferences about what AI control research should look like—at this point I think Eliezer’s argument becomes quite weak.
If Eliezer or Nate were to lay out a precise argument I think it would be easy to find the precise point where I object. Unfortunately no one is really in the position to be making precise arguments, so everything is going to be a bit blurrier. But here are some of the observations that seem to lead me to a very different conclusion:
A.
Many decision theories, priors, etc. are reflectively consistent. Eliezer imagines an agent which uses the “right” settings in the long run because it started out with the right settings (or some as-yet-unknown framework for handling its uncertainty) and so stuck with them. I imagine an agent which uses the “right” settings in the long run because it defers to humans, and which may in the short term use incorrect decision theory/priors/etc. This is a central advantage of the act-based approach, and in my view no one has really offered a strong response.
The most natural response would be that using a wrong decision theory/prior/etc. in the short term would lead to bad outcomes, even if one had appropriate deference to humans. The strongest version of this argument goes something like “we’ve encountered some surprises in the past, like the simulation argument, blackmail by future superintelligences, some weird stuff with aliens etc., Pascal’s mugging, and it’s hard to know that we won’t encounter more surprises unless we figure out many of these philosophical issues.”
I think this argument has some merit, but these issues seem to be completely orthogonal to the development of AI (humans might mess these things up about as well as an act-based AI), and so they should be evaluated separately. I think they look a lot less urgent than AI control—I think the only way you end up with MIRI’s level of interest is if you see our decisions about AI as involving a long-term commitment.
B.
I think that Eliezer at least does not yet understand, or has not yet thought deeply about, the situation where we use RL to train agents how to think. He repeatedly makes remarks about how AI control research targeted at deep learning will not generalize to extremely powerful AI systems, while consistently avoiding engagement with the most plausible scenario where deep learning is a central ingredient of powerful AI.
C.
There appears to be a serious methodological disagreement about how AI control research should work.
For existing RL systems, the alignment problem is open. Without solving this problem, it is hard to see how we could build an aligned system which used existing techniques in any substantive way.
Future AI systems may involve new AI techniques that present new difficulties.
I think that we should first resolve, or try our best to resolve, the difficulties posed by existing techniques—whether or not we believe that new techniques will emerge. Once we resolve that problem, we can think about how new techniques will complicate the alignment problem, and try to produce new solutions that will scale to accommodate a wider range of future developments.
Part of my view is that it is much easier to work on problems for which we have a concrete model. Another part is that our work on the alignment problem matters radically more if AI is developed soon. There are a bunch of other issues at play, I discuss a subset here.
I think that Eliezer’s view is something like: we know that future techniques will introduce some qualitatively new difficulties, and those are most likely to be the real big ones. If we understand how to handle those difficulties, then we will be in a radically better position with respect to value alignment. And if we don’t, we are screwed. So we should focus on those difficulties.
Eliezer also believes that the alignment problem is most likely to be very difficult or impossible for systems of the kind that we currently build, such that some new AI techniques are necessary before anyone can build an aligned AI, and such that it is particularly futile to try to solve the alignment problem for existing techniques.
Thanks, Paul—I missed this response earlier, and I think you’ve pointed out some of the major disagreements here.
I agree that there’s something somewhat consequentialist going on during all kinds of complex computation. I’m skeptical that we need better decision theory to do this reliably—are there reasons or intuition-pumps you know of that have a bearing on this?
I mentioned two (which I don’t find persuasive):
Different decision theories / priors / etc. are reflectively consistent, so you may want to make sure to choose the right ones the first time. (I think that the act-based approach basically avoids this.)
We have encountered some surprising possible failure modes, like blackmail by distant superintelligences, and might be concerned that we will run into new surprises if we don’t understand consequentialism well.
I guess there is one more:
If we want to understand what our agents are doing, we need to have a pretty good understanding of how effective decision-making ought to work. Otherwise algorithms whose consequentialism we understand will tend to be beaten out by algorithms whose consequentialism we don’t understand. This may make alignment way harder.