“Additionally, the fact that the predictor uses consequentialist reasoning indicates that you probably need to understand consequentialist reasoning to build the predictor in the first place.”
I’ve had this conversation with Nate before, and I don’t understand why I should think it’s true. Presumably we think we will eventually be able to make predictors that predict a wide variety of systems without us understanding every interesting subset ahead of time, right? Why are consequentialists different?
(Warning, very long post, partly thinking out loud. But I endorse the summary. I would be most interested in Eliezer’s response.)
Something vaguely “consequentialist” is an important part of how humans reason about hard cognitive problems of all kinds (e.g. we must decide what cognitive strategy to use, what to focus our attention on, what topics to explore and which to ignore).
It’s not clear what prediction problems require this kind of consequentialism and what kinds of prediction problems can be solved directly by a brute force search for predictors. (I think Ilya has suggested that the cutoff is something like “anything a human can do in 100ms, you can train directly.”)
However, the behavior of an intelligent agent is in some sense a “universal” example of a hard-to-predict-without-consequentialism phenomenon.
If someone claims to have a solution that “just” requires a predictor, then they haven’t necessarily reduced the complexity of the problem, given that a good predictor depend on something consequentialist. If the predictor only needs to apply in some domain, then maybe the domain is easy and you can attack it more directly. But if that domain includes predicting intelligent agents, then it’s obviously not easy.
Actually building an agent that solves these hard prediction problems will probably require building some kind of consequentialism. So it offers just as much opportunity to kill yourself as the general AI problem.
And if you don’t explicitly build in consequentialism, then you’ve just made the situation even worse. There is still probably consequentialism somewhere inside your model, you just don’t even understand how it works because it was produced by a brute force search.
I think that this argument is mostly right. I also think that many thoughtful ML researchers would agree with the substantive claims, though they might disagree about language. We aren’t going to be able to directly train a simple model to solve all of the cognitive problems a human can solve, but there is a real hope that we could train a simple model to control computational machinery in a way that solves hard cognitive problems. And those policies will be “consequentialist” in the sense that their behavior is optimized to achieve a desired consequence. (An NTM is a simple mostly theoretical example of this; there are more practical instantiations as well, and moreover I think it is clear that you can’t actually use full differentiability forever and at least some of the system is going to have to be trained by RL.)
I get off the boat once we start drawing inferences about what AI control research should look like—at this point I think Eliezer’s argument becomes quite weak.
If Eliezer or Nate were to lay out a precise argument I think it would be easy to find the precise point where I object. Unfortunately no one is really in the position to be making precise arguments, so everything is going to be a bit blurrier. But here are some of the observations that seem to lead me to a very different conclusion:
A.
Many decision theories, priors, etc. are reflectively consistent. Eliezer imagines an agent which uses the “right” settings in the long run because it started out with the right settings (or some as-yet-unknown framework for handling its uncertainty) and so stuck with them. I imagine an agent which uses the “right” settings in the long run because it defers to humans, and which may in the short term use incorrect decision theory/priors/etc. This is a central advantage of the act-based approach, and in my view no one has really offered a strong response.
The most natural response would be that using a wrong decision theory/prior/etc. in the short term would lead to bad outcomes, even if one had appropriate deference to humans. The strongest version of this argument goes something like “we’ve encountered some surprises in the past, like the simulation argument, blackmail by future superintelligences, some weird stuff with aliens etc., Pascal’s mugging, and it’s hard to know that we won’t encounter more surprises unless we figure out many of these philosophical issues.”
I think this argument has some merit, but these issues seem to be completely orthogonal to the development of AI (humans might mess these things up about as well as an act-based AI), and so they should be evaluated separately. I think they look a lot less urgent than AI control—I think the only way you end up with MIRI’s level of interest is if you see our decisions about AI as involving a long-term commitment.
B.
I think that Eliezer at least does not yet understand, or has not yet thought deeply about, the situation where we use RL to train agents how to think. He repeatedly makes remarks about how AI control research targeted at deep learning will not generalize to extremely powerful AI systems, while consistently avoiding engagement with the most plausible scenario where deep learning is a central ingredient of powerful AI.
C.
There appears to be a serious methodological disagreement about how AI control research should work.
For existing RL systems, the alignment problem is open. Without solving this problem, it is hard to see how we could build an aligned system which used existing techniques in any substantive way.
Future AI systems may involve new AI techniques that present new difficulties.
I think that we should first resolve, or try our best to resolve, the difficulties posed by existing techniques—whether or not we believe that new techniques will emerge. Once we resolve that problem, we can think about how new techniques will complicate the alignment problem, and try to produce new solutions that will scale to accommodate a wider range of future developments.
Part of my view is that it is much easier to work on problems for which we have a concrete model. Another part is that our work on the alignment problem matters radically more if AI is developed soon. There are a bunch of other issues at play, I discuss a subset here.
I think that Eliezer’s view is something like: we know that future techniques will introduce some qualitatively new difficulties, and those are most likely to be the real big ones. If we understand how to handle those difficulties, then we will be in a radically better position with respect to value alignment. And if we don’t, we are screwed. So we should focus on those difficulties.
Eliezer also believes that the alignment problem is most likely to be very difficult or impossible for systems of the kind that we currently build, such that some new AI techniques are necessary before anyone can build an aligned AI, and such that it is particularly futile to try to solve the alignment problem for existing techniques.
Thanks, Paul—I missed this response earlier, and I think you’ve pointed out some of the major disagreements here.
I agree that there’s something somewhat consequentialist going on during all kinds of complex computation. I’m skeptical that we need better decision theory to do this reliably—are there reasons or intuition-pumps you know of that have a bearing on this?
Different decision theories / priors / etc. are reflectively consistent, so you may want to make sure to choose the right ones the first time. (I think that the act-based approach basically avoids this.)
We have encountered some surprising possible failure modes, like blackmail by distant superintelligences, and might be concerned that we will run into new surprises if we don’t understand consequentialism well.
I guess there is one more:
If we want to understand what our agents are doing, we need to have a pretty good understanding of how effective decision-making ought to work. Otherwise algorithms whose consequentialism we understand will tend to be beaten out by algorithms whose consequentialism we don’t understand. This may make alignment way harder.
Here’s the argument as I understand it (paraphrasing Nate). If we have a system predict a human making plans, then we need some story for why it can do this effectively. One story is that, like Solomonoff induction, it’s learning a physical model of the human and simulating the human this way. However, in practice, this is unlikely to be the reason an actual prediction engine predicts humans well (it’s too computationally difficult).
So we need some other argument for why the predictor might work. Here’s one argument: perhaps it’s looking at a human making plans, figuring out what humans are planning towards, and using its own planning capabilities (towards the same goal) to predict what plans the human will make. But it seems like, to be confident that this will work, you need to have some understanding of how the predictor’s planning capabilities work. In particular, humans trying to study correct planning run into some theoretical problems including decision theory, and it seems like a system would need to answer some of these same questions in order to predict humans well.
I’m not sure what to think of this argument. Paul’s current proposal contains reinforcement learning agents who plan towards an objective defined by a more powerful agent, so it is leaning on the reinforcement learner’s ability to plan towards desirable goals. Rather than understand how the reinforcement learner works internally, Paul proposes giving the reinforcement learner a good enough objective (defined by the powerful agent) such that optimizing this objective is equivalent to optimizing what the humans want. This raises some problems, so probably some additional ingredient is necessary. I suspect I’ll have better opinions on this after thinking about the informed oversight problem some more.
I also asked Nate about the analogy between computer vision and learning to predict a human making plans. It seems like computer vision is an easier problem for a few reasons: it doesn’t require serial thought (so it can be done by e.g. a neural network with a fixed number of layers), humans solve the problem using something similar to neural networks anyway, and planning towards the wrong goal is much more dangerous than recognizing objects incorrectly.
Thanks, Jessica. This argument still doesn’t seem right to me—let me try to explain why.
It seems to me like something more tractable than Solomonoff induction, like an approximate cognitive-level model of a human or the other kinds of models that are being produced now (or will be produced in the future) in machine learning (neural nets, NTMs, other etc.), could be used to approximately predict the actions of humans making plans. This is how I expect most kinds of modeling and inference to work, about humans and about other systems of interest in the world, and it seems like most of my behaviors are approximately predictable using a model of me that falls far short of modeling my full brain. This makes me think that an AI won’t need to have hand-made planning faculties to learn to predict planners (human or otherwise), any more than it’ll need weather faculties to predict weather or physics faculties to predict physical systems. Does that make sense?
(I think the analogy to computer vision point toward the learnability of planning; humans use neural nets to plan, after all!)
It seems like an important part of how humans make plans is that we use some computations to decide what other computations are worth performing. Roughly, we use shallow pattern recognition on a question to determine what strategy to use to think further thoughts, and after thinking those thoughts use shallow pattern recognition to figure out what thought to have after that, eventually leading to answering the question. (I expect the brain’s actual algorithm to be much more complicated than this simple model, but to share some aspects of it).
A system predicting what a human would do would presumably also have to figure out which further thoughts are worth thinking, upon being asked to predict how a human answers a question. For example, if I’m answering a complex math question that I have to break into parts to solve it, then for the system to predict my (presumably correct) answer, it might also break the problem into pieces and solve each piece. If it’s bad at determining which thoughts are worth thinking to predict the human’s answer (e.g. it chooses to break the problem into unhelpful pieces), then it will think thoughts that are not very useful for predicting the answer, so it will not be very effective without a huge amount of hardware. I think this is clear when the human is thinking for a long time (e.g. 2 weeks) and less clear for much shorter time periods (e.g. 1 minute, which you might be able to do with shallow pattern recognition in some cases?).
At the point where the system is able to figure out what thoughts to think in order to predict the human well, its planning to determine which thoughts to think looks at least as competent a human’s planning to answer the question, without necessarily using similar intermediate steps in the plan.
It seems like ordinary neural nets can’t decide what to think about (they can only recognize shallow patterns), and perhaps NTMs can. But if a NTM could predict how I answer some questions well (because it’s able to plan out what thoughts to think), I would be scared to ask it to predict my answer to future questions. It seems to be a competent planner, and not one that internally looks like my own thinking or anything I could easily understand. I see the internal approval-direction approach as trying to make systems whose internal planning looks more like planning understood by humans (by supervising the intermediate steps of planning); without internal supervision, we would be running a system capable of making complex plans in a way humans do not understand, which seems dangerous. As an example of a potential problem (not necessarily the most likely one), perhaps the system is very good at planning towards objectives but mediocre at figuring out what objective humans are planning towards, so it predicts plans well during training but occasionally outputs plans optimized for the wrong objective during testing.
It seems likely that very competent physics or weather predictions would also require at least some primitive form of planning what thoughts to think (e.g. maybe the system decides to actually simulate the clouds in an important region). But it seems like you can get a decent performance on these tasks with only primitive planning, whereas I don’t expect this for e.g. predicting a human doing novel research over >1-hour timescales.
Did this help explain the argument better? (I still haven’t thought about this argument enough to be that confident that it goes through, but I don’t see any obvious problems at the moment).
I agree with paragraphs 1, 2, and 3. To recap, the question we’re discussing is “do you need to understand consequentialist reasoning to build a predictor that can predict consequentialist reasoners?”
A couple of notes on paragraph 4:
I’m not claiming that neural nets or NTMs are sufficient, just that they represent the kind of thing I expect to increasingly succeed at modeling human decisions (and many other things of interest): model classes that are efficiently learnable, and that don’t include built-in planning faculties.
You are bringing up understandability of an NTM-based human-decision-predictor. I think that’s a fine thing to talk about, but it’s different from the question we were talking about.
You’re also bringing up the danger of consequentialist hypotheses hijacking the overall system. This is fine to talk about as well, but it is also different from the question we were talking about.
In paragraph 5, you seem to be proposing that to make any competent predictor, we’ll need to understand planning. This is a broader assertion, and the argument in favor of it is different from the original argument (“predicting planners requires planning faculties so that you can emulate the planner” vs “predicting anything requires some amount of prioritization and decision-making”). In these cases, I’m more skeptical that a deep theoretical understanding of decision-making is important, but I’m open to talking about it—it just seems different from the original question.
Overall, I feel like this response is out-of-scope for the current question—does that make sense, or do I seem off-base?
I see more now what you’re saying about NTMs. In some sense NTMs don’t have “built-in” planning capabilities; to the extent that they plan well, it’s because they learned that transition functions that make plans work better to predict some things. I think it’s likely that you can get planning capabilities in this manner, without actually understanding how the planning works internally. So it seems like there isn’t actually disagreement on this point (sorry for misinterpreting the question). The more controversial point is that you need to understand planning to train safe predictors of humans making plans.
I don’t think I was bringing up consequentialist hypotheses hijacking the system in this paragraph. I was noting the danger of having a system (which is in some sense just trying to predict humans well) output a plan it thinks a human would produce after thinking a very long time, given that it is good at predicting plans toward an objective but bad at predicting the humans’ objective.
Regarding paragraph 5: I was trying to say that you probably only need primitive planning abilities for a lot of prediction tasks, in some cases ones we already understand today. For example, you might use a deep neural net for deciding which weather simulations are worth running, and reinforce the deep neural net on the extent to which running the weather simulation changed the system’s accuracy. This is probably sufficient for a lot of applications.
Thanks Jessica—sorry I misunderstood about hijacking. A couple of questions:
Is there a difference between “safe” and “accurate” predictors? I’m now thinking that you’re worried about NTMs basically making inaccurate predictions, and that accurate predictors of planning will require us to understand planning.
My feeling is that today’s current understanding of planning—if I run this computation, I will get the result, and if I run it again, I’ll get the same one—are sufficient for harder prediction tasks. Are there particular aspects of planning that we don’t yet understand well that you expect to be important for planning computation during prediction?
A very accurate predictor will be safe. A predictor that is somewhat accurate but not very accurate could be unsafe. So yes, I’m concerned that with a realistic amount of computing resources, NTMs might make dangerous partially-accurate predictions, even though they would make safe accurate predictions with a very large amount of computing resources. This seems like it will be true if the NTM is predicting the human’s actions by trying to infer the human’s goal and then outputting a plan towards this goal, though perhaps there are other strategies for efficiently predicting a human. (I think some of the things I said previously were confusing—I said that it seems likely that an NTM can learn to plan well unsafely, which seems confusing since it can only be unsafe by making bad predictions. As an extreme example, perhaps the NTM essentially implements a consequentialist utility maximizer that decides what predictions to output; these predictions will be correct sometimes and incorrect whenever it is in the consequentialist utility maximizer’s interest).
It seems like current understanding of planning is already running into bottlenecks—e.g. see the people working on attention for neural networks. If the NTM is predicting a human making plans by inferring the human’s goal and planning towards this goal, then there needs to be some story for what e.g. decision theory and logical uncertainty it’s using in making these plans. For it to be the right decision theory, it must have this decision theory in its hypothesis space somewhere. In situations where the problem of planning out what computations to do to predict a human is as complicated as the human’s actual planning, and the human’s planning involves complex decision theory (e.g. the human is writing a paper on decision theory), this might be a problem. So you might need to understand some amount of decision theory / logical uncertainty to make this predictor.
(note that I’m not completely sold on this argument; I’m attempting to steelman it)
Thanks Jessica, I think we’re on similar pages—I’m also interested in how to ensure that predictions of humans are accurate and non-adversarial, and I think there are probably a lot of interesting problems there.
If the NTMs get to look at the predictions of the other NTMs when making their own predictions (there’s probably a fixed-point way to do this), then maybe there’s one out there that copies one of the versions of 3 but makes adjustments for 3’s bad decision theory.
Why not say “If X is a model using a bad decision theory, there is a closely related model X’ that uses a better decision theory and makes better predictions. So once we have some examples that distinguish the two cases, we will use X’ rather than X.”
Sometimes this kind of argument doesn’t work and you can get tighter guarantees by considering the space of modifications (by coincidence this exact situation arises here), but I don’t see why this case in particular would bring up that issue.
Suppose there are N binary dimensions that predictors can vary on. Then we’d need 2N predictors to cover every possibility. On the other hand, we would only need to consider N possible modifications to a predictor. Of course, if the dimensions factor that nicely, then you can probably make enough assumptions about the hypothesis class that you can learn from the 2N experts efficiently.
Overall it seems nicer to have a guarantee of the form “if there is a predictable bias in the predictions, then the system will correct this bias” rather than “if there is a strictly better predictor than a bad predictor, then the system will listen to the good predictor”, since it allows capabilities to be distributed among predictors instead of needing to be concentrated in a single predictor. But maybe things work anyway for the reason you gave.
(The discussion seems to apply without modification to any predictor.)
It seems like “gets the wrong decision theory” is a really mild failure mode. If you can’t cope with that, there is no way you are going to cope with actually malignant failures.
Maybe the designer wasn’t counting on dealing with malignant failures at all, and this is an extra reminder that there can be subtle errors that don’t manifest most of the time. But I don’t think it’s much of an argument for understanding philosophy in particular.
“Additionally, the fact that the predictor uses consequentialist reasoning indicates that you probably need to understand consequentialist reasoning to build the predictor in the first place.”
I’ve had this conversation with Nate before, and I don’t understand why I should think it’s true. Presumably we think we will eventually be able to make predictors that predict a wide variety of systems without us understanding every interesting subset ahead of time, right? Why are consequentialists different?
Here is my understanding of the argument:
(Warning, very long post, partly thinking out loud. But I endorse the summary. I would be most interested in Eliezer’s response.)
Something vaguely “consequentialist” is an important part of how humans reason about hard cognitive problems of all kinds (e.g. we must decide what cognitive strategy to use, what to focus our attention on, what topics to explore and which to ignore).
It’s not clear what prediction problems require this kind of consequentialism and what kinds of prediction problems can be solved directly by a brute force search for predictors. (I think Ilya has suggested that the cutoff is something like “anything a human can do in 100ms, you can train directly.”)
However, the behavior of an intelligent agent is in some sense a “universal” example of a hard-to-predict-without-consequentialism phenomenon.
If someone claims to have a solution that “just” requires a predictor, then they haven’t necessarily reduced the complexity of the problem, given that a good predictor depend on something consequentialist. If the predictor only needs to apply in some domain, then maybe the domain is easy and you can attack it more directly. But if that domain includes predicting intelligent agents, then it’s obviously not easy.
Actually building an agent that solves these hard prediction problems will probably require building some kind of consequentialism. So it offers just as much opportunity to kill yourself as the general AI problem.
And if you don’t explicitly build in consequentialism, then you’ve just made the situation even worse. There is still probably consequentialism somewhere inside your model, you just don’t even understand how it works because it was produced by a brute force search.
I think that this argument is mostly right. I also think that many thoughtful ML researchers would agree with the substantive claims, though they might disagree about language. We aren’t going to be able to directly train a simple model to solve all of the cognitive problems a human can solve, but there is a real hope that we could train a simple model to control computational machinery in a way that solves hard cognitive problems. And those policies will be “consequentialist” in the sense that their behavior is optimized to achieve a desired consequence. (An NTM is a simple mostly theoretical example of this; there are more practical instantiations as well, and moreover I think it is clear that you can’t actually use full differentiability forever and at least some of the system is going to have to be trained by RL.)
I get off the boat once we start drawing inferences about what AI control research should look like—at this point I think Eliezer’s argument becomes quite weak.
If Eliezer or Nate were to lay out a precise argument I think it would be easy to find the precise point where I object. Unfortunately no one is really in the position to be making precise arguments, so everything is going to be a bit blurrier. But here are some of the observations that seem to lead me to a very different conclusion:
A.
Many decision theories, priors, etc. are reflectively consistent. Eliezer imagines an agent which uses the “right” settings in the long run because it started out with the right settings (or some as-yet-unknown framework for handling its uncertainty) and so stuck with them. I imagine an agent which uses the “right” settings in the long run because it defers to humans, and which may in the short term use incorrect decision theory/priors/etc. This is a central advantage of the act-based approach, and in my view no one has really offered a strong response.
The most natural response would be that using a wrong decision theory/prior/etc. in the short term would lead to bad outcomes, even if one had appropriate deference to humans. The strongest version of this argument goes something like “we’ve encountered some surprises in the past, like the simulation argument, blackmail by future superintelligences, some weird stuff with aliens etc., Pascal’s mugging, and it’s hard to know that we won’t encounter more surprises unless we figure out many of these philosophical issues.”
I think this argument has some merit, but these issues seem to be completely orthogonal to the development of AI (humans might mess these things up about as well as an act-based AI), and so they should be evaluated separately. I think they look a lot less urgent than AI control—I think the only way you end up with MIRI’s level of interest is if you see our decisions about AI as involving a long-term commitment.
B.
I think that Eliezer at least does not yet understand, or has not yet thought deeply about, the situation where we use RL to train agents how to think. He repeatedly makes remarks about how AI control research targeted at deep learning will not generalize to extremely powerful AI systems, while consistently avoiding engagement with the most plausible scenario where deep learning is a central ingredient of powerful AI.
C.
There appears to be a serious methodological disagreement about how AI control research should work.
For existing RL systems, the alignment problem is open. Without solving this problem, it is hard to see how we could build an aligned system which used existing techniques in any substantive way.
Future AI systems may involve new AI techniques that present new difficulties.
I think that we should first resolve, or try our best to resolve, the difficulties posed by existing techniques—whether or not we believe that new techniques will emerge. Once we resolve that problem, we can think about how new techniques will complicate the alignment problem, and try to produce new solutions that will scale to accommodate a wider range of future developments.
Part of my view is that it is much easier to work on problems for which we have a concrete model. Another part is that our work on the alignment problem matters radically more if AI is developed soon. There are a bunch of other issues at play, I discuss a subset here.
I think that Eliezer’s view is something like: we know that future techniques will introduce some qualitatively new difficulties, and those are most likely to be the real big ones. If we understand how to handle those difficulties, then we will be in a radically better position with respect to value alignment. And if we don’t, we are screwed. So we should focus on those difficulties.
Eliezer also believes that the alignment problem is most likely to be very difficult or impossible for systems of the kind that we currently build, such that some new AI techniques are necessary before anyone can build an aligned AI, and such that it is particularly futile to try to solve the alignment problem for existing techniques.
Thanks, Paul—I missed this response earlier, and I think you’ve pointed out some of the major disagreements here.
I agree that there’s something somewhat consequentialist going on during all kinds of complex computation. I’m skeptical that we need better decision theory to do this reliably—are there reasons or intuition-pumps you know of that have a bearing on this?
I mentioned two (which I don’t find persuasive):
Different decision theories / priors / etc. are reflectively consistent, so you may want to make sure to choose the right ones the first time. (I think that the act-based approach basically avoids this.)
We have encountered some surprising possible failure modes, like blackmail by distant superintelligences, and might be concerned that we will run into new surprises if we don’t understand consequentialism well.
I guess there is one more:
If we want to understand what our agents are doing, we need to have a pretty good understanding of how effective decision-making ought to work. Otherwise algorithms whose consequentialism we understand will tend to be beaten out by algorithms whose consequentialism we don’t understand. This may make alignment way harder.
Here’s the argument as I understand it (paraphrasing Nate). If we have a system predict a human making plans, then we need some story for why it can do this effectively. One story is that, like Solomonoff induction, it’s learning a physical model of the human and simulating the human this way. However, in practice, this is unlikely to be the reason an actual prediction engine predicts humans well (it’s too computationally difficult).
So we need some other argument for why the predictor might work. Here’s one argument: perhaps it’s looking at a human making plans, figuring out what humans are planning towards, and using its own planning capabilities (towards the same goal) to predict what plans the human will make. But it seems like, to be confident that this will work, you need to have some understanding of how the predictor’s planning capabilities work. In particular, humans trying to study correct planning run into some theoretical problems including decision theory, and it seems like a system would need to answer some of these same questions in order to predict humans well.
I’m not sure what to think of this argument. Paul’s current proposal contains reinforcement learning agents who plan towards an objective defined by a more powerful agent, so it is leaning on the reinforcement learner’s ability to plan towards desirable goals. Rather than understand how the reinforcement learner works internally, Paul proposes giving the reinforcement learner a good enough objective (defined by the powerful agent) such that optimizing this objective is equivalent to optimizing what the humans want. This raises some problems, so probably some additional ingredient is necessary. I suspect I’ll have better opinions on this after thinking about the informed oversight problem some more.
I also asked Nate about the analogy between computer vision and learning to predict a human making plans. It seems like computer vision is an easier problem for a few reasons: it doesn’t require serial thought (so it can be done by e.g. a neural network with a fixed number of layers), humans solve the problem using something similar to neural networks anyway, and planning towards the wrong goal is much more dangerous than recognizing objects incorrectly.
Thanks, Jessica. This argument still doesn’t seem right to me—let me try to explain why.
It seems to me like something more tractable than Solomonoff induction, like an approximate cognitive-level model of a human or the other kinds of models that are being produced now (or will be produced in the future) in machine learning (neural nets, NTMs, other etc.), could be used to approximately predict the actions of humans making plans. This is how I expect most kinds of modeling and inference to work, about humans and about other systems of interest in the world, and it seems like most of my behaviors are approximately predictable using a model of me that falls far short of modeling my full brain. This makes me think that an AI won’t need to have hand-made planning faculties to learn to predict planners (human or otherwise), any more than it’ll need weather faculties to predict weather or physics faculties to predict physical systems. Does that make sense?
(I think the analogy to computer vision point toward the learnability of planning; humans use neural nets to plan, after all!)
It seems like an important part of how humans make plans is that we use some computations to decide what other computations are worth performing. Roughly, we use shallow pattern recognition on a question to determine what strategy to use to think further thoughts, and after thinking those thoughts use shallow pattern recognition to figure out what thought to have after that, eventually leading to answering the question. (I expect the brain’s actual algorithm to be much more complicated than this simple model, but to share some aspects of it).
A system predicting what a human would do would presumably also have to figure out which further thoughts are worth thinking, upon being asked to predict how a human answers a question. For example, if I’m answering a complex math question that I have to break into parts to solve it, then for the system to predict my (presumably correct) answer, it might also break the problem into pieces and solve each piece. If it’s bad at determining which thoughts are worth thinking to predict the human’s answer (e.g. it chooses to break the problem into unhelpful pieces), then it will think thoughts that are not very useful for predicting the answer, so it will not be very effective without a huge amount of hardware. I think this is clear when the human is thinking for a long time (e.g. 2 weeks) and less clear for much shorter time periods (e.g. 1 minute, which you might be able to do with shallow pattern recognition in some cases?).
At the point where the system is able to figure out what thoughts to think in order to predict the human well, its planning to determine which thoughts to think looks at least as competent a human’s planning to answer the question, without necessarily using similar intermediate steps in the plan.
It seems like ordinary neural nets can’t decide what to think about (they can only recognize shallow patterns), and perhaps NTMs can. But if a NTM could predict how I answer some questions well (because it’s able to plan out what thoughts to think), I would be scared to ask it to predict my answer to future questions. It seems to be a competent planner, and not one that internally looks like my own thinking or anything I could easily understand. I see the internal approval-direction approach as trying to make systems whose internal planning looks more like planning understood by humans (by supervising the intermediate steps of planning); without internal supervision, we would be running a system capable of making complex plans in a way humans do not understand, which seems dangerous. As an example of a potential problem (not necessarily the most likely one), perhaps the system is very good at planning towards objectives but mediocre at figuring out what objective humans are planning towards, so it predicts plans well during training but occasionally outputs plans optimized for the wrong objective during testing.
It seems likely that very competent physics or weather predictions would also require at least some primitive form of planning what thoughts to think (e.g. maybe the system decides to actually simulate the clouds in an important region). But it seems like you can get a decent performance on these tasks with only primitive planning, whereas I don’t expect this for e.g. predicting a human doing novel research over >1-hour timescales.
Did this help explain the argument better? (I still haven’t thought about this argument enough to be that confident that it goes through, but I don’t see any obvious problems at the moment).
I agree with paragraphs 1, 2, and 3. To recap, the question we’re discussing is “do you need to understand consequentialist reasoning to build a predictor that can predict consequentialist reasoners?”
A couple of notes on paragraph 4:
I’m not claiming that neural nets or NTMs are sufficient, just that they represent the kind of thing I expect to increasingly succeed at modeling human decisions (and many other things of interest): model classes that are efficiently learnable, and that don’t include built-in planning faculties.
You are bringing up understandability of an NTM-based human-decision-predictor. I think that’s a fine thing to talk about, but it’s different from the question we were talking about.
You’re also bringing up the danger of consequentialist hypotheses hijacking the overall system. This is fine to talk about as well, but it is also different from the question we were talking about.
In paragraph 5, you seem to be proposing that to make any competent predictor, we’ll need to understand planning. This is a broader assertion, and the argument in favor of it is different from the original argument (“predicting planners requires planning faculties so that you can emulate the planner” vs “predicting anything requires some amount of prioritization and decision-making”). In these cases, I’m more skeptical that a deep theoretical understanding of decision-making is important, but I’m open to talking about it—it just seems different from the original question.
Overall, I feel like this response is out-of-scope for the current question—does that make sense, or do I seem off-base?
Regarding paragraph 4:
I see more now what you’re saying about NTMs. In some sense NTMs don’t have “built-in” planning capabilities; to the extent that they plan well, it’s because they learned that transition functions that make plans work better to predict some things. I think it’s likely that you can get planning capabilities in this manner, without actually understanding how the planning works internally. So it seems like there isn’t actually disagreement on this point (sorry for misinterpreting the question). The more controversial point is that you need to understand planning to train safe predictors of humans making plans.
I don’t think I was bringing up consequentialist hypotheses hijacking the system in this paragraph. I was noting the danger of having a system (which is in some sense just trying to predict humans well) output a plan it thinks a human would produce after thinking a very long time, given that it is good at predicting plans toward an objective but bad at predicting the humans’ objective.
Regarding paragraph 5: I was trying to say that you probably only need primitive planning abilities for a lot of prediction tasks, in some cases ones we already understand today. For example, you might use a deep neural net for deciding which weather simulations are worth running, and reinforce the deep neural net on the extent to which running the weather simulation changed the system’s accuracy. This is probably sufficient for a lot of applications.
Thanks Jessica—sorry I misunderstood about hijacking. A couple of questions:
Is there a difference between “safe” and “accurate” predictors? I’m now thinking that you’re worried about NTMs basically making inaccurate predictions, and that accurate predictors of planning will require us to understand planning.
My feeling is that today’s current understanding of planning—if I run this computation, I will get the result, and if I run it again, I’ll get the same one—are sufficient for harder prediction tasks. Are there particular aspects of planning that we don’t yet understand well that you expect to be important for planning computation during prediction?
A very accurate predictor will be safe. A predictor that is somewhat accurate but not very accurate could be unsafe. So yes, I’m concerned that with a realistic amount of computing resources, NTMs might make dangerous partially-accurate predictions, even though they would make safe accurate predictions with a very large amount of computing resources. This seems like it will be true if the NTM is predicting the human’s actions by trying to infer the human’s goal and then outputting a plan towards this goal, though perhaps there are other strategies for efficiently predicting a human. (I think some of the things I said previously were confusing—I said that it seems likely that an NTM can learn to plan well unsafely, which seems confusing since it can only be unsafe by making bad predictions. As an extreme example, perhaps the NTM essentially implements a consequentialist utility maximizer that decides what predictions to output; these predictions will be correct sometimes and incorrect whenever it is in the consequentialist utility maximizer’s interest).
It seems like current understanding of planning is already running into bottlenecks—e.g. see the people working on attention for neural networks. If the NTM is predicting a human making plans by inferring the human’s goal and planning towards this goal, then there needs to be some story for what e.g. decision theory and logical uncertainty it’s using in making these plans. For it to be the right decision theory, it must have this decision theory in its hypothesis space somewhere. In situations where the problem of planning out what computations to do to predict a human is as complicated as the human’s actual planning, and the human’s planning involves complex decision theory (e.g. the human is writing a paper on decision theory), this might be a problem. So you might need to understand some amount of decision theory / logical uncertainty to make this predictor.
(note that I’m not completely sold on this argument; I’m attempting to steelman it)
Thanks Jessica, I think we’re on similar pages—I’m also interested in how to ensure that predictions of humans are accurate and non-adversarial, and I think there are probably a lot of interesting problems there.
Why not say “If X is a model using a bad decision theory, there is a closely related model X’ that uses a better decision theory and makes better predictions. So once we have some examples that distinguish the two cases, we will use X’ rather than X.”
Sometimes this kind of argument doesn’t work and you can get tighter guarantees by considering the space of modifications (by coincidence this exact situation arises here), but I don’t see why this case in particular would bring up that issue.
Suppose there are N binary dimensions that predictors can vary on. Then we’d need 2N predictors to cover every possibility. On the other hand, we would only need to consider N possible modifications to a predictor. Of course, if the dimensions factor that nicely, then you can probably make enough assumptions about the hypothesis class that you can learn from the 2N experts efficiently.
Overall it seems nicer to have a guarantee of the form “if there is a predictable bias in the predictions, then the system will correct this bias” rather than “if there is a strictly better predictor than a bad predictor, then the system will listen to the good predictor”, since it allows capabilities to be distributed among predictors instead of needing to be concentrated in a single predictor. But maybe things work anyway for the reason you gave.
(The discussion seems to apply without modification to any predictor.)
It seems like “gets the wrong decision theory” is a really mild failure mode. If you can’t cope with that, there is no way you are going to cope with actually malignant failures.
Maybe the designer wasn’t counting on dealing with malignant failures at all, and this is an extra reminder that there can be subtle errors that don’t manifest most of the time. But I don’t think it’s much of an argument for understanding philosophy in particular.
I agree with this.