I agree with paragraphs 1, 2, and 3. To recap, the question we’re discussing is “do you need to understand consequentialist reasoning to build a predictor that can predict consequentialist reasoners?”
A couple of notes on paragraph 4:
I’m not claiming that neural nets or NTMs are sufficient, just that they represent the kind of thing I expect to increasingly succeed at modeling human decisions (and many other things of interest): model classes that are efficiently learnable, and that don’t include built-in planning faculties.
You are bringing up understandability of an NTM-based human-decision-predictor. I think that’s a fine thing to talk about, but it’s different from the question we were talking about.
You’re also bringing up the danger of consequentialist hypotheses hijacking the overall system. This is fine to talk about as well, but it is also different from the question we were talking about.
In paragraph 5, you seem to be proposing that to make any competent predictor, we’ll need to understand planning. This is a broader assertion, and the argument in favor of it is different from the original argument (“predicting planners requires planning faculties so that you can emulate the planner” vs “predicting anything requires some amount of prioritization and decision-making”). In these cases, I’m more skeptical that a deep theoretical understanding of decision-making is important, but I’m open to talking about it—it just seems different from the original question.
Overall, I feel like this response is out-of-scope for the current question—does that make sense, or do I seem off-base?
I see more now what you’re saying about NTMs. In some sense NTMs don’t have “built-in” planning capabilities; to the extent that they plan well, it’s because they learned that transition functions that make plans work better to predict some things. I think it’s likely that you can get planning capabilities in this manner, without actually understanding how the planning works internally. So it seems like there isn’t actually disagreement on this point (sorry for misinterpreting the question). The more controversial point is that you need to understand planning to train safe predictors of humans making plans.
I don’t think I was bringing up consequentialist hypotheses hijacking the system in this paragraph. I was noting the danger of having a system (which is in some sense just trying to predict humans well) output a plan it thinks a human would produce after thinking a very long time, given that it is good at predicting plans toward an objective but bad at predicting the humans’ objective.
Regarding paragraph 5: I was trying to say that you probably only need primitive planning abilities for a lot of prediction tasks, in some cases ones we already understand today. For example, you might use a deep neural net for deciding which weather simulations are worth running, and reinforce the deep neural net on the extent to which running the weather simulation changed the system’s accuracy. This is probably sufficient for a lot of applications.
Thanks Jessica—sorry I misunderstood about hijacking. A couple of questions:
Is there a difference between “safe” and “accurate” predictors? I’m now thinking that you’re worried about NTMs basically making inaccurate predictions, and that accurate predictors of planning will require us to understand planning.
My feeling is that today’s current understanding of planning—if I run this computation, I will get the result, and if I run it again, I’ll get the same one—are sufficient for harder prediction tasks. Are there particular aspects of planning that we don’t yet understand well that you expect to be important for planning computation during prediction?
A very accurate predictor will be safe. A predictor that is somewhat accurate but not very accurate could be unsafe. So yes, I’m concerned that with a realistic amount of computing resources, NTMs might make dangerous partially-accurate predictions, even though they would make safe accurate predictions with a very large amount of computing resources. This seems like it will be true if the NTM is predicting the human’s actions by trying to infer the human’s goal and then outputting a plan towards this goal, though perhaps there are other strategies for efficiently predicting a human. (I think some of the things I said previously were confusing—I said that it seems likely that an NTM can learn to plan well unsafely, which seems confusing since it can only be unsafe by making bad predictions. As an extreme example, perhaps the NTM essentially implements a consequentialist utility maximizer that decides what predictions to output; these predictions will be correct sometimes and incorrect whenever it is in the consequentialist utility maximizer’s interest).
It seems like current understanding of planning is already running into bottlenecks—e.g. see the people working on attention for neural networks. If the NTM is predicting a human making plans by inferring the human’s goal and planning towards this goal, then there needs to be some story for what e.g. decision theory and logical uncertainty it’s using in making these plans. For it to be the right decision theory, it must have this decision theory in its hypothesis space somewhere. In situations where the problem of planning out what computations to do to predict a human is as complicated as the human’s actual planning, and the human’s planning involves complex decision theory (e.g. the human is writing a paper on decision theory), this might be a problem. So you might need to understand some amount of decision theory / logical uncertainty to make this predictor.
(note that I’m not completely sold on this argument; I’m attempting to steelman it)
Thanks Jessica, I think we’re on similar pages—I’m also interested in how to ensure that predictions of humans are accurate and non-adversarial, and I think there are probably a lot of interesting problems there.
If the NTMs get to look at the predictions of the other NTMs when making their own predictions (there’s probably a fixed-point way to do this), then maybe there’s one out there that copies one of the versions of 3 but makes adjustments for 3’s bad decision theory.
Why not say “If X is a model using a bad decision theory, there is a closely related model X’ that uses a better decision theory and makes better predictions. So once we have some examples that distinguish the two cases, we will use X’ rather than X.”
Sometimes this kind of argument doesn’t work and you can get tighter guarantees by considering the space of modifications (by coincidence this exact situation arises here), but I don’t see why this case in particular would bring up that issue.
Suppose there are N binary dimensions that predictors can vary on. Then we’d need 2N predictors to cover every possibility. On the other hand, we would only need to consider N possible modifications to a predictor. Of course, if the dimensions factor that nicely, then you can probably make enough assumptions about the hypothesis class that you can learn from the 2N experts efficiently.
Overall it seems nicer to have a guarantee of the form “if there is a predictable bias in the predictions, then the system will correct this bias” rather than “if there is a strictly better predictor than a bad predictor, then the system will listen to the good predictor”, since it allows capabilities to be distributed among predictors instead of needing to be concentrated in a single predictor. But maybe things work anyway for the reason you gave.
(The discussion seems to apply without modification to any predictor.)
It seems like “gets the wrong decision theory” is a really mild failure mode. If you can’t cope with that, there is no way you are going to cope with actually malignant failures.
Maybe the designer wasn’t counting on dealing with malignant failures at all, and this is an extra reminder that there can be subtle errors that don’t manifest most of the time. But I don’t think it’s much of an argument for understanding philosophy in particular.
I agree with paragraphs 1, 2, and 3. To recap, the question we’re discussing is “do you need to understand consequentialist reasoning to build a predictor that can predict consequentialist reasoners?”
A couple of notes on paragraph 4:
I’m not claiming that neural nets or NTMs are sufficient, just that they represent the kind of thing I expect to increasingly succeed at modeling human decisions (and many other things of interest): model classes that are efficiently learnable, and that don’t include built-in planning faculties.
You are bringing up understandability of an NTM-based human-decision-predictor. I think that’s a fine thing to talk about, but it’s different from the question we were talking about.
You’re also bringing up the danger of consequentialist hypotheses hijacking the overall system. This is fine to talk about as well, but it is also different from the question we were talking about.
In paragraph 5, you seem to be proposing that to make any competent predictor, we’ll need to understand planning. This is a broader assertion, and the argument in favor of it is different from the original argument (“predicting planners requires planning faculties so that you can emulate the planner” vs “predicting anything requires some amount of prioritization and decision-making”). In these cases, I’m more skeptical that a deep theoretical understanding of decision-making is important, but I’m open to talking about it—it just seems different from the original question.
Overall, I feel like this response is out-of-scope for the current question—does that make sense, or do I seem off-base?
Regarding paragraph 4:
I see more now what you’re saying about NTMs. In some sense NTMs don’t have “built-in” planning capabilities; to the extent that they plan well, it’s because they learned that transition functions that make plans work better to predict some things. I think it’s likely that you can get planning capabilities in this manner, without actually understanding how the planning works internally. So it seems like there isn’t actually disagreement on this point (sorry for misinterpreting the question). The more controversial point is that you need to understand planning to train safe predictors of humans making plans.
I don’t think I was bringing up consequentialist hypotheses hijacking the system in this paragraph. I was noting the danger of having a system (which is in some sense just trying to predict humans well) output a plan it thinks a human would produce after thinking a very long time, given that it is good at predicting plans toward an objective but bad at predicting the humans’ objective.
Regarding paragraph 5: I was trying to say that you probably only need primitive planning abilities for a lot of prediction tasks, in some cases ones we already understand today. For example, you might use a deep neural net for deciding which weather simulations are worth running, and reinforce the deep neural net on the extent to which running the weather simulation changed the system’s accuracy. This is probably sufficient for a lot of applications.
Thanks Jessica—sorry I misunderstood about hijacking. A couple of questions:
Is there a difference between “safe” and “accurate” predictors? I’m now thinking that you’re worried about NTMs basically making inaccurate predictions, and that accurate predictors of planning will require us to understand planning.
My feeling is that today’s current understanding of planning—if I run this computation, I will get the result, and if I run it again, I’ll get the same one—are sufficient for harder prediction tasks. Are there particular aspects of planning that we don’t yet understand well that you expect to be important for planning computation during prediction?
A very accurate predictor will be safe. A predictor that is somewhat accurate but not very accurate could be unsafe. So yes, I’m concerned that with a realistic amount of computing resources, NTMs might make dangerous partially-accurate predictions, even though they would make safe accurate predictions with a very large amount of computing resources. This seems like it will be true if the NTM is predicting the human’s actions by trying to infer the human’s goal and then outputting a plan towards this goal, though perhaps there are other strategies for efficiently predicting a human. (I think some of the things I said previously were confusing—I said that it seems likely that an NTM can learn to plan well unsafely, which seems confusing since it can only be unsafe by making bad predictions. As an extreme example, perhaps the NTM essentially implements a consequentialist utility maximizer that decides what predictions to output; these predictions will be correct sometimes and incorrect whenever it is in the consequentialist utility maximizer’s interest).
It seems like current understanding of planning is already running into bottlenecks—e.g. see the people working on attention for neural networks. If the NTM is predicting a human making plans by inferring the human’s goal and planning towards this goal, then there needs to be some story for what e.g. decision theory and logical uncertainty it’s using in making these plans. For it to be the right decision theory, it must have this decision theory in its hypothesis space somewhere. In situations where the problem of planning out what computations to do to predict a human is as complicated as the human’s actual planning, and the human’s planning involves complex decision theory (e.g. the human is writing a paper on decision theory), this might be a problem. So you might need to understand some amount of decision theory / logical uncertainty to make this predictor.
(note that I’m not completely sold on this argument; I’m attempting to steelman it)
Thanks Jessica, I think we’re on similar pages—I’m also interested in how to ensure that predictions of humans are accurate and non-adversarial, and I think there are probably a lot of interesting problems there.
Why not say “If X is a model using a bad decision theory, there is a closely related model X’ that uses a better decision theory and makes better predictions. So once we have some examples that distinguish the two cases, we will use X’ rather than X.”
Sometimes this kind of argument doesn’t work and you can get tighter guarantees by considering the space of modifications (by coincidence this exact situation arises here), but I don’t see why this case in particular would bring up that issue.
Suppose there are N binary dimensions that predictors can vary on. Then we’d need 2N predictors to cover every possibility. On the other hand, we would only need to consider N possible modifications to a predictor. Of course, if the dimensions factor that nicely, then you can probably make enough assumptions about the hypothesis class that you can learn from the 2N experts efficiently.
Overall it seems nicer to have a guarantee of the form “if there is a predictable bias in the predictions, then the system will correct this bias” rather than “if there is a strictly better predictor than a bad predictor, then the system will listen to the good predictor”, since it allows capabilities to be distributed among predictors instead of needing to be concentrated in a single predictor. But maybe things work anyway for the reason you gave.
(The discussion seems to apply without modification to any predictor.)
It seems like “gets the wrong decision theory” is a really mild failure mode. If you can’t cope with that, there is no way you are going to cope with actually malignant failures.
Maybe the designer wasn’t counting on dealing with malignant failures at all, and this is an extra reminder that there can be subtle errors that don’t manifest most of the time. But I don’t think it’s much of an argument for understanding philosophy in particular.
I agree with this.