Thanks Jessica—sorry I misunderstood about hijacking. A couple of questions:
Is there a difference between “safe” and “accurate” predictors? I’m now thinking that you’re worried about NTMs basically making inaccurate predictions, and that accurate predictors of planning will require us to understand planning.
My feeling is that today’s current understanding of planning—if I run this computation, I will get the result, and if I run it again, I’ll get the same one—are sufficient for harder prediction tasks. Are there particular aspects of planning that we don’t yet understand well that you expect to be important for planning computation during prediction?
A very accurate predictor will be safe. A predictor that is somewhat accurate but not very accurate could be unsafe. So yes, I’m concerned that with a realistic amount of computing resources, NTMs might make dangerous partially-accurate predictions, even though they would make safe accurate predictions with a very large amount of computing resources. This seems like it will be true if the NTM is predicting the human’s actions by trying to infer the human’s goal and then outputting a plan towards this goal, though perhaps there are other strategies for efficiently predicting a human. (I think some of the things I said previously were confusing—I said that it seems likely that an NTM can learn to plan well unsafely, which seems confusing since it can only be unsafe by making bad predictions. As an extreme example, perhaps the NTM essentially implements a consequentialist utility maximizer that decides what predictions to output; these predictions will be correct sometimes and incorrect whenever it is in the consequentialist utility maximizer’s interest).
It seems like current understanding of planning is already running into bottlenecks—e.g. see the people working on attention for neural networks. If the NTM is predicting a human making plans by inferring the human’s goal and planning towards this goal, then there needs to be some story for what e.g. decision theory and logical uncertainty it’s using in making these plans. For it to be the right decision theory, it must have this decision theory in its hypothesis space somewhere. In situations where the problem of planning out what computations to do to predict a human is as complicated as the human’s actual planning, and the human’s planning involves complex decision theory (e.g. the human is writing a paper on decision theory), this might be a problem. So you might need to understand some amount of decision theory / logical uncertainty to make this predictor.
(note that I’m not completely sold on this argument; I’m attempting to steelman it)
Thanks Jessica, I think we’re on similar pages—I’m also interested in how to ensure that predictions of humans are accurate and non-adversarial, and I think there are probably a lot of interesting problems there.
If the NTMs get to look at the predictions of the other NTMs when making their own predictions (there’s probably a fixed-point way to do this), then maybe there’s one out there that copies one of the versions of 3 but makes adjustments for 3’s bad decision theory.
Why not say “If X is a model using a bad decision theory, there is a closely related model X’ that uses a better decision theory and makes better predictions. So once we have some examples that distinguish the two cases, we will use X’ rather than X.”
Sometimes this kind of argument doesn’t work and you can get tighter guarantees by considering the space of modifications (by coincidence this exact situation arises here), but I don’t see why this case in particular would bring up that issue.
Suppose there are N binary dimensions that predictors can vary on. Then we’d need 2N predictors to cover every possibility. On the other hand, we would only need to consider N possible modifications to a predictor. Of course, if the dimensions factor that nicely, then you can probably make enough assumptions about the hypothesis class that you can learn from the 2N experts efficiently.
Overall it seems nicer to have a guarantee of the form “if there is a predictable bias in the predictions, then the system will correct this bias” rather than “if there is a strictly better predictor than a bad predictor, then the system will listen to the good predictor”, since it allows capabilities to be distributed among predictors instead of needing to be concentrated in a single predictor. But maybe things work anyway for the reason you gave.
(The discussion seems to apply without modification to any predictor.)
It seems like “gets the wrong decision theory” is a really mild failure mode. If you can’t cope with that, there is no way you are going to cope with actually malignant failures.
Maybe the designer wasn’t counting on dealing with malignant failures at all, and this is an extra reminder that there can be subtle errors that don’t manifest most of the time. But I don’t think it’s much of an argument for understanding philosophy in particular.
Thanks Jessica—sorry I misunderstood about hijacking. A couple of questions:
Is there a difference between “safe” and “accurate” predictors? I’m now thinking that you’re worried about NTMs basically making inaccurate predictions, and that accurate predictors of planning will require us to understand planning.
My feeling is that today’s current understanding of planning—if I run this computation, I will get the result, and if I run it again, I’ll get the same one—are sufficient for harder prediction tasks. Are there particular aspects of planning that we don’t yet understand well that you expect to be important for planning computation during prediction?
A very accurate predictor will be safe. A predictor that is somewhat accurate but not very accurate could be unsafe. So yes, I’m concerned that with a realistic amount of computing resources, NTMs might make dangerous partially-accurate predictions, even though they would make safe accurate predictions with a very large amount of computing resources. This seems like it will be true if the NTM is predicting the human’s actions by trying to infer the human’s goal and then outputting a plan towards this goal, though perhaps there are other strategies for efficiently predicting a human. (I think some of the things I said previously were confusing—I said that it seems likely that an NTM can learn to plan well unsafely, which seems confusing since it can only be unsafe by making bad predictions. As an extreme example, perhaps the NTM essentially implements a consequentialist utility maximizer that decides what predictions to output; these predictions will be correct sometimes and incorrect whenever it is in the consequentialist utility maximizer’s interest).
It seems like current understanding of planning is already running into bottlenecks—e.g. see the people working on attention for neural networks. If the NTM is predicting a human making plans by inferring the human’s goal and planning towards this goal, then there needs to be some story for what e.g. decision theory and logical uncertainty it’s using in making these plans. For it to be the right decision theory, it must have this decision theory in its hypothesis space somewhere. In situations where the problem of planning out what computations to do to predict a human is as complicated as the human’s actual planning, and the human’s planning involves complex decision theory (e.g. the human is writing a paper on decision theory), this might be a problem. So you might need to understand some amount of decision theory / logical uncertainty to make this predictor.
(note that I’m not completely sold on this argument; I’m attempting to steelman it)
Thanks Jessica, I think we’re on similar pages—I’m also interested in how to ensure that predictions of humans are accurate and non-adversarial, and I think there are probably a lot of interesting problems there.
Why not say “If X is a model using a bad decision theory, there is a closely related model X’ that uses a better decision theory and makes better predictions. So once we have some examples that distinguish the two cases, we will use X’ rather than X.”
Sometimes this kind of argument doesn’t work and you can get tighter guarantees by considering the space of modifications (by coincidence this exact situation arises here), but I don’t see why this case in particular would bring up that issue.
Suppose there are N binary dimensions that predictors can vary on. Then we’d need 2N predictors to cover every possibility. On the other hand, we would only need to consider N possible modifications to a predictor. Of course, if the dimensions factor that nicely, then you can probably make enough assumptions about the hypothesis class that you can learn from the 2N experts efficiently.
Overall it seems nicer to have a guarantee of the form “if there is a predictable bias in the predictions, then the system will correct this bias” rather than “if there is a strictly better predictor than a bad predictor, then the system will listen to the good predictor”, since it allows capabilities to be distributed among predictors instead of needing to be concentrated in a single predictor. But maybe things work anyway for the reason you gave.
(The discussion seems to apply without modification to any predictor.)
It seems like “gets the wrong decision theory” is a really mild failure mode. If you can’t cope with that, there is no way you are going to cope with actually malignant failures.
Maybe the designer wasn’t counting on dealing with malignant failures at all, and this is an extra reminder that there can be subtle errors that don’t manifest most of the time. But I don’t think it’s much of an argument for understanding philosophy in particular.
I agree with this.