A few things that I found helpful in reading this post:
I mentally replaced D with “the past” and D’ with “the future”.
I mentally replaced z with “a guide to reasoning about the future”.
This gives us a summary something like:
We want to understand the future, based on our knowledge of the past. However, training a neural net on the past might not lead it to generalise well about the future. Instead, we can train a network to be a guide to reasoning about the future, by evaluating its outputs based on how well humans with access to it can reason about the past, plus how well humans expect it to generalise to the future, plus immense amounts of interpretability work. (Note that this summary was originally incorrect, and has been modified in response to Lanrian’s corrections below.)
Some concerns that arise from my understanding of this proposal:
It seems like the only thing stopping z from primarily containing object-level knowledge about the world is the human prior about the unlikelihood of object-level knowledge. But humans are really bad at assigning priors even to relatively simple statements—this is the main reason that we need science.
z will consist of a large number of claims, but I have no idea how to assign a prior to the conjunction of many big claims about the world, even in theory. That prior can’t calculated recursively, because there may be arbitrarily-complicated interactions between different components of z.
Consider the following proposal: “train an oracle to predict the future, along with an explanation of its reasoning. Reward it for predicting correctly, and penalise it for explanations that sound fishy”. Is there an important difference between this and imitative generalisation?
An agent can “generalise badly” because it’s not very robust, or because it’s actively pursuing goals that are misaligned with those of humans. It doesn’t seem like this proposal distinguishes between these types of failures. Is this distinction important in motivating the proposal?
We want to understand the future, based on our knowledge of the past. However, training a neural net on the past might not lead it to generalise well about the future. Instead, we can train a network to be a guide to reasoning about the future, by evaluating its outputs based on how well humans with access to it can reason about the future
I don’t think this is right. I’ve put my proposed modifications in cursive:
We want to understand the future, based on our knowledge of the past. However, training a neural net on the past might not lead it to generalise well about the future. Instead, we can train a network to be a guide to reasoning about the future, by evaluating its outputs based on how well humans with access to it can reason about the past [we don’t have ground-truth for the future, so we can’t test how well humans can reason about it] and how well humans think it would generalise to the future. Then, we train a separate network to predict what humans with access to the previous network would predict about the future.
(It might be a good idea to share some parameters between the second and first network.)
Consider the following proposal: “train an oracle to predict the future, along with an explanation of its reasoning. Reward it for predicting correctly, and penalise it for explanations that sound fishy”. Is there an important difference between this and imitative generalisation?
As I understand, there are two separate oracles. No oracles are rewarded for predicting correctly. One oracle is rewarded for coming up with good explanations. The other oracle is rewarded for predicting human’s guess, not the truth, correctly.
How do we predict the future with these two oracles? First, we search for the best explanation of the past. The best explanation of the past is 1. a good explanation, and 2. when human guesses with that explanation, they guess correctly. Then, we use human-guess-oracle to predict what human guesses about the future, with the best explanation of the past.
Let’s say we are predicting the winner of the war given the number of soldiers. In the past, 50 won in 50 vs 5, and 150 won in 150 vs 15. In the future, there will be 50 vs 10.
Three explanations of the past is suggested: 1. the side with more soldiers wins, otherwise random 2. the side with even number of soldiers wins against odd number of soldiers, otherwise random 3. if the side is exactly as ten times numerous as the other side, it wins, otherwise random. Three explanations score perfectly against the past. 1 predicts 50 wins in the future, 2 and 3 predict random. IG prefers 1, because the explanation 2 and 3 are crazy, although predictive of the past.
I think these are important differences: 1. Oracle is not trained to predict the future. 2. Explanation must be useful to human, because oracle predicts human’s use of explanation and can’t use explanation directly. 3. Predicting oracle does not generate explanation itself and has no control over it.
It seems like the only thing stopping z from primarily containing object-level knowledge about the world is the human prior about the unlikelihood of object-level knowledge. But humans are really bad at assigning priors even to relatively simple statements—this is the main reason that we need science.
Agree that humans are not necessarily great at assigning priors. The main response to this is that we don’t have a way to get better priors than an amplified human’s best prior. If amplified humans think the NN prior is better than their prior, they can always just use this prior. So in theory this should be both strictly better than the alternative, and the best possible prior we can use.
Science seems like it’s about collecting more data and measuring the likelihood, not changing the prior. We still need to use our prior—there are infinite scientific theories that fit the data, but we prefer ones that are simple and elegant.
z will consist of a large number of claims, but I have no idea how to assign a prior to the conjunction of many big claims about the world, even in theory. That prior can’t calculated recursively, because there may be arbitrarily-complicated interactions between different components of z.
One thing that helps a bit here is that we can use an amplified human. We also don’t need the human to calculate the prior directly, just to do things like assess whether some change makes the prior better or worse. But I’m not sure how much of a roadblock this is in practice, or what Paul thinks about this problem.
Consider the following proposal: “train an oracle to predict the future, along with an explanation of its reasoning. Reward it for predicting correctly, and penalise it for explanations that sound fishy”. Is there an important difference between this and imitative generalisation?
Yeah, the important difference is that in this case there’s nothing that constrains the explanations to be the same as the actual reasoning the oracle is using, so the explanations you’re getting are not necessarily predictive of the kind of generalisation that will happen. In IG it’s important that the quality of z is measured by having humans use it to make predictions.
An agent can “generalise badly” because it’s not very robust, or because it’s actively pursuing goals that are misaligned with those of humans. It doesn’t seem like this proposal distinguishes between these types of failures. Is this distinction important in motivating the proposal?
I’m not sure exactly what you’re asking. I think the proposal is motivated by something like: having the task be IID/being able to check arbitrary outputs from our model to make sure it’s generalising correctly buys us a lot of safety properties. If we have this guarantee, we only have to worry about rare or probabilistic defection, not that the model might be giving us misleading answers for every question we can’t check.
A few things that I found helpful in reading this post:
I mentally replaced D with “the past” and D’ with “the future”.
I mentally replaced z with “a guide to reasoning about the future”.
This gives us a summary something like:
We want to understand the future, based on our knowledge of the past. However, training a neural net on the past might not lead it to generalise well about the future. Instead, we can train a network to be a guide to reasoning about the future, by evaluating its outputs based on how well humans with access to it can reason about the past, plus how well humans expect it to generalise to the future, plus immense amounts of interpretability work. (Note that this summary was originally incorrect, and has been modified in response to Lanrian’s corrections below.)
Some concerns that arise from my understanding of this proposal:
It seems like the only thing stopping z from primarily containing object-level knowledge about the world is the human prior about the unlikelihood of object-level knowledge. But humans are really bad at assigning priors even to relatively simple statements—this is the main reason that we need science.
z will consist of a large number of claims, but I have no idea how to assign a prior to the conjunction of many big claims about the world, even in theory. That prior can’t calculated recursively, because there may be arbitrarily-complicated interactions between different components of z.
Consider the following proposal: “train an oracle to predict the future, along with an explanation of its reasoning. Reward it for predicting correctly, and penalise it for explanations that sound fishy”. Is there an important difference between this and imitative generalisation?
An agent can “generalise badly” because it’s not very robust, or because it’s actively pursuing goals that are misaligned with those of humans. It doesn’t seem like this proposal distinguishes between these types of failures. Is this distinction important in motivating the proposal?
I don’t think this is right. I’ve put my proposed modifications in cursive:
We want to understand the future, based on our knowledge of the past. However, training a neural net on the past might not lead it to generalise well about the future. Instead, we can train a network to be a guide to reasoning about the future, by evaluating its outputs based on how well humans with access to it can reason about the past [we don’t have ground-truth for the future, so we can’t test how well humans can reason about it] and how well humans think it would generalise to the future. Then, we train a separate network to predict what humans with access to the previous network would predict about the future.
(It might be a good idea to share some parameters between the second and first network.)
Ooops, yes, this seems correct. I’ll edit mine accordingly.
As I understand, there are two separate oracles. No oracles are rewarded for predicting correctly. One oracle is rewarded for coming up with good explanations. The other oracle is rewarded for predicting human’s guess, not the truth, correctly.
How do we predict the future with these two oracles? First, we search for the best explanation of the past. The best explanation of the past is 1. a good explanation, and 2. when human guesses with that explanation, they guess correctly. Then, we use human-guess-oracle to predict what human guesses about the future, with the best explanation of the past.
Let’s say we are predicting the winner of the war given the number of soldiers. In the past, 50 won in 50 vs 5, and 150 won in 150 vs 15. In the future, there will be 50 vs 10.
Three explanations of the past is suggested: 1. the side with more soldiers wins, otherwise random 2. the side with even number of soldiers wins against odd number of soldiers, otherwise random 3. if the side is exactly as ten times numerous as the other side, it wins, otherwise random. Three explanations score perfectly against the past. 1 predicts 50 wins in the future, 2 and 3 predict random. IG prefers 1, because the explanation 2 and 3 are crazy, although predictive of the past.
I think these are important differences: 1. Oracle is not trained to predict the future. 2. Explanation must be useful to human, because oracle predicts human’s use of explanation and can’t use explanation directly. 3. Predicting oracle does not generate explanation itself and has no control over it.
Agree that humans are not necessarily great at assigning priors. The main response to this is that we don’t have a way to get better priors than an amplified human’s best prior. If amplified humans think the NN prior is better than their prior, they can always just use this prior. So in theory this should be both strictly better than the alternative, and the best possible prior we can use.
Science seems like it’s about collecting more data and measuring the likelihood, not changing the prior. We still need to use our prior—there are infinite scientific theories that fit the data, but we prefer ones that are simple and elegant.
One thing that helps a bit here is that we can use an amplified human. We also don’t need the human to calculate the prior directly, just to do things like assess whether some change makes the prior better or worse. But I’m not sure how much of a roadblock this is in practice, or what Paul thinks about this problem.
Yeah, the important difference is that in this case there’s nothing that constrains the explanations to be the same as the actual reasoning the oracle is using, so the explanations you’re getting are not necessarily predictive of the kind of generalisation that will happen. In IG it’s important that the quality of z is measured by having humans use it to make predictions.
I’m not sure exactly what you’re asking. I think the proposal is motivated by something like: having the task be IID/being able to check arbitrary outputs from our model to make sure it’s generalising correctly buys us a lot of safety properties. If we have this guarantee, we only have to worry about rare or probabilistic defection, not that the model might be giving us misleading answers for every question we can’t check.