Cool, thanks for the clarifications. To be clear, overall I’m much more sympathetic to the argument as I currently understand it, than when I originally thought you were trying to draw a distinction between “new forms of reasoning honed by trial-and-error” in part 1 (which I interpreted as talking about systems lacking sufficiently good models of the world to find solutions in any other way than trial and error) and “systems that have a detailed understanding of the world” in part 2.
Let me try to sum up the disagreement. The key questions are:
What training data will we realistically be able to train our agents on?
What types of generalisation should we expect from that training data?
How well will we be able to tell that these agents are doing the wrong thing?
On 1: you think long-horizon real-world data will play a significant role in training, because we’ll need it to teach agents to do the most valuable tasks. This seems plausible to me; but I think that in order for this type of training to be useful, the agents will need to already have robust motivations (else they won’t be able to find rewards that are given over long time horizons). And I don’t think that this training will be extensive enough to reshape those motivations to a large degree (whereas I recall that in an earlier discussion on amplification, you argued that small amounts of training could potentially reshape motivations significantly). Our disagreement about question 1 affects questions 2 and 3, but it affects question 2 less than I previously thought, as I’ll discuss.
On 2: previously I thought you were arguing that we should expect very task-specific generalisations like being trained on “reduce crime” and learning “reduce reported crime”, which I was calling underspecified. However, based on your last comment it seems that you’re actually mainly talking about broader generalisations, like being trained on “follow instructions” and learning “do things that the instruction-giver would rate highly”. This seems more plausible, because it’s a generalisation that you can learn in many different types of training; and so our disagreement on 1 becomes less consequential.
I don’t have a strong opinion on the likelihood of this type of generalisation. I guess your argument is that, because we’re doing a lot of trial and error, we’ll keep iterating until we either get something aligned with our instructions, or something which optimises for high ratings directly. But it seems to me by default, during early training periods the AI won’t have much information about either the overseer’s knowledge (or the overseer’s existence), and may not even have the concept of rewards, making alignment with instructions much more natural. Above, you disagree; in either case my concern is that this underlying concept of “natural generalisation” is doing a lot of work, despite not having been explored in your original post (or anywhere else, to my knowledge). We could go back and forth about where the burden of proof is, but it seems more important to develop a better characterisation of natural generalisation; I might try to do this in a separate post.
On 3: it seems to me that the resources which we’ll put into evaluating a single deployment are several orders of magnitude higher than the resources we’ll put into evaluating each training data point—e.g. we’ll likely have whole academic disciplines containing thousands of people working full-time for many years on analysing the effects of the most powerful AIs’ behaviour.
You say that you expect people to work to design training procedures that get good performance on type two measures. I agree with this—but if you design an AI that gets good performance on type 2 measurements despite never being trained on them, then that rules out the most straightforward versions of the “do things that the instruction-giver would rate highly” motivation. And since the trial and error to find strategies which fool type 2 measurements will be carried out over years, the direct optimisation for fooling type 2 measurements will be weak.
I guess the earlier disagreement about question 1 is also relevant here. If you’re an AI trained primarily on data and feedback which are very different from real-world long-term evaluations, then there are very few motivations which lead you to do well on real-world long-term evaluations. “Follow instructions” is one of them; some version of “do things that the instruction-giver would rate highly” is another, but it would need to be quite a specific version. In other words, the greater the disparity between the training regime and the evaluation regime, the fewer ways there are for an AI’s motivations to score well on both, but also score badly on our idealised preferences.
In another comment, you give a bunch of ways in which models might generalise successfully to longer horizons, and then argue that “many of these would end up pursuing goals that are closely related to the goals they pursue over short horizons”. I agree with this, but note that “aligned goals” are also closely related to the goals pursued over short time horizons. So it comes back to whether motivations will generalise in a way which prioritises the “obedience” aspect or the “produces high scores” aspect of the short-term goals.
I agree that the core question is about how generalization occurs. My two stories involve kinds of generalization, and I think there are also ways generalization could work that could lead to good behavior.
It is important to my intuition that not only can we never train for the “good” generalization, we can’t even evaluate techniques to figure out which generalization “well” (since both of the bad generalizations would lead to behavior that looks good over long horizons).
If there is a disagreement it is probably that I have a much higher probability of the kind of generalization in story 1. I’m not sure if there’s actually a big quantitative disagreement though rather than a communication problem.
I also think it’s quite likely that the story in my post is unrealistic in a bunch of ways and I’m currently thinking more about what I think would actually happen.
Some more detailed responses that feel more in-the-weeds:
you think long-horizon real-world data will play a significant role in training, because we’ll need it to teach agents to do the most valuable tasks. This seems plausible to me; but I think that in order for this type of training to be useful, the agents will need to already have robust motivations (else they won’t be able to find rewards that are given over long time horizons
I might not understand this point. For example, suppose I’m training a 1-day predictor to make good predictions over 10 or 100 days. I expect such predictors to initially fail over long horizons, but to potentially be greatly improved with moderate amounts of fine-tuning. It seems to me that if this model has “robust motivations” then they would most likely be to predict accurately, but I’m not sure about why the model necessarily has robust motivations.
I feel similarly about goals like “plan to get high reward (defined as signals on channel X, you can learn how the channel works).” But even if prediction was a special case, if you learn a model then you can use it for planning/RL in simulation.
But it seems to me by default, during early training periods the AI won’t have much information about either the overseer’s knowledge (or the overseer’s existence), and may not even have the concept of rewards, making alignment with instructions much more natural.
It feels to me like our models are already getting to the point where they respond to quirks of the labeling or evaluation process, and are basically able to build simple models of the oversight process.
my concern is that this underlying concept of “natural generalisation” is doing a lot of work, despite not having been explored in your original post
Definitely, I think it’s critical to what happens and not really explored in the post (which is mostly intended to provide some color for what failure might look like).
That said, a major part of my view is that it’s pretty likely that we get either arbitrary motivations or reward-maximization (or something in between), and it’s not a big deal which since they both seem bad and seem averted in the same way.
I think the really key question is how likely it is that we get some kind of “intended” generalization like friendliness. I’m frequently on the opposite side of this disagreement, arguing that the probability that people will get some nice generalization if they really try is at least 25% or 50%, but I’m also happy being on the pessimistic side and saying that the probability we can get nice generalizations is at most 50% or 75%.
(or anywhere else, to my knowledge)
Two kinds of generalization is an old post on this question (though I wish it had used more tasteful examples).
Turning reflection up to 11 touches on the issue as well, though coming from a very different place than you.
I think there are a bunch of Arbital posts where Eliezer tries to articulate some of his opinions on this but I don’t know pointers offhand. I think most of my sense is
I haven’t written that much about why I think generalizations like “just be helpful” aren’t that likely. I agree with the point that these issues are underexplored by people working on alignment, and even more underdiscussed, given how important they are.
There are some google doc comment threads with MIRI where I’ve written about why I think those are plausible (namely that it plausible-but-challenging for breeding of animals, and that seems like one of our best anchors overall, suggesting that plausible-but-challenging is a good anchor). I think in those cases the key argument was about whether you need this to generalize far, since both me and MIRI think it’s a kind of implausible generalization to go out to infinity rather than becoming distorted at some point along the way, but I am more optimistic about making a series of “short hops” where models generalize helpfully to being moderately smarter and then they can carry out the next step of training for you.
Take a big language model like GPT-3, and then train it via RL on tasks where it gets given a language instruction from a human, and then it gets reward if the human thinks it’s done the task successfully.
Cool, thanks for the clarifications. To be clear, overall I’m much more sympathetic to the argument as I currently understand it, than when I originally thought you were trying to draw a distinction between “new forms of reasoning honed by trial-and-error” in part 1 (which I interpreted as talking about systems lacking sufficiently good models of the world to find solutions in any other way than trial and error) and “systems that have a detailed understanding of the world” in part 2.
Let me try to sum up the disagreement. The key questions are:
What training data will we realistically be able to train our agents on?
What types of generalisation should we expect from that training data?
How well will we be able to tell that these agents are doing the wrong thing?
On 1: you think long-horizon real-world data will play a significant role in training, because we’ll need it to teach agents to do the most valuable tasks. This seems plausible to me; but I think that in order for this type of training to be useful, the agents will need to already have robust motivations (else they won’t be able to find rewards that are given over long time horizons). And I don’t think that this training will be extensive enough to reshape those motivations to a large degree (whereas I recall that in an earlier discussion on amplification, you argued that small amounts of training could potentially reshape motivations significantly). Our disagreement about question 1 affects questions 2 and 3, but it affects question 2 less than I previously thought, as I’ll discuss.
On 2: previously I thought you were arguing that we should expect very task-specific generalisations like being trained on “reduce crime” and learning “reduce reported crime”, which I was calling underspecified. However, based on your last comment it seems that you’re actually mainly talking about broader generalisations, like being trained on “follow instructions” and learning “do things that the instruction-giver would rate highly”. This seems more plausible, because it’s a generalisation that you can learn in many different types of training; and so our disagreement on 1 becomes less consequential.
I don’t have a strong opinion on the likelihood of this type of generalisation. I guess your argument is that, because we’re doing a lot of trial and error, we’ll keep iterating until we either get something aligned with our instructions, or something which optimises for high ratings directly. But it seems to me by default, during early training periods the AI won’t have much information about either the overseer’s knowledge (or the overseer’s existence), and may not even have the concept of rewards, making alignment with instructions much more natural. Above, you disagree; in either case my concern is that this underlying concept of “natural generalisation” is doing a lot of work, despite not having been explored in your original post (or anywhere else, to my knowledge). We could go back and forth about where the burden of proof is, but it seems more important to develop a better characterisation of natural generalisation; I might try to do this in a separate post.
On 3: it seems to me that the resources which we’ll put into evaluating a single deployment are several orders of magnitude higher than the resources we’ll put into evaluating each training data point—e.g. we’ll likely have whole academic disciplines containing thousands of people working full-time for many years on analysing the effects of the most powerful AIs’ behaviour.
You say that you expect people to work to design training procedures that get good performance on type two measures. I agree with this—but if you design an AI that gets good performance on type 2 measurements despite never being trained on them, then that rules out the most straightforward versions of the “do things that the instruction-giver would rate highly” motivation. And since the trial and error to find strategies which fool type 2 measurements will be carried out over years, the direct optimisation for fooling type 2 measurements will be weak.
I guess the earlier disagreement about question 1 is also relevant here. If you’re an AI trained primarily on data and feedback which are very different from real-world long-term evaluations, then there are very few motivations which lead you to do well on real-world long-term evaluations. “Follow instructions” is one of them; some version of “do things that the instruction-giver would rate highly” is another, but it would need to be quite a specific version. In other words, the greater the disparity between the training regime and the evaluation regime, the fewer ways there are for an AI’s motivations to score well on both, but also score badly on our idealised preferences.
In another comment, you give a bunch of ways in which models might generalise successfully to longer horizons, and then argue that “many of these would end up pursuing goals that are closely related to the goals they pursue over short horizons”. I agree with this, but note that “aligned goals” are also closely related to the goals pursued over short time horizons. So it comes back to whether motivations will generalise in a way which prioritises the “obedience” aspect or the “produces high scores” aspect of the short-term goals.
I agree that the core question is about how generalization occurs. My two stories involve kinds of generalization, and I think there are also ways generalization could work that could lead to good behavior.
It is important to my intuition that not only can we never train for the “good” generalization, we can’t even evaluate techniques to figure out which generalization “well” (since both of the bad generalizations would lead to behavior that looks good over long horizons).
If there is a disagreement it is probably that I have a much higher probability of the kind of generalization in story 1. I’m not sure if there’s actually a big quantitative disagreement though rather than a communication problem.
I also think it’s quite likely that the story in my post is unrealistic in a bunch of ways and I’m currently thinking more about what I think would actually happen.
Some more detailed responses that feel more in-the-weeds:
I might not understand this point. For example, suppose I’m training a 1-day predictor to make good predictions over 10 or 100 days. I expect such predictors to initially fail over long horizons, but to potentially be greatly improved with moderate amounts of fine-tuning. It seems to me that if this model has “robust motivations” then they would most likely be to predict accurately, but I’m not sure about why the model necessarily has robust motivations.
I feel similarly about goals like “plan to get high reward (defined as signals on channel X, you can learn how the channel works).” But even if prediction was a special case, if you learn a model then you can use it for planning/RL in simulation.
It feels to me like our models are already getting to the point where they respond to quirks of the labeling or evaluation process, and are basically able to build simple models of the oversight process.
Definitely, I think it’s critical to what happens and not really explored in the post (which is mostly intended to provide some color for what failure might look like).
That said, a major part of my view is that it’s pretty likely that we get either arbitrary motivations or reward-maximization (or something in between), and it’s not a big deal which since they both seem bad and seem averted in the same way.
I think the really key question is how likely it is that we get some kind of “intended” generalization like friendliness. I’m frequently on the opposite side of this disagreement, arguing that the probability that people will get some nice generalization if they really try is at least 25% or 50%, but I’m also happy being on the pessimistic side and saying that the probability we can get nice generalizations is at most 50% or 75%.
Two kinds of generalization is an old post on this question (though I wish it had used more tasteful examples).
Turning reflection up to 11 touches on the issue as well, though coming from a very different place than you.
I think there are a bunch of Arbital posts where Eliezer tries to articulate some of his opinions on this but I don’t know pointers offhand. I think most of my sense is
I haven’t written that much about why I think generalizations like “just be helpful” aren’t that likely. I agree with the point that these issues are underexplored by people working on alignment, and even more underdiscussed, given how important they are.
There are some google doc comment threads with MIRI where I’ve written about why I think those are plausible (namely that it plausible-but-challenging for breeding of animals, and that seems like one of our best anchors overall, suggesting that plausible-but-challenging is a good anchor). I think in those cases the key argument was about whether you need this to generalize far, since both me and MIRI think it’s a kind of implausible generalization to go out to infinity rather than becoming distorted at some point along the way, but I am more optimistic about making a series of “short hops” where models generalize helpfully to being moderately smarter and then they can carry out the next step of training for you.
What does this actually mean, in terms of the details of how you’d train a model to do this?
Take a big language model like GPT-3, and then train it via RL on tasks where it gets given a language instruction from a human, and then it gets reward if the human thinks it’s done the task successfully.
Makes sense, thanks!