To clarify your position: if I train a system that makes good predictions over 1 minute and 10 minutes and 100 minutes, is your position that there’s not much reason that this system would make a good prediction over 1000 minutes? Analogously, if I train a system by meta-learning to get high rewards over a wide range of simulated environments, is your position that there’s not much reason to think it will try to get high rewards when deployed in the real world?
In most of the cases you’ve discussed, trying to do tasks over much longer time horizons involves doing a very different task. Reducing reported crime over 10 minutes and reducing reported crime over 100 minutes have very little to do with reducing reported crime over a year or 10 years. The same is true for increasing my wealth, or increasing my knowledge (which over 10 minutes involves telling me things, but over a year might involve doing novel scientific research). I tend to be pretty optimistic about AI motivations generalising, but this type of generalisation seems far too underspecified. “Making predictions” is perhaps an exception, insofar as it’s a very natural concept, and also one which transfers very straightforwardly from simulations to reality. But it probably depends a lot on what type of predictions we’re talking about.
On meta-learning: it doesn’t seem realistic to think about an AI “trying to get high rewards” on tasks where the time horizon is measured in months or years. Instead it’ll try to achieve some generalisation of the goals it learned during training. But as I already argued, we’re not going to be able to train on single tasks which are similar enough to real-world long-term tasks that motivations will transfer directly in any recognisable way.
Insofar as ML researchers think about this, I think their most common position is something like “we’ll train an AI to follow a wide range of instructions, and then it’ll generalise to following new instructions over longer time horizons”. This makes a lot of sense to me, because I expect we’ll be able to provide enough datapoints (mainly simulated datapoints, plus language pre-training) to pin down the concept “follow instructions” reasonably well, whereas I don’t expect we can provide enough datapoints to pin down a motivation like “reduce reports of crime”. (Note that I also think that we’ll be able to provide enough datapoints to incentivise influence-seeking behaviour, so this isn’t a general argument against AI risk, but rather an argument against the particular type of task-specific generalisation you describe.)
In other words, we should expect generalisation to long-term tasks to occur via a general motivation to follow our instructions, rather than on a task-specific basis, because the latter is so underspecified. But generalisation via following instructions doesn’t have a strong bias towards easily-measurable goals.
I agree that it’s only us who are operating by trial and error—the system understands what it’s doing. I don’t think that undermines my argument. The point is that we pick the system, and so determine what it’s doing, by trial and error, because we have no understanding of what it’s doing (under the current paradigm). For some kinds of goals we may be able to pick systems that achieve those goals by trial and error (modulo empirical uncertainty about generalization, as discussed in the second part). For other goals there isn’t a plausible way to do that.
I think that throughout your post there’s an ambiguity between two types of measurement. Type one measurements are those which we can make easily enough to use as a feedback signal for training AIs. Type two measurements are those which we can make easily enough to tell us whether an AI we’ve deployed is doing a good job. In general many more things are type-two-measurable than type-one-measurable, because training feedback needs to be very cheap. So if we train an AI on type one measurements, we’ll usually be able to use type two measurements to evaluate whether it’s doing a good job post-deployment. And that AI won’t game those type two measurements even if it generalises its training signal to much longer time horizons, because it will never have been trained on type two measurements.
These seem like the key disagreements, so I’ll leave off here, to prevent the thread from branching too much. (Edited one out because I decided it was less important).
I feel like a very natural version of “follow instructions” is “Do things that would the instruction-giver would rate highly.” (Which is the generalization I’m talking about.) I don’t think any of the arguments about “long horizon versions of tasks are different from short versions” tell us anything about which of these generalizations would be learnt (since they are both equally alien over long horizons).
Other versions like “Follow instructions (without regards to what the training process cares about)” seem quite likely to perform significantly worse on the training set. It’s also not clear to me that “follow the spirit of the instructions” is better-specified than “do things the instruction-giver would rate highly if we asked them”—informally I would say the latter is better-specified, and it seems like the argument here is resting crucially on some other sense of well-specification.
On meta-learning: it doesn’t seem realistic to think about an AI “trying to get high rewards” on tasks where the time horizon is measured in months or years.
I’ve trained in simulation on tasks where I face a wide variety of environment, each with a reward signal, and I am taught to learn the dynamics of the environment and the reward and then take actions that lead to a lot of reward. In simulation my tasks can have reasonably long time horizons (as measured by how long I think), though that depends on open questions about scaling behavior. I don’t agree with the claim that it’s unrealistic to imagine such models generalizing to reality by wanting something-like-reward.
In most of the cases you’ve discussed, trying to do tasks over much longer time horizons involves doing a very different task [...]
Trying to maximize wealth over 100 minutes is indeed very different from maximizing wealth over 1 year, and is also almost completely useless for basically the same reason (except in domains like day trading where mark to market acts as a strong value function).
My take is that people will be pushed to optimizing over longer horizons because these qualitatively different tasks over short horizons aren’t useful. The useful tasks in fact do involve preparing for the future and acquiring flexible influence, and so time horizons long enough to be useful will also be long enough to be relevantly similar to yet longer horizons.
Developers will be incentivized to find any way to get good behavior over long horizons, and it seems like we have many candidates that I regard as plausible and which all seem reasonably likely to lead to the kind of behavior I discuss. To me it feels like you are quite opinionated about how that generalization will work.
It seems like your take is “consequences over long enough horizons to be useful will be way too expensive to use for training,” which seems close to 50⁄50 to me.
I think that throughout your post there’s an ambiguity between two types of measurement. Type one measurements are those which we can make easily enough to use as a feedback signal for training AIs. Type two measurements are those which we can make easily enough to tell us whether an AI we’ve deployed is doing a good job. In general many more things are type-two-measurable than type-one-measurable, because training feedback needs to be very cheap.
I agree that this is a useful distinction and there will be some gap. I think that quantitatively I expect the gap to be much smaller than you do (e.g. getting 10k historical examples of 1-year plans seems quite realistic), and I expect people to work to design training procedures that get good performance on type two measures (roughly by definition), and I guess I’m significantly more agnostic about the likelihood of generalization from the longest type one measures to type two measures.
In other words, we should expect generalisation to long-term tasks to occur via a general motivation to follow our instructions, rather than on a task-specific basis, because the latter is so underspecified. But generalisation via following instructions doesn’t have a strong bias towards easily-measurable goals.
I’m imagining systems generalizing much more narrowly to the evaluation process used during training. This is still underspecified in some sense (are you trying to optimize the data that goes into SGD, or the data that goes into the dataset, or the data that goes into the sensors?) and in the limit that basically leads to influence-maximization and continuously fades into scenario 2. It’s also true that e.g. I may be able to confirm at test-time that there is no training process holding me accountable, and for some of these generalizations that would lead to a kind of existential crisis (where I’ve never encountered anything like this during training and it’s no longer clear what I’m even aiming at). It doesn’t feel like these are the kinds of underspecification you are referring to.
The type 1 vs. type 2 feedback distinction here seems really central. I’m interested if this seems like a fair characterization to both of you.
Type 1: Feedback which we use for training (via gradient descent) Type 2: Feedback which we use to decide whether to deploy trained agent.
(There’s a bit of gray between Type 1 and 2, since choosing whether to deploy is another form of selection, but I’m assuming we’re okay stating that gradient descent and model selection operate in qualitatively distinct regimes.)
The key disagreement is whether we expect type 1 feedback will be closer to type 2 feedback, or whether type 2 feedback will be closer to our true goals. If the former, our agents generalizing from type 1 to type 2 is relatively uninformative, and we still have Goodhart. In the latter case, the agent is only very weakly optimizing the type 2 feedback, and so we don’t need to worry much about Goodhart, and should expect type 2 feedback to continue track our true goals well.
Main argument for type 1 ~ type 2: by definition, we design type 1 feedback (+associated learning algorithm) so that resulting agents perform well under type 2 Main argument for type 1 !~ type 2: type 2 feedback can be something like 1000-10000x more expensive, since we only have to evaluate it once, rather than enough times to be useful for gradient descent
I’d also be interested to discuss this disagreement in particular, since I could definitely go either way on it. (I plan to think about it more myself.)
Type 2: Feedback which we use to decide whether to deploy trained agent.
Let’s also include feedback which we can use to decide whether to stop deploying an agent; the central example in my head is an agent which has been deployed for some time before we discover that it’s doing bad things.
Relatedly, another argument for type 1 !~ type 2 which seems important to me: type 2 feedback can look at long time horizons, which I expect to be very useful. (Maybe you included this in the cost estimate, but idk how to translate between longer times and higher cost directly.)
“by definition, we design type 1 feedback (+associated learning algorithm) so that resulting agents perform well under type 2”
This doesn’t seem right. We design type 1 feedback so that resulting agents perform well on our true goals. This only matches up with type 2 feedback insofar as type 2 feedback is closely related to our true goals. But if that’s the case, then it would be strange for agents to learn the motivation of doing well on type 2 feedback without learning the motivation of doing well on our true goals.
In practice, I expect that misaligned agents which perform well on type 2 feedback will do so primarily by deception, for instrumental purposes. But it’s hard to picture agents which carry out this type of deception, but which don’t also decide to take over the world directly.
This doesn’t seem right. We design type 1 feedback so that resulting agents perform well on our true goals. This only matches up with type 2 feedback insofar as type 2 feedback is closely related to our true goals.
But type 2 feedback is (by definition) our best attempt to estimate how well the model is doing what we really care about. So in practice any results-based selection for “does what we care about” goes via selecting based on type 2 feedback. The difference only comes up when we reason mechanically about the behavior of our agents and how they are likely to generalize, but it’s not clear that’s an important part of the default plan (whereas I think we will clearly extensively leverage “try several strategies and see what works”).
But if that’s the case, then it would be strange for agents to learn the motivation of doing well on type 2 feedback without learning the motivation of doing well on our true goals.
“Do things that look to a human like you are achieving X” is closely related to X, but that doesn’t mean that learning to do the one implies that you will learn to do the other.
Maybe it’s helpful to imagine the world where type 1 feedback is “human evals after 1 week horizon”, type 2 feedback is “human evals after 1 year horizon,” and “what we really care about” is the “human evals after a 100 year horizon.” I think that’s much better than the actual situation, but even in that case I’d have a significant probability on getting systems that work on the 1 year horizon without working indefinitely (especially if we do selection for working on 2 years + are able to use a small amount of 2 year data). Do you feel pretty confident that something that generalizes from 1 week to 1 year will go indefinitely, or is your intuition predicated on something about the nature of “be helpful” and how that’s a natural motivation for a mind? (Or maybe that we will be able to identify some other similar “natural” motivation and design our training process to be aligned with that?) In the former case, it seems like we can have an empirical discussion about how generalization tends to work. In the latter case, it seems like we need to be getting into more details about why “be helpful” is a particularly natural (or else why we should be able to pick out something else like that). In the other cases I think I haven’t fully internalized your view.
I agree with the two questions you’ve identified as the core issues, although I’d slightly rephrase the former. It’s hard to think about something being aligned indefinitely. But it seems like, if we have primarily used a given system for carrying out individual tasks, it would take quite a lot of misalignment for it to carry out a systematic plan to deceive us. So I’d rephrase the first option you mention as “feeling pretty confident that something that generalises from 1 week to 1 year won’t become misaligned enough to cause disasters”. This point seems more important than the second point (the nature of “be helpful” and how that’s a natural motivation for a mind), but I’ll discuss both.
I think the main disagreement about the former is over the relative strength of “results-based selection” versus “intentional design”. When I said above that “we design type 1 feedback so that resulting agents perform well on our true goals”, I was primarily talking about “design” as us reasoning about our agents, and the training process they undergo, not the process of running them for a long time and picking the ones that do best. The latter is a very weak force! Almost all of the optimisation done by humans comes from intentional design plus rapid trial and error (on the timeframe of days or weeks). Very little of the optimisation comes from long-term trial and error (on the timeframe of a year) - by necessity, because it’s just so slow.
So, conditional on our agents generalising from “one week” to “one year”, we should expect that it’s because we somehow designed a training procedure that produces scalable alignment (or at least scalable non-misalignment), or because they’re deceptively aligned (as in your influence-seeking agents scenario), but not because long-term trial and error was responsible for steering us towards getting what we can measure.
Then there’s the second question, of whether “do things that look to a human like you’re achieving X” is a plausible generalisation. My intuitions on this question are very fuzzy, so I wouldn’t be surprised if they’re wrong. But, tentatively, here’s one argument. Consider a policy which receives instructions from a human, talks to the human to clarify the concepts involved, then gets rewarded and updated based on how well it carries out those instructions. From the policy’s perspective, the thing it interacts with, and which its actions are based on, is human instructions. Indeed, for most of the training process the policy plausibly won’t even have the concept of “reward” (in the same way that humans didn’t evolve a concept of fitness). But it will have this concept of human intentions, which is a very good proxy for reward. And so it seems much more natural for the policy’s goals to be formulated in terms of human intentions and desires, which are the observable quantities that it responds to; rather than human feedback, which is the unobservable quantity that it is optimised with respect to. (Rewards can be passed as observations to the policy, but I claim that it’s both safer and more useful if rewards are unobservable by the policy during training.)
This argument is weakened by the fact that, when there’s a conflict between them (e.g. in cases where it’s possible to fool the humans), agents aiming to “look like you’re doing X” will receive more reward. But during most of training the agent won’t be very good at fooling humans, and so I am optimistic that its core motivations will still be more like “do what the human says” than “look like you’re doing what the human says”.
I think that by default we will search for ways to build systems that do well on type 2 feedback. We do likely have a large dataset of type-2-bad behaviors from the real world, across many applications, and can make related data in simulation. It also seems quite plausible that this is a very tiny delta, if we are dealing with models that have already learned everything they would need to know about the world and this is just a matter of selecting a motivation, so that you can potentially get good type 2 behavior using a very small amount of data. Relatedly, it seems like really all you need is to train predictors for type 2 feedback (in order to use those predictions for training/planning), and that the relevant prediction problems often seem much easier than the actual sophisticated behaviors we are interested in.
Another important of my view about type 1 ~ type 2 is that if gradient descent handles the scale from [1 second, 1 month] then it’s not actually very far to get from [1 month, 2 years]. It seems like we’ve already come 6 orders of magnitude and now we are talking about generalizing 1 more order of magnitude.
At a higher level, I feel like the important thing is that type 1 and type 2 feedback are going to be basically the same kind of thing but with a quantitative difference (or at least we can set up type 1 feedback so that this is true). On the other hand “what we really want” is a completely different thing (that we basically can’t even define cleanly). So prima facie it feels to me like if models generalize “well” then we can get them to generalize from type 1 to type 2, whereas no such thing is true for “what we really care about.”
Cool, thanks for the clarifications. To be clear, overall I’m much more sympathetic to the argument as I currently understand it, than when I originally thought you were trying to draw a distinction between “new forms of reasoning honed by trial-and-error” in part 1 (which I interpreted as talking about systems lacking sufficiently good models of the world to find solutions in any other way than trial and error) and “systems that have a detailed understanding of the world” in part 2.
Let me try to sum up the disagreement. The key questions are:
What training data will we realistically be able to train our agents on?
What types of generalisation should we expect from that training data?
How well will we be able to tell that these agents are doing the wrong thing?
On 1: you think long-horizon real-world data will play a significant role in training, because we’ll need it to teach agents to do the most valuable tasks. This seems plausible to me; but I think that in order for this type of training to be useful, the agents will need to already have robust motivations (else they won’t be able to find rewards that are given over long time horizons). And I don’t think that this training will be extensive enough to reshape those motivations to a large degree (whereas I recall that in an earlier discussion on amplification, you argued that small amounts of training could potentially reshape motivations significantly). Our disagreement about question 1 affects questions 2 and 3, but it affects question 2 less than I previously thought, as I’ll discuss.
On 2: previously I thought you were arguing that we should expect very task-specific generalisations like being trained on “reduce crime” and learning “reduce reported crime”, which I was calling underspecified. However, based on your last comment it seems that you’re actually mainly talking about broader generalisations, like being trained on “follow instructions” and learning “do things that the instruction-giver would rate highly”. This seems more plausible, because it’s a generalisation that you can learn in many different types of training; and so our disagreement on 1 becomes less consequential.
I don’t have a strong opinion on the likelihood of this type of generalisation. I guess your argument is that, because we’re doing a lot of trial and error, we’ll keep iterating until we either get something aligned with our instructions, or something which optimises for high ratings directly. But it seems to me by default, during early training periods the AI won’t have much information about either the overseer’s knowledge (or the overseer’s existence), and may not even have the concept of rewards, making alignment with instructions much more natural. Above, you disagree; in either case my concern is that this underlying concept of “natural generalisation” is doing a lot of work, despite not having been explored in your original post (or anywhere else, to my knowledge). We could go back and forth about where the burden of proof is, but it seems more important to develop a better characterisation of natural generalisation; I might try to do this in a separate post.
On 3: it seems to me that the resources which we’ll put into evaluating a single deployment are several orders of magnitude higher than the resources we’ll put into evaluating each training data point—e.g. we’ll likely have whole academic disciplines containing thousands of people working full-time for many years on analysing the effects of the most powerful AIs’ behaviour.
You say that you expect people to work to design training procedures that get good performance on type two measures. I agree with this—but if you design an AI that gets good performance on type 2 measurements despite never being trained on them, then that rules out the most straightforward versions of the “do things that the instruction-giver would rate highly” motivation. And since the trial and error to find strategies which fool type 2 measurements will be carried out over years, the direct optimisation for fooling type 2 measurements will be weak.
I guess the earlier disagreement about question 1 is also relevant here. If you’re an AI trained primarily on data and feedback which are very different from real-world long-term evaluations, then there are very few motivations which lead you to do well on real-world long-term evaluations. “Follow instructions” is one of them; some version of “do things that the instruction-giver would rate highly” is another, but it would need to be quite a specific version. In other words, the greater the disparity between the training regime and the evaluation regime, the fewer ways there are for an AI’s motivations to score well on both, but also score badly on our idealised preferences.
In another comment, you give a bunch of ways in which models might generalise successfully to longer horizons, and then argue that “many of these would end up pursuing goals that are closely related to the goals they pursue over short horizons”. I agree with this, but note that “aligned goals” are also closely related to the goals pursued over short time horizons. So it comes back to whether motivations will generalise in a way which prioritises the “obedience” aspect or the “produces high scores” aspect of the short-term goals.
I agree that the core question is about how generalization occurs. My two stories involve kinds of generalization, and I think there are also ways generalization could work that could lead to good behavior.
It is important to my intuition that not only can we never train for the “good” generalization, we can’t even evaluate techniques to figure out which generalization “well” (since both of the bad generalizations would lead to behavior that looks good over long horizons).
If there is a disagreement it is probably that I have a much higher probability of the kind of generalization in story 1. I’m not sure if there’s actually a big quantitative disagreement though rather than a communication problem.
I also think it’s quite likely that the story in my post is unrealistic in a bunch of ways and I’m currently thinking more about what I think would actually happen.
Some more detailed responses that feel more in-the-weeds:
you think long-horizon real-world data will play a significant role in training, because we’ll need it to teach agents to do the most valuable tasks. This seems plausible to me; but I think that in order for this type of training to be useful, the agents will need to already have robust motivations (else they won’t be able to find rewards that are given over long time horizons
I might not understand this point. For example, suppose I’m training a 1-day predictor to make good predictions over 10 or 100 days. I expect such predictors to initially fail over long horizons, but to potentially be greatly improved with moderate amounts of fine-tuning. It seems to me that if this model has “robust motivations” then they would most likely be to predict accurately, but I’m not sure about why the model necessarily has robust motivations.
I feel similarly about goals like “plan to get high reward (defined as signals on channel X, you can learn how the channel works).” But even if prediction was a special case, if you learn a model then you can use it for planning/RL in simulation.
But it seems to me by default, during early training periods the AI won’t have much information about either the overseer’s knowledge (or the overseer’s existence), and may not even have the concept of rewards, making alignment with instructions much more natural.
It feels to me like our models are already getting to the point where they respond to quirks of the labeling or evaluation process, and are basically able to build simple models of the oversight process.
my concern is that this underlying concept of “natural generalisation” is doing a lot of work, despite not having been explored in your original post
Definitely, I think it’s critical to what happens and not really explored in the post (which is mostly intended to provide some color for what failure might look like).
That said, a major part of my view is that it’s pretty likely that we get either arbitrary motivations or reward-maximization (or something in between), and it’s not a big deal which since they both seem bad and seem averted in the same way.
I think the really key question is how likely it is that we get some kind of “intended” generalization like friendliness. I’m frequently on the opposite side of this disagreement, arguing that the probability that people will get some nice generalization if they really try is at least 25% or 50%, but I’m also happy being on the pessimistic side and saying that the probability we can get nice generalizations is at most 50% or 75%.
(or anywhere else, to my knowledge)
Two kinds of generalization is an old post on this question (though I wish it had used more tasteful examples).
Turning reflection up to 11 touches on the issue as well, though coming from a very different place than you.
I think there are a bunch of Arbital posts where Eliezer tries to articulate some of his opinions on this but I don’t know pointers offhand. I think most of my sense is
I haven’t written that much about why I think generalizations like “just be helpful” aren’t that likely. I agree with the point that these issues are underexplored by people working on alignment, and even more underdiscussed, given how important they are.
There are some google doc comment threads with MIRI where I’ve written about why I think those are plausible (namely that it plausible-but-challenging for breeding of animals, and that seems like one of our best anchors overall, suggesting that plausible-but-challenging is a good anchor). I think in those cases the key argument was about whether you need this to generalize far, since both me and MIRI think it’s a kind of implausible generalization to go out to infinity rather than becoming distorted at some point along the way, but I am more optimistic about making a series of “short hops” where models generalize helpfully to being moderately smarter and then they can carry out the next step of training for you.
Take a big language model like GPT-3, and then train it via RL on tasks where it gets given a language instruction from a human, and then it gets reward if the human thinks it’s done the task successfully.
In most of the cases you’ve discussed, trying to do tasks over much longer time horizons involves doing a very different task. Reducing reported crime over 10 minutes and reducing reported crime over 100 minutes have very little to do with reducing reported crime over a year or 10 years. The same is true for increasing my wealth, or increasing my knowledge (which over 10 minutes involves telling me things, but over a year might involve doing novel scientific research). I tend to be pretty optimistic about AI motivations generalising, but this type of generalisation seems far too underspecified. “Making predictions” is perhaps an exception, insofar as it’s a very natural concept, and also one which transfers very straightforwardly from simulations to reality. But it probably depends a lot on what type of predictions we’re talking about.
On meta-learning: it doesn’t seem realistic to think about an AI “trying to get high rewards” on tasks where the time horizon is measured in months or years. Instead it’ll try to achieve some generalisation of the goals it learned during training. But as I already argued, we’re not going to be able to train on single tasks which are similar enough to real-world long-term tasks that motivations will transfer directly in any recognisable way.
Insofar as ML researchers think about this, I think their most common position is something like “we’ll train an AI to follow a wide range of instructions, and then it’ll generalise to following new instructions over longer time horizons”. This makes a lot of sense to me, because I expect we’ll be able to provide enough datapoints (mainly simulated datapoints, plus language pre-training) to pin down the concept “follow instructions” reasonably well, whereas I don’t expect we can provide enough datapoints to pin down a motivation like “reduce reports of crime”. (Note that I also think that we’ll be able to provide enough datapoints to incentivise influence-seeking behaviour, so this isn’t a general argument against AI risk, but rather an argument against the particular type of task-specific generalisation you describe.)
In other words, we should expect generalisation to long-term tasks to occur via a general motivation to follow our instructions, rather than on a task-specific basis, because the latter is so underspecified. But generalisation via following instructions doesn’t have a strong bias towards easily-measurable goals.
I think that throughout your post there’s an ambiguity between two types of measurement. Type one measurements are those which we can make easily enough to use as a feedback signal for training AIs. Type two measurements are those which we can make easily enough to tell us whether an AI we’ve deployed is doing a good job. In general many more things are type-two-measurable than type-one-measurable, because training feedback needs to be very cheap. So if we train an AI on type one measurements, we’ll usually be able to use type two measurements to evaluate whether it’s doing a good job post-deployment. And that AI won’t game those type two measurements even if it generalises its training signal to much longer time horizons, because it will never have been trained on type two measurements.
These seem like the key disagreements, so I’ll leave off here, to prevent the thread from branching too much. (Edited one out because I decided it was less important).
I feel like a very natural version of “follow instructions” is “Do things that would the instruction-giver would rate highly.” (Which is the generalization I’m talking about.) I don’t think any of the arguments about “long horizon versions of tasks are different from short versions” tell us anything about which of these generalizations would be learnt (since they are both equally alien over long horizons).
Other versions like “Follow instructions (without regards to what the training process cares about)” seem quite likely to perform significantly worse on the training set. It’s also not clear to me that “follow the spirit of the instructions” is better-specified than “do things the instruction-giver would rate highly if we asked them”—informally I would say the latter is better-specified, and it seems like the argument here is resting crucially on some other sense of well-specification.
I’ve trained in simulation on tasks where I face a wide variety of environment, each with a reward signal, and I am taught to learn the dynamics of the environment and the reward and then take actions that lead to a lot of reward. In simulation my tasks can have reasonably long time horizons (as measured by how long I think), though that depends on open questions about scaling behavior. I don’t agree with the claim that it’s unrealistic to imagine such models generalizing to reality by wanting something-like-reward.
Trying to maximize wealth over 100 minutes is indeed very different from maximizing wealth over 1 year, and is also almost completely useless for basically the same reason (except in domains like day trading where mark to market acts as a strong value function).
My take is that people will be pushed to optimizing over longer horizons because these qualitatively different tasks over short horizons aren’t useful. The useful tasks in fact do involve preparing for the future and acquiring flexible influence, and so time horizons long enough to be useful will also be long enough to be relevantly similar to yet longer horizons.
Developers will be incentivized to find any way to get good behavior over long horizons, and it seems like we have many candidates that I regard as plausible and which all seem reasonably likely to lead to the kind of behavior I discuss. To me it feels like you are quite opinionated about how that generalization will work.
It seems like your take is “consequences over long enough horizons to be useful will be way too expensive to use for training,” which seems close to 50⁄50 to me.
I agree that this is a useful distinction and there will be some gap. I think that quantitatively I expect the gap to be much smaller than you do (e.g. getting 10k historical examples of 1-year plans seems quite realistic), and I expect people to work to design training procedures that get good performance on type two measures (roughly by definition), and I guess I’m significantly more agnostic about the likelihood of generalization from the longest type one measures to type two measures.
I’m imagining systems generalizing much more narrowly to the evaluation process used during training. This is still underspecified in some sense (are you trying to optimize the data that goes into SGD, or the data that goes into the dataset, or the data that goes into the sensors?) and in the limit that basically leads to influence-maximization and continuously fades into scenario 2. It’s also true that e.g. I may be able to confirm at test-time that there is no training process holding me accountable, and for some of these generalizations that would lead to a kind of existential crisis (where I’ve never encountered anything like this during training and it’s no longer clear what I’m even aiming at). It doesn’t feel like these are the kinds of underspecification you are referring to.
The type 1 vs. type 2 feedback distinction here seems really central. I’m interested if this seems like a fair characterization to both of you.
Type 1: Feedback which we use for training (via gradient descent)
Type 2: Feedback which we use to decide whether to deploy trained agent.
(There’s a bit of gray between Type 1 and 2, since choosing whether to deploy is another form of selection, but I’m assuming we’re okay stating that gradient descent and model selection operate in qualitatively distinct regimes.)
The key disagreement is whether we expect type 1 feedback will be closer to type 2 feedback, or whether type 2 feedback will be closer to our true goals. If the former, our agents generalizing from type 1 to type 2 is relatively uninformative, and we still have Goodhart. In the latter case, the agent is only very weakly optimizing the type 2 feedback, and so we don’t need to worry much about Goodhart, and should expect type 2 feedback to continue track our true goals well.
Main argument for type 1 ~ type 2: by definition, we design type 1 feedback (+associated learning algorithm) so that resulting agents perform well under type 2
Main argument for type 1 !~ type 2: type 2 feedback can be something like 1000-10000x more expensive, since we only have to evaluate it once, rather than enough times to be useful for gradient descent
I’d also be interested to discuss this disagreement in particular, since I could definitely go either way on it. (I plan to think about it more myself.)
A couple of clarifications:
Let’s also include feedback which we can use to decide whether to stop deploying an agent; the central example in my head is an agent which has been deployed for some time before we discover that it’s doing bad things.
Relatedly, another argument for type 1 !~ type 2 which seems important to me: type 2 feedback can look at long time horizons, which I expect to be very useful. (Maybe you included this in the cost estimate, but idk how to translate between longer times and higher cost directly.)
This doesn’t seem right. We design type 1 feedback so that resulting agents perform well on our true goals. This only matches up with type 2 feedback insofar as type 2 feedback is closely related to our true goals. But if that’s the case, then it would be strange for agents to learn the motivation of doing well on type 2 feedback without learning the motivation of doing well on our true goals.
In practice, I expect that misaligned agents which perform well on type 2 feedback will do so primarily by deception, for instrumental purposes. But it’s hard to picture agents which carry out this type of deception, but which don’t also decide to take over the world directly.
But type 2 feedback is (by definition) our best attempt to estimate how well the model is doing what we really care about. So in practice any results-based selection for “does what we care about” goes via selecting based on type 2 feedback. The difference only comes up when we reason mechanically about the behavior of our agents and how they are likely to generalize, but it’s not clear that’s an important part of the default plan (whereas I think we will clearly extensively leverage “try several strategies and see what works”).
“Do things that look to a human like you are achieving X” is closely related to X, but that doesn’t mean that learning to do the one implies that you will learn to do the other.
Maybe it’s helpful to imagine the world where type 1 feedback is “human evals after 1 week horizon”, type 2 feedback is “human evals after 1 year horizon,” and “what we really care about” is the “human evals after a 100 year horizon.” I think that’s much better than the actual situation, but even in that case I’d have a significant probability on getting systems that work on the 1 year horizon without working indefinitely (especially if we do selection for working on 2 years + are able to use a small amount of 2 year data). Do you feel pretty confident that something that generalizes from 1 week to 1 year will go indefinitely, or is your intuition predicated on something about the nature of “be helpful” and how that’s a natural motivation for a mind? (Or maybe that we will be able to identify some other similar “natural” motivation and design our training process to be aligned with that?) In the former case, it seems like we can have an empirical discussion about how generalization tends to work. In the latter case, it seems like we need to be getting into more details about why “be helpful” is a particularly natural (or else why we should be able to pick out something else like that). In the other cases I think I haven’t fully internalized your view.
I agree with the two questions you’ve identified as the core issues, although I’d slightly rephrase the former. It’s hard to think about something being aligned indefinitely. But it seems like, if we have primarily used a given system for carrying out individual tasks, it would take quite a lot of misalignment for it to carry out a systematic plan to deceive us. So I’d rephrase the first option you mention as “feeling pretty confident that something that generalises from 1 week to 1 year won’t become misaligned enough to cause disasters”. This point seems more important than the second point (the nature of “be helpful” and how that’s a natural motivation for a mind), but I’ll discuss both.
I think the main disagreement about the former is over the relative strength of “results-based selection” versus “intentional design”. When I said above that “we design type 1 feedback so that resulting agents perform well on our true goals”, I was primarily talking about “design” as us reasoning about our agents, and the training process they undergo, not the process of running them for a long time and picking the ones that do best. The latter is a very weak force! Almost all of the optimisation done by humans comes from intentional design plus rapid trial and error (on the timeframe of days or weeks). Very little of the optimisation comes from long-term trial and error (on the timeframe of a year) - by necessity, because it’s just so slow.
So, conditional on our agents generalising from “one week” to “one year”, we should expect that it’s because we somehow designed a training procedure that produces scalable alignment (or at least scalable non-misalignment), or because they’re deceptively aligned (as in your influence-seeking agents scenario), but not because long-term trial and error was responsible for steering us towards getting what we can measure.
Then there’s the second question, of whether “do things that look to a human like you’re achieving X” is a plausible generalisation. My intuitions on this question are very fuzzy, so I wouldn’t be surprised if they’re wrong. But, tentatively, here’s one argument. Consider a policy which receives instructions from a human, talks to the human to clarify the concepts involved, then gets rewarded and updated based on how well it carries out those instructions. From the policy’s perspective, the thing it interacts with, and which its actions are based on, is human instructions. Indeed, for most of the training process the policy plausibly won’t even have the concept of “reward” (in the same way that humans didn’t evolve a concept of fitness). But it will have this concept of human intentions, which is a very good proxy for reward. And so it seems much more natural for the policy’s goals to be formulated in terms of human intentions and desires, which are the observable quantities that it responds to; rather than human feedback, which is the unobservable quantity that it is optimised with respect to. (Rewards can be passed as observations to the policy, but I claim that it’s both safer and more useful if rewards are unobservable by the policy during training.)
This argument is weakened by the fact that, when there’s a conflict between them (e.g. in cases where it’s possible to fool the humans), agents aiming to “look like you’re doing X” will receive more reward. But during most of training the agent won’t be very good at fooling humans, and so I am optimistic that its core motivations will still be more like “do what the human says” than “look like you’re doing what the human says”.
I think that by default we will search for ways to build systems that do well on type 2 feedback. We do likely have a large dataset of type-2-bad behaviors from the real world, across many applications, and can make related data in simulation. It also seems quite plausible that this is a very tiny delta, if we are dealing with models that have already learned everything they would need to know about the world and this is just a matter of selecting a motivation, so that you can potentially get good type 2 behavior using a very small amount of data. Relatedly, it seems like really all you need is to train predictors for type 2 feedback (in order to use those predictions for training/planning), and that the relevant prediction problems often seem much easier than the actual sophisticated behaviors we are interested in.
Another important of my view about type 1 ~ type 2 is that if gradient descent handles the scale from [1 second, 1 month] then it’s not actually very far to get from [1 month, 2 years]. It seems like we’ve already come 6 orders of magnitude and now we are talking about generalizing 1 more order of magnitude.
At a higher level, I feel like the important thing is that type 1 and type 2 feedback are going to be basically the same kind of thing but with a quantitative difference (or at least we can set up type 1 feedback so that this is true). On the other hand “what we really want” is a completely different thing (that we basically can’t even define cleanly). So prima facie it feels to me like if models generalize “well” then we can get them to generalize from type 1 to type 2, whereas no such thing is true for “what we really care about.”
Cool, thanks for the clarifications. To be clear, overall I’m much more sympathetic to the argument as I currently understand it, than when I originally thought you were trying to draw a distinction between “new forms of reasoning honed by trial-and-error” in part 1 (which I interpreted as talking about systems lacking sufficiently good models of the world to find solutions in any other way than trial and error) and “systems that have a detailed understanding of the world” in part 2.
Let me try to sum up the disagreement. The key questions are:
What training data will we realistically be able to train our agents on?
What types of generalisation should we expect from that training data?
How well will we be able to tell that these agents are doing the wrong thing?
On 1: you think long-horizon real-world data will play a significant role in training, because we’ll need it to teach agents to do the most valuable tasks. This seems plausible to me; but I think that in order for this type of training to be useful, the agents will need to already have robust motivations (else they won’t be able to find rewards that are given over long time horizons). And I don’t think that this training will be extensive enough to reshape those motivations to a large degree (whereas I recall that in an earlier discussion on amplification, you argued that small amounts of training could potentially reshape motivations significantly). Our disagreement about question 1 affects questions 2 and 3, but it affects question 2 less than I previously thought, as I’ll discuss.
On 2: previously I thought you were arguing that we should expect very task-specific generalisations like being trained on “reduce crime” and learning “reduce reported crime”, which I was calling underspecified. However, based on your last comment it seems that you’re actually mainly talking about broader generalisations, like being trained on “follow instructions” and learning “do things that the instruction-giver would rate highly”. This seems more plausible, because it’s a generalisation that you can learn in many different types of training; and so our disagreement on 1 becomes less consequential.
I don’t have a strong opinion on the likelihood of this type of generalisation. I guess your argument is that, because we’re doing a lot of trial and error, we’ll keep iterating until we either get something aligned with our instructions, or something which optimises for high ratings directly. But it seems to me by default, during early training periods the AI won’t have much information about either the overseer’s knowledge (or the overseer’s existence), and may not even have the concept of rewards, making alignment with instructions much more natural. Above, you disagree; in either case my concern is that this underlying concept of “natural generalisation” is doing a lot of work, despite not having been explored in your original post (or anywhere else, to my knowledge). We could go back and forth about where the burden of proof is, but it seems more important to develop a better characterisation of natural generalisation; I might try to do this in a separate post.
On 3: it seems to me that the resources which we’ll put into evaluating a single deployment are several orders of magnitude higher than the resources we’ll put into evaluating each training data point—e.g. we’ll likely have whole academic disciplines containing thousands of people working full-time for many years on analysing the effects of the most powerful AIs’ behaviour.
You say that you expect people to work to design training procedures that get good performance on type two measures. I agree with this—but if you design an AI that gets good performance on type 2 measurements despite never being trained on them, then that rules out the most straightforward versions of the “do things that the instruction-giver would rate highly” motivation. And since the trial and error to find strategies which fool type 2 measurements will be carried out over years, the direct optimisation for fooling type 2 measurements will be weak.
I guess the earlier disagreement about question 1 is also relevant here. If you’re an AI trained primarily on data and feedback which are very different from real-world long-term evaluations, then there are very few motivations which lead you to do well on real-world long-term evaluations. “Follow instructions” is one of them; some version of “do things that the instruction-giver would rate highly” is another, but it would need to be quite a specific version. In other words, the greater the disparity between the training regime and the evaluation regime, the fewer ways there are for an AI’s motivations to score well on both, but also score badly on our idealised preferences.
In another comment, you give a bunch of ways in which models might generalise successfully to longer horizons, and then argue that “many of these would end up pursuing goals that are closely related to the goals they pursue over short horizons”. I agree with this, but note that “aligned goals” are also closely related to the goals pursued over short time horizons. So it comes back to whether motivations will generalise in a way which prioritises the “obedience” aspect or the “produces high scores” aspect of the short-term goals.
I agree that the core question is about how generalization occurs. My two stories involve kinds of generalization, and I think there are also ways generalization could work that could lead to good behavior.
It is important to my intuition that not only can we never train for the “good” generalization, we can’t even evaluate techniques to figure out which generalization “well” (since both of the bad generalizations would lead to behavior that looks good over long horizons).
If there is a disagreement it is probably that I have a much higher probability of the kind of generalization in story 1. I’m not sure if there’s actually a big quantitative disagreement though rather than a communication problem.
I also think it’s quite likely that the story in my post is unrealistic in a bunch of ways and I’m currently thinking more about what I think would actually happen.
Some more detailed responses that feel more in-the-weeds:
I might not understand this point. For example, suppose I’m training a 1-day predictor to make good predictions over 10 or 100 days. I expect such predictors to initially fail over long horizons, but to potentially be greatly improved with moderate amounts of fine-tuning. It seems to me that if this model has “robust motivations” then they would most likely be to predict accurately, but I’m not sure about why the model necessarily has robust motivations.
I feel similarly about goals like “plan to get high reward (defined as signals on channel X, you can learn how the channel works).” But even if prediction was a special case, if you learn a model then you can use it for planning/RL in simulation.
It feels to me like our models are already getting to the point where they respond to quirks of the labeling or evaluation process, and are basically able to build simple models of the oversight process.
Definitely, I think it’s critical to what happens and not really explored in the post (which is mostly intended to provide some color for what failure might look like).
That said, a major part of my view is that it’s pretty likely that we get either arbitrary motivations or reward-maximization (or something in between), and it’s not a big deal which since they both seem bad and seem averted in the same way.
I think the really key question is how likely it is that we get some kind of “intended” generalization like friendliness. I’m frequently on the opposite side of this disagreement, arguing that the probability that people will get some nice generalization if they really try is at least 25% or 50%, but I’m also happy being on the pessimistic side and saying that the probability we can get nice generalizations is at most 50% or 75%.
Two kinds of generalization is an old post on this question (though I wish it had used more tasteful examples).
Turning reflection up to 11 touches on the issue as well, though coming from a very different place than you.
I think there are a bunch of Arbital posts where Eliezer tries to articulate some of his opinions on this but I don’t know pointers offhand. I think most of my sense is
I haven’t written that much about why I think generalizations like “just be helpful” aren’t that likely. I agree with the point that these issues are underexplored by people working on alignment, and even more underdiscussed, given how important they are.
There are some google doc comment threads with MIRI where I’ve written about why I think those are plausible (namely that it plausible-but-challenging for breeding of animals, and that seems like one of our best anchors overall, suggesting that plausible-but-challenging is a good anchor). I think in those cases the key argument was about whether you need this to generalize far, since both me and MIRI think it’s a kind of implausible generalization to go out to infinity rather than becoming distorted at some point along the way, but I am more optimistic about making a series of “short hops” where models generalize helpfully to being moderately smarter and then they can carry out the next step of training for you.
What does this actually mean, in terms of the details of how you’d train a model to do this?
Take a big language model like GPT-3, and then train it via RL on tasks where it gets given a language instruction from a human, and then it gets reward if the human thinks it’s done the task successfully.
Makes sense, thanks!
I agree that this is probably the key point; my other comment (“I think this is the key point and it’s glossed over...”) feels very relevant to me.