(FWIW my central threat model is that policies care about reward to some extent, but that the goals which actually motivate them to do power-seeking things are more object-level.)
Are you imagining that such systems get meaningfully low reward on the training distribution because they are pursuing those goals, or that these goals are extremely well-correlated with reward on the training distribution and only come apart at test time? Is the model deceptively aligned?
I expect policies to be getting rich input streams like video, text, etc, which they use to make decisions. Reward is different from other types of data because reward isn’t actually observed as part of these input streams by policies during episodes. This makes it harder to learn as a goal compared with things that are more directly observable (in a similar way to how “care about children” is an easier goal to learn than “care about genes”).
Children vs genes doesn’t seem like a good comparison, it seems obvious that models will understand the idea of reward during training (whereas humans don’t understand genes during evolution). A better comparison might be “have children during my life” vs “have a legacy after I’m gone,” but in fact humans have both goals even though one is never directly observed.
I guess more importantly, I don’t buy the claim about “things you only get selected on are less natural as goals than things you observe during episodes,” especially if your policy is trained to make good predictions of reward. I don’t know if there’s a specific reason for your view, or this is just a clash of intuitions. It feels to me like my position is kind of the default, in that you are offering a feature and saying it is a major consideration that SGD wouldn’t learn a particular kind of cognition.
I don’t think this line of reasoning works, because “the episode appearing in training” can be a dependent variable. For example, consider an RL agent that’s credibly told that its data is not going to be used for training unless it misbehaves badly. An agent which maximizes reward conditional on the episode appearing in training is therefore going to misbehave the minimum amount required to get its episode into the training data (and more generally, behave like getting its episode into training is a terminal goal). This seems very counterintuitive.
If the agent misbehaves so that its data will be used for training, then the misbehaving actions will get a low reward. So SGD will shift from “misbehave so that my data will be used for training” to “ignore the effect of my actions on whether my data will be used for training, and just produce actions that would result in a low reward assuming that this data is used for training.”
I agree the behavior of the model isn’t easily summarized by an English sentence. The thing that seems most clear is that the model trained by SGD will learn not to sacrifice reward in order to increase the probability that the episode is used in training. If you think that’s wrong I’m happy to disagree about it.
Some versions that wouldn’t result in power-grabbing:
Every one of your examples results in the model grabbing power and doing something bad at deployment time, though I agree that it may not care about holding power. (And maybe it doesn’t even have to grab power if you don’t have any countermeasures at test time? But then your system seems useless.)
But if you get frustrated by your AI taking over the datacenter and train it not to do that, as discussed in Ajeya’s story here, then you are left with an AI that grabs power and holds it.
I don’t think any of these are particularly likely, the point is more that “high reward via tampering with/taking over the training setup” is a fairly different type of thing from “high reward via actually performing tasks”, and it’s a non-trivial and quite specific hypothesis that the latter will generalize to the former (in the regimes we’re concerned about).
I don’t find this super compelling. There is a long causal chain from “Do the task” to “Humans think you did the task” to “You get high reward” to “You are selected by SGD.” By default there will be lots of datapoints that distinguish the late stages on the path from the early stages, since human evaluations do not perfectly track any kind of objective evaluation of “did you actually do the task?” So a model that cares only about doing the task and not about human evaluations will underperform a model that cares about human evaluations. And then a model that cares about human evaluations can then manipulate those evaluations via grabbing power and controlling human observations of the situation (and if it cares about something even later then it can also e.g. kill and replace the humans, but that’s not important to this argument).
So the implicit alternative you mention here just won’t get a low loss, and it’s really a question about whether SGD will find a lower loss policy rather than one about how it generalizes.
And once that’s the discussion, someone saying “well there is a policy that gets lower loss and doesn’t look very complicated, so I guess SGD has a good chance of finding it” seems like they are on the winning side of SGD. Your claim seems to be that models will fail to get low loss on the training set in a specific way, and I think saying “it’s a non-trivial and quite specific hypothesis that you will get a low loss” would be an inappropriate burden-of-proof shifting.
I agree that it’s unclear whether this will actually happen prior to models that are smart enough to obsolete human work on alignment, but I really think we’re looking at more like a very natural hypothesis with 50-50 chances (and at the very least the way you are framing your objection is not engaging with the arguments in the particular post of Ajeya’s that you are responding to).
I also agree that the question of whether the model will mess with the training setup or use power for something else is very complicated, but it’s not something that is argued for in Ajeya’s post.
I think the crucial thing that makes this robust is “if the model grabs power and does something else, then humans will train it out unless they decide doing so is unsafe.” I think one potentially relevant takeaway is: (i) it’s plausible that AI systems will be motivated to grab power in non-catastrophic ways which could give opportunities to correct course, (ii) that isn’t automatically guaranteed by any particular honeypot, e.g. you can’t just give a high reward and assume that will motivate the system.
Reading back over this now, I think we’re arguing at cross purposes in some ways. I should have clarified earlier that my specific argument was against policies learning a terminal goal of reward that generalizes to long-term power-seeking.
I do expect deceptive alignment after policies learn other broadly-scoped terminal goals and realize that reward-maximization is a good instrumental strategy. So all my arguments about the naturalness of reward-maximization as a goal are focused on the question of which type of terminal goal policies with dangerous levels of capabilities learn first.* Let’s distinguish three types (where “myopic” is intended to mean something like “only cares about the current episode”).
Non-myopic misaligned goals that lead to instrumental reward maximization (deceptive alignment)
Myopic terminal reward maximization
Non-myopic terminal reward maximization
Either 1 and 2 (or both of them) seem plausible to me. 3 is the one I’m skeptical about. How come?
We should expect models to have fairly robust terminal goals (since, unlike beliefs or instrumental goals, terminal goals shouldn’t change quickly with new information). So once they understand the concept of reward maximization, it’ll be easier for them to adopt it as an instrumental strategy than a terminal goal. (An analogy to evolution: once humans construct highly novel strategies for maximizing genetic fitness (like making thousands of clones) people are more likely to do it for instrumental reasons than terminal reasons.)
Even if they adopt reward maximization as a terminal goal, they’re more likely to adopt a myopic version of it than a non-myopic version, since (I claim) the concept of reward maximization doesn’t generalize very naturally to larger scales. Above, you point out that even relatively myopic reward maximization will lead to limited takeover, and so we’ll train subsequent agents to be less myopic. But it seems to me that the selection pressure generated by a handful of examples of real-world attempted takeovers is very small, compared with other aspects of training; and that even if it’s significant, it may just teach agents specific constraints like “don’t take over datacenters”.
* Now that I say that, I notice that I’m also open to the possibility of policies learning deceptively-aligned goals first, then gradually shifting from reward-maximization as an instrumental goal to reward-maximization as a terminal goal. But let’s focus for now on which goals are learned first.
Are you imagining that such systems get meaningfully low reward on the training distribution because they are pursuing those goals, or that these goals are extremely well-correlated with reward on the training distribution and only come apart at test time? Is the model deceptively aligned?
Children vs genes doesn’t seem like a good comparison, it seems obvious that models will understand the idea of reward during training (whereas humans don’t understand genes during evolution). A better comparison might be “have children during my life” vs “have a legacy after I’m gone,” but in fact humans have both goals even though one is never directly observed.
I guess more importantly, I don’t buy the claim about “things you only get selected on are less natural as goals than things you observe during episodes,” especially if your policy is trained to make good predictions of reward. I don’t know if there’s a specific reason for your view, or this is just a clash of intuitions. It feels to me like my position is kind of the default, in that you are offering a feature and saying it is a major consideration that SGD wouldn’t learn a particular kind of cognition.
If the agent misbehaves so that its data will be used for training, then the misbehaving actions will get a low reward. So SGD will shift from “misbehave so that my data will be used for training” to “ignore the effect of my actions on whether my data will be used for training, and just produce actions that would result in a low reward assuming that this data is used for training.”
I agree the behavior of the model isn’t easily summarized by an English sentence. The thing that seems most clear is that the model trained by SGD will learn not to sacrifice reward in order to increase the probability that the episode is used in training. If you think that’s wrong I’m happy to disagree about it.
Every one of your examples results in the model grabbing power and doing something bad at deployment time, though I agree that it may not care about holding power. (And maybe it doesn’t even have to grab power if you don’t have any countermeasures at test time? But then your system seems useless.)
But if you get frustrated by your AI taking over the datacenter and train it not to do that, as discussed in Ajeya’s story here, then you are left with an AI that grabs power and holds it.
I don’t find this super compelling. There is a long causal chain from “Do the task” to “Humans think you did the task” to “You get high reward” to “You are selected by SGD.” By default there will be lots of datapoints that distinguish the late stages on the path from the early stages, since human evaluations do not perfectly track any kind of objective evaluation of “did you actually do the task?” So a model that cares only about doing the task and not about human evaluations will underperform a model that cares about human evaluations. And then a model that cares about human evaluations can then manipulate those evaluations via grabbing power and controlling human observations of the situation (and if it cares about something even later then it can also e.g. kill and replace the humans, but that’s not important to this argument).
So the implicit alternative you mention here just won’t get a low loss, and it’s really a question about whether SGD will find a lower loss policy rather than one about how it generalizes.
And once that’s the discussion, someone saying “well there is a policy that gets lower loss and doesn’t look very complicated, so I guess SGD has a good chance of finding it” seems like they are on the winning side of SGD. Your claim seems to be that models will fail to get low loss on the training set in a specific way, and I think saying “it’s a non-trivial and quite specific hypothesis that you will get a low loss” would be an inappropriate burden-of-proof shifting.
I agree that it’s unclear whether this will actually happen prior to models that are smart enough to obsolete human work on alignment, but I really think we’re looking at more like a very natural hypothesis with 50-50 chances (and at the very least the way you are framing your objection is not engaging with the arguments in the particular post of Ajeya’s that you are responding to).
I also agree that the question of whether the model will mess with the training setup or use power for something else is very complicated, but it’s not something that is argued for in Ajeya’s post.
I think the crucial thing that makes this robust is “if the model grabs power and does something else, then humans will train it out unless they decide doing so is unsafe.” I think one potentially relevant takeaway is: (i) it’s plausible that AI systems will be motivated to grab power in non-catastrophic ways which could give opportunities to correct course, (ii) that isn’t automatically guaranteed by any particular honeypot, e.g. you can’t just give a high reward and assume that will motivate the system.
Reading back over this now, I think we’re arguing at cross purposes in some ways. I should have clarified earlier that my specific argument was against policies learning a terminal goal of reward that generalizes to long-term power-seeking.
I do expect deceptive alignment after policies learn other broadly-scoped terminal goals and realize that reward-maximization is a good instrumental strategy. So all my arguments about the naturalness of reward-maximization as a goal are focused on the question of which type of terminal goal policies with dangerous levels of capabilities learn first.* Let’s distinguish three types (where “myopic” is intended to mean something like “only cares about the current episode”).
Non-myopic misaligned goals that lead to instrumental reward maximization (deceptive alignment)
Myopic terminal reward maximization
Non-myopic terminal reward maximization
Either 1 and 2 (or both of them) seem plausible to me. 3 is the one I’m skeptical about. How come?
We should expect models to have fairly robust terminal goals (since, unlike beliefs or instrumental goals, terminal goals shouldn’t change quickly with new information). So once they understand the concept of reward maximization, it’ll be easier for them to adopt it as an instrumental strategy than a terminal goal. (An analogy to evolution: once humans construct highly novel strategies for maximizing genetic fitness (like making thousands of clones) people are more likely to do it for instrumental reasons than terminal reasons.)
Even if they adopt reward maximization as a terminal goal, they’re more likely to adopt a myopic version of it than a non-myopic version, since (I claim) the concept of reward maximization doesn’t generalize very naturally to larger scales. Above, you point out that even relatively myopic reward maximization will lead to limited takeover, and so we’ll train subsequent agents to be less myopic. But it seems to me that the selection pressure generated by a handful of examples of real-world attempted takeovers is very small, compared with other aspects of training; and that even if it’s significant, it may just teach agents specific constraints like “don’t take over datacenters”.
* Now that I say that, I notice that I’m also open to the possibility of policies learning deceptively-aligned goals first, then gradually shifting from reward-maximization as an instrumental goal to reward-maximization as a terminal goal. But let’s focus for now on which goals are learned first.