I don’t understand how this is something else than a debate over words.
When an entity “cares about X like ghandi cares about avoiding murder” or “cares about X like a pure egoist cares about his own pleasure” I would call that “having X as terminal goal”.[1] Happy to avoid this use of “goal” for the purpose of this conversation but I don’t understand why you think it is a bad way to talk about things or why it changes any of the argument about instrumental convergence.
The kind of entities I claim we should be afraid of is the kind of entities that terminally want X in the same way that Gandhi wants to avoid murder or in the same way that a pure egoist wants to pursue his own pleasure at the expense of others, where X is something that is not compatible with human values.
Is the claim that you think there is a constraint on X where X needs to be justified on moral realism grounds and is thus guaranteed to not be in conflict with human values? That looks implausible to me even granting moral realism, I think it is possible to be a pure egoist and to only terminally care about your own pleasure in a way that makes you want to avoid modifications that make you more altruistic but that doesn’t look justified on moral realist grounds. (More generally, I think the space of things you could care about in the same way that Gandhi cares about avoiding murder is very large, roughly as large as the space of instrumental goals.)
I don’t think it is obviously true that the space of things you can care about like Ghandi cares about murder is very large. I think arguments that oppose the orthogonality thesis are almost always about this kind of “caring about X” rather than about the more shallow kind of goals you are talking about. I don’t buy these arguments but I think this is where the reasonable disagreement is and redefining “terminal goal” to mean sth weaker than “cares about X like Ghandi cares about murder” is not helpful.
It might be possible to create AIs that only care about X in the more shallow sense that you are describing in the paper and I agree it would be safer, but I don’t think it will be easy to avoid creating agents that care about X in the same way that Gandhi wants to avoid murder. When you chat with current AIs, it looks to me that to the extent they care about things, they care about them in the same way that Gandhi cares about murder (see e.g. the alignment faking paper). Any insight into how to build AIs that don’t care about anything in the same way that Gandhi cares about murder?
Maybe Ghandi cares about murder because of C, and cares about C because of B, … Eventually that bottoms out in axioms A (e.g. the golden rule), and I would call that his terminal goals. This does not matter for the purpose of the conversation, since Ghandi would probably also resist having his axioms being modified. The case of the pure egoist is a bit clearer, since I think it is possible to have pure egoists who care about their own pleasure without further justification.
One thing I forgot to mention is that there are reasons to expect “we are likely to build smart consequentialist (that e.g. max sum_t V_t0(s_t))” to be true that are stronger than “look at current AIs” / “this is roughly aligned with commercial incentives”, such as then ones described by Evan Hubinger here.
TL;DR: alignment faking may be more sample efficient / easier to learn / more efficient at making loss go down than internalizing what humans want, so AIs that fake alignment may be selected for.
Carlsmith is a good review of the “Will AIs be be smart consequentialists?” arguments up to late 2023. I think the conversation has progressed a little since then but not massively.
“When an entity ‘cares about X like ghandi cares about avoiding murder’ or ‘cares about X like a pure egoist cares about his own pleasure’ I would call that ‘having X as terminal goal.’”
I think I would agree with this, unless you would also claim that “caring about X like a pure egoist cares about his own pleasure” is the only way of having a terminal goal. I would define a terminal goal more broadly as a non-instrumental goal: a goal pursued for its own sake, not for anything else. How a pure egoist cares about his own pleasure might have particular features that some non-instrumental goals might not have. I would still say these latter types of non-instrumental goals are terminal goals.
“Is the claim that you think there is a constraint on X where X needs to be justified on moral realism grounds and is thus guaranteed to not be in conflict with human values?”
No, the paper does not assume moral realism. The point about moral realism in the paper is just this: an agent believing that bringing about X is wrong might have a reason not to change their goals in a way that will cause them to later do X, but the instrumental convergence thesis doesn’t assume moral realism, so arguments in favor of goal preservation can’t assume moral realism either.
I agree that even if moral realism is true, a pure egoist might want to stay a pure egoist.
“I don’t think it is obviously true that the space of things you can care about like Ghandi cares about murder is very large. I think arguments that oppose the orthogonality thesis are almost always about this kind of “caring about X” rather than about the more shallow kind of goals you are talking about. I don’t buy these arguments but I think this is where the reasonable disagreement is and redefining “terminal goal” to mean sth weaker than “cares about X like Ghandi cares about murder” is not helpful.”
This part makes me think you are adopting a more restrictive notion of terminal goals than I would. What’s wrong with non-instrumental goals as the definition of a terminal goal? One reason for adopting the broader definition is that we don’t know what a superintelligence will be like, so we don’t want to assume it will care about things in a human-like way.
“Any insight into how to build AIs that don’t care about anything in the same way that Gandhi cares about murder?”
I haven’t thought about how to create a system that has what you call “shallow” goals. It just seems to me that non-instrumental goals can, in principle, take this “shallow” form, especially for agents who (by stipulation) might not have hedonic sensations.
You agree that some sorts of terminal goals (like Gandhi’s or the egoist’s) imply you should protect them (e.g. a preference to maximize E[sum_t V_t0(s_t)])
You agree that it’s plausible AIs might have this sort of self-preserving terminal goals and that these goals may be misaligned with human values, and that the arguments for instrumental self-preservation do apply to those AIs
You think that the strength of arguments for instrumental self-preservation is overrated because of the possibility of building AIs that don’t have self-preserving terminal goals
You’d prefer if people talked about “self-preserving terminal goals” or sth more specific when making arguments about instrumental self-preservation, since not all forms of caring / having terminal goals imply self-preservation
You don’t have a specific proposal to build such AIs—this paper is mostly pointing at a part of the option space for building safer AI systems (which is related to proposals about building myopic AIs, though it’s not exactly the same thing)
I think we might still have a big disagreement on what sort of goals AIs are likely to be built by default / if we try to avoid self-preserving terminal goals—but it’s mostly a quantitative empirical disagreement.
I’d note that I find quite strange all versions of non-self-preserving terminal goals that I know how to formalize. For example maximizing E[sum_t V_t(s_t)] does not result in self-preservation, but instead it results in AIs that would like to self-modify immediately to have very easy to achieve goals (if that was possible). I believe people have also tried and so far failed to come up with satisfying formalisms describing AIs that are indifferent to having their goals be modified / to being shut down.
I can see how my last comment may have made it seem like I thought some terminal goals should be protected just because they are terminal goals. However, when I said that Gandhi’s anti-murder goal and the egoist’s self-indulgence goal might have distinct features that not all terminal goals share, I only meant that we need a broad definition of terminal goals to make sure it captures all varieties of terminal goals. I didn’t mean to imply anything about the relevance of any potential differences between types of terminal goals. I would not assume that whatever distinguishes an egoist’s goal of self-indulgence from an AI’s goal of destroying buildings means the egoist should protect his terminal goal even if an AI might not need to. In fact, I doubt that’s the case.
Imagine there are two people. One is named Ally. She’s an altruist with a terminal goal of treating all interests exactly as her own. The other is named Egon. He is an egoist with a terminal goal of satisfying only his own interests. Also in the mix is an AI with a terminal goal to destroy buildings. Ally and Egon may have a different sort of relationship to their terminal goals than the AI has to its terminal goal, but if you said, “Ally and Egon should both protect their respective terminal goals,” I would need an explanation for this, and I doubt I would agree with whatever that explanation is.
Do you think that something being a terminal goal is in itself a reason to keep that goal? And/or do you think that keeping a goal is an aspect of what it means to have a goal in the first place?
And/or do you think that keeping a goal is an aspect of what it means to have a goal in the first place?
I meant sth like that (though weaker, I didn’t want to claim all goals are like that), though I don’t claim this is a good choice of words. I agree it is natural to speak about goals only to refer to its object (e.g. building destruction) and not the additional meta-stuff (e.g. do you maximize E[sum_t V_t0(s_t)] or E[sum_t V_t(s_t)] or sth else?). Maybe “terminal preferences” more naturally cover both objects (what you call goals?) and the meta-stuff. (In the message above I was using “terminal goals” to refer to objects and the meta-stuff.)
I don’t know how to call the meta-stuff, it’s a bit sad I don’t know a good word for it.
With this clarified wording, I think what I said above holds. For example, if I had to frame the risk from instrumental convergence with the slightly more careful wording I would say “it’s plausible that AIs will have self-preserving terminal preferences (e.g. like max E[sum_t V_t0(s_t)]). It is likely we will build such AIs because this is roughly how humans are, we don’t have a good plan to build very useful AIs that are not like that, and current AIs seem to be a bit like that. And if this is true, and we get V wrong, a powerful AI would likely conclude its values are better pursued if it got more power, which means self-preservation and ultimately takeover.”
I don’t love calling it “self-preserving terminal preferences” though because it feels tautological when in fact self-preserving terminal preferences are natural and don’t need to involve any explicit reference to self-preservation in their definition. Maybe there is a better word for it.
“It’s plausible that AIs will have self-preserving preferences (e.g. like E[sum_t V_t0(s_t)]). It is likely we will build such AIs because this is roughly how humans are, we don’t have a good plan to build very useful AIs that are not like that, and current AIs seem to be a bit like that. And if this is true, and we get V even slightly wrong, a powerful AI might conclude its values are better pursued if it got more power, which means self-preservation and ultimately takeover.”
This strikes me as plausible. The paper has a narrow target. It’s arguing against the instrumental convergence argument for goal preservation. It argues that we shouldn’t expect an AI to preserve its goal on the basis of instrumental rationality alone. However, instrumental goal preservation could be false, yet there could be other reasons to believe a superintelligence would preserve its goals. You’re making that kind of case here without appealing to instrumental convergence.
The drawback to this sort of argument is that it has a narrower scope and relies on more assumptions than Omohundro and Bostrom might prefer. The purpose of the instrumental convergence thesis is to tell us something about any likely superintelligence, even one that is radically different from anything we know, including AIs of today. The argument here is a strong one, but only if we think a superintelligence will not be a totally alien creature. Maybe it won’t be, but again, the instrumental convergence thesis doesn’t want to assume that.
I don’t understand how this is something else than a debate over words.
When an entity “cares about X like ghandi cares about avoiding murder” or “cares about X like a pure egoist cares about his own pleasure” I would call that “having X as terminal goal”.[1] Happy to avoid this use of “goal” for the purpose of this conversation but I don’t understand why you think it is a bad way to talk about things or why it changes any of the argument about instrumental convergence.
The kind of entities I claim we should be afraid of is the kind of entities that terminally want X in the same way that Gandhi wants to avoid murder or in the same way that a pure egoist wants to pursue his own pleasure at the expense of others, where X is something that is not compatible with human values.
Is the claim that you think there is a constraint on X where X needs to be justified on moral realism grounds and is thus guaranteed to not be in conflict with human values? That looks implausible to me even granting moral realism, I think it is possible to be a pure egoist and to only terminally care about your own pleasure in a way that makes you want to avoid modifications that make you more altruistic but that doesn’t look justified on moral realist grounds. (More generally, I think the space of things you could care about in the same way that Gandhi cares about avoiding murder is very large, roughly as large as the space of instrumental goals.)
I don’t think it is obviously true that the space of things you can care about like Ghandi cares about murder is very large. I think arguments that oppose the orthogonality thesis are almost always about this kind of “caring about X” rather than about the more shallow kind of goals you are talking about. I don’t buy these arguments but I think this is where the reasonable disagreement is and redefining “terminal goal” to mean sth weaker than “cares about X like Ghandi cares about murder” is not helpful.
It might be possible to create AIs that only care about X in the more shallow sense that you are describing in the paper and I agree it would be safer, but I don’t think it will be easy to avoid creating agents that care about X in the same way that Gandhi wants to avoid murder. When you chat with current AIs, it looks to me that to the extent they care about things, they care about them in the same way that Gandhi cares about murder (see e.g. the alignment faking paper). Any insight into how to build AIs that don’t care about anything in the same way that Gandhi cares about murder?
Maybe Ghandi cares about murder because of C, and cares about C because of B, … Eventually that bottoms out in axioms A (e.g. the golden rule), and I would call that his terminal goals. This does not matter for the purpose of the conversation, since Ghandi would probably also resist having his axioms being modified. The case of the pure egoist is a bit clearer, since I think it is possible to have pure egoists who care about their own pleasure without further justification.
One thing I forgot to mention is that there are reasons to expect “we are likely to build smart consequentialist (that e.g. max sum_t V_t0(s_t))” to be true that are stronger than “look at current AIs” / “this is roughly aligned with commercial incentives”, such as then ones described by Evan Hubinger here.
TL;DR: alignment faking may be more sample efficient / easier to learn / more efficient at making loss go down than internalizing what humans want, so AIs that fake alignment may be selected for.
Carlsmith is a good review of the “Will AIs be be smart consequentialists?” arguments up to late 2023. I think the conversation has progressed a little since then but not massively.
“When an entity ‘cares about X like ghandi cares about avoiding murder’ or ‘cares about X like a pure egoist cares about his own pleasure’ I would call that ‘having X as terminal goal.’”
I think I would agree with this, unless you would also claim that “caring about X like a pure egoist cares about his own pleasure” is the only way of having a terminal goal. I would define a terminal goal more broadly as a non-instrumental goal: a goal pursued for its own sake, not for anything else. How a pure egoist cares about his own pleasure might have particular features that some non-instrumental goals might not have. I would still say these latter types of non-instrumental goals are terminal goals.
“Is the claim that you think there is a constraint on X where X needs to be justified on moral realism grounds and is thus guaranteed to not be in conflict with human values?”
No, the paper does not assume moral realism. The point about moral realism in the paper is just this: an agent believing that bringing about X is wrong might have a reason not to change their goals in a way that will cause them to later do X, but the instrumental convergence thesis doesn’t assume moral realism, so arguments in favor of goal preservation can’t assume moral realism either.
I agree that even if moral realism is true, a pure egoist might want to stay a pure egoist.
“I don’t think it is obviously true that the space of things you can care about like Ghandi cares about murder is very large. I think arguments that oppose the orthogonality thesis are almost always about this kind of “caring about X” rather than about the more shallow kind of goals you are talking about. I don’t buy these arguments but I think this is where the reasonable disagreement is and redefining “terminal goal” to mean sth weaker than “cares about X like Ghandi cares about murder” is not helpful.”
This part makes me think you are adopting a more restrictive notion of terminal goals than I would. What’s wrong with non-instrumental goals as the definition of a terminal goal? One reason for adopting the broader definition is that we don’t know what a superintelligence will be like, so we don’t want to assume it will care about things in a human-like way.
“Any insight into how to build AIs that don’t care about anything in the same way that Gandhi cares about murder?”
I haven’t thought about how to create a system that has what you call “shallow” goals. It just seems to me that non-instrumental goals can, in principle, take this “shallow” form, especially for agents who (by stipulation) might not have hedonic sensations.
I think we mostly agree then!
To make sure I understand your stance:
You agree that some sorts of terminal goals (like Gandhi’s or the egoist’s) imply you should protect them (e.g. a preference to maximize E[sum_t V_t0(s_t)])
You agree that it’s plausible AIs might have this sort of self-preserving terminal goals and that these goals may be misaligned with human values, and that the arguments for instrumental self-preservation do apply to those AIs
You think that the strength of arguments for instrumental self-preservation is overrated because of the possibility of building AIs that don’t have self-preserving terminal goals
You’d prefer if people talked about “self-preserving terminal goals” or sth more specific when making arguments about instrumental self-preservation, since not all forms of caring / having terminal goals imply self-preservation
You don’t have a specific proposal to build such AIs—this paper is mostly pointing at a part of the option space for building safer AI systems (which is related to proposals about building myopic AIs, though it’s not exactly the same thing)
I think we might still have a big disagreement on what sort of goals AIs are likely to be built by default / if we try to avoid self-preserving terminal goals—but it’s mostly a quantitative empirical disagreement.
I’d note that I find quite strange all versions of non-self-preserving terminal goals that I know how to formalize. For example maximizing E[sum_t V_t(s_t)] does not result in self-preservation, but instead it results in AIs that would like to self-modify immediately to have very easy to achieve goals (if that was possible). I believe people have also tried and so far failed to come up with satisfying formalisms describing AIs that are indifferent to having their goals be modified / to being shut down.
I can see how my last comment may have made it seem like I thought some terminal goals should be protected just because they are terminal goals. However, when I said that Gandhi’s anti-murder goal and the egoist’s self-indulgence goal might have distinct features that not all terminal goals share, I only meant that we need a broad definition of terminal goals to make sure it captures all varieties of terminal goals. I didn’t mean to imply anything about the relevance of any potential differences between types of terminal goals. I would not assume that whatever distinguishes an egoist’s goal of self-indulgence from an AI’s goal of destroying buildings means the egoist should protect his terminal goal even if an AI might not need to. In fact, I doubt that’s the case.
Imagine there are two people. One is named Ally. She’s an altruist with a terminal goal of treating all interests exactly as her own. The other is named Egon. He is an egoist with a terminal goal of satisfying only his own interests. Also in the mix is an AI with a terminal goal to destroy buildings. Ally and Egon may have a different sort of relationship to their terminal goals than the AI has to its terminal goal, but if you said, “Ally and Egon should both protect their respective terminal goals,” I would need an explanation for this, and I doubt I would agree with whatever that explanation is.
Do you think that something being a terminal goal is in itself a reason to keep that goal? And/or do you think that keeping a goal is an aspect of what it means to have a goal in the first place?
I meant sth like that (though weaker, I didn’t want to claim all goals are like that), though I don’t claim this is a good choice of words. I agree it is natural to speak about goals only to refer to its object (e.g. building destruction) and not the additional meta-stuff (e.g. do you maximize E[sum_t V_t0(s_t)] or E[sum_t V_t(s_t)] or sth else?). Maybe “terminal preferences” more naturally cover both objects (what you call goals?) and the meta-stuff. (In the message above I was using “terminal goals” to refer to objects and the meta-stuff.)
I don’t know how to call the meta-stuff, it’s a bit sad I don’t know a good word for it.
With this clarified wording, I think what I said above holds. For example, if I had to frame the risk from instrumental convergence with the slightly more careful wording I would say “it’s plausible that AIs will have self-preserving terminal preferences (e.g. like max E[sum_t V_t0(s_t)]). It is likely we will build such AIs because this is roughly how humans are, we don’t have a good plan to build very useful AIs that are not like that, and current AIs seem to be a bit like that. And if this is true, and we get V wrong, a powerful AI would likely conclude its values are better pursued if it got more power, which means self-preservation and ultimately takeover.”
I don’t love calling it “self-preserving terminal preferences” though because it feels tautological when in fact self-preserving terminal preferences are natural and don’t need to involve any explicit reference to self-preservation in their definition. Maybe there is a better word for it.
“It’s plausible that AIs will have self-preserving preferences (e.g. like E[sum_t V_t0(s_t)]). It is likely we will build such AIs because this is roughly how humans are, we don’t have a good plan to build very useful AIs that are not like that, and current AIs seem to be a bit like that. And if this is true, and we get V even slightly wrong, a powerful AI might conclude its values are better pursued if it got more power, which means self-preservation and ultimately takeover.”
This strikes me as plausible. The paper has a narrow target. It’s arguing against the instrumental convergence argument for goal preservation. It argues that we shouldn’t expect an AI to preserve its goal on the basis of instrumental rationality alone. However, instrumental goal preservation could be false, yet there could be other reasons to believe a superintelligence would preserve its goals. You’re making that kind of case here without appealing to instrumental convergence.
The drawback to this sort of argument is that it has a narrower scope and relies on more assumptions than Omohundro and Bostrom might prefer. The purpose of the instrumental convergence thesis is to tell us something about any likely superintelligence, even one that is radically different from anything we know, including AIs of today. The argument here is a strong one, but only if we think a superintelligence will not be a totally alien creature. Maybe it won’t be, but again, the instrumental convergence thesis doesn’t want to assume that.