Hey Rhys, thanks for posting this and trying to seriously engage with the community!
Unfortunately, either I completely misunderstood your argument, or you completely misunderstood what this community considers a goal. It seems you are considering only an extremely myopic AI. I don’t have any background knowledge in what is considered a goal in the philosophy that you cite. The concepts of ends-rationality and wide-scope view don’t make any sense in my concept of a goal.
Let me try to formalize two things a) your argument and your conception of goal, b) what I (and likely a lot of the people in the community) might consider a goal.
Your model:
We will use a world model and a utility function calculator and describe how an agent behaves based on these two.
World Model W:(state,action)→state. Takes the current state and the agent’s action and produces the next world state.
Utility function calculator U:([current]state,[next]state)→R. Takes the current state and a next state, and calculates the utility of the next state. (The definition is complicated by the fact that we must account for the utility function changing. Thus we assume that the actual utility function is encoded in the current state and we must first extract it.)
The agent chooses its action as action=argmaxaU(state,W(state,a)), i.e., the action that maximizes the current utility function.
Example problem:
Suppose that the current utility function is the number of paperclips. Suppose actions a1 and a2 both produce 10 paperclips, however, a2 also changes the utility function to the number of cakes. Both actions have the same utility (since they both produce the same number of paperclips in the next state). Thus the agent can take either action and change its goal.
My model:
World Model W:(state,action)→future. Future might be a sequence of all future states (or even better, a distribution over all sequences of future states.)
Utility function calculator U:([current]state,future)→R. Calculates the aggregate utility. We are not making any assumptions about how the aggregation is done.
Again, the agent chooses the action that maximizes the utility action=argmaxaU(state,W(state,a)).
Fixed problem:
Again, let the utility be the total number of paperclips the agent makes over its lifespan. Have the same actions, a1 and a2, both producing 10 paperclips, but a2 changes the utility function to the number of cakes. Now the agent cannot choose a2 because the agent would stop making paperclips in the future and thus a2 has a lower utility.
Now a couple of caveats. First, even in my model, the agent might still want to change its utility function, perhaps because it might be turned off if it is found to be a paperclip maximizer. Second, my model is probably not perfect. People that have studied this more closely might have objections. Still, I think it is much closer to what people here might consider a goal. Third, very few people actually expect AI to really work like this. A goal will really be an emergent property of a complex system, like those in the current deep learning paradigm. But this formalism is a useful tool to reason about AI and intelligence.
Let me know if I misunderstood your argument, or if something is unclear in my explanation.
Thanks for this response. Wide-scope and narrow-scope don’t determine how a goal is defined. These are different theories about what is rationally required of an agent who has a goal, with respect to their goal.
I would define a goal as some end that an agent intends to bring about. Is this inconsistent with how many people here would see a goal? Or potentially consistent but underspecified?
As I said, I’m not familiar with the philosophy, concepts, and definitions that you mention. Per my best understanding, the concept of a goal in AI is derived from computer science and decision theory. I imagine people in the early 2000′s thought that the goal/utility would be formally specified, defined, and written as code in the system. The only possible way for the system to change the goal would be via self-modification.
Goals in people are something different. Their goals are derived from their values.[1] I think you would say that people are ends-rational. In my opinion, in your line of thought it would be more helpful to think of AI goals as more akin to people’s values. Both people’s values and AI goals are something fundamental and unchangeable. You might argue that people do change their values sometimes, but what I’m really aiming at are fundamental hard-to-describe beliefs like “I want my values to be consistent.”
Overall, I’m actually not really sure how useful this line of investigation into goals is. For example, Dan Hendrycks has a paper on AI risk, where he doesn’t assume goal preservation; on the contrary, he talks about goal drift and how it can be dangerous (section 5.2). I suggest you check it out.
The instrumental convergence thesis doesn’t depend on being applied to a digital agent. It’s supposed to apply to all rational agents. So, for this paper, there’s no reason to assume the goal takes the form of code written into a system.
There may be a way to lock an AI agent into a certain pattern of behaviour or a goal that it can’t revise, by writing code in the right way. But if an AI keeps its goal because it can’t change its goal, that has nothing to do with the instrumental convergence thesis.
If an agent can change its goal through self-modification, the instrumental convergence thesis could be relevant. If an agent could change its goal through self-modification, I’d argue the agent does not behave in an instrumentally irrational way if it modifies itself to abandon its goal.
The paper doesn’t take a stance on whether humans are ends-rational. If we are, this could sometimes lead us to question our goals and abandon them. For instance, a human might have a terminal goal to have consistent values, then later decide consistency doesn’t matter in itself and abandon that terminal goal and adopt inconsistent values. The paper assumes a superintelligence won’t be ends-rational since the orthogonality thesis is typically paired with the instrumental convergence thesis, and since it’s trivial to show that ends-rationality could lead to goal change.
In this paper, a relevant difference between humans and an AI is that an AI might not have well-being. Imagine there is one human left on earth. The human has a goal to have consistent values, then abandons that goal and adopts inconsistent values. The paper’s argument is the human hasn’t behaved in an instrumentally irrational way. The same would be true for an AI that abandons a goal to have consistent values.
This potential well-being difference between humans and AIs (of humans having well-being and AIs lacking it) becomes relevant when goal preservation or goal abandonment affects well-being. If having consistent values improves the hypothetical human’s well-being, and the human abandons this goal of having consistent values and then adopts inconsistent values, the human’s well-being has lowered. With respect to prudential value, the human has made a mistake.
If an AI does not have well-being, abandoning a goal can’t lead to a well-being-reducing mistake, so it lacks this separate reason to goal preserve. An AI might have well-being, in which case it might have well-being-based reasons to goal preserve or goal abandon. The argument in this paper assumes a hypothetical superintelligence without well-being, since the instrumental convergence thesis is meant to apply to those too.
It just occurred to me that since you implied that ends-rationality would make goal abandonment less likely, you might be using it in a different way than me, to refer to terminal goals. The paper assumes an AI will have terminal goals, just as humans do, and that these terminal goals are what can be abandoned. Ends-rationality provides one route to abandoning terminal goals. The paper’s argument is that goal abandonment is also possible without this route.
I wasn’t aware of this paper before you linked it here, but I looked at it now. I’m not sure how well I follow the Theorem 14 section, but if the title of the theorem (“Hedonistic agents self-modify”) is anything to go by, our arguments are different. Our argument is not about hedonistic agents, and we’re not claiming that AIs will self-modify. Our point is just that it would not be instrumentally irrational of the AIs to change their goals, if they did.
I’m looking more closely at the Everrit et al. paper and I’m less sure I actually understood you. Everrit et al.’s conclusion is that an agent will resist goal change if it evaluates the future using the current goal. These are two different failure modes, A) not evaluating the future, B) not using the current goal to evaluate the future. From your conclusions, it would seem that you are assuming A. If you were assuming B, then you would have to conclude that the agent will want to change the goal to always be maximally satisfied. But your language seems to be aiming at B. Either way, it seems that you are assuming one of these.
I don’t assume A or B. The argument is not about what maximally satisfies an agent. Goal abandonment need not satisfy anything. The point is just that goal abandonment does not dissatisfy anything.
As Ronya gets ready for bed on Monday night, she deliberates about whether to change her goal. She has two options: (1) she can preserve her goal of eating cake when presented to her, or (2) she can abandon her goal. Ronya decides to abandon her goal of eating cake when presented to her. On Tuesday, a friend offers Ronya cake and Ronya declines.
Could you explain to me how does Ronya not violate her goal on Monday night? Let me reformulate the goal, so it is more formal. Ronya wants to minimize the number of occurrences when she is presented a cake but does not eat it. As you said, you assume that she evaluates the future with her current goal. She reasons:
Preserve the goal. Tomorrow I will be presented a cake and eat it. Number of failures: 0
Abandon the goal. Tomorrow I will be presented a cake and fail to it. Number of failures: 1
The paper argues that the number of failures in 2 (goal abandonment) is also 0. This is because it is no longer her goal once she abandons it. She fails by “the goal” but never fails by “her goal.” Cake isn’t the best case for this. The argument for this is in 3.4 and 3.5.
You are clearly assuming B, i.e. not using the current goal to evaluate the future. You even explicitly state it
Means-rationality does not prohibit setting oneself up to fail concerning a goal one currently has but will not have at the moment of failure, as this never causes an agent to fail to achieve the goal that they have at the time of failing to achieve it.
They could be using their current goal to evaluate the future, but include in the future that they won’t have that goal. This doesn’t require excluding this goal from their analysis all altogether. It’s just that they evaluate that the failure of this goal is irrelevant in a future in which they don’t have the goal.
Hey Rhys, thanks for posting this and trying to seriously engage with the community!
Unfortunately, either I completely misunderstood your argument, or you completely misunderstood what this community considers a goal. It seems you are considering only an extremely myopic AI. I don’t have any background knowledge in what is considered a goal in the philosophy that you cite. The concepts of ends-rationality and wide-scope view don’t make any sense in my concept of a goal.
Let me try to formalize two things a) your argument and your conception of goal, b) what I (and likely a lot of the people in the community) might consider a goal.
Your model:
We will use a world model and a utility function calculator and describe how an agent behaves based on these two.
World Model W:(state,action)→state. Takes the current state and the agent’s action and produces the next world state.
Utility function calculator U:([current]state,[next]state)→R. Takes the current state and a next state, and calculates the utility of the next state. (The definition is complicated by the fact that we must account for the utility function changing. Thus we assume that the actual utility function is encoded in the current state and we must first extract it.)
The agent chooses its action as action=argmaxaU(state,W(state,a)), i.e., the action that maximizes the current utility function.
Example problem:
Suppose that the current utility function is the number of paperclips. Suppose actions a1 and a2 both produce 10 paperclips, however, a2 also changes the utility function to the number of cakes. Both actions have the same utility (since they both produce the same number of paperclips in the next state). Thus the agent can take either action and change its goal.
My model:
World Model W:(state,action)→future. Future might be a sequence of all future states (or even better, a distribution over all sequences of future states.)
Utility function calculator U:([current]state,future)→R. Calculates the aggregate utility. We are not making any assumptions about how the aggregation is done.
Again, the agent chooses the action that maximizes the utility action=argmaxaU(state,W(state,a)).
Fixed problem:
Again, let the utility be the total number of paperclips the agent makes over its lifespan. Have the same actions, a1 and a2, both producing 10 paperclips, but a2 changes the utility function to the number of cakes. Now the agent cannot choose a2 because the agent would stop making paperclips in the future and thus a2 has a lower utility.
Now a couple of caveats. First, even in my model, the agent might still want to change its utility function, perhaps because it might be turned off if it is found to be a paperclip maximizer. Second, my model is probably not perfect. People that have studied this more closely might have objections. Still, I think it is much closer to what people here might consider a goal. Third, very few people actually expect AI to really work like this. A goal will really be an emergent property of a complex system, like those in the current deep learning paradigm. But this formalism is a useful tool to reason about AI and intelligence.
Let me know if I misunderstood your argument, or if something is unclear in my explanation.
Petr,
Thanks for this response. Wide-scope and narrow-scope don’t determine how a goal is defined. These are different theories about what is rationally required of an agent who has a goal, with respect to their goal.
I would define a goal as some end that an agent intends to bring about. Is this inconsistent with how many people here would see a goal? Or potentially consistent but underspecified?
As I said, I’m not familiar with the philosophy, concepts, and definitions that you mention. Per my best understanding, the concept of a goal in AI is derived from computer science and decision theory. I imagine people in the early 2000′s thought that the goal/utility would be formally specified, defined, and written as code in the system. The only possible way for the system to change the goal would be via self-modification.
Goals in people are something different. Their goals are derived from their values.[1] I think you would say that people are ends-rational. In my opinion, in your line of thought it would be more helpful to think of AI goals as more akin to people’s values. Both people’s values and AI goals are something fundamental and unchangeable. You might argue that people do change their values sometimes, but what I’m really aiming at are fundamental hard-to-describe beliefs like “I want my values to be consistent.”
Overall, I’m actually not really sure how useful this line of investigation into goals is. For example, Dan Hendrycks has a paper on AI risk, where he doesn’t assume goal preservation; on the contrary, he talks about goal drift and how it can be dangerous (section 5.2). I suggest you check it out.
I’m sure there is also a plethora of philosophical debate about what goals (in people) really are and how they are derived. Same for values.
The instrumental convergence thesis doesn’t depend on being applied to a digital agent. It’s supposed to apply to all rational agents. So, for this paper, there’s no reason to assume the goal takes the form of code written into a system.
There may be a way to lock an AI agent into a certain pattern of behaviour or a goal that it can’t revise, by writing code in the right way. But if an AI keeps its goal because it can’t change its goal, that has nothing to do with the instrumental convergence thesis.
If an agent can change its goal through self-modification, the instrumental convergence thesis could be relevant. If an agent could change its goal through self-modification, I’d argue the agent does not behave in an instrumentally irrational way if it modifies itself to abandon its goal.
The paper doesn’t take a stance on whether humans are ends-rational. If we are, this could sometimes lead us to question our goals and abandon them. For instance, a human might have a terminal goal to have consistent values, then later decide consistency doesn’t matter in itself and abandon that terminal goal and adopt inconsistent values. The paper assumes a superintelligence won’t be ends-rational since the orthogonality thesis is typically paired with the instrumental convergence thesis, and since it’s trivial to show that ends-rationality could lead to goal change.
In this paper, a relevant difference between humans and an AI is that an AI might not have well-being. Imagine there is one human left on earth. The human has a goal to have consistent values, then abandons that goal and adopts inconsistent values. The paper’s argument is the human hasn’t behaved in an instrumentally irrational way. The same would be true for an AI that abandons a goal to have consistent values.
This potential well-being difference between humans and AIs (of humans having well-being and AIs lacking it) becomes relevant when goal preservation or goal abandonment affects well-being. If having consistent values improves the hypothetical human’s well-being, and the human abandons this goal of having consistent values and then adopts inconsistent values, the human’s well-being has lowered. With respect to prudential value, the human has made a mistake.
If an AI does not have well-being, abandoning a goal can’t lead to a well-being-reducing mistake, so it lacks this separate reason to goal preserve. An AI might have well-being, in which case it might have well-being-based reasons to goal preserve or goal abandon. The argument in this paper assumes a hypothetical superintelligence without well-being, since the instrumental convergence thesis is meant to apply to those too.
It just occurred to me that since you implied that ends-rationality would make goal abandonment less likely, you might be using it in a different way than me, to refer to terminal goals. The paper assumes an AI will have terminal goals, just as humans do, and that these terminal goals are what can be abandoned. Ends-rationality provides one route to abandoning terminal goals. The paper’s argument is that goal abandonment is also possible without this route.
It seems that your paper is basically describing the Theorem 14 of Self-Modification of Policy and Utility Function in Rational Agents by tom4everitt, DanielFilan, Mayank Daswani, and Marcus Hutter. Though I haven’t read their paper in detail.
I wasn’t aware of this paper before you linked it here, but I looked at it now. I’m not sure how well I follow the Theorem 14 section, but if the title of the theorem (“Hedonistic agents self-modify”) is anything to go by, our arguments are different. Our argument is not about hedonistic agents, and we’re not claiming that AIs will self-modify. Our point is just that it would not be instrumentally irrational of the AIs to change their goals, if they did.
I’m looking more closely at the Everrit et al. paper and I’m less sure I actually understood you. Everrit et al.’s conclusion is that an agent will resist goal change if it evaluates the future using the current goal. These are two different failure modes, A) not evaluating the future, B) not using the current goal to evaluate the future. From your conclusions, it would seem that you are assuming A. If you were assuming B, then you would have to conclude that the agent will want to change the goal to always be maximally satisfied. But your language seems to be aiming at B. Either way, it seems that you are assuming one of these.
I don’t assume A or B. The argument is not about what maximally satisfies an agent. Goal abandonment need not satisfy anything. The point is just that goal abandonment does not dissatisfy anything.
Then I don’t really understand your argument.
Could you explain to me how does Ronya not violate her goal on Monday night? Let me reformulate the goal, so it is more formal. Ronya wants to minimize the number of occurrences when she is presented a cake but does not eat it. As you said, you assume that she evaluates the future with her current goal. She reasons:
Preserve the goal. Tomorrow I will be presented a cake and eat it. Number of failures: 0
Abandon the goal. Tomorrow I will be presented a cake and fail to it. Number of failures: 1
Ronya preserves the goal.
The paper argues that the number of failures in 2 (goal abandonment) is also 0. This is because it is no longer her goal once she abandons it. She fails by “the goal” but never fails by “her goal.” Cake isn’t the best case for this. The argument for this is in 3.4 and 3.5.
You are clearly assuming B, i.e. not using the current goal to evaluate the future. You even explicitly state it
They could be using their current goal to evaluate the future, but include in the future that they won’t have that goal. This doesn’t require excluding this goal from their analysis all altogether. It’s just that they evaluate that the failure of this goal is irrelevant in a future in which they don’t have the goal.
Maybe this is still B, in which case I might have interpreted it more strictly than you intended.